1 The use-case data
- Understand the aim and design of the use-case DIA expression proteomics experiment
- Know how label-free relative quantitation is achieved in Data-Independent Acquisition (DIA) proteomics
- Understand what DIA-NN does and what data it outputs for downstream analysis
1.1 Exploring changes in protein abundance in monkeypox infection
As a use case, we will analyse a label-free DIA plasma proteomics dataset comparing patients infected with Monkeypox virus (MPXV), patients with COVID-19, and healthy controls. The aim of the experiment was to identify proteins that are differentially abundant between these groups, providing insight into the host response to MPXV infection and how it compares to another viral infection.
Plasma samples were collected from three groups:
- MPXV: patients with confirmed Monkeypox virus infection
- COVID-19: patients with confirmed COVID-19 infection
- Control: healthy control individuals
Proteins were extracted and digested to peptides for bottom-up mass spectrometry analysis. Samples were run individually (label-free) on the mass spectrometer using a Data-Independent Acquisition (DIA) method.
1.2 Data-Independent Acquisition (DIA)
Unlike Data-Dependent Acquisition (DDA), where the mass spectrometer selects individual precursor ions for fragmentation based on their intensity, DIA acquires fragment ion data for all precursors within defined m/z windows systematically and without selection bias. DIA data can exhibit more missing values between runs than TMT-labelled DDA data, since proteins must be detected independently in each sample rather than being quantified as part of a multiplexed pool. However, label-free DIA acquisition offers more complete and robust quantification than label-free DDA workflows while also enabling the quantification of tens of thousands (or more) samples as it does not involve multiplexing samples. DIA workflows designed for comprehensive coverage can quantify proteins across a much larger range of protein abundances than DDA workflows, which is particularly valuable for the detection and quantification of low abundance proteins.
1.3 DIA-NN: the identification and quantification software
The raw DIA mass spectrometry data was processed using DIA-NN, a freely available neural-network-based software tool for DIA data analysis. DIA-NN performs spectral library searching and quantification, outputting a report file (in .parquet format) containing precursor- and protein-level identifications and quantities for each sample.DIA-NN can use spectral libraries provided in the input or create in-silico spectral libraries from all theoretical peptides present in protein sequences in the provided FASTA file. In both cases, DIA-NN can be instructed to use the raw data from an experiment to refine expected retention times and ion mobility information present in the empirical spectral library provided or the in-silico generated library.
Since precursors are identified in quantitative proteomics experiments by comparing the mass spectra acquired during an experiment and an expected spectra in a spectral library, statistical control of false matches between these acquried and expected values is critical. This is done by estimating a “false discovery rate”. FDR is reported as q-values which are computed per sample and per experimental dataset at the precursor as well as the protein group levels. Applying a q-value filter of ≤ 0.01 indicates that amongst all precursors or protein groups that will be retained after filtering at this threshold of 0.01, 1% are expected to be incorrect/false identifications. In large quantitative proteomics experiments, FDR is usually controlled at both the sample and dataset level. DIA-NN reports contain the following q-value columns :
Q.Value — for run-level precursor FDR control This tells you how confident DIA-NN is that a specific precursor was correctly identified in a particular LC-MS sample injection (a “run”).
PG.Q.Value — run-level protein group FDR This tells you how confident DIA-NN is that a protein group was correctly identified in a particular run.
-Global.Q.Value — experiment-level precursor confidence / FDR. This described the confidence DIA-NN has that a precursor was correctly identified across the whole experiment (all files that were processed together in DIA-NN), rather than in just one run.
-Global.PQ.Value — experiment-level precursor FDR This represents the confidence DIA-NN has that a protein group was correctly identified across the whole experiment.
When DIA-NN creates a spectral library, whether an in-silico theoretical library, or a refined empirical library from the raw data to be able to apply the “match between run (MBR)” setting that improves data completeness, DIA-NN outputs an FDR score for how confident it is that the precursor, protein or protein group was correctly assigned in the library.These FDR scores are output as columns called “Lib.Q.Value”and “Lib.PG.Q.Value”. A value of ≤ 0.01 in this case indicates that if filtering at a threshold of Lib.Q.value ≤ 0.01 or Lib.PG.Q.Value ≤ 0.01, 1% of the library entries are expected to be incorrect at the precursor and protein group levels, respectively.
These filters are especially important when MBR is switched on for DIA-NN data processing.
Lib.Q.Value — library level precursor FDR.
Lib.PG.Q.Value — library-level protein group FDR.
We will use these values to filter our data to high-confidence identifications.
1.4 Summary
The use-case data we will process and analyse is a DIA label-free bottom-up plasma proteomics dataset. The aim is to quantify relative protein abundances between patient groups and apply statistical tests to determine whether any proteins differ significantly in abundance.