dia_qf[["precursors_filtered_missing"]] %>%
assay() %>%
longForm() %>%
ggplot(aes(x = value)) +
geom_histogram() +
theme_bw() +
xlab("Abundance (raw)")4 Normalisation and aggregation
- Know how to log2 transform DIA precursor data within a
QFeaturesobject - Understand how to use
robustSummaryto summarise precursor-level data to protein-level abundance, including in the presence of missing values - Be able to normalise protein-level data using the
diff.medianmethod - Understand the importance of visualising missing values at the protein level
Having filtered our DIA-NN data to retain only high-confidence identifications and removed precursors with excessive missing values, we are now ready to continue processing. In this lesson we will log2 transform the precursor-level data, aggregate to protein-level abundances, and normalise the protein-level data ready for downstream statistical analysis.
4.1 Logarithmic transformation
If we look at the distribution of raw precursor intensities, we will see that the abundance values are dramatically skewed towards zero.
This is to be expected since the majority of proteins exist at low abundances within the cell or in serum and only a few are highly abundant. However, raw protein abundances are not normally distributed, which means parametric statistical tests cannot be applied directly.
Why does the skewed distribution matter? Consider a protein with abundance values of 0.25, 1, and 4 across three samples A, B, and C:
- Each step represents a 4-fold increase — the fold changes are equal.
- Yet on a linear scale, samples A and B (difference of 0.75) appear much closer together than B and C (difference of 3).
- A parametric test would treat these as unequal differences, introducing a systematic bias.
By applying a log2 transformation, the values become −2, 0, and +2 — evenly spaced — converting the skewed distribution into a symmetrical, approximately Gaussian distribution suitable for downstream statistical analysis.
Although there is no mathematical reason for applying a log2 transformation rather than using a higher base such as log10, the log2 scale provides an easy visualisation tool. Any protein that halves in abundance between conditions will have a 0.5 fold change, which translates into a log2 fold change of -1. Any protein that doubles in abundance will have a fold change of 2 and a log2 fold change of +1.
Logarithmic transformation should be applied at the appropriate stage of the processing pipeline for the aggregation method being used. Here we apply it before aggregation to protein level, because we use robustSummary for summarisation. robustSummary fits a linear model to the precursor-level data, which assumes approximately Gaussian distributed residuals — a requirement that is only met after log transformation.
We apply a log2 transformation to the filtered precursor data using the logTransform function, which creates a new set in our QFeatures object.
dia_qf <- logTransform(
dia_qf,
base = 2,
i = "precursors_filtered_missing",
name = "precursors_filtered_missing_log")4.2 Summarising to protein-level abundance
We now aggregate from precursor level to protein level using the aggregateFeatures function. For DIA data with missing values, we use the robustSummary method (Sticker et al. (2020)), which models the log-transformed peptide/precursor-level quantification as being dependent upon the protein-level abundance plus a precursor-level effect. This modelling-based approach can handle missing values without imputation, as it only considers finite data. The only requirement is that a precursor must be quantified in at least two samples.
dia_qf <- aggregateFeatures(dia_qf,
i = "precursors_filtered_missing_log",
fcol = "Protein.Ids",
name = "proteins",
fun = MsCoreUtils::robustSummary,
maxit = 10000)
dia_qfAn instance of class QFeatures (type: bulk) with 5 sets:
[1] precursors: SummarizedExperiment with 4692 rows and 31 columns
[2] precursors_no_cont: SummarizedExperiment with 4417 rows and 31 columns
[3] precursors_filtered_missing: SummarizedExperiment with 4137 rows and 31 columns
[4] precursors_filtered_missing_log: SummarizedExperiment with 4137 rows and 31 columns
[5] proteins: SummarizedExperiment with 368 rows and 31 columns
Robust protein identification and quantification depend heavily on the number and quality of peptides detected for each protein. Whilst this workflow does not explicitly demonstrate filtering of single-peptide identifications (“one-hit wonders”), it is considered good practice to assess these carefully during downstream analysis. As a general rule of thumb, the use of at least two unique proteotypic peptides per protein is widely considered best practice. Major peer-reviewed proteomics journals, including Molecular & Cellular Proteomics and Journal of Proteome Research, commonly expect this standard in discovery-based studies to help control false discovery rates (FDR) and improve confidence in protein identification.Although quantification based on a single peptide can be acceptable in targeted workflows such as SRM or PRM, this generally requires that the peptide be well characterized, reproducibly detected, and manually validated. In contrast, discovery-driven proteomics experiments should typically adhere to the two-peptide guideline to maximise confidence, reproducibility, and peer-review acceptance.However, strict enforcement of the two-peptide rule may not always be feasible in studies involving non-model or poorly annotated organisms. In such cases, incomplete genomic or proteomic databases, limited sequence coverage, and high sequence homology can restrict confident identification to a single peptide. Researchers must then apply additional validation strategies, such as manual spectral inspection, orthogonal evidence, or cross-species homology assessment, to support protein identification and quantification.
4.3 Normalisation
We now have log protein level abundance data to which we could apply a parametric statistical test. However, to perform a statistical test and discover whether any proteins differ in abundance between conditions, we first need to account for non-biological variance that may contribute to any differential abundance. Such variance can arise from experimental error or technical variation, although the latter is much more prominent when dealing with label-free DDA data, compared to TMT DDA or LFQ DIA.
Normalisation is the process by which we account for non-biological variation in protein abundance between samples and attempt to return our quantitative data back to its ‘normal’ condition i.e., representative of how it was in the original biological system. There are various methods that exist to normalise expression proteomics data and it is necessary to consider which of these to apply on a case-by-case basis. Unfortunately, there is not currently a single normalisation method which performs best for all quantitative proteomics datasets.
In QFeatures we can use the normalize function to apply normalisation. To see which methods are supported, type ?normalize to access the function’s help page.
Of the supported methods, median-based methods work well for most quantitative proteomics data. Unlike mean-based methods, median-based normalisation is less sensitive to the extreme values and outliers that are commonly present in proteomics datasets, making it a more robust choice for correcting sample-level systematic differences in loading or ionisation efficiency.
Median-based normalisation works by calculating a single correction factor per sample — the difference between that sample’s median abundance and a reference value — and applying it as a constant shift to every protein in that sample. This means all proteins in a given sample are shifted by the same amount, preserving the relative differences between proteins within each sample while bringing the overall distributions into alignment across samples.
We apply the diff.median method, which shifts each sample’s intensity distribution so that all sample medians match the grand median across all samples.
More sophisticated normalisation approaches may be appropriate depending on the data. For example, the authors of the original study from which this dataset is derived (Wang et al. 2022) used cyclic loess normalisation, available via limma::normalizeCyclicLoess(), which corrects for intensity-dependent biases rather than applying a simple global shift. However, applying it within the QFeatures pipeline requires extracting and replacing the assay matrix manually, and because cyclic loess is sensitive to missing values, additional care is needed — complexity beyond the scope of this course.
dia_qf <- normalize(dia_qf,
i = "proteins",
name = "norm_proteins",
method = "diff.median")We can visualise the effect of normalisation by plotting density curves of protein abundance per sample before and after normalisation.
pre_norm <- longForm(dia_qf[,,'proteins'], colvars = 'group') %>%
ggplot(aes(x = value, colour = group, group = colname)) +
geom_density() +
theme_classic() +
xlab("log2 (Abundance)") +
ggtitle('Pre-normalisation')harmonizing input:
removing 155 sampleMap rows not in names(experiments)
post_norm <- longForm(dia_qf[,,'norm_proteins'], colvars = 'group') %>%
ggplot(aes(x = value, colour = group, group = colname)) +
geom_density() +
theme_classic() +
xlab("log2 (Abundance)") +
ggtitle('Post-normalisation')harmonizing input:
removing 155 sampleMap rows not in names(experiments)
pre_norm + post_norm- Log2 transformation is applied to precursor-level data before aggregation to protein level, giving data a more Gaussian distribution suitable for downstream statistical analysis
robustSummaryis the recommended summarisation method for DIA data, as it can handle missing values at the precursor level without requiring imputation- Proteins where all precursors are missing in a given sample will still have missing protein-level abundances, even with
robustSummary— these should be explored and handled appropriately - Upset plots are a useful way to visualise patterns of missingness across samples, helping to identify whether missing values are random or associated with particular groups
diff.mediannormalisation corrects for sample-level systematic differences by shifting each sample’s intensity distribution to have the same median



