3 Data cleaning

Learning Objectives

Know how to use filterFeatures to filter DIA-NN precursor data by Q-value thresholds
Know how to use joinAssays to combine per-sample SummarizedExperiment objects into a single combined set
Know how to identify and remove contaminant proteins
Be able to explore and visualise missing values in DIA proteomics data
Understand how to filter sparse precursors based on a missing value threshold

Now that we have our precursor-level data stored in a QFeatures object, the next step is to carry out data cleaning — filtering to retain only high-confidence identifications and removing precursors with excessive missing values across samples.

3.1 Filtering by Q-value

Discussion

Before reading on, consider the following:

Why do we need to filter the DIA-NN output at all? What would happen if we kept all identifications?
What do the Q-value columns represent, and what is the difference between run-level and library-level Q-values?
What threshold would you choose?
What impact could this filtering have on the downstream analysis and results?

DIA-NN assigns a Q-value to each precursor and protein group identification, representing the local false discovery rate (FDR) at that score threshold — i.e., the expected proportion of identifications at that level that are incorrect. Filtering by Q-value ensures we retain only high-confidence identifications before proceeding to downstream analysis.

DIA-NN outputs several Q-value columns:

Q.Value — run-level precursor Q-value (FDR within a single run)
PG.Q.Value — run-level protein group Q-value
Lib.Q.Value — library-level precursor Q-value (based on the spectral library used in matching)
Lib.PG.Q.Value — library-level protein group Q-value

Since our data was processed with match-between-runs (MBR) enabled, we additionally filter on the library-level Q-values (Lib.Q.Value and Lib.PG.Q.Value). When MBR is used, DIA-NN rescores identifications across runs using an empirical library built in a first pass; the library-level Q-values reflect the confidence of those cross-run matches and are therefore the most stringent and appropriate filter for MBR data. We apply a standard 1% FDR threshold (Q-value ≤ 0.01) to all four columns.

To remove features based on variables within our rowData we make use of the filterFeatures function. This function takes a QFeatures object as its input and filters it against a condition based on the rowData (indicated by the ~ operator). If the condition evaluates to TRUE for a feature, that feature will be kept; features returning FALSE are removed.

The filterFeatures function provides the option to apply a filter either (i) to the whole QFeatures object and all of its SummarizedExperiments, or (ii) to specific SummarizedExperiments within the QFeatures object, by passing the name or index of the desired set to the i argument.

Since we want to apply the Q-value filter across all per-sample SummarizedExperiments simultaneously, we do not pass an i argument — this causes filterFeatures to apply the filter to every set in the QFeatures object.

dia_qf

An instance of class QFeatures (type: bulk) with 31 sets:

 [1] COVID_E1: SummarizedExperiment with 3834 rows and 1 columns 
 [2] COVID_E10: SummarizedExperiment with 3914 rows and 1 columns 
 [3] COVID_E2: SummarizedExperiment with 4016 rows and 1 columns 
 ...
 [29] MPX_D4: SummarizedExperiment with 3950 rows and 1 columns 
 [30] MPX_D5: SummarizedExperiment with 4043 rows and 1 columns 
 [31] MPX_D6: SummarizedExperiment with 3861 rows and 1 columns

dia_qf <- dia_qf %>%
  filterFeatures(~ Q.Value <= 0.01) %>% # Run-level precursor Q-value
  filterFeatures(~ PG.Q.Value <= 0.01) %>% # Run-level protein group Q-value
  filterFeatures(~ Lib.Q.Value <= 0.01) %>% # Library-level precursor Q-value
  filterFeatures(~ Lib.PG.Q.Value <= 0.01) # Library-level protein group Q-value

'Q.Value' found in 31 out of 31 assay(s).

'PG.Q.Value' found in 31 out of 31 assay(s).

'Lib.Q.Value' found in 31 out of 31 assay(s).

'Lib.PG.Q.Value' found in 31 out of 31 assay(s).

dia_qf

An instance of class QFeatures (type: bulk) with 31 sets:

 [1] COVID_E1: SummarizedExperiment with 3751 rows and 1 columns 
 [2] COVID_E10: SummarizedExperiment with 3851 rows and 1 columns 
 [3] COVID_E2: SummarizedExperiment with 3910 rows and 1 columns 
 ...
 [29] MPX_D4: SummarizedExperiment with 3882 rows and 1 columns 
 [30] MPX_D5: SummarizedExperiment with 3971 rows and 1 columns 
 [31] MPX_D6: SummarizedExperiment with 3788 rows and 1 columns

We can see how many precursors remain per sample after filtering by inspecting the number of rows in each per-sample set.

hist(
  nrows(dia_qf),
  xlab = "Precursors",
  main = "Precursor counts per sample\nafter Q-value filtering"
)

3.2 Joining per-sample sets into a single precursor set

After filtering, our QFeatures object contains one SummarizedExperiment per sample run. To work with the data combined across all samples, we use the joinAssays function. This merges all per-sample sets into a single combined set called "precursors", aligning features (precursors) by their Precursor.Id. Where a precursor was not detected in a particular sample, the corresponding entry will be NA.

dia_qf <- joinAssays(
  x = dia_qf,
  i = names(dia_qf),
  name = "precursors",
  fcol = "Precursor.Id"
)

Using 'Precursor.Id' to join assays.

Note that joinAssays only retains rowData columns that are shared across all sets and have consistent values — run-specific metrics such as Q.Value, PEP, and Run are therefore dropped from the combined set. This is expected: these per-run quality metrics were used for filtering in the previous step and are no longer needed. We can confirm which columns are dropped by comparing the rowData names before and after joining.

# rowData columns in the per-sample sets
sample_level_rdata_names <- rowDataNames(dia_qf)[[1]]

# rowData columns in the combined set
joined_rdata_names <- rowDataNames(dia_qf)[["precursors"]]

# Columns present in sample-level sets but dropped from the combined set
setdiff(sample_level_rdata_names, joined_rdata_names)

 [1] "Run"                         "Run.Index"                  
 [3] "RT"                          "Predicted.RT"               
 [5] "Predicted.iRT"               "Precursor.Normalised"       
 [7] "Ms1.Area"                    "Ms1.Normalised"             
 [9] "Ms1.Apex.Area"               "Ms1.Apex.Mz.Delta"          
[11] "Normalisation.Factor"        "Quantity.Quality"           
[13] "Empirical.Quality"           "Normalisation.Noise"        
[15] "Ms1.Profile.Corr"            "Evidence"                   
[17] "Mass.Evidence"               "Channel.Evidence"           
[19] "Ms1.Total.Signal.Before"     "Ms1.Total.Signal.After"     
[21] "RT.Start"                    "RT.Stop"                    
[23] "FWHM"                        "PG.MaxLFQ"                  
[25] "Genes.MaxLFQ"                "Genes.MaxLFQ.Unique"        
[27] "PG.MaxLFQ.Quality"           "Genes.MaxLFQ.Quality"       
[29] "Genes.MaxLFQ.Unique.Quality" "Q.Value"                    
[31] "PEP"                         "PG.Q.Value"                 
[33] "PG.PEP"                      "GG.Q.Value"                 
[35] "Protein.Q.Value"             "Best.Fr.Mz"                 
[37] "Best.Fr.Mz.Delta"            "runCol"

3.3 Removing individual sample-level assays

Now that the per-sample data has been joined into the "precursors" set, we no longer need to keep the individual per-sample SummarizedExperiment objects. To reduce the size of our QFeatures object, we identify the sets for individual samples and remove them. This isn’t strictly necessary for this dataset, but it may be if you have a very large experiment. It also helps to keep the object tidy and focused on the combined set for downstream analysis.

The sampleMap is an internal table in a QFeatures object that records which samples are present in each assay. We can inspect it before removing the per-sample sets — at this point it contains one row per sample per assay, so with 31 samples, it has 62 rows - one row each for the original per-sample assay and for the "precursors" assay.

sampleMap(dia_qf)

DataFrame with 62 rows and 3 columns
         assay     primary     colname
      <factor> <character> <character>
1    COVID_E1     COVID_E1    COVID_E1
2    COVID_E10   COVID_E10   COVID_E10
3    COVID_E2     COVID_E2    COVID_E2
4    COVID_E3     COVID_E3    COVID_E3
5    COVID_E4     COVID_E4    COVID_E4
...        ...         ...         ...
58  precursors      MPX_D2      MPX_D2
59  precursors      MPX_D3      MPX_D3
60  precursors      MPX_D4      MPX_D4
61  precursors      MPX_D5      MPX_D5
62  precursors      MPX_D6      MPX_D6

After calling removeAssay, the 31 per-sample assays are dropped and their corresponding sampleMap entries are removed, leaving only the rows for "precursors". You will see a message indicating that sampleMap rows have been harmonised — this is expected behaviour.

sets_to_rm <- which(names(dia_qf) != "precursors")
dia_qf <- removeAssay(dia_qf, i = sets_to_rm)

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 31 sampleMap rows not in names(experiments)

sampleMap(dia_qf)

DataFrame with 31 rows and 3 columns
         assay     primary     colname
      <factor> <character> <character>
1   precursors    COVID_E1    COVID_E1
2   precursors   COVID_E10   COVID_E10
3   precursors    COVID_E2    COVID_E2
4   precursors    COVID_E3    COVID_E3
5   precursors    COVID_E4    COVID_E4
...        ...         ...         ...
27  precursors      MPX_D2      MPX_D2
28  precursors      MPX_D3      MPX_D3
29  precursors      MPX_D4      MPX_D4
30  precursors      MPX_D5      MPX_D5
31  precursors      MPX_D6      MPX_D6

3.4 Removing contaminant proteins

In any proteomics experiment, some identified peptides will derive from contaminant proteins rather than the sample of biological interest. Common contaminants include proteins introduced during sample preparation — for example, human keratins from skin or hair, bovine serum albumin from cell culture media, and trypsin used for proteolytic digestion. These proteins are not informative for the biological question and should be removed before downstream analysis.

It is standard practice to include a curated contaminants database alongside the primary sequence database during the database search. For this data, we used a modified version of the contaminants fasta from Hao Group’s Protein Contaminant Libraries for DDA and DIA Proteomics Frankenfield et al. (2022), where we removed the contaminants from Fetal Bovine Serum, which are not relevant here. The entries in this contaminant database are named with a Cont_ prefix, which is carried through into the Protein.Ids column of the DIA-NN output, making them straightforward to identify computationally.

We can inspect the rowData to check the contaminants can be identified as expected.

It’s important to note that because DIA-NN uses a normalisation based on total sample intensity under the assumption that the same amount of peptides were injected in each sample, precursors mapping to contaminants included in the FASTA database are excluded from quantification.

# Extract the rowData
rd <- rowData(dia_qf[['precursors']]) %>%
  data.frame()

table(grepl('Cont_', rd$Protein.Ids))


FALSE  TRUE 
 4417   275

3.5 Filtering contaminants

We now remove precursors with a contaminant match from our data.

As with all data analyses, it is sensible to keep a copy of the data before applying filters, in case we need to refer back to the unfiltered data later. We therefore extract a copy of the set using addAssay, giving it a new name that reflects the processing step we are about to apply. We will only apply filters to this new copy, leaving the original set untouched.

Assay links

QFeatures maintains hierarchical links between quantitative levels, allowing easy access to all data levels for individual features of interest. This is fundamental to the QFeatures infrastructure and will be exemplified throughout this course. When using addAssay to add a copy of an existing set, the newly created SummarizedExperiment does not automatically have a link to the set it was copied from. Where a link is required, it can be added explicitly using addAssayLink.

We create a copy of the "precursors" set called "precursors_no_cont". After filtering, this set will contain only precursors do not originate from contaminants. We use plot to visualise the assay structure before and after adding the link.

dia_qf <- addAssay(dia_qf, dia_qf[["precursors"]], name = "precursors_no_cont")

plot(dia_qf)

dia_qf <- addAssayLink(dia_qf, from = "precursors", to = "precursors_no_cont")

plot(dia_qf)

We then use filterFeatures to retain only precursors whose Protein.Ids field contains no Cont_ entry. We specifying the i argument to apply the filter only to the "precursors_no_cont" set. This retains the original "precursors" assay with all precursors for reference, while the new set contains only those precursors that pass the missing value threshold.

dia_qf <- filterFeatures(
  dia_qf,
  ~ !grepl("Cont_", Protein.Ids),
  i = "precursors_no_cont"
)

'Protein.Ids' found in 2 out of 2 assay(s).

dia_qf

An instance of class QFeatures (type: bulk) with 2 sets:

 [1] precursors: SummarizedExperiment with 4692 rows and 31 columns 
 [2] precursors_no_cont: SummarizedExperiment with 4417 rows and 31 columns

3.6 Exploring missing values

DIA proteomics data can exhibit substantial missing values because proteins must be detected independently in each sample run. Before filtering, it is important to understand the extent and distribution of missing values in the data.

We use the nNA function to calculate the proportion of missing values per precursor (row) and per sample (column).

precursors_missing <- nNA(dia_qf, i = "precursors_no_cont")

print(precursors_missing)

$nNA
DataFrame with 1 row and 3 columns
          assay       nNA       pNA
    <character> <integer> <numeric>
1 precursors...     20365  0.148729

$nNArows
DataFrame with 4417 rows and 4 columns
             assay          name       nNA       pNA
       <character>   <character> <integer> <numeric>
1    precursors...    FEVQVTVPK2         0         0
2    precursors... MTSNFPVDLS...         0         0
3    precursors... IEIPLPFGGK...         0         0
4    precursors...      YPLYVLK2         0         0
5    precursors... DQEVLLQTFL...         0         0
...            ...           ...       ...       ...
4413 precursors... TFSEWLESVK...        20  0.645161
4414 precursors... DAHLTWEVAG...        30  0.967742
4415 precursors... LVLVGDGGTG...        30  0.967742
4416 precursors... ILNNGHAFNV...        30  0.967742
4417 precursors... RFNTANDDNV...        30  0.967742

$nNAcols
DataFrame with 31 rows and 4 columns
            assay        name       nNA       pNA
      <character> <character> <integer> <numeric>
1   precursors...    COVID_E1       715  0.161875
2   precursors...   COVID_E10       608  0.137650
3   precursors...    COVID_E2       535  0.121123
4   precursors...    COVID_E3       581  0.131537
5   precursors...    COVID_E4       541  0.122481
...           ...         ...       ...       ...
27  precursors...      MPX_D2       641  0.145121
28  precursors...      MPX_D3       684  0.154856
29  precursors...      MPX_D4       663  0.150102
30  precursors...      MPX_D5       498  0.112746
31  precursors...      MPX_D6       682  0.154403

Exercise 1 - Challenge: Visualising missing values

Level:

Using the precursors_missing object, create plots to explore the distribution of missing values in the data.

Plot a histogram for the proportion of missing values across precursors.
Plot the proportion of missing values per sample, coloured by group.

Based on your plots:

Do most precursors have missing values?
Are there any precursors or samples with an unusually high proportion of missing values?
Does the proportion of missing values differ between groups?

Hint

The precursors_missing object contains two data frames: nNArows (one row per precursor) and nNAcols (one row per sample), each with a pNA column giving the proportion of missing values.

Answer

Task 1

hist(
  precursors_missing$nNArows$pNA,
  xlab = "Proportion missing values",
  main = "Missing values per precursor"
)

Task 2

precursors_missing$nNAcols %>%
  as_tibble() %>%
  merge(data.frame(colData(dia_qf)), by.x = "name", by.y = "runCol") %>%
  ggplot(aes(x = name, y = pNA, group = group, fill = group)) +
  geom_bar(stat = "identity") +
  labs(x = "Sample", y = "Proportion missing values") +
  coord_flip() +
  scale_fill_brewer(palette = "Dark2") +
  theme_classic()

Precursors that are missing across the majority of samples provide little useful quantitative information and may introduce noise into downstream analyses. We filter out precursors with excessive missing values using the filterNA function, where the pNA argument specifies the maximum proportion of missing values allowed per precursor.

The choice of threshold involves a trade-off: stricter thresholds (lower pNA) remove more precursors and produce a cleaner but smaller dataset; more permissive thresholds retain more precursors but include those with very sparse quantification, which are more likely to be inaccurately quantified. Here we use pNA = 0.75, meaning a precursor must be observed in at least 25% of samples — with 31 samples, this corresponds to a minimum of approximately 8 observations. Further filtering will be applied at the protein level in the statistical analysis, where we will require proteins to be quantified in at least 3 replicates per group before testing.

3.7 Filtering sparse precursors

We create a copy called "precursors_filtered_missing" — after filtering, this set will contain only precursors that meet the missing value threshold.

dia_qf <- addAssay(
  dia_qf,
  dia_qf[["precursors_no_cont"]],
  name = "precursors_filtered_missing"
)

dia_qf <- addAssayLink(
  dia_qf,
  from = "precursors_no_cont",
  to = "precursors_filtered_missing"
)

Now we perform the missing value filtering on the new assay, again specifying the i argument to apply the filter to only the desired set.

dia_qf <- dia_qf %>%
  filterNA(pNA = 0.75, i = "precursors_filtered_missing")

3.8 Save data

Key Points

DIA-NN Q-value columns (Q.Value, PG.Q.Value, Lib.Q.Value, Lib.PG.Q.Value) represent the local FDR for each identification; filtering at ≤ 0.01 retains only identifications at 1% FDR
When match-between-runs is enabled, library-level Q-values (Lib.Q.Value, Lib.PG.Q.Value) are the most appropriate filter as they reflect the confidence of cross-run matches
filterFeatures applies row-level filters based on rowData variables, and can be applied to the whole QFeatures object or to specific sets
Contaminant proteins (e.g., keratins, trypsin, BSA) should be removed before downstream analysis; including a curated contaminant database in the search and filtering on a Cont_ prefix in Protein.Ids is a straightforward way to do this
joinAssays combines per-sample SummarizedExperiment objects into a single combined set, with missing values (NA) where a precursor was not detected in a given sample
nNA provides a convenient summary of missing values per feature (row) and per sample (column)
DIA data typically has a relatively high proportion of missing values, as proteins must be detected independently in each sample
filterNA removes precursors that are missing above a specified proportion of samples, with the threshold chosen based on the experimental context

References

Frankenfield, Ashley M., Jiawei Ni, Mustafa Ahmed, and Ling Hao. 2022. “Protein Contaminants Matter: Building Universal Protein Contaminant Libraries for DDA and DIA Proteomics.” Journal of Proteome Research 21 (9): 2104–13. https://doi.org/10.1021/acs.jproteome.2c00145.