2  Import and infrastructure

TipLearning Objectives
  • Understand the structure of SummarizedExperiment and QFeatures objects and how the two are related
  • Understand how DIA-NN output data is structured and how it can be imported into R
  • Know how to use readQFeaturesFromDIANN to create a QFeatures object from DIA-NN output
  • Be able to access sample metadata via colData in a QFeatures object
  • Be able to explore the data within a QFeatures object using assay, rowData, and colData
library("arrow")
library("readxl")
library("QFeatures")
library("tidyr")
library("dplyr")
library("ggplot2")
library("patchwork")

2.1 The data structure

The raw DIA mass spectrometry data was processed using DIA-NN, which outputs a report file in .parquet format. Unlike a simple abundance matrix, this is a long-format table — each row corresponds to a single precursor–sample combination, rather than one row per precursor across all samples simultaneously. The file contains precursor-level identifications and quantities for every sample, along with quality metrics from the DIA-NN search.

We read this file into R using the read_parquet function from the arrow package.

diann_df <- read_parquet("data/monkeypox_plasma_proteomes.parquet")

We can inspect the dimensions and column names to understand what the data contains.

dim(diann_df)
[1] 120505     71
names(diann_df)
 [1] "Run.Index"                    "Run"                         
 [3] "Channel"                      "Precursor.Id"                
 [5] "Modified.Sequence"            "Stripped.Sequence"           
 [7] "Precursor.Charge"             "Precursor.Lib.Index"         
 [9] "Decoy"                        "Proteotypic"                 
[11] "Precursor.Mz"                 "Protein.Ids"                 
[13] "Protein.Group"                "Protein.Names"               
[15] "Genes"                        "RT"                          
[17] "iRT"                          "Predicted.RT"                
[19] "Predicted.iRT"                "IM"                          
[21] "iIM"                          "Predicted.IM"                
[23] "Predicted.iIM"                "Precursor.Quantity"          
[25] "Precursor.Normalised"         "Ms1.Area"                    
[27] "Ms1.Normalised"               "Ms1.Apex.Area"               
[29] "Ms1.Apex.Mz.Delta"            "Normalisation.Factor"        
[31] "Quantity.Quality"             "Empirical.Quality"           
[33] "Normalisation.Noise"          "Ms1.Profile.Corr"            
[35] "Evidence"                     "Mass.Evidence"               
[37] "Channel.Evidence"             "Ms1.Total.Signal.Before"     
[39] "Ms1.Total.Signal.After"       "RT.Start"                    
[41] "RT.Stop"                      "FWHM"                        
[43] "PG.TopN"                      "PG.MaxLFQ"                   
[45] "Genes.TopN"                   "Genes.MaxLFQ"                
[47] "Genes.MaxLFQ.Unique"          "PG.MaxLFQ.Quality"           
[49] "Genes.MaxLFQ.Quality"         "Genes.MaxLFQ.Unique.Quality" 
[51] "Q.Value"                      "PEP"                         
[53] "Global.Q.Value"               "Lib.Q.Value"                 
[55] "Peptidoform.Q.Value"          "Global.Peptidoform.Q.Value"  
[57] "Lib.Peptidoform.Q.Value"      "PTM.Site.Confidence"         
[59] "Site.Occupancy.Probabilities" "Protein.Sites"               
[61] "Lib.PTM.Site.Confidence"      "Translated.Q.Value"          
[63] "Channel.Q.Value"              "PG.Q.Value"                  
[65] "PG.PEP"                       "GG.Q.Value"                  
[67] "Protein.Q.Value"              "Global.PG.Q.Value"           
[69] "Lib.PG.Q.Value"               "Best.Fr.Mz"                  
[71] "Best.Fr.Mz.Delta"            

The key columns we will use in our analysis are:

  • Precursor.Id: a unique identifier for each precursor — typically the modified peptide sequence and charge state (e.g. "PEPTM[ox]IDEQ..2")
  • Protein.Group: the canonical protein group identifier used for aggregation to protein level. Proteins that cannot be distinguished from each other based on proteotypic peptides detected in the data are grouped together under one Protein Group.
  • Protein.Ids: the UniProt accession(s) of the protein(s) to which the precursor is assigned
  • Genes: the gene symbol(s) associated with the identified proteins
  • Run: the identifier for the sample run — one value per LC-MS sample injection
  • Precursor.Quantity: the quantified abundance of the precursor in that run
  • Q.Value: the precursor-level Q-value, representing the estimated false discovery rate at the precursor level; lower values indicate more confident identifications
  • Lib.Q.Value: the library-level Q-value — the confidence that the precursor was matched to the correct entry in the spectral library
  • Protein.Q.Value: the protein-level Q-value, representing the estimated FDR at the protein level

The report also contains chromatographic quality metrics per precursor per run. For example:

  • FWHM: Full-Width at Half Maximum of the chromatographic peak, a measure of peak shape quality — narrow, symmetrical peaks are preferred

We will use these Q-value columns and quality metrics to filter the data in the next lesson.

We will be using the QFeatures Bioconductor package to import, store, and manipulate our data. Before we import the data it is first necessary to understand the structure of a QFeatures object and its underlying SummarizedExperiment objects.

2.2 The structure of a SummarizedExperiment

To simplify the storage of quantitative proteomics data we can store the data as a SummarizedExperiment object (as shown in Figure 10.1). SummarizedExperiment objects can be conceptualised as data containers made up of three different parts:

  1. The assay - a matrix which contains the quantitative data from a proteomics experiment. Each row represents a feature (a precursor, peptide or protein) and each column contains the quantitative data (measurements) from one experimental sample.

  2. The rowData - a table (data frame) which contains all remaining information associated with each feature (i.e. every column from your search output that was not a quantification column). Rows represent features but columns inform about different attributes of the feature (e.g., its sequence, name, modifications).

  3. The colData - a table (data frame) to store sample metadata that would not appear in the output of your search software. This could be, for example, the condition, replicate, or batch each sample belongs to. Again, this information is stored in a matrix-like data structure.

Finally, there is also an additional container called metadata which is a place for users to store experimental metadata, such as which instrument the samples were run on or the date of data acquisition. We will focus on populating and understanding the three main containers above.

Figure 2.1: Diagramatic representation of the structure of a SummarizedExperiment object in R (modified from the SummarizedExperiment package)

Data stored in these three main areas can be easily accessed using the assay(), rowData() and colData() functions, as we will see later.

2.3 The structure of a QFeatures object

Whilst a SummarizedExperiment is able to neatly store quantitative proteomics data at a single data level (i.e., precursor, peptide or protein), a typical data analysis workflow requires us to look across multiple levels. For example, it is common to start an analysis at a lower data level and then aggregate upward towards a final protein-level dataset. Doing this allows for greater flexibility and understanding of the processes required to clean and aggregate the data.

A QFeatures object is essentially a list of SummarizedExperiment objects. However, the main benefit of using a QFeatures object over storing each data level as an independent SummarizedExperiment is that the QFeatures infrastructure maintains explicit links between the SummarizedExperiments that it stores. This allows for maximum traceability when processing data across multiple levels — for example, tracking which precursors or peptides contribute to each protein (Figure 10.2, modified from the QFeatures vignette with permission).

Figure 2.2: Graphic representation of the explicit links between data levels when stored in a QFeatures object.
NoteQFeatures nomenclature

When talking about a QFeatures object, each dataset (individual SummarizedExperiment) can be referred to as a set (short for dataset). Previously, each dataset was referred to as an experimental assay, and some older resources may still use this term. However, the experimental assay should not be confused with the quantitative matrix section of a SummarizedExperiment, which is called the assay data.

In order to generate the explicit links between data levels, we need to import the lowest desired data level into a QFeatures object and aggregate upwards within the QFeatures infrastructure using the aggregateFeatures function. If two SummarizedExperiments are generated separately and then added into the same QFeatures object, there will not automatically be links between them. In this case, if links are required, we can manually add them using the addAssayLink() function.

The best way to get our head around the QFeatures infrastructure is to import our data into a QFeatures object and start exploring.

2.4 Importing sample metadata

Sample metadata is information related to each sample that helps us interpret, organise and analyse our proteomics data. This data is something you must create yourself, related to your experiment. You can use a spreadsheet program such as Excel, or it can be coded in R directly.

Here, we have created a sample metadata file in Excel. This file is called monkeypox_metadata.xlsx. It provides information about each sample, including which group (MPXV, COVID-19, or Control) each run belongs to. We use the read_excel function from the readxl package to import the data into R.

sample_metadata <- read_excel("data/monkeypox_metadata.xlsx")
print(sample_metadata)
# A tibble: 31 × 5
   sample_id                                   runCol  group   sex   age.group
   <chr>                                       <chr>   <chr>   <chr> <chr>    
 1 20220624_Z1_ZW_001_30-0066_Ctr-A1_heat_DIA  Ctr_A1  control m     21-30    
 2 20220624_Z1_ZW_001_30-0066_Ctr-A10_heat_DIA Ctr_A10 control m     31-40    
 3 20220624_Z1_ZW_001_30-0066_Ctr-A11_heat_DIA Ctr_A11 control m     31-40    
 4 20220624_Z1_ZW_001_30-0066_Ctr-A12_heat_DIA Ctr_A12 control m     41-50    
 5 20220624_Z1_ZW_001_30-0066_Ctr-A2_heat_DIA  Ctr_A2  control m     21-30    
 6 20220624_Z1_ZW_001_30-0066_Ctr-A3_heat_DIA  Ctr_A3  control m     21-30    
 7 20220624_Z1_ZW_001_30-0066_Ctr-A4_heat_DIA  Ctr_A4  control m     21-30    
 8 20220624_Z1_ZW_001_30-0066_Ctr-A5_heat_DIA  Ctr_A5  control m     21-30    
 9 20220624_Z1_ZW_001_30-0066_Ctr-A6_heat_DIA  Ctr_A6  control m     21-30    
10 20220624_Z1_ZW_001_30-0066_Ctr-A7_heat_DIA  Ctr_A7  control m     21-30    
# ℹ 21 more rows

There are several columns in this file including the sample_id which is the name of the sample as well as the group, sex and age group. We have also created a column called runCol which is a concise, human-readable label for each sample which we add to our diann_df to in the next section to use in our analysis.

NoteThe pipe operator

In this course we will frequently use the pipe operator (%>%) from the dplyr package and tidyverse. The pipe passes the output of one function directly as the input to the next, allowing you to chain together multiple operations in a linear, readable way — rather than nesting function calls or creating many intermediate variables. For more information see Hadley Wickham’s Tidyverse at https://www.tidyverse.org.

2.4.1 Cleaning run names

In DIA-NN the names in the Run column often inherit the full raw file path pasting together the name of the .wiff2 folder and the .wiff.scan files (.wiff files are raw data formats generated by SCIEX mass spectrometers containing unprocessed metadata and scan data from LM-MS/MS runs). We need to clean up these run names so that they match the sample_id field in our metadata.

# Clean Run names
diann_df[, "Run"] %>% 
  unique() %>% 
  head(2)
# A tibble: 2 × 1
  Run                                                                           
  <chr>                                                                         
1 20220624_Z1_ZW_001_30-0066_COVID-E9_heat_DIA.wiff2-20220624_Z1_ZW_001_30-0066…
2 20220624_Z1_ZW_001_30-0066_MPX-D2_heat_DIA.wiff2-20220624_Z1_ZW_001_30-0066_M…
# DIANN has pasted the name of the .wiff2 folder as well as the .wiff.scan 
# files together. So we need to clean up the Run names so that they match 
# the metadata `sample_id` field.
diann_df <- diann_df %>%
  mutate(Run = sub('\\..*', '', Run))

# Check we have cleaned the Run names
diann_df[, "Run"] %>% 
  unique() %>% 
  head(2)
# A tibble: 2 × 1
  Run                                         
  <chr>                                       
1 20220624_Z1_ZW_001_30-0066_COVID-E9_heat_DIA
2 20220624_Z1_ZW_001_30-0066_MPX-D2_heat_DIA  

In the following code chunk we add a new column called runCol to the diann_df data. This column will serve as a link between the metadata and the diann_df proteomics data as well as being useful for visualisation and annotating our data at a later stage in the analysis.

diann_df <- diann_df %>%
  merge(sample_metadata[, c("sample_id", "runCol")], 
        by.x = "Run", 
        by.y = "sample_id")

If we look at the column dimensions and names of the diann_df we now see the runCol column has been added.

dim(diann_df)
[1] 120505     72
names(diann_df)
 [1] "Run"                          "Run.Index"                   
 [3] "Channel"                      "Precursor.Id"                
 [5] "Modified.Sequence"            "Stripped.Sequence"           
 [7] "Precursor.Charge"             "Precursor.Lib.Index"         
 [9] "Decoy"                        "Proteotypic"                 
[11] "Precursor.Mz"                 "Protein.Ids"                 
[13] "Protein.Group"                "Protein.Names"               
[15] "Genes"                        "RT"                          
[17] "iRT"                          "Predicted.RT"                
[19] "Predicted.iRT"                "IM"                          
[21] "iIM"                          "Predicted.IM"                
[23] "Predicted.iIM"                "Precursor.Quantity"          
[25] "Precursor.Normalised"         "Ms1.Area"                    
[27] "Ms1.Normalised"               "Ms1.Apex.Area"               
[29] "Ms1.Apex.Mz.Delta"            "Normalisation.Factor"        
[31] "Quantity.Quality"             "Empirical.Quality"           
[33] "Normalisation.Noise"          "Ms1.Profile.Corr"            
[35] "Evidence"                     "Mass.Evidence"               
[37] "Channel.Evidence"             "Ms1.Total.Signal.Before"     
[39] "Ms1.Total.Signal.After"       "RT.Start"                    
[41] "RT.Stop"                      "FWHM"                        
[43] "PG.TopN"                      "PG.MaxLFQ"                   
[45] "Genes.TopN"                   "Genes.MaxLFQ"                
[47] "Genes.MaxLFQ.Unique"          "PG.MaxLFQ.Quality"           
[49] "Genes.MaxLFQ.Quality"         "Genes.MaxLFQ.Unique.Quality" 
[51] "Q.Value"                      "PEP"                         
[53] "Global.Q.Value"               "Lib.Q.Value"                 
[55] "Peptidoform.Q.Value"          "Global.Peptidoform.Q.Value"  
[57] "Lib.Peptidoform.Q.Value"      "PTM.Site.Confidence"         
[59] "Site.Occupancy.Probabilities" "Protein.Sites"               
[61] "Lib.PTM.Site.Confidence"      "Translated.Q.Value"          
[63] "Channel.Q.Value"              "PG.Q.Value"                  
[65] "PG.PEP"                       "GG.Q.Value"                  
[67] "Protein.Q.Value"              "Global.PG.Q.Value"           
[69] "Lib.PG.Q.Value"               "Best.Fr.Mz"                  
[71] "Best.Fr.Mz.Delta"             "runCol"                      
head(diann_df)
                                           Run Run.Index Channel Precursor.Id
1 20220624_Z1_ZW_001_30-0066_COVID-E1_heat_DIA         0           FEVQVTVPK2
  Modified.Sequence Stripped.Sequence Precursor.Charge Precursor.Lib.Index
1         FEVQVTVPK         FEVQVTVPK                2                1172
  Decoy Proteotypic Precursor.Mz Protein.Ids Protein.Group Protein.Names Genes
1     0           1     523.7977      P01023        P01023    A2MG_HUMAN   A2M
        RT      iRT Predicted.RT Predicted.iRT IM iIM Predicted.IM
1 11.96337 36.66281     11.97756      36.64785  0   0            0
  Predicted.iIM Precursor.Quantity Precursor.Normalised Ms1.Area Ms1.Normalised
1             0            1653568              1421358   966520       830792.3
  Ms1.Apex.Area Ms1.Apex.Mz.Delta Normalisation.Factor Quantity.Quality
1        941713      -0.001831055            0.8595707        0.9687976
  Empirical.Quality Normalisation.Noise Ms1.Profile.Corr Evidence Mass.Evidence
1         0.9761357          -0.1229831        0.9922304 6.766142      4.969353
  Channel.Evidence Ms1.Total.Signal.Before Ms1.Total.Signal.After RT.Start
1        0.9730694                11857848               10346640 11.84572
   RT.Stop       FWHM PG.TopN PG.MaxLFQ Genes.TopN Genes.MaxLFQ
1 12.05762 0.09868256       0   2093698          0      2093698
  Genes.MaxLFQ.Unique PG.MaxLFQ.Quality Genes.MaxLFQ.Quality
1             1581720         0.9925535            0.9925535
  Genes.MaxLFQ.Unique.Quality     Q.Value         PEP Global.Q.Value
1                   0.9920511 0.001565558 0.004092143    3.15557e-06
   Lib.Q.Value Peptidoform.Q.Value Global.Peptidoform.Q.Value
1 3.412969e-06                   0               2.423655e-06
  Lib.Peptidoform.Q.Value PTM.Site.Confidence Site.Occupancy.Probabilities
1            2.032158e-06                   1                             
  Protein.Sites Lib.PTM.Site.Confidence Translated.Q.Value Channel.Q.Value
1                                     1                  0               0
   PG.Q.Value     PG.PEP GG.Q.Value Protein.Q.Value Global.PG.Q.Value
1 0.004016064 0.01531236 0.00406504     0.004444445       0.002747253
  Lib.PG.Q.Value Best.Fr.Mz Best.Fr.Mz.Delta   runCol
1    0.002525253   376.1867     -0.003051758 COVID_E1
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

2.5 Reading into a QFeatures object

We use the readQFeaturesFromDIANN function to import the cleaned DIA-NN data into a QFeatures object. This function creates one SummarizedExperiment per run, with each set containing the precursor-level quantification data for that run. The runCol argument specifies which column in the input data frame identifies the run.

dia_qf <- 
  readQFeaturesFromDIANN(diann_df,
                         quantCols = "Precursor.Quantity",
                         fnames = "Precursor.Id",
                         runCol = "runCol",
                         colData = sample_metadata)
Checking arguments.
Loading data as a 'SummarizedExperiment' object.
Splitting data in runs.
Formatting sample annotations (colData).
Formatting data as a 'QFeatures' object.
Setting assay rownames.

2.6 Exploring the data

When we inspect our QFeatures object, we see that it contains 31 sets, each corresponding to a different sample run. Each set contains the precursor-level quantification data for that run, with feature names given by the Precursor.Id column from the original DIA-NN output.

dia_qf
An instance of class QFeatures (type: bulk) with 31 sets:

 [1] COVID_E1: SummarizedExperiment with 3834 rows and 1 columns 
 [2] COVID_E10: SummarizedExperiment with 3914 rows and 1 columns 
 [3] COVID_E2: SummarizedExperiment with 4016 rows and 1 columns 
 ...
 [29] MPX_D4: SummarizedExperiment with 3950 rows and 1 columns 
 [30] MPX_D5: SummarizedExperiment with 4043 rows and 1 columns 
 [31] MPX_D6: SummarizedExperiment with 3861 rows and 1 columns 

By default if a QFeatures object has more than 6 samples, only the first three and last three sample summaries are printed to the screen. To see all samples and their respective data summary, we can use the experiments function.

experiments(dia_qf)
ExperimentList class object of length 31:
 [1] COVID_E1: SummarizedExperiment with 3834 rows and 1 columns
 [2] COVID_E10: SummarizedExperiment with 3914 rows and 1 columns
 [3] COVID_E2: SummarizedExperiment with 4016 rows and 1 columns
 [4] COVID_E3: SummarizedExperiment with 4028 rows and 1 columns
 [5] COVID_E4: SummarizedExperiment with 3978 rows and 1 columns
 [6] COVID_E5: SummarizedExperiment with 3662 rows and 1 columns
 [7] COVID_E6: SummarizedExperiment with 3915 rows and 1 columns
 [8] COVID_E7: SummarizedExperiment with 3950 rows and 1 columns
 [9] COVID_E8: SummarizedExperiment with 3882 rows and 1 columns
 [10] COVID_E9: SummarizedExperiment with 3906 rows and 1 columns
 [11] Ctr_A1: SummarizedExperiment with 3878 rows and 1 columns
 [12] Ctr_A10: SummarizedExperiment with 3894 rows and 1 columns
 [13] Ctr_A11: SummarizedExperiment with 3800 rows and 1 columns
 [14] Ctr_A12: SummarizedExperiment with 3876 rows and 1 columns
 [15] Ctr_A2: SummarizedExperiment with 3864 rows and 1 columns
 [16] Ctr_A3: SummarizedExperiment with 4105 rows and 1 columns
 [17] Ctr_A4: SummarizedExperiment with 3823 rows and 1 columns
 [18] Ctr_A5: SummarizedExperiment with 3759 rows and 1 columns
 [19] Ctr_A6: SummarizedExperiment with 3903 rows and 1 columns
 [20] Ctr_A7: SummarizedExperiment with 3774 rows and 1 columns
 [21] Ctr_A8: SummarizedExperiment with 3936 rows and 1 columns
 [22] Ctr_A9: SummarizedExperiment with 3856 rows and 1 columns
 [23] Ctr_B1: SummarizedExperiment with 3936 rows and 1 columns
 [24] Ctr_B2: SummarizedExperiment with 3819 rows and 1 columns
 [25] Ctr_B3: SummarizedExperiment with 3788 rows and 1 columns
 [26] MPX_D1: SummarizedExperiment with 3845 rows and 1 columns
 [27] MPX_D2: SummarizedExperiment with 3858 rows and 1 columns
 [28] MPX_D3: SummarizedExperiment with 3852 rows and 1 columns
 [29] MPX_D4: SummarizedExperiment with 3950 rows and 1 columns
 [30] MPX_D5: SummarizedExperiment with 4043 rows and 1 columns
 [31] MPX_D6: SummarizedExperiment with 3861 rows and 1 columns

We can quickly see how many precursors were measure per sample for example, we see for the first sample called COVID_E1 there are 3834 precursors.

We can access individual samples (sets, or SummarizedExperiments) using standard double bracket nomenclature (this is how you would normally access items of a list in R). As with all indexing we can use the list position or dataset name.

dia_qf[[1]]
class: SummarizedExperiment 
dim: 3834 1 
metadata(0):
assays(1): ''
rownames(3834): FEVQVTVPK2 MTSNFPVDLSDYPK2 ... VLTPPMGTVMDVLK2
  LSHNAIASLRPR3
rowData names(71): Run Run.Index ... Best.Fr.Mz.Delta runCol
colnames(1): COVID_E1
colData names(0):

Is the same as,

dia_qf[["COVID_E1"]]
class: SummarizedExperiment 
dim: 3834 1 
metadata(0):
assays(1): ''
rownames(3834): FEVQVTVPK2 MTSNFPVDLSDYPK2 ... VLTPPMGTVMDVLK2
  LSHNAIASLRPR3
rowData names(71): Run Run.Index ... Best.Fr.Mz.Delta runCol
colnames(1): COVID_E1
colData names(0):

The quantitation data for each sample is store in the assay container and can be extracted using the assay function.

assay(dia_qf[[1]]) %>% 
  head()
                       COVID_E1
FEVQVTVPK2          1653567.750
MTSNFPVDLSDYPK2       90330.430
IEIPLPFGGK2          261497.078
YPLYVLK2              28442.559
SGSLPDDVLLHFAGAGK2     6343.696
DQEVLLQTFLDDASPGDK3    4285.400

Discussion

Why does readQFeaturesFromDIANN create one SummarizedExperiment per run, rather than a single combined set across all runs from the outset?

The DIA-NN output contains per-run quality metrics stored in the rowData of each precursor. These are run-specific values and would be ambiguous or impossible to retain if all runs were immediately merged into a single matrix. By keeping one SummarizedExperiment per run, we can inspect and filter on these per-run metrics before joining the data. We will join the per-sample sets into a single combined precursor set in the next lesson, once filtering is complete.

2.6.1 How many precursors per sample?

Each sample will have a different number of precursors identified and quantified. We can explore this by inspecting the number of rows in each per-sample set, which corresponds to the number of precursors quantified in that sample.

ExerciseExercise 1 - Challenge: Precursors per sample

Level:

  1. Find a function to determine how many precursors are quantified in each sample. Use it to plot a histogram of precursor counts across samples.

  2. Create a bar plot showing the number of precursors identified in each sample, coloured by group. Based on your plot:

  • Which group tends to have the most precursors identified?
  • Is there any sample that looks like an outlier?
Hint Browse the QFeatures documentation with ?QFeatures to find a suitable function for obtaining per-sample feature counts.

Task 1

hist(nrows(dia_qf),
     xlab="Precursors",
     main="Precursor counts per sample")

Task 2

data.frame(
  n_precursors = nrows(dia_qf),
  sample = names(nrows(dia_qf))) %>%
  merge(data.frame(colData(dia_qf)), by.x = "sample", by.y = "runCol") %>%
  ggplot(aes(x = sample, y = n_precursors, fill = group)) +
  geom_bar(stat = "identity") +
  labs(x = "Sample", y = "Number of precursors") +
  coord_flip() +
  theme_classic()

2.6.2 Exploring the row data

The rowData of each per-sample SummarizedExperiment contains quality metrics from DIA-NN. This information can be accessed on a sample by sample level by using the rowData function.

rowData(dia_qf[[1]]) %>% 
  head(2)
DataFrame with 2 rows and 71 columns
                          Run Run.Index     Channel  Precursor.Id
                  <character> <integer> <character>   <character>
                Modified.Sequence Stripped.Sequence Precursor.Charge
                      <character>       <character>        <integer>
                Precursor.Lib.Index     Decoy Proteotypic Precursor.Mz
                          <integer> <integer>   <integer>    <numeric>
                Protein.Ids Protein.Group Protein.Names       Genes        RT
                <character>   <character>   <character> <character> <numeric>
                      iRT Predicted.RT Predicted.iRT        IM       iIM
                <numeric>    <numeric>     <numeric> <numeric> <numeric>
                Predicted.IM Predicted.iIM Precursor.Normalised  Ms1.Area
                   <numeric>     <numeric>            <numeric> <numeric>
                Ms1.Normalised Ms1.Apex.Area Ms1.Apex.Mz.Delta
                     <numeric>     <numeric>         <numeric>
                Normalisation.Factor Quantity.Quality Empirical.Quality
                           <numeric>        <numeric>         <numeric>
                Normalisation.Noise Ms1.Profile.Corr  Evidence Mass.Evidence
                          <numeric>        <numeric> <numeric>     <numeric>
                Channel.Evidence Ms1.Total.Signal.Before Ms1.Total.Signal.After
                       <numeric>               <numeric>              <numeric>
                 RT.Start   RT.Stop      FWHM   PG.TopN PG.MaxLFQ Genes.TopN
                <numeric> <numeric> <numeric> <numeric> <numeric>  <numeric>
                Genes.MaxLFQ Genes.MaxLFQ.Unique PG.MaxLFQ.Quality
                   <numeric>           <numeric>         <numeric>
                Genes.MaxLFQ.Quality Genes.MaxLFQ.Unique.Quality     Q.Value
                           <numeric>                   <numeric>   <numeric>
                        PEP Global.Q.Value Lib.Q.Value Peptidoform.Q.Value
                  <numeric>      <numeric>   <numeric>           <numeric>
                Global.Peptidoform.Q.Value Lib.Peptidoform.Q.Value
                                 <numeric>               <numeric>
                PTM.Site.Confidence Site.Occupancy.Probabilities Protein.Sites
                          <numeric>                  <character>   <character>
                Lib.PTM.Site.Confidence Translated.Q.Value Channel.Q.Value
                              <numeric>          <numeric>       <numeric>
                PG.Q.Value    PG.PEP GG.Q.Value Protein.Q.Value
                 <numeric> <numeric>  <numeric>       <numeric>
                Global.PG.Q.Value Lib.PG.Q.Value Best.Fr.Mz Best.Fr.Mz.Delta
                        <numeric>      <numeric>  <numeric>        <numeric>
                     runCol
                <character>
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

However, we want to access this information across all the samples. To do this we can use the rbindRowData function.

rdata <- rbindRowData(dia_qf, i = names(dia_qf))

The rbindRowData functions returns a DFrame table that contains the row binded rowData tables from the selected assays. In this context, i is a character(), integer() or logical() object for subsetting assays. Only rowData variables that are common to all assays are kept.

Recall, there is also a slot in the QFeatures object that stores information on the samples. This is the colData slot.

colData(dia_qf)
DataFrame with 31 rows and 5 columns
              sample_id      runCol       group         sex   age.group
            <character> <character> <character> <character> <character>
COVID_E1  20220624_Z...    COVID_E1     Covid19           m       21-30
COVID_E10 20220624_Z...   COVID_E10     Covid19           m       41-50
COVID_E2  20220624_Z...    COVID_E2     Covid19           m       21-30
COVID_E3  20220624_Z...    COVID_E3     Covid19           m       41-50
COVID_E4  20220624_Z...    COVID_E4     Covid19           m       31-40
...                 ...         ...         ...         ...         ...
MPX_D2    20220624_Z...      MPX_D2        MPXV           m       21-30
MPX_D3    20220624_Z...      MPX_D3        MPXV           m       31-40
MPX_D4    20220624_Z...      MPX_D4        MPXV           m       31-40
MPX_D5    20220624_Z...      MPX_D5        MPXV           m       41-50
MPX_D6    20220624_Z...      MPX_D6        MPXV           m       21-30

In the below code chunk we create a data.frame called cdata containing the sample information and merge this with the rdata so we get group information for each sample. This will help with our data exploration.

# 1. Create a data.frame of the sample information

cdata <- data.frame(colData(dia_qf))

# 2. Merge this with the row data
# Note: Our data is in separate sets, so the sample names are in the 'assay' 
# column. We merge with the column data (using the runCol column) to get group
# information for each sample

rdata <- rdata %>%
  data.frame() %>%
  merge(cdata, by.x = "assay", by.y = "runCol")

Now we have one data.frame with precursor-level and sample-level information we can now visualise quality metrics from DIA-NN including the Full-Width at Half Maximum (FWHM) of the chromatographic peaks. In the code below we plot the distribution of FWHM values per sample, coloured by group.

rdata %>%
  ggplot(aes(FWHM, group = Run, colour = group)) +
  geom_density() +
  theme_classic()

ExerciseExercise 2 - Challenge: Exploring rowData quality metrics

Level:

The rdata object contains many columns from the DIA-NN rowData beyond FWHM.

  1. Use str(rdata) to identify other numerical quality metric columns.
  2. Choose one metric and create a plot to visualise its distribution across samples, coloured by group.
  3. Do any metrics appear to differ systematically between groups, or between individual samples?

2.7 Save data

TipKey Points
  • DIA-NN outputs a .parquet format report file containing precursor- and protein-level quantification data for each sample
  • The readQFeaturesFromDIANN function imports DIA-NN output into a QFeatures object, creating one SummarizedExperiment per sample run
  • Sample metadata is stored in the colData of the QFeatures object and can be used to annotate samples with experimental group information
  • The rowData of each per-sample SummarizedExperiment contains quality metrics from DIA-NN (e.g. FWHM, Q-values) that can be used to explore data quality and guide downstream filtering decisions

References