7  Biological interpretation

Learning Objectives
  • Be aware of different analyses that can be done to gain a biological understanding of expression proteomics results
  • Understand the concept of Gene Ontology (GO) enrichment analyses
  • Complete GO enrichment analyses using the enrichGO function from the clusterProfiler package

7.1 Adding metadata to our results using dplyr

Before we can look any further into the biological meaning of any protein abundance changes we need to extract these proteins from our overall results. It is also useful to re-add information about the master protein descriptions since this is lost in the output of limma analysis.

It is important to note that the results table we have generated from limma is not in the same order as the input data. Therefore, to add information from our original data e.g., from the rowData such as the Master.Protein.Descriptions we must match the protein names between them.

To do this, let’s extract information the Master.Protein.Descriptions from the original data we have created called all_proteins.

Recall that all_protiens is SummarizedExperiment object,

all_proteins
class: SummarizedExperiment 
dim: 3823 6 
metadata(0):
assays(2): assay aggcounts
rownames(3823): A0A0B4J2D5 A0A2R8Y4L2 ... Q9Y6X9 Q9Y6Y8
rowData names(21): Checked Tags ... Protein.FDR.Confidence .n
colnames(6): M_1 M_2 ... G1_2 G1_3
colData names(4): sample rep condition tag

We wish to extract information from the rowData regarding the Master.Protein.Descriptions,

## Add master protein descriptions back
protein_info <- all_proteins %>%
  rowData() %>%
  as_tibble() %>%
  select(Protein = Master.Protein.Accessions, 
         protein_description = Master.Protein.Descriptions)

protein_info %>% head()
# A tibble: 6 × 2
  Protein    protein_description                                                
  <chr>      <chr>                                                              
1 A0A0B4J2D5 Putative glutamine amidotransferase-like class 1 domain-containing…
2 A0A2R8Y4L2 Heterogeneous nuclear ribonucleoprotein A1-like 3 OS=Homo sapiens …
3 A0AVT1     Ubiquitin-like modifier-activating enzyme 6 OS=Homo sapiens OX=960…
4 A0MZ66     Shootin-1 OS=Homo sapiens OX=9606 GN=SHTN1 PE=1 SV=4               
5 A1L0T0     2-hydroxyacyl-CoA lyase 2 OS=Homo sapiens OX=9606 GN=ILVBL PE=1 SV…
6 A1X283     SH3 and PX domain-containing protein 2B OS=Homo sapiens OX=9606 GN…

Note, we also extract the Master.Protein.Accessions column so we can use this to match to the protein column in our limma results.

Now we can use the left_join function from dplyr to match the protein descriptions to the protein IDs,

limma_results <- limma_results %>% 
  left_join(protein_info, by = "Protein")

# Verify
limma_results %>%
  head()
  Protein     logFC   AveExpr         t      P.Value    adj.P.Val        B
1  Q9NQW6 -2.077921  2.641270 -40.40812 1.139582e-15 4.356623e-12 25.79468
2  Q9BW19 -2.343592  1.430210 -36.79774 4.073530e-15 7.786553e-12 24.71272
3  P49454 -1.991326  2.598481 -34.88623 8.411188e-15 9.266757e-12 24.07905
4  Q9ULW0 -2.152503  2.476087 -34.52309 9.695796e-15 9.266757e-12 23.95341
5  Q562F6 -3.205603 -1.304111 -32.91865 1.849860e-14 1.414403e-11 23.37677
6  O14965 -2.098029  0.797902 -31.22180 3.791132e-14 2.415583e-11 22.72604
  direction significance
1      down          sig
2      down          sig
3      down          sig
4      down          sig
5      down          sig
6      down          sig
                                                    protein_description
1                     Anillin OS=Homo sapiens OX=9606 GN=ANLN PE=1 SV=2
2 Kinesin-like protein KIFC1 OS=Homo sapiens OX=9606 GN=KIFC1 PE=1 SV=2
3       Centromere protein F OS=Homo sapiens OX=9606 GN=CENPF PE=1 SV=3
4 Targeting protein for Xklp2 OS=Homo sapiens OX=9606 GN=TPX2 PE=1 SV=2
5                 Shugoshin 2 OS=Homo sapiens OX=9606 GN=SGO2 PE=1 SV=2
6            Aurora kinase A OS=Homo sapiens OX=9606 GN=AURKA PE=1 SV=3
Manipulating data with dplyr and tidyverse

There is lots of information online about getting started with dplyr and using the tidyverse. We really like this lesson from the Data Carpentry if you are new to the tidyverse.

7.2 Subset differentially abundant proteins

Let’s subset our results and only keep proteins which have been flagged as exhibiting significant abundance changes,

sig_changing <- limma_results %>% 
  as_tibble() %>%
  filter(significance == "sig")

sig_up <- sig_changing %>%
  filter(direction == "up")

sig_down <- sig_changing %>%
  filter(direction == "down")

7.3 Biological interpretation of differentially abundant proteins

Our statistical analyses provided us with a list of proteins that are present with significantly different abundances between M-phase and G1-phase of the cell cycle. We can get an initial idea about what these proteins are and do by looking at the protein descriptions.

## Look at descriptions of proteins upregulated in M relative to deysnchronized cells
sig_up %>%
  pull(protein_description) %>%
  head()
[1] "Golgi reassembly-stacking protein 2 OS=Homo sapiens OX=9606 GN=GORASP2 PE=1 SV=3"
[2] "Selenoprotein S OS=Homo sapiens OX=9606 GN=SELENOS PE=1 SV=3"                    
[3] "Anamorsin OS=Homo sapiens OX=9606 GN=CIAPIN1 PE=1 SV=2"                          
[4] "Krueppel-like factor 16 OS=Homo sapiens OX=9606 GN=KLF16 PE=1 SV=1"              
[5] "Histone H1.10 OS=Homo sapiens OX=9606 GN=H1-10 PE=1 SV=1"                        
[6] "Importin subunit alpha-4 OS=Homo sapiens OX=9606 GN=KPNA3 PE=1 SV=2"             

Whilst we may recognise some of the changing proteins, this might be the first time that we are coming across others. Moreover, some protein descriptions contain useful information, but this is very limited. We still want to find out more about the biological role of the statistically significant proteins so that we can infer the potential effects of their abundance changes.

There are many functional analyses that could be done on the proteins with differential abundance:

  • Investigate the biological pathways that the proteins function within (KEGG etc.)
  • Identify potential interacting partners (IntAct, STRING)
  • Determine the subcellular localisation in which the changing proteins are found
  • Understand the co-regulation of their mRNAs (Expression Atlas)
  • Compare our changing proteins to those previously identified in other proteomic studies of the cell cycle

7.3.1 Gene Ontology (GO) enrichment analysis

One of the common methods used to probe the biological relevance of proteins with significant changes in abundance between conditions is to carry out Gene Ontology (GO) enrichment, or over-representation, analysis.

The Gene Ontology consortium have defined a set of hierarchical descriptions to be assigned to genes and their resulting proteins. These descriptions are split into three categories: cellular components (CC), biological processes (BP) and molecular function (MF). The idea is to provide information about a protein’s subcellular localisation, functionality and which processes it contributes to within the cell. Hence, the overarching aim of GO enrichment analysis is to answer the question:

“Given a list of proteins found to be differentially abundant in my phenotype of interest, what are the cellular components, molecular functions and biological processes involved in this phenotype?”.

Unfortunately, just looking at the GO terms associated with our differentially abundant proteins is insufficient to draw any solid conclusions. For example, if we find that 120 of the 453 proteins significantly downregulated in M phase are annotated with the GO term “kinase activity”, it may seem intuitive to conclude that reducing kinase activity is important for the M-phase phenotype. However, if 90% of all proteins in the cell were kinases (an extreme example), then we might expect to discover a high representation of the “kinase activity” GO term in any protein list we end up with.

This leads us to the concept of an over-representation analysis. We wish to ask whether any GO terms are over-represented (i.e., present at a higher frequency than expected by chance) in our lists of differentially abundant proteins. In other words, we need to know how many proteins with a GO term could have shown differential abundance in our experiment vs. how many proteins with this GO term did show differential abundance in our experiment.

We are going to use a function in R called enrichGO from the the clusterProfiler Yu et al. (2012) Bioconductor R package to perform GO enrichment analysis. The package vignette can be found here. and full tutorials for using the package here

Annotation packages in Bioconductor

The enrichGO function uses the org.Hs.eg.db Bioconductor package that has genome wide annotation for human, primarily based on mapping using Entrez Gene identifiers. It also uses the GO.db package which is a set of annotation maps describing the entire Gene Ontology assembled using data from GO.

In the next code chunk we call the enrichGO function,

ego_down <- enrichGO(gene = sig_down$Protein,               # list of down proteins
                     universe = limma_results$Protein,      # all proteins 
                     OrgDb = org.Hs.eg.db,                  # database to query
                     keyType = "UNIPROT",                   # protein ID encoding
                     pvalueCutoff = 0.05, 
                     qvalueCutoff = 0.05,
                     ont = "MF",                            # can be CC, MF, BP, or ALL
                     readable = TRUE)

Let’s take a look at the output.

ego_down
#
# over-representation test
#
#...@organism    Homo sapiens 
#...@ontology    MF 
#...@keytype     UNIPROT 
#...@gene    chr [1:453] "Q9NQW6" "Q9BW19" "P49454" "Q9ULW0" "Q562F6" "O14965" "Q02224" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...65 enriched terms found
'data.frame':   65 obs. of  9 variables:
 $ ID         : chr  "GO:0016491" "GO:0005201" "GO:0005509" "GO:0003756" ...
 $ Description: chr  "oxidoreductase activity" "extracellular matrix structural constituent" "calcium ion binding" "protein disulfide isomerase activity" ...
 $ GeneRatio  : chr  "65/447" "18/447" "37/447" "9/447" ...
 $ BgRatio    : chr  "220/3745" "24/3745" "99/3745" "11/3745" ...
 $ pvalue     : num  2.97e-13 1.20e-12 3.20e-11 2.01e-07 2.01e-07 ...
 $ p.adjust   : num  1.29e-10 2.60e-10 4.63e-09 1.43e-05 1.43e-05 ...
 $ qvalue     : num  1.06e-10 2.15e-10 3.82e-09 1.18e-05 1.18e-05 ...
 $ geneID     : chr  "QSOX1/P4HA2/PDIA4/P4HB/PLOD2/HSD17B4/PDIA3/P4HA1/HADHA/DNAJC10/ALDH18A1/POR/MTHFD2/PDIA6/LOXL2/PLOD1/ACADM/TXND"| __truncated__ "CCN1/THBS1/SPARC/FN1/TNC/COL12A1/LAMC1/COL23A1/EDIL3/LAMC2/TUFT1/COL4A2/EFEMP1/AGRN/HAPLN1/COL1A1/FBN2/COL6A3" "THBS1/SPARC/ITGB1/HSPA5/FSTL1/CALR/EDIL3/HSP90B1/SPTAN1/SDF4/LOXL2/FKBP10/EFEMP1/ESYT1/ASPH/NOTCH2/CANX/PRKCSH/"| __truncated__ "QSOX1/PDIA4/P4HB/PDIA3/PDIA6/TXNDC5/TMX1/ERP44/CRELD2" ...
 $ Count      : int  65 18 37 9 9 23 23 16 12 26 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

The output of the enrichGO function is an object of class enrichResult that contains the ID and Description of all enriched GO terms. There is also information about which geneIDs from our significantly downregulated proteins are annotated with each of the enriched GO terms. Let’s take a look at the descriptions.

ego_down$Description %>% 
  head(10)
 [1] "oxidoreductase activity"                                      
 [2] "extracellular matrix structural constituent"                  
 [3] "calcium ion binding"                                          
 [4] "protein disulfide isomerase activity"                         
 [5] "intramolecular oxidoreductase activity, transposing S-S bonds"
 [6] "monoatomic cation transmembrane transporter activity"         
 [7] "inorganic cation transmembrane transporter activity"          
 [8] "glycosaminoglycan binding"                                    
 [9] "collagen binding"                                             
[10] "active transmembrane transporter activity"                    

There is a long list because of the hierarchical nature of GO terms. The results of GO enrichment analysis can be visualised in many different ways. For a full overview of GO enrichment visualisation tools see Visualization of functional enrichment result.

dotplot(ego_down, 
        x = "Count", 
        showCategory = 10, 
        font.size = 10,
        label_format = 100,
        color = "p.adjust")

Key Points
  • Gene ontology (GO) terms described the molecular function (MF), biological processes (BP) and cellular component (CC) of proteins.
  • GO terms are hierarchical and generic. They do not relate to specific biological systems e.g., cell type or condition.
  • GO enrichment analysis aims to identify GO terms that are present in a list of proteins of interest (foreground) at a higher frequency than expected by chance based on their frequency in a background list of proteins (universe). The universe should be a list of all proteins included identified and quantified in your experiment.
  • The enrichGO function from the clusterProfiler package provides a convenient way to carry out reproducible GO enrichment analysis in R.

References

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.