7 Biological interpretation
- Be aware of different analyses that can be done to gain a biological understanding of expression proteomics results
- Understand the concept of Gene Ontology (GO) enrichment analyses
- Complete GO enrichment analyses using the
enrichGO
function from theclusterProfiler
package
7.1 Adding metadata to our results using dplyr
Before we can look any further into the biological meaning of any protein abundance changes we need to extract these proteins from our overall results. It is also useful to re-add information about the master protein descriptions since this is lost in the output of limma
analysis.
It is important to note that the results table we have generated from limma
is not in the same order as the input data. Therefore, to add information from our original data e.g., from the rowData
such as the Master.Protein.Descriptions
we must match the protein names between them.
To do this, let’s extract information the Master.Protein.Descriptions
from the original data we have created called all_proteins
.
Recall that all_protiens
is SummarizedExperiment
object,
all_proteins
class: SummarizedExperiment
dim: 3823 6
metadata(0):
assays(2): assay aggcounts
rownames(3823): A0A0B4J2D5 A0A2R8Y4L2 ... Q9Y6X9 Q9Y6Y8
rowData names(21): Checked Tags ... Protein.FDR.Confidence .n
colnames(6): M_1 M_2 ... G1_2 G1_3
colData names(4): sample rep condition tag
We wish to extract information from the rowData
regarding the Master.Protein.Descriptions
,
## Add master protein descriptions back
protein_info <- all_proteins %>%
rowData() %>%
as_tibble() %>%
select(Protein = Master.Protein.Accessions,
protein_description = Master.Protein.Descriptions)
protein_info %>% head()
# A tibble: 6 × 2
Protein protein_description
<chr> <chr>
1 A0A0B4J2D5 Putative glutamine amidotransferase-like class 1 domain-containing…
2 A0A2R8Y4L2 Heterogeneous nuclear ribonucleoprotein A1-like 3 OS=Homo sapiens …
3 A0AVT1 Ubiquitin-like modifier-activating enzyme 6 OS=Homo sapiens OX=960…
4 A0MZ66 Shootin-1 OS=Homo sapiens OX=9606 GN=SHTN1 PE=1 SV=4
5 A1L0T0 2-hydroxyacyl-CoA lyase 2 OS=Homo sapiens OX=9606 GN=ILVBL PE=1 SV…
6 A1X283 SH3 and PX domain-containing protein 2B OS=Homo sapiens OX=9606 GN…
Note, we also extract the Master.Protein.Accessions
column so we can use this to match to the protein
column in our limma
results.
Now we can use the left_join
function from dplyr
to match the protein descriptions to the protein IDs,
limma_results <- limma_results %>%
left_join(protein_info, by = "Protein")
# Verify
limma_results %>%
head()
Protein logFC AveExpr t P.Value adj.P.Val B
1 Q9NQW6 -2.077921 2.641270 -40.40812 1.139582e-15 4.356623e-12 25.79468
2 Q9BW19 -2.343592 1.430210 -36.79774 4.073530e-15 7.786553e-12 24.71272
3 P49454 -1.991326 2.598481 -34.88623 8.411188e-15 9.266757e-12 24.07905
4 Q9ULW0 -2.152503 2.476087 -34.52309 9.695796e-15 9.266757e-12 23.95341
5 Q562F6 -3.205603 -1.304111 -32.91865 1.849860e-14 1.414403e-11 23.37677
6 O14965 -2.098029 0.797902 -31.22180 3.791132e-14 2.415583e-11 22.72604
direction significance
1 down sig
2 down sig
3 down sig
4 down sig
5 down sig
6 down sig
protein_description
1 Anillin OS=Homo sapiens OX=9606 GN=ANLN PE=1 SV=2
2 Kinesin-like protein KIFC1 OS=Homo sapiens OX=9606 GN=KIFC1 PE=1 SV=2
3 Centromere protein F OS=Homo sapiens OX=9606 GN=CENPF PE=1 SV=3
4 Targeting protein for Xklp2 OS=Homo sapiens OX=9606 GN=TPX2 PE=1 SV=2
5 Shugoshin 2 OS=Homo sapiens OX=9606 GN=SGO2 PE=1 SV=2
6 Aurora kinase A OS=Homo sapiens OX=9606 GN=AURKA PE=1 SV=3
dplyr
and tidyverse
There is lots of information online about getting started with dplyr
and using the tidyverse
. We really like this lesson from the Data Carpentry if you are new to the tidyverse
.
7.2 Subset differentially abundant proteins
Let’s subset our results and only keep proteins which have been flagged as exhibiting significant abundance changes,
7.3 Biological interpretation of differentially abundant proteins
Our statistical analyses provided us with a list of proteins that are present with significantly different abundances between M-phase and G1-phase of the cell cycle. We can get an initial idea about what these proteins are and do by looking at the protein descriptions.
## Look at descriptions of proteins upregulated in M relative to deysnchronized cells
sig_up %>%
pull(protein_description) %>%
head()
[1] "Golgi reassembly-stacking protein 2 OS=Homo sapiens OX=9606 GN=GORASP2 PE=1 SV=3"
[2] "Selenoprotein S OS=Homo sapiens OX=9606 GN=SELENOS PE=1 SV=3"
[3] "Anamorsin OS=Homo sapiens OX=9606 GN=CIAPIN1 PE=1 SV=2"
[4] "Krueppel-like factor 16 OS=Homo sapiens OX=9606 GN=KLF16 PE=1 SV=1"
[5] "Histone H1.10 OS=Homo sapiens OX=9606 GN=H1-10 PE=1 SV=1"
[6] "Importin subunit alpha-4 OS=Homo sapiens OX=9606 GN=KPNA3 PE=1 SV=2"
Whilst we may recognise some of the changing proteins, this might be the first time that we are coming across others. Moreover, some protein descriptions contain useful information, but this is very limited. We still want to find out more about the biological role of the statistically significant proteins so that we can infer the potential effects of their abundance changes.
There are many functional analyses that could be done on the proteins with differential abundance:
- Investigate the biological pathways that the proteins function within (KEGG etc.)
- Identify potential interacting partners (IntAct, STRING)
- Determine the subcellular localisation in which the changing proteins are found
- Understand the co-regulation of their mRNAs (Expression Atlas)
- Compare our changing proteins to those previously identified in other proteomic studies of the cell cycle
7.3.1 Gene Ontology (GO) enrichment analysis
One of the common methods used to probe the biological relevance of proteins with significant changes in abundance between conditions is to carry out Gene Ontology (GO) enrichment, or over-representation, analysis.
The Gene Ontology consortium have defined a set of hierarchical descriptions to be assigned to genes and their resulting proteins. These descriptions are split into three categories: cellular components (CC), biological processes (BP) and molecular function (MF). The idea is to provide information about a protein’s subcellular localisation, functionality and which processes it contributes to within the cell. Hence, the overarching aim of GO enrichment analysis is to answer the question:
“Given a list of proteins found to be differentially abundant in my phenotype of interest, what are the cellular components, molecular functions and biological processes involved in this phenotype?”.
Unfortunately, just looking at the GO terms associated with our differentially abundant proteins is insufficient to draw any solid conclusions. For example, if we find that 120 of the 453 proteins significantly downregulated in M phase are annotated with the GO term “kinase activity”, it may seem intuitive to conclude that reducing kinase activity is important for the M-phase phenotype. However, if 90% of all proteins in the cell were kinases (an extreme example), then we might expect to discover a high representation of the “kinase activity” GO term in any protein list we end up with.
This leads us to the concept of an over-representation analysis. We wish to ask whether any GO terms are over-represented (i.e., present at a higher frequency than expected by chance) in our lists of differentially abundant proteins. In other words, we need to know how many proteins with a GO term could have shown differential abundance in our experiment vs. how many proteins with this GO term did show differential abundance in our experiment.
We are going to use a function in R called enrichGO
from the the clusterProfiler
Yu et al. (2012) Bioconductor R package to perform GO enrichment analysis. The package vignette can be found here. and full tutorials for using the package here
The enrichGO
function uses the org.Hs.eg.db
Bioconductor package that has genome wide annotation for human, primarily based on mapping using Entrez Gene identifiers. It also uses the GO.db
package which is a set of annotation maps describing the entire Gene Ontology assembled using data from GO.
In the next code chunk we call the enrichGO
function,
ego_down <- enrichGO(gene = sig_down$Protein, # list of down proteins
universe = limma_results$Protein, # all proteins
OrgDb = org.Hs.eg.db, # database to query
keyType = "UNIPROT", # protein ID encoding
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
ont = "MF", # can be CC, MF, BP, or ALL
readable = TRUE)
Let’s take a look at the output.
ego_down
#
# over-representation test
#
#...@organism Homo sapiens
#...@ontology MF
#...@keytype UNIPROT
#...@gene chr [1:453] "Q9NQW6" "Q9BW19" "P49454" "Q9ULW0" "Q562F6" "O14965" "Q02224" ...
#...pvalues adjusted by 'BH' with cutoff <0.05
#...65 enriched terms found
'data.frame': 65 obs. of 9 variables:
$ ID : chr "GO:0016491" "GO:0005201" "GO:0005509" "GO:0003756" ...
$ Description: chr "oxidoreductase activity" "extracellular matrix structural constituent" "calcium ion binding" "protein disulfide isomerase activity" ...
$ GeneRatio : chr "65/447" "18/447" "37/447" "9/447" ...
$ BgRatio : chr "220/3745" "24/3745" "99/3745" "11/3745" ...
$ pvalue : num 2.97e-13 1.20e-12 3.20e-11 2.01e-07 2.01e-07 ...
$ p.adjust : num 1.29e-10 2.60e-10 4.63e-09 1.43e-05 1.43e-05 ...
$ qvalue : num 1.06e-10 2.15e-10 3.82e-09 1.18e-05 1.18e-05 ...
$ geneID : chr "QSOX1/P4HA2/PDIA4/P4HB/PLOD2/HSD17B4/PDIA3/P4HA1/HADHA/DNAJC10/ALDH18A1/POR/MTHFD2/PDIA6/LOXL2/PLOD1/ACADM/TXND"| __truncated__ "CCN1/THBS1/SPARC/FN1/TNC/COL12A1/LAMC1/COL23A1/EDIL3/LAMC2/TUFT1/COL4A2/EFEMP1/AGRN/HAPLN1/COL1A1/FBN2/COL6A3" "THBS1/SPARC/ITGB1/HSPA5/FSTL1/CALR/EDIL3/HSP90B1/SPTAN1/SDF4/LOXL2/FKBP10/EFEMP1/ESYT1/ASPH/NOTCH2/CANX/PRKCSH/"| __truncated__ "QSOX1/PDIA4/P4HB/PDIA3/PDIA6/TXNDC5/TMX1/ERP44/CRELD2" ...
$ Count : int 65 18 37 9 9 23 23 16 12 26 ...
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
The output of the enrichGO
function is an object of class enrichResult
that contains the ID
and Description
of all enriched GO terms. There is also information about which geneID
s from our significantly downregulated proteins are annotated with each of the enriched GO terms. Let’s take a look at the descriptions.
ego_down$Description %>%
head(10)
[1] "oxidoreductase activity"
[2] "extracellular matrix structural constituent"
[3] "calcium ion binding"
[4] "protein disulfide isomerase activity"
[5] "intramolecular oxidoreductase activity, transposing S-S bonds"
[6] "monoatomic cation transmembrane transporter activity"
[7] "inorganic cation transmembrane transporter activity"
[8] "glycosaminoglycan binding"
[9] "collagen binding"
[10] "active transmembrane transporter activity"
There is a long list because of the hierarchical nature of GO terms. The results of GO enrichment analysis can be visualised in many different ways. For a full overview of GO enrichment visualisation tools see Visualization of functional enrichment result.
dotplot(ego_down,
x = "Count",
showCategory = 10,
font.size = 10,
label_format = 100,
color = "p.adjust")
- Gene ontology (GO) terms described the molecular function (MF), biological processes (BP) and cellular component (CC) of proteins.
- GO terms are hierarchical and generic. They do not relate to specific biological systems e.g., cell type or condition.
- GO enrichment analysis aims to identify GO terms that are present in a list of proteins of interest (foreground) at a higher frequency than expected by chance based on their frequency in a background list of proteins (universe). The universe should be a list of all proteins included identified and quantified in your experiment.
- The
enrichGO
function from theclusterProfiler
package provides a convenient way to carry out reproducible GO enrichment analysis in R.