7 Biological interpretation
- Be aware of different analyses that can be done to gain a biological understanding of expression proteomics results
- Understand the concept of Gene Ontology (GO) enrichment analyses
- Complete GO enrichment analyses using the
enrichGO
function from theclusterProfiler
package
7.1 Adding metadata to our results using dplyr
Before we can look any further into the biological meaning of any protein abundance changes we need to extract these proteins from our overall results. It is also useful to re-add information about the master protein descriptions since this is lost in the output of limma
analysis.
It is important to note that the results table we have generated from limma
is not in the same order as the input data. Therefore, to add information from our original data e.g., from the rowData
such as the Master.Protein.Descriptions
we must match the protein names between them.
To do this, let’s extract information the Master.Protein.Descriptions
from the original data we have created called all_proteins
.
Recall that all_proteins
is a SummarizedExperiment
object,
all_proteins
class: SummarizedExperiment
dim: 3823 6
metadata(0):
assays(2): assay aggcounts
rownames(3823): A0A0B4J2D5 A0A2R8Y4L2 ... Q9Y6X9 Q9Y6Y8
rowData names(21): Checked Tags ... Protein.FDR.Confidence .n
colnames(6): M_1 M_2 ... G1_2 G1_3
colData names(4): sample rep condition tag
We wish to extract information from the rowData
regarding the Master.Protein.Descriptions
,
## Add master protein descriptions back
protein_info <- all_proteins %>%
rowData() %>%
as_tibble() %>%
select(Protein = Master.Protein.Accessions,
Protein_description = Master.Protein.Descriptions)
protein_info %>% head()
# A tibble: 6 × 2
Protein Protein_description
<chr> <chr>
1 A0A0B4J2D5 Putative glutamine amidotransferase-like class 1 domain-containing…
2 A0A2R8Y4L2 Heterogeneous nuclear ribonucleoprotein A1-like 3 OS=Homo sapiens …
3 A0AVT1 Ubiquitin-like modifier-activating enzyme 6 OS=Homo sapiens OX=960…
4 A0MZ66 Shootin-1 OS=Homo sapiens OX=9606 GN=SHTN1 PE=1 SV=4
5 A1L0T0 2-hydroxyacyl-CoA lyase 2 OS=Homo sapiens OX=9606 GN=ILVBL PE=1 SV…
6 A1X283 SH3 and PX domain-containing protein 2B OS=Homo sapiens OX=9606 GN…
Note, we also extract the Master.Protein.Accessions
column so we can use this to match to the protein
column in our limma
results.
Now we can use the left_join
function from dplyr
to match the protein descriptions to the protein IDs,
limma_results <- limma_results %>%
left_join(protein_info, by = "Protein", suffix = c(".left", ".right"))
# Verify
limma_results %>%
head()
Protein logFC AveExpr t P.Value adj.P.Val B
1 Q9NQW6 -2.077921 2.641270 -40.40812 1.139582e-15 4.356623e-12 25.79468
2 Q9BW19 -2.343592 1.430210 -36.79774 4.073530e-15 7.786553e-12 24.71272
3 P49454 -1.991326 2.598481 -34.88623 8.411188e-15 9.266757e-12 24.07905
4 Q9ULW0 -2.152503 2.476087 -34.52309 9.695796e-15 9.266757e-12 23.95341
5 Q562F6 -3.205603 -1.304111 -32.91865 1.849860e-14 1.414403e-11 23.37677
6 O14965 -2.098029 0.797902 -31.22180 3.791132e-14 2.415583e-11 22.72604
direction significance
1 down sig
2 down sig
3 down sig
4 down sig
5 down sig
6 down sig
Protein_description
1 Anillin OS=Homo sapiens OX=9606 GN=ANLN PE=1 SV=2
2 Kinesin-like protein KIFC1 OS=Homo sapiens OX=9606 GN=KIFC1 PE=1 SV=2
3 Centromere protein F OS=Homo sapiens OX=9606 GN=CENPF PE=1 SV=3
4 Targeting protein for Xklp2 OS=Homo sapiens OX=9606 GN=TPX2 PE=1 SV=2
5 Shugoshin 2 OS=Homo sapiens OX=9606 GN=SGO2 PE=1 SV=2
6 Aurora kinase A OS=Homo sapiens OX=9606 GN=AURKA PE=1 SV=3
dplyr
and tidyverse
There is lots of information online about getting started with dplyr
and using the tidyverse
. We really like this lesson from the Data Carpentry if you are new to the tidyverse
.
7.2 Subset differentially abundant proteins
Let’s subset our results and only keep proteins which have been flagged as exhibiting significant abundance changes,
7.3 Biological interpretation of differentially abundant proteins
Our statistical analyses provided us with a list of proteins that are present with significantly different abundances between M-phase and G1-phase of the cell cycle. We can get an initial idea about what these proteins are and do by looking at the protein descriptions.
## Look at descriptions of proteins upregulated in M relative to G1
sig_up %>%
pull(Protein_description) %>%
head()
[1] "Golgi reassembly-stacking protein 2 OS=Homo sapiens OX=9606 GN=GORASP2 PE=1 SV=3"
[2] "Selenoprotein S OS=Homo sapiens OX=9606 GN=SELENOS PE=1 SV=3"
[3] "Anamorsin OS=Homo sapiens OX=9606 GN=CIAPIN1 PE=1 SV=2"
[4] "Krueppel-like factor 16 OS=Homo sapiens OX=9606 GN=KLF16 PE=1 SV=1"
[5] "Histone H1.10 OS=Homo sapiens OX=9606 GN=H1-10 PE=1 SV=1"
[6] "Importin subunit alpha-4 OS=Homo sapiens OX=9606 GN=KPNA3 PE=1 SV=2"
Whilst we may recognise some of the changing proteins, this might be the first time that we are coming across others. Moreover, some protein descriptions contain useful information, but this is very limited. We still want to find out more about the biological role of the statistically significant proteins so that we can infer the potential effects of their abundance changes.
There are many functional analyses that could be done on the proteins with differential abundance:
- Investigate the biological pathways that the proteins function within (KEGG etc.)
- Identify potential interacting partners (IntAct, STRING)
- Determine the subcellular localisation in which the changing proteins are found
- Understand the co-regulation of their mRNAs (Expression Atlas)
- Compare our changing proteins to those previously identified in other proteomic studies of the cell cycle
7.3.1 Gene Ontology (GO) enrichment analysis
One of the common methods used to probe the biological relevance of proteins with significant changes in abundance between conditions is to carry out Gene Ontology (GO) enrichment, or over-representation, analysis.
The Gene Ontology consortium have defined a set of hierarchical descriptions to be assigned to genes and their resulting proteins. These descriptions are split into three categories: cellular components (CC), biological processes (BP) and molecular function (MF). The idea is to provide information about a protein’s subcellular localisation, functionality and which processes it contributes to within the cell. Hence, the overarching aim of GO enrichment analysis is to answer the question:
“Given a list of proteins found to be differentially abundant in my phenotype of interest, what are the cellular components, molecular functions and biological processes involved in this phenotype?”.
Unfortunately, just looking at the GO terms associated with our differentially abundant proteins is insufficient to draw any solid conclusions. For example, if we find that 120 of the 453 proteins significantly downregulated in M phase are annotated with the GO term “kinase activity”, it may seem intuitive to conclude that reducing kinase activity is important for the M-phase phenotype. However, if 90% of all proteins in the cell were kinases (an extreme example), then we might expect to discover a high representation of the “kinase activity” GO term in any protein list we end up with.
This leads us to the concept of an over-representation analysis. We wish to ask whether any GO terms are over-represented (i.e., present at a higher frequency than expected by chance) in our lists of differentially abundant proteins. In other words, we need to know how many proteins with a GO term could have shown differential abundance in our experiment vs. how many proteins with this GO term did show differential abundance in our experiment.
We are going to use a function in R called enrichGO
from the the clusterProfiler
Yu et al. (2012) Bioconductor R package to perform GO enrichment analysis. The package vignette can be found here. and full tutorials for using the package here
The enrichGO
function uses the org.Hs.eg.db
Bioconductor package that has genome wide annotation for human. It also uses the GO.db
package which is a set of annotation maps describing the entire Gene Ontology assembled using data from GO.
Unfortunately, because GO annotations are by design at the gene level, performing analyses with proteomics data can be more tricky.
In the next code chunk we call the enrichGO
function.
ego_down <- enrichGO(gene = sig_down$Protein, # list of down proteins
universe = limma_results$Protein, # all proteins
OrgDb = org.Hs.eg.db, # database to query
keyType = "UNIPROT", # protein ID encoding
qvalueCutoff = 0.05,
ont = "BP", # can be CC, MF, BP, or ALL
readable = TRUE)
Let’s take a look at the output.
head(ego_down@result)
ID Description GeneRatio
GO:0030198 GO:0030198 extracellular matrix organization 24/444
GO:0043062 GO:0043062 extracellular structure organization 24/444
GO:0045229 GO:0045229 external encapsulating structure organization 24/444
GO:0055085 GO:0055085 transmembrane transport 62/444
GO:0045333 GO:0045333 cellular respiration 36/444
GO:0030199 GO:0030199 collagen fibril organization 13/444
BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0030198 52/3683 0.4615385 3.828482 7.604487 8.227443e-10
GO:0043062 52/3683 0.4615385 3.828482 7.604487 8.227443e-10
GO:0045229 52/3683 0.4615385 3.828482 7.604487 8.227443e-10
GO:0055085 247/3683 0.2510121 2.082157 6.518386 3.253147e-09
GO:0045333 109/3683 0.3302752 2.739648 6.825355 4.230860e-09
GO:0030199 19/3683 0.6842105 5.675557 7.564154 1.319636e-08
p.adjust qvalue
GO:0030198 7.742023e-07 6.550199e-07
GO:0043062 7.742023e-07 6.550199e-07
GO:0045229 7.742023e-07 6.550199e-07
GO:0055085 2.295909e-06 1.942471e-06
GO:0045333 2.388744e-06 2.021015e-06
GO:0030199 6.208888e-06 5.253078e-06
geneID
GO:0030198 CCN1/QSOX1/COL12A1/ITGB1/SERPINH1/APP/LAMC1/PLOD2/COL23A1/P4HA1/FLOT1/LOXL2/COL4A2/FKBP10/PLOD1/MMP2/CRTAP/PRDX4/COLGALT1/PLOD3/TGFB2/ERO1A/COL1A1/DAG1
GO:0043062 CCN1/QSOX1/COL12A1/ITGB1/SERPINH1/APP/LAMC1/PLOD2/COL23A1/P4HA1/FLOT1/LOXL2/COL4A2/FKBP10/PLOD1/MMP2/CRTAP/PRDX4/COLGALT1/PLOD3/TGFB2/ERO1A/COL1A1/DAG1
GO:0045229 CCN1/QSOX1/COL12A1/ITGB1/SERPINH1/APP/LAMC1/PLOD2/COL23A1/P4HA1/FLOT1/LOXL2/COL4A2/FKBP10/PLOD1/MMP2/CRTAP/PRDX4/COLGALT1/PLOD3/TGFB2/ERO1A/COL1A1/DAG1
GO:0055085 THBS1/ITGB1/HSPA5/RACGAP1/TOMM70/MAGT1/ATP5F1A/SLC3A2/ATP1B1/SLC39A14/SLC2A1/HSPD1/SLC29A1/TIMM50/HSPA9/ASPH/TIMM21/ITGAV/NDUFS2/AIFM1/USP9X/TMEM165/SLC12A2/NNT/FLNA/PEX14/MDH2/AFG3L2/UQCRC1/SLC25A3/NDUFS7/RALBP1/YES1/ATP5F1B/ERO1A/ARPP19/FMR1/ABCB7/GHITM/COX2/ATP2A2/GRPEL1/ATP2B1/SLC25A24/TIMM44/ATP1A1/ATP6AP2/LETM1/SLC7A1/GOPC/ATP5PB/RPS6KB1/NDUFB3/UQCRFS1/TOMM40/ACTN4/COX4I1/SLC30A1/PMPCB/NDUFB8/STXBP3/SLC1A5
GO:0045333 PLEC/ATP5F1A/CS/SOD2/SDHA/ACO2/SHMT2/NDUFS2/NNT/NDUFB11/ETFA/MDH2/SDHB/SUCLA2/UQCRC1/UQCRC2/CDK1/NDUFS7/SUCLG2/ATP5F1B/GHITM/COX2/DLST/DLD/OXA1L/IDH3A/ATP5PB/STOML2/NDUFB3/CAT/HCCS/UQCRFS1/PDHB/COX4I1/GPD2/NDUFB8
GO:0030199 COL12A1/SERPINH1/PLOD2/P4HA1/LOXL2/FKBP10/PLOD1/CRTAP/COLGALT1/PLOD3/TGFB2/ERO1A/COL1A1
Count
GO:0030198 24
GO:0043062 24
GO:0045229 24
GO:0055085 62
GO:0045333 36
GO:0030199 13
The output of the enrichGO
function is an object of class enrichResult
that contains the ID
and Description
of all enriched GO terms. There is also information about which geneID
s from our significantly downregulated proteins are annotated with each of the enriched GO terms. Let’s take a look at the descriptions.
ego_down$Description %>%
head(10)
[1] "extracellular matrix organization"
[2] "extracellular structure organization"
[3] "external encapsulating structure organization"
[4] "transmembrane transport"
[5] "cellular respiration"
[6] "collagen fibril organization"
[7] "cell division"
[8] "cell adhesion"
[9] "aerobic respiration"
[10] "mitochondrial transmembrane transport"
There is a long list because of the hierarchical nature of GO terms. The results of GO enrichment analysis can be visualised in many different ways. For a full overview of GO enrichment visualisation tools see Visualization of functional enrichment result.
Here, we’ll use a ‘dotplot’ first
p <- dotplot(ego_down,
x = "Count",
showCategory = 20,
font.size = 10,
label_format = 100,
color = "p.adjust")
print(p)
The dotplot gives a good overview of the over-representation results for the top GO terms, but it’s not clear how much overlap there is between the proteins annotated with each term. For this, we can use an ‘upset’ plot.
upsetplot(ego_down, n=10)
It’s usually informative to explore the proteins with the over-represented GO terms further. For this, we need to know which proteins have a particular GO term. This is where is gets a little more tricky, since we need to map between Uniprot IDs and Entrez gene IDs for this.
We start by using the toTable
method to turn the UNIPROT map into a data.frame
.
## Obtain a gene to protein mapping for the proteins in our QFeatures
## org.Hs.egUNIPROT is part of org.Hs.eg.db
g2p_map_df <- toTable(org.Hs.egUNIPROT) %>%
filter(uniprot_id %in% rownames(cc_qf[['log_norm_proteins']]))
## Obtain the GO terms for each gene
go_submap <- org.Hs.egGO[g2p_map_df$gene_id]
## Merge the gene to protein map and GO terms for
## genes to get GO terms for proteins
p2go <- toTable(go_submap) %>%
left_join(g2p_map_df, by='gene_id')
Warning in left_join(., g2p_map_df, by = "gene_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 33151 of `x` matches multiple rows in `y`.
ℹ Row 186 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Now we can visualise the proteins with a particular GO term. Here, we will use a heatmap to see the patterns in protein abundance differences between the conditions for a given GO term.
- Gene ontology (GO) terms described the molecular function (MF), biological processes (BP) and cellular component (CC) of genes (and their protein products).
- GO terms are hierarchical and generic. They do not relate to specific biological systems e.g., cell type or condition.
- GO enrichment analysis aims to identify GO terms that are present in a list of proteins of interest (foreground) at a higher frequency than expected by chance based on their frequency in a background list of proteins (universe). The universe should be a list of all proteins included identified and quantified in your experiment.
- The
enrichGO
function from theclusterProfiler
package provides a convenient way to carry out reproducible GO enrichment analysis in R. - There are many ways to visualise the results from functional enrichment analyses. With GO terms in particular, it’s important to consider the relationships between the gene sets we are testing for over-representation.