Analysis of expression proteomics data in R

Overview

These materials focus on expression proteomics, which aims to characterise the protein diversity and abundance in a particular system. You will learn about the bioinformatic analysis steps involved when working with these kind of data, in particular several dedicated proteomics Bioconductor packages, part of the R programming language. We will use a real-world dataset obtained from a tandem mass tag (TMT) mass spectrometry experiment. We cover the basic data structures used to store and manipulate protein abundance data, how to do quality control and filtering of the data, as well as several visualisations. Finally, we include statistical analysis of differential abundance across sample groups (e.g. control vs. treated) and further evaluation and biological interpretation of the results via gene ontology analysis.

Learning Objectives

You will learn about:

How mass spectrometry can be used to quantify protein abundance and some of the methods used for peptide quantitation.
The bioinformatics steps involved in processing and analysing expression proteomics data.
How to assess the quality of your data, deal with missing values and summarise PSM-level (peptide-spectrum match) data to protein-level.
How to perform differential expression analysis to compare protein abundances between different groups of samples.

Target Audience

Proteomics practitioners or data analysts/bioinformaticians that would like to learn how to use R to analyse proteomics data.

Prerequisites

Basic understanding of mass spectometry.
- Watch this video for an excellent overview.
A working knowledge of R and the tidyverse.
Familiarity with other Bioconductor data classes, such as those used for RNA-seq analysis, is useful but not required.

Exercises

Exercises in these materials are labelled according to their level of difficulty:

Level	Description
	Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course.
	Exercises in level 2 combine different concepts together and apply it to a given task.
	Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems.

Instructors

This workshop will be run by:

Lisa Breckels - Cambridge Centre for Proteomics, University of Cambridge
Tom Smith - MRC Laboratory of Molecular Biology, Cambridge
Alistair Hines - Cambridge Centre for Proteomics, University of Cambridge
Oliver Crook - Kavli Institute for NanoScience Discovery, University of Oxford

Previous instructors include:

Charlotte Hutchings - Cambridge Centre for Proteomics, University of Cambridge
Thomas Krueger - Department of Biochemistry, University of Cambridge
Charlotte S. Dawson - Cambridge Centre for Proteomics, University of Cambridge

Authors

About the authors:

Charlotte Hutchings
Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge
Roles: writing - original draft; conceptualisation; coding
Lisa Breckels
Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge
Roles: writing - original draft; conceptualisation; coding
Tom Smith
Affiliation: MRC Laboratory of Molecular Biology, Cambridge
Roles: writing-review-editing; coding
Charlotte Dawson
Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge. Roles: writing-review-editing; coding

Citation

Please cite these materials if:

You adapted or used any of them in your own teaching.
These materials were useful for your research work.

For example, you can cite us in the methods section of your paper: “We carried out our analyses based on the recommendations in Hutchings and Breckels (2024)”.

You can cite these materials as:

Hutchings C, Breckels LM (2024) “CambridgeCentreForProteomics/course_expression_proteomics: Analysis of expression proteomics data in R”, https://cambridgecentreforproteomics.github.io/course_expression_proteomics

Or in BibTeX format:

@Misc{,
  author = {Charlotte Hutchings and Lisa M Breckels},
  title = {CambridgeCentreForProteomics/course_expression_proteomics: Analysis of expression proteomics data in R},
  month = {November},
  year = {2024},
  url = {https://cambridgecentreforproteomics.github.io/course_expression_proteomics}
}

Other key references

Data analysis workflow

Hutchings C, Dawson CS, Krueger T, Lilley KS, Breckels LM. A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data [version 2; peer review: 3 approved]. F1000Research 2024, 12:1402 https://f1000research.com/articles/12-1402/v2

The QFeatures R/Bioconductor package.

Gatto L, Vanderaa C: QFeatures: Quantitative features for mass spectrometry data. R package version 1.12.0. 2023. Reference Source

The limma R/Bioconductor package

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.

Case-study data

Queiroz, R.M.L., Smith, T., Villanueva, E., Marti-Solano, M., Monti, M., Pizzinga, M., Mirea, D.-M., Ramakrishna, M., Harvey, R.F., Dezi, V., Thomas, G.H., Willis, A.E. & Lilley, K.S. (2019) Comprehensive identification of RNA–protein interactions in any organism using orthogonal organic phase separation (OOPS). Nature Biotechnology. 37 (2), 169–178. doi:10.1038/s41587-018-0001-2.

Mass spectrometry-based proteomics:

Dupree, E.J., Jayathirtha, M., Yorkey, H., Mihasan, M., Petre, B.A. & Darie, C.C. (2020) A Critical Review of Bottom-Up Proteomics: The Good, the Bad, and the Future of This Field. Proteomes. 8 (3), 14. doi:10.3390/proteomes8030014.

Obermaier, C., Griebel, A. & Westermeier, R. (2021) Principles of protein labeling techniques. In: A. Posch (ed.). Proteomic Profiling: Methods and Protocols. Methods in Molecular Biology. New York, NY, Springer US. pp. 549–562. doi:10.1007/978-1-0716-1186-9_35.

Rainer, L.G., Sebastian Gibb, Johannes (n.d.) Chapter 5 Quantitative data | R for Mass Spectrometry. https://rformassspectrometry.github.io/book/sec-quant.html.

Acknowledgements

Thank you to Hugo Tavares for coordinating this course and his valuable input in developing and testing this material.
Thomas Kruger and Charlotte S. Dawson for their input and guidance writing this material and the f1000 workflow A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data.
Prof. Kathryn Lilley, group head and director of Cambridge Centre for Proteomics at the Department of Biochemistry, University of Cambridge.
The QFeatures and limma R/Bioconductor packages are fundamental to this workflow, please cite them alongside the course if you use this material. Thank you to Laurent Gatto and Christophe Vanderaa for providing exemplary software for proteomics.
Thank you to the R for Mass Spectrometry team for providing excellent material in particular the R for Mass Spectrometry Book by Laurent Gatto, Sebastian Gibb and Johannes Rainer.