Analysis of expression proteomics data in R

Overview

These materials focus on expression proteomics, which aims to characterise the protein diversity and abundance in a particular system. You will learn about the bioinformatic analysis steps involved when working with these kind of data, in particular several dedicated proteomics Bioconductor packages, part of the R programming language. We will use a real-world dataset obtained from a tandem mass tag (TMT) mass spectrometry experiment. We cover the basic data structures used to store and manipulate protein abundance data, how to do quality control and filtering of the data, as well as several visualisations. Finally, we include statistical analysis of differential abundance across sample groups (e.g. control vs. treated) and further evaluation and biological interpretation of the results via gene ontology analysis.

Learning Objectives

You will learn about:

  • How mass spectrometry can be used to quantify protein abundance and some of the methods used for peptide quantitation.
  • The bioinformatics steps involved in processing and analysing expression proteomics data.
  • How to assess the quality of your data, deal with missing values and summarise PSM-level (peptide-spectrum match) data to protein-level.
  • How to perform differential expression analysis to compare protein abundances between different groups of samples.

Target Audience

Proteomics practitioners or data analysts/bioinformaticians that would like to learn how to use R to analyse proteomics data.

Prerequisites

  • Basic understanding of mass spectometry.
  • A working knowledge of R and the tidyverse.
  • Familiarity with other Bioconductor data classes, such as those used for RNA-seq analysis, is useful but not required.

Exercises

Exercises in these materials are labelled according to their level of difficulty:

Level Description
Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course.
Exercises in level 2 combine different concepts together and apply it to a given task.
Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems.

Instructors

This workshop will be run by:

  • Charlotte Hutchings - Cambridge Centre for Proteomcis, University of Cambridge
  • Lisa M. Breckels - Cambridge Centre for Proteomcis, University of Cambridge
  • Tom Smith - Bioinformatics Facility, MRC Toxicology Unit, University of Cambridge
  • Charlotte S. Dawson - Cambridge Centre for Proteomcis, University of Cambridge

Previous instructors include:

  • Thomas Krueger - Department of Biochemistry, University of Cambridge

Authors

About the authors:

  • Charlotte Hutchings
    Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge
    Roles: writing - original draft; conceptualisation; coding
  • Lisa Breckels
    Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge
    Roles: writing - original draft; conceptualisation; coding
  • Tom Smith
    Affiliation: Bioinformatics Facility, MRC Toxicology Unit, University of Cambridge
    Roles: writing-review-editing; coding
  • Charlotte Dawson
    Affiliation: Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge Roles: writing-review-editing; coding

Citation

Please cite these materials if:

  • You adapted or used any of them in your own teaching.
  • These materials were useful for your research work. For example, you can cite us in the methods section of your paper: “We carried our analyses based on the recommendations in Hutchings and Breckels (2023)”.

You can cite these materials as:

Hutchings C, Breckels LM (2023) “CambridgeCentreForProteomics/course_expression_proteomics: Analysis of expression proteomics data in R”, https://cambridgecentreforproteomics.github.io/course_expression_proteomics

Or in BibTeX format:

@Misc{,
  author = {Charlotte Hutchings and Lisa M Breckels},
  title = {CambridgeCentreForProteomics/course_expression_proteomics: Analysis of expression proteomics data in R},
  month = {November},
  year = {2023},
  url = {https://cambridgecentreforproteomics.github.io/course_expression_proteomics}
}

Other key references

Data analysis workflow

Hutchings C, Dawson CS, Krueger T, Lilley KS, Breckels LM. A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data [version 1; peer review: awaiting peer review]. F1000Research 2023, 12:1402 https://doi.org/10.12688/f1000research.139116.1

The QFeatures R/Bioconductor package.

Gatto L, Vanderaa C: QFeatures: Quantitative features for mass spectrometry data. R package version 1.12.0. 2023. Reference Source

The limma R/Bioconductor package

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.

Case-study data

Queiroz, R.M.L., Smith, T., Villanueva, E., Marti-Solano, M., Monti, M., Pizzinga, M., Mirea, D.-M., Ramakrishna, M., Harvey, R.F., Dezi, V., Thomas, G.H., Willis, A.E. & Lilley, K.S. (2019) Comprehensive identification of RNA–protein interactions in any organism using orthogonal organic phase separation (OOPS). Nature Biotechnology. 37 (2), 169–178. doi:10.1038/s41587-018-0001-2.

Mass spectrometry-based proteomics:

Dupree, E.J., Jayathirtha, M., Yorkey, H., Mihasan, M., Petre, B.A. & Darie, C.C. (2020) A Critical Review of Bottom-Up Proteomics: The Good, the Bad, and the Future of This Field. Proteomes. 8 (3), 14. doi:10.3390/proteomes8030014.

Obermaier, C., Griebel, A. & Westermeier, R. (2021) Principles of protein labeling techniques. In: A. Posch (ed.). Proteomic Profiling: Methods and Protocols. Methods in Molecular Biology. New York, NY, Springer US. pp. 549–562. doi:10.1007/978-1-0716-1186-9_35.

Rainer, L.G., Sebastian Gibb, Johannes (n.d.) Chapter 5 Quantitative data | R for Mass Spectrometry. https://rformassspectrometry.github.io/book/sec-quant.html.

Acknowledgements

  • Thank you to Hugo Tavares for coordinating this course and his valuable input in developing and testing this material.
  • Thomas Kruger and Charlotte S. Dawson for their input and guidance writing this material and the f1000 workflow A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data.
  • Prof. Kathryn Lilley, group head and director of Cambridge Centre for Proteomics at the Department of Biochemistry, University of Cambridge.
  • The QFeatures and limma R/Bioconductor packages are fundamental to this workflow, please cite them alongside the course if you use this material. Thank you to Laurent Gatto and Christophe Vanderaa for providing exemplary software for proteomics.
  • Thank you to the R for Mass Spectrometry team for providing excellent material in particular the R for Mass Spectrometry Book by Laurent Gatto, Sebastian Gibb and Johannes Rainer.