cRAP databases

library(Biostrings)
library(camprotR)
library(httr)

Introduction

Peptide-spectrum matching requires 2 main inputs: a collection of MS2 spectra and a FASTA database of protein/peptide sequences. Search algorithms are then used to match the supplied MS2 spectra to theoretical spectra generated from an in-silico digest of the supplied FASTA database (typically we use the SwissProt protein database for the organism of interest). The selection of the FASTA database should be done carefully as having a very large database can reduce the number of PSMs obtained (Jeong2012?). Supposedly, using an unecessarily large database can also affect protein quantification.

There are certain proteinaceous contaminants that are commonly, accidentally introduced during sample preparation for proteomics. In order to avoid misidentification of MS2 spectra that have come from these contaminants we use an additional (much smaller) FASTA database containing the sequences of these common contaminants. This is generally called a contaminants database, or is colloquially known as a cRAP database in our lab.

It is very important that you know what proteins are in your contaminants database as PSMs/peptides from these proteins are deliberately filtered out in many of the functions in camprotR. See this blog post some experimental situations where you may be interested in proteins that are usually deemed contaminants.

The 3 cRAP databases

There are two widely used FASTA databases containing common contaminants:

The Global Proteome Machine (GPM) common Repository of Adventitious Proteins (cRAP)
The contaminants.fasta file which is distributed in every single installation of MaxQuant.

In addition, in the CCP we use our own cRAP database which is similar but not identical to the GPM cRAP. I highly recommend you use the CCP cRAP database, as the other two are quite out of date. The download_crap() function in camprotR provides an easy and quick way to download the CCP cRAP database (see the section below).

common Repository of Adventitious Proteins (cRAP)

This list of proteins was collated by the Global Proteome Machine (GPM) organisation. To quote the cRAP website,

cRAP is an attempt to create a list of proteins commonly found in proteomics experiments either by accident or through unavoidable contamination of protein samples.

cRAP includes 5 main categories of proteins:

Laboratory proteins used somewhere in the sample preparation process.
Proteins added from dust or accidental contact with the sample
Proteins used as molecular weight markers
Proteins in the Sigma Universal Protein Standard (UPS)
Common viral contaminants

According to the cRAP website, it has not been updated since 2012. Therefore, I would not recommend using this FASTA for your peptide searches.

Nevertheless, we can download this FASTA database through R. The following chunk is not run by default as the URL is not always stable.

# download cRAP
crap_path <- tempfile(fileext = ".fasta")
download.file("ftp://ftp.thegpm.org/fasta/cRAP/crap.fasta",
              destfile = crap_path)

The GPM cRAP database is also installed with camprotR. You can access it with the following command.

crap_path <- system.file("extdata", "cRAP_GPM.fasta.gz", package = "camprotR")

Then we can take a look at it using the useful Biostrings package. With Biostrings we can load FASTA files into R and easily compare and manipulate them.

# load cRAP FASTA
crap <- Biostrings::readAAStringSet(crap_path)
crap
#> AAStringSet object of length 116:
#>       width seq                                             names               
#>   [1]   607 MKWVTFISLLLLFSSAYSRGVF...DKEACFAVEGPKLVVSTQTALA sp|ALBU_BOVIN|
#>   [2]   511 MKLFWLLFTIGFCWAQYSSNTQ...AHFSISNSAEDPFIAIHAESKL sp|AMYS_HUMAN|
#>   [3]   214 MKLLILTCLVAVALARPKHPIK...SFSDIPNPIGSENSEKTTMPLW sp|CAS1_BOVIN|
#>   [4]   222 MKFFIFTCLLAVALAKNTMEHV...HQKAMKPWIQPKTKVIPYVRYL sp|CAS2_BOVIN|
#>   [5]   224 MKVLILACLVALALARELEELN...QAFLLYQEPVLGPVRGPFPIIV sp|CASB_BOVIN|
#>   ...   ... ...
#> [112]   238 MSKGEELFTGVVPILVELDGDV...HMVLLEFVTAAGITHGMDELYK sp|GFP_AEQVI|
#> [113]   204 MAEEVEEERLKYLDFVRAAGVY...VSSYLPLLPTEKITKVFGDEAS sp|SRPP_HEVBR|
#> [114]   138 MAEDEDNQQGQGEGLKYLGFVQ...RSLASSLPGQTKILAKVFYGEN sp|REF_HEVBR|
#> [115]   348 MFSSVMVALVSLAVAVSANPGL...PDKAVMNADNHEYFSENNPAQS sp|PLMP_GRIFR|
#> [116]   271 MSHIQRETSCSRPRLNSNLDAD...KYGIDNPDMNKLQFHLMLDEFF KKA1_ECOLX

We can see that there are 116 sequences in cRAP.

head(names(crap))
#> [1] "sp|ALBU_BOVIN|" "sp|AMYS_HUMAN|" "sp|CAS1_BOVIN|" "sp|CAS2_BOVIN|"
#> [5] "sp|CASB_BOVIN|" "sp|CASK_BOVIN|"

We can see that the headers for each sequence in the FASTA file have very little information in them. Another reason not to use this cRAP database as your contaminants database.

Cambridge Centre for Proteomics cRAP (CCP cRAP)

The cRAP database we use in the CCP is largely based off of the GPM cRAP database, with a few extra sequences added on the end. It also has much more informative headers and the sequences for the commercial proteases Endoproteinase GluC (NEB, P8100S) and recombinant Lys-C (Promega, V167A) have been added.

You can download the latest version of the CCP cRAP using the download_ccp_crap() function. It is important to take a note of what date you download the CCP cRAP database and what the current UniProt release is, as the sequences and accessions can change (slightly) over time. Generally I like to include the UniProt release in the file name e.g. 2021-01_CCP_cRAP.fasta, but hfor this example we’ll just use a temporary file.

ccp_tmp <- tempfile(fileext = ".fasta")
download_ccp_crap(ccp_tmp, is_crap = TRUE, verbose = TRUE)
#> Downloading from UniProtKB release: 2024_03

We can load this FASTA file into R and take a look.

ccp_crap <- Biostrings::readAAStringSet(ccp_tmp)
ccp_crap
#> AAStringSet object of length 125:
#>       width seq                                             names               
#>   [1]   348 MSIPETQKGVIFYESHGKLEYK...PEIYEKMEKGQIVGRYVVDTSK sp|cRAP001|P00330...
#>   [2]   609 MKWVTFISLLFLFSSAYSRGVF...KETCFAEEGKKLVAASQAALGL sp|cRAP002|P02768...
#>   [3]   364 MPHSHPALTPEQKKELSDIAHR...YTPSGQAGAAASESLFISNHAY sp|cRAP003|P00883...
#>   [4]   511 MKLFWLLFTIGFCWAQYSSNTQ...AHFSISNSAEDPFIAIHAESKL sp|cRAP004|P0DUB6...
#>   [5]   511 MKLFWLLFTIGFCWAQYSSNTQ...AHFSISNSAEDPFIAIHAESKL sp|cRAP005|P0DTE7...
#>   ...   ... ...
#> [121]   387 MNVLLLLVLCTLAMGCGATSPP...KEQRSAECPGPAQKGYPFILPS sp|cRAP121|Q58D62...
#> [122]   266 MLRLLVFTSLVLYGHSTQDFPE...KPTVFTRVSAYISWINNAIASN sp|cRAP122|Q28153...
#> [123]   269 MIRALLLSTLVAGALSCGVPTY...KPSVFTRVSNYNDWISSVIENN sp|cRAP123|Q29461...
#> [124]   274 VILPNNDRHQITDTTNGHYAPV...NPDNGDNNNSDNPDAAHHHHHH sp|cRAP126|000000...
#> [125]   261 AGYRDGFGASGSCEVDAVCATQ...GVYSQISRYFAPHQHQHQHQHQ sp|cRAP127|000000...

We can see there are 125 sequences. Now lets have a look at some of the sequence headers.

head(names(ccp_crap))
#> [1] "sp|cRAP001|P00330|ADH1_YEAST Alcohol dehydrogenase 1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=ADH1 PE=1 SV=5"
#> [2] "sp|cRAP002|P02768|ALBU_HUMAN Albumin OS=Homo sapiens OX=9606 GN=ALB PE=1 SV=2"                                                            
#> [3] "sp|cRAP003|P00883|ALDOA_RABIT Fructose-bisphosphate aldolase A OS=Oryctolagus cuniculus OX=9986 GN=ALDOA PE=1 SV=2"                       
#> [4] "sp|cRAP004|P0DUB6|AMY1A_HUMAN Alpha-amylase 1A OS=Homo sapiens OX=9606 GN=AMY1A PE=1 SV=1"                                                
#> [5] "sp|cRAP005|P0DTE7|AMY1B_HUMAN Alpha-amylase 1B OS=Homo sapiens OX=9606 GN=AMY1B PE=1 SV=1"                                                
#> [6] "sp|cRAP006|P0DTE8|AMY1C_HUMAN Alpha-amylase 1C OS=Homo sapiens OX=9606 GN=AMY1C PE=1 SV=1"

These sequence headers are actually complete, and follow the standard structure of UniProt FASTA headers. Importantly they contain both the UniProt accession e.g. P00330 and the sequence version e.g. SV=5. Each time the amino acid sequence of a UniProt entry is modified, the sequence version is incremented by 1. With the UniProt accession and the SV number, one can know exactly which sequence is being referred to in the FASTA file.

This isn’t exactly important for everyday use, but may prove important when reanalysing a proteomics experiment after a long time has passed.

MaxQuant contaminants.fasta

Included in every installation of MaxQuant there is a file called contaminants.fasta. This is the contaminants database that MaxQuant will use be default for peptide searches, however it is possible to provide your own contaminants FASTA in MaxQuant if you want.

This is conjecture, but I’m fairly sure this FASTA database hasn’t changed since the MaxQuant paper was released in 2008, making it possibly even older than the GPM cRAP database. Therefore, I also would not recommend using this FASTA for your peptide searches.

Nevertheless, we can download this FASTA database through R and explore it like the GPM cRAP FASTA above. The following chunk is not run by default as the URL is not always stable.

# download cRAP
mq_path <- tempfile(fileext = ".fasta")
download.file("http://lotus1.gwdg.de/mpg/mmbc/maxquant_input.nsf/7994124a4298328fc125748d0048fee2/$FILE/contaminants.fasta",
              destfile = mq_path)

This FASTA is also installed with camprotR and you can access it in a similar way to the GPM cRAP FASTA.

# get file path
mq_path <- system.file("extdata", "contaminants.fasta.gz", package = "camprotR")

# load cRAP FASTA
mq <- Biostrings::readAAStringSet(mq_path)
mq
#> AAStringSet object of length 245:
#>       width seq                                             names               
#>   [1]   231 FPTDDDDKIVGGYTCAANSIPY...KPGVYTKVCNYVNWIQQTIAAN P00761 SWISS-PROT...
#>   [2]   540 MSRQFTYKSGAAAKGGFSGCSA...SEFRDSQGKTLALSSPTKKTMR Q32MB2 TREMBL:Q32...
#>   [3]   594 MTSVGVFSDMLNGCGKDGLVPR...GGSVSGSSSSKIISTTTLNKRR P19013 SWISS-PROT...
#>   [4]   521 MSLSPCRAQRGFSARSACSARS...GSSCHTILKKTVESSLKTSITY Q7RTT2 TREMBL:Q7R...
#>   [5]   653 MKRICGSLLLLGLSISAALAAP...IIQTFTKDLSSEAAQRAPGSCG P15636 SWISS-PROT...
#>   ...   ... ...
#> [241]   199 MKPVSGRKSPVLYLLGILTVLL...LSLEIEKLEQEKRKKLQYQYAS ENSEMBL:ENSBTAP00...
#> [242]   452 MWAVLSLPLACLLAQAWLVPGS...IIYEETSRTVLFLGRVVDPTLL ENSEMBL:ENSBTAP00...
#> [243]   591 MSFGNWKPTVVVQGILWILYGL...RRASSRSKRRPPPTILLSTDLQ ENSEMBL:ENSBTAP00...
#> [244]  1477 MGKNKLLYPSLTLLLLLLLPTD...YETDEFAVAEYSAPCSKDIGNA ENSEMBL:ENSBTAP00...
#> [245]   604 MKQLKLTGFVIFFFFLTESLTL...AQEPEACFKEESPKLAAKSQAA REFSEQ:XP_585019 ...

head(names(mq))
#> [1] "P00761 SWISS-PROT:P00761|TRYP_PIG Trypsin - Sus scrofa (Pig)."                           
#> [2] "Q32MB2 TREMBL:Q32MB2;Q86Y46 Tax_Id=9606 Gene_Symbol=KRT73 Keratin-73"                    
#> [3] "P19013 SWISS-PROT:P19013 Tax_Id=9606 Gene_Symbol=KRT4 keratin 4"                         
#> [4] "Q7RTT2 TREMBL:Q7RTT2 Tax_Id=9606 Gene_Symbol=KRT78 Keratin-78"                           
#> [5] "P15636 SWISS-PROT:P15636 Protease I precursor Lysyl endopeptidase Achromobacter lyticus."
#> [6] "P09870 SWISS-PROT:P09870 Arg-C (Clostripain) - Clostridium histolyticum."

Make your own cRAP database

You can add sequences easily enough to a cRAP FASTA database by just copying and pasting. However, for reproducibility purposes it is better to use an R script to generate your own custom cRAP database.

Here we will generate a cRAP database based off of the CCP cRAP, but with some extra protease sequences added to the end.

First you would download the CCP cRAP FASTA. The following chunk is not run by default as we have already downloaded it above. In this chunk we show the “best practice” of naming the CCP cRAP FASTA file with the current UniProt release.

# download CCP cRAP
download_ccp_crap(paste0(check_uniprot_release(), "_CCP_cRAP.fasta"))

Then you download the sequences you want to add and save them into a FASTA file. Here we’ll get some protease sequences from the bacteria Streptomyces griseus.

We can download sequences from UniProt automatically using the make_fasta() function. This function takes a character vector of UniProt accessions, downloads their sequences, and then saves the sequences into a FASTA file. We don’t want to add the cRAP00X numbering to the sequence headers just yet so we set is_crap = FALSE.

griseus_tmp <- tempfile(fileext = ".fasta")
make_fasta(accessions = c("P00776", "P00777", "P80561"),
                file = griseus_tmp,
                is_crap = FALSE)
#> Downloading from UniProtKB release: 2024_03

Before we add these to our CCP cRAP FASTA, we just need to take note of what cRAP number the sequences headers stop at.

tail(names(ccp_crap))
#> [1] "sp|cRAP120|P12763|FETUA_BOVIN Alpha-2-HS-glycoprotein OS=Bos taurus OX=9913 GN=AHSG PE=1 SV=2"                      
#> [2] "sp|cRAP121|Q58D62|FETUB_BOVIN Fetuin-B OS=Bos taurus OX=9913 GN=FETUB PE=1 SV=1"                                    
#> [3] "sp|cRAP122|Q28153|CELA1_BOVIN Chymotrypsin-like elastase family member 1 OS=Bos taurus OX=9913 GN=CELA1 PE=2 SV=1"  
#> [4] "sp|cRAP123|Q29461|CEL2A_BOVIN Chymotrypsin-like elastase family member 2A OS=Bos taurus OX=9913 GN=CELA2A PE=2 SV=1"
#> [5] "sp|cRAP126|000000|ENDOP_GLUC Endoproteinase GluC NEB P8100S"                                                        
#> [6] "sp|cRAP127|000000|RECOM_LYSC rLys-C Promega V167A"

The final number is 127, so we want the cRAP numbering of the additional sequences to start at 128. Finally, we use the append_fasta() function from camprotR to add our S. griseus sequences to our CCP cRAP FASTA. This function is used to add the sequences from one FASTA (file1) onto the end of another FASTA (file2).

append_fasta(
  file1 = griseus_tmp,
  file2 = ccp_tmp,
  is_crap = TRUE,
  crap_start = 128
)

Lets just check that this worked properly. There should be 130 sequences now.

Biostrings::readAAStringSet(ccp_tmp)
#> AAStringSet object of length 128:
#>       width seq                                             names               
#>   [1]   348 MSIPETQKGVIFYESHGKLEYK...PEIYEKMEKGQIVGRYVVDTSK sp|cRAP001|P00330...
#>   [2]   609 MKWVTFISLLFLFSSAYSRGVF...KETCFAEEGKKLVAASQAALGL sp|cRAP002|P02768...
#>   [3]   364 MPHSHPALTPEQKKELSDIAHR...YTPSGQAGAAASESLFISNHAY sp|cRAP003|P00883...
#>   [4]   511 MKLFWLLFTIGFCWAQYSSNTQ...AHFSISNSAEDPFIAIHAESKL sp|cRAP004|P0DUB6...
#>   [5]   511 MKLFWLLFTIGFCWAQYSSNTQ...AHFSISNSAEDPFIAIHAESKL sp|cRAP005|P0DTE7...
#>   ...   ... ...
#> [124]   274 VILPNNDRHQITDTTNGHYAPV...NPDNGDNNNSDNPDAAHHHHHH sp|cRAP126|000000...
#> [125]   261 AGYRDGFGASGSCEVDAVCATQ...GVYSQISRYFAPHQHQHQHQHQ sp|cRAP127|000000...
#> [126]   297 MTFKRFSPLSSTSRYARLLAVA...TGGTTFYQPVTEALSAYGATVL sp|cRAP128|P00776...
#> [127]   299 MRIKRTSNRSNAARRVRTTAVL...SGGTTFFQPVTEALSAYGVSVY sp|cRAP129|P00777...
#> [128]   445 MRPNRFSLRRSPTAVAAVALAA...KLRVQDIARQDTGYIDSWKLTF sp|cRAP130|P80561...