| Literature DB >> 23921808 |
Dattatreya Mellacheruvu1, Zachary Wright, Amber L Couzens, Jean-Philippe Lambert, Nicole A St-Denis, Tuo Li, Yana V Miteva, Simon Hauri, Mihaela E Sardiu, Teck Yew Low, Vincentius A Halim, Richard D Bagshaw, Nina C Hubner, Abdallah Al-Hakim, Annie Bouchard, Denis Faubert, Damian Fermin, Wade H Dunham, Marilyn Goudreault, Zhen-Yuan Lin, Beatriz Gonzalez Badillo, Tony Pawson, Daniel Durocher, Benoit Coulombe, Ruedi Aebersold, Giulio Superti-Furga, Jacques Colinge, Albert J R Heck, Hyungwon Choi, Matthias Gstaiger, Shabaz Mohammed, Ileana M Cristea, Keiryn L Bennett, Mike P Washburn, Brian Raught, Rob M Ewing, Anne-Claude Gingras, Alexey I Nesvizhskii.
Abstract
Affinity purification coupled with mass spectrometry (AP-MS) is a widely used approach for the identification of protein-protein interactions. However, for any given protein of interest, determining which of the identified polypeptides represent bona fide interactors versus those that are background contaminants (for example, proteins that interact with the solid-phase support, affinity reagent or epitope tag) is a challenging task. The standard approach is to identify nonspecific interactions using one or more negative-control purifications, but many small-scale AP-MS studies do not capture a complete, accurate background protein set when available controls are limited. Fortunately, negative controls are largely bait independent. Hence, aggregating negative controls from multiple AP-MS studies can increase coverage and improve the characterization of background associated with a given experimental protocol. Here we present the contaminant repository for affinity purification (the CRAPome) and describe its use for scoring protein-protein interactions. The repository (currently available for Homo sapiens and Saccharomyces cerevisiae) and computational tools are freely accessible at http://www.crapome.org/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23921808 PMCID: PMC3773500 DOI: 10.1038/nmeth.2557
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Figure 1The CRAPome at a glance. (a) Creation of the CRAPome. (1) Contributors to the CRAPome submit raw MS files for negative control runs, detailed experimental protocols and mapping information. (2) Raw MS files are first converted to mzXML and analyzed by X!Tandem and the Trans-Proteomic Pipeline; counts are extracted for protein quantification and the CRAPome administrator performs a quality control check (see Methods). (3) Released high quality runs (data) are associated with experimental descriptions and protocols (metadata) by the CRAPome administrator in consultation with the data provider. (4) Query of the CRAPome database by external users via the web interface. (b) Overview of the first CRAPome workflow. (1) Proteins are queried against the CRAPome by inputting one of several identifiers (Supplementary Note) which enable mapping to Gene ID. Different views enable exploration of the contaminant profile of each queried protein, either as a summary table (2) or in graphical formats (3). (c) Overview of the third CRAPome workflow (note that the second workflow is similar, except that no user data is uploaded; the second workflow generates lists of contaminant proteins). (1) Desired controls are selected, with the help of CVs. (2) Users upload their own data (test experiments and controls if available) to the CRAPome and (3) select parameters for data analysis. Data is displayed in a table format and in different graphical formats, which include the detection of a given interaction in the public repository iRefIndex (4).
Figure 2Composition of the CRAPome (human data). (a) Relationship between the detection of a given protein in the CRAPome and its protein abundance (all entries are mapped to official gene identification numbers and displayed as corresponding gene symbols). The abundance distribution in HEK293 cells was calculated from shotgun mass spectrometry data (see Methods). The left axis indicates the number of proteins identified at each of the spectral count abundances (green circles; green dashed line shows fit to data); the right axis indicates the fraction of the proteins at a given binned abundance in the CRAPome database (blue triangles). (b) Similarity clusters of all experiments. All experiments in the CRAPome were scored for similarity in their contaminant profiles based on a cosine function: the size of the clusters represents the number of the experiments with strong similarity. Selected similarity clusters are indicated, alongside their composition. (c) Cluster ix, described in b as FLAG agarose in HeLa cells, can be further defined as two sub-clusters based on subcellular fractionation performed prior to the affinity purification (cytosolic and nuclear fractions); other clusters can also be further refined. (d) Example of epitope-tag specificity for selected proteins/genes. (e) Spectral count distribution of the proteins shown in d across the entire dataset. Spectral count bins are shown for all non-zero experiments. The highest spectral count boundary for each bin is shown.
General overview of the frequency of detection across the CRAPome (H. sapiens data). The two numbers are computed at different frequencies: (i) “Redundant” gene counts are based on a generous estimation of shared peptides: in this case, each protein/gene to which a given peptide is matched is counted as a contaminant (ii) “Reduced” gene counts are based on a more stringent definition of protein/gene parsimony, as described in Methods.
| Frequency in CRAPome | Redundant gene counts | Reduced gene counts |
|---|---|---|
| > 90% | 15 | 14 |
| > 75% | 37 | 30 |
| > 50% | 110 | 89 |
| > 20% | 504 | 463 |
| > 10% | 898 | 878 |
| ≤ 10% | 6884 | 3571 |
| TOTAL | 7782 | 4449 |
List of the most frequently detected protein families across the CRAPome, alongside some of the most frequently detected representative genes (H. sapiens data).
| Gene family | Example gene symbols |
|---|---|
| Heat shock proteins | HSPA1A, HSPA8, HSPA2 |
| Keratins | KRT1, KRT10, KRT2 |
| Tubulins | TUBA1B, TUBA3C, TUBB |
| Actins | ACTB, ACTA2, ACTBL2 |
| Elongation factors | EEF1A, EEF1A2 |
| Histones | HIST1H1C, H2AFX, HIST2H2B |
| Ribonucleo proteins | HNRNPK, HNRNPU, HNRNPH1 |
| Ribosomal proteins | RPS3, RPS18, RPL23 |
Figure 3Scoring functions in the CRAPome illustrated on a four bait dataset (MEPCE, EIF4A2, WASL, RAF1; 8 experiments). (a) Comparison between the primary Fold Change score (FC-A) and SAINT for scoring known interactions using negative control runs (n = 6) provided by the user; ROC based on the interactions in iRefIndex. Note that when SAINT scores are identical, ties are broken by the FC-A score. Selected SAINT probability or FC-A score thresholds are represented by triangles and circles, respectively. (b) The relationship between SAINT probability and FC score is well represented by a sigmoid function (dashed curve). (c – d) Histogram visualization of the data presented in (b) can help with data exploration and threshold selection. (e – f) Scoring protein interactions using controls from the CRAPome with SAINT (e) and FC-A (f): User controls (n = 6) are compared to two sets of controls from the CRAPome, selected based on the CVs (Set 1 = 10 controls; Set 2 = 11 controls).
Figure 4Use of a more stringent Fold Change score (FC-B) to recover true interacting partners for ORC2L. (a) Schematic illustration of the consequences of averaging all spectral counts as opposed to selecting the top three maximal values for scoring protein-protein interactions. Here, protein X represents a contaminant in the purification scheme that is detected with variable counts across the 15 selected controls (the intensity of shading is proportional to the spectral counts). By contrast, protein Y is a contaminant detected with similar counts across all selected controls. The standard primary Fold Change calculation (FC-A) averages the counts across all controls while the more stringent secondary Fold Change score (FC-B) takes the average of the top 3 highest spectral counts for the abundance estimate. The resulting FC-A and FC-B scores are represented schematically where a larger circle indicates a higher fold change, with FC-A and FC-B assigning a similar score to protein Y, but not to protein X. (b) Comparison of SAINT scoring and stringent FC-B with good bait samples. Note here that only the top of the map (the interactions with SAINT probability ≥ 0.9) are displayed. (c) Same as c for bait samples (ORC2L) contaminated with myosin: the more stringent fold change score FC-B helps in discriminating between true interaction partners (labeled “ORC complex”) and contaminants (labeled “myosins”).