| Literature DB >> 19318424 |
Smriti R Ramakrishnan1, Christine Vogel, John T Prince, Zhihua Li, Luiz O Penalva, Margaret Myers, Edward M Marcotte, Daniel P Miranker, Rong Wang.
Abstract
MOTIVATION: Tandem mass spectrometry (MS/MS) offers fast and reliable characterization of complex protein mixtures, but suffers from low sensitivity in protein identification. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other information available, e.g. the probability of a protein's presence is likely to correlate with its mRNA concentration.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19318424 PMCID: PMC2682515 DOI: 10.1093/bioinformatics/btp168
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Boosting protein identifications with prior information on mRNA concentration. A complex protein sample, e.g. cellular extract, is enzymatically digested into peptides and subjected to MS/MS. Raw MS/MS spectra are searched against a database of sequences using primary protein identification software, e.g. Bioworks (ThermoFinnigan), PeptideProphet (Keller et al., 2002) and ProteinProphet (Nesvizhskii et al., 2003), which produces a list of proteins and scores that signify the probability of correct identification. In a secondary analysis, MSpresso reexamines protein identification scores with respect to their mRNA abundance. MSpresso boosts the protein identification score given sufficient mRNA concentration. Proteins are then labeled ‘present’ if their MSpresso probability is larger than a newly determined cutoff. The MSpresso score, P(K = 1|S, M), estimates the probability of protein presence as the posterior probability of K = 1 given mRNA abundance M and MS protein identification score S. MSpresso uses three probabilities, P(K|S), P(K) and P(K|M), and their estimation is discussed in the text. K, protein presence; S, MS/MS identification score; M, mRNA concentration.
Fig. 2.Experimental data describes the relationship between the probability of protein presence given that the corresponding mRNA is observed at a certain abundance, P(K|M = m). The relationship is modeled by a histogram of the fraction of proteins present in the protein reference set per bin of mRNA concentration, generated from a rank ordered list of mRNA abundances using 225 proteins per mRNA bin. The protein reference dataset contains four MS-based proteomics datasets (Chi et al., 2007; de Godoy et al., 2006; Peng et al., 2003; Washburn et al., 2001); the mRNA data is an average of three datasets (Holstege et al., 1998; Velculescu et al., 1995; Wang et al., 2002).
MSpresso performance in different experiments
| Experiment | Test set | Area under the ROC (AUC) | Number of proteins identified at 5% FPR | ||||
|---|---|---|---|---|---|---|---|
| MS/MS | MSpresso | Percentage increase | MS/MS | MSpresso | Percentage increase | ||
| Yeast-YPD-LCQ | Cell lysate, rich medium (YPD), LCQ (five injections) | 0.75 | 0.89 | 19 | 234 | 327 | 40 |
| Yeast-YPD-ORBI | Cell lysate, rich medium (YPD), ORBI (eight injections) | 0.80 | 0.84 | 5 | 428 | 618 | 63 |
| Yeast-YMD-LCQ | Cell lysate, minimal medium (YMD), LCQ (six injections) | 0.73 | 0.84 | 15 | 229 | 278 | 21 |
| Yeast-Fraction-LCQ | Cell lysate, fractionated in polysomal gradient, rich medium (YPD), LCQ (three injections) | 0.72 | 0.77 | 7 | 21 | 34 | 62 |
| Cell lysate, minimal medium (MOPS9), ORBI (three injections) | 0.69 | 0.80 | 16 | 63 | 87 | 38 | |
| Human-LCQ | Cell lysate from Daoy, LCQ (two injections) | 0.71 | 0.75 | 6 | 99 | 121 | 22 |
| Human-ORBI | Cell lysate from Daoy, ORBI (one injection) | 0.79 | 0.81 | 3 | 105 | 125 | 19 |
In each experiment, we generated MSpresso scores for each protein with observed mRNA abundance and MS/MS identification score. The better the MSpresso-based scoring, the higher the ‘Percentage AUC increase’ and ‘Percentage increase in number of identified proteins’. These experiments use the ‘self’ MSpresso model: trained and evaluated on experiment-specific reference data. MSpresso results using the ‘reuse’ model are presented in the Supplementary Material (Table S10).
aData as extrapolated from the ROC curve where there was no data at 5% FPR.
Fig. 3.Performance of MSpresso in yeast grown in rich medium. We evaluate the performance of MSpresso, the original MS/MS identifications (ProteinProphet), and MSpresso using a random P(K|M) model using a ground-truth reference set to determine true and false identifications. (A) ROC plot (TPR versus FPR): MSpresso identifies more true positives at a given FPR than the MS/MS identifications, and has a 19% higher AUC. (B) Precision–recall plot (TPR versus precision): MSpresso increases precision at fixed recall across different score thresholds.