| Literature DB >> 17961253 |
Gelio Alves1, Aleksey Y Ogurtsov, Yi-Kuo Yu.
Abstract
BACKGROUND: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17961253 PMCID: PMC2211744 DOI: 10.1186/1745-6150-2-25
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1Comparison of score histogram versus theoretical distribution. Comparison of score histogram versus theoretical distribution. A randomly picked query spectrum is used to score peptides in NCBI's nr database. For this query spectrum, nine hundred unit intensity peaks were added to the processed spectrum to match Sus. In panel (A), the red staircase represents the histogram of scores computed using Eq. (1) with w= 1, while the blue line represents the theoretical distribution predicted from peptides with n = 44 theoretical peaks. In panel (B), scores computed using Eq. (1) with w(m) = exp(-Δ m) for peptides with different numbers of theoretical peaks are collected, resulting in the overall score histogram represented by the red staircase. The solid curve plots our fitting of the histogram using Eq. (17) where the fitting variables are β, γ ≡ n/(6⟨x2⟩ β2) and .
Figure 2Average cumulative number of false positives versus . Average cumulative number of false positives versus E-values. Theoretically speaking, average number of false positives with E-values less than or equal to a cutoff Eshould be Eprovided that the number of trials is large enough. The accuracy of E-values assigned by RAId_DbS is tested along with three other methods, X! Tandem(v1.0), Mascot(v2.1) and OMSSA(v2.0). For X! Tandem, Mascot and OMSSA searches, default parameters of each program are used except the maximum number of miscleavages, which is set to 3 uniformly for this test. The diagonal solid lines in each panel are the theoretical lines. There are two curves associated with each method. The dashed line corresponds to the results using regular nr. The solid line corresponds to the results using nr with cluster removal, which we anticipate to be a better representative of a random database. See text for additional details.
Example output of RAId_DbS containing multiple significant peptide hits. The contents in the "DEFINITION" and "GI-LIST" columns have been shortened to fit the page. The first two hits correspond to the same peptide MYLGYEYVTAIR, while the third to the fifth hits correspond to the same peptide LGEYGFQNALLVR if we follow the mass spectrometry convention not to distinguish Leucine from Isoleucine. After that, the next peptide has an E-value 1.5, indicating a false hit. One thing worth noticing is that there is a clean separation between significant hits and the rest of peptide hits
| E-VALUE | PEPTIDE | MASS | DEFINITION | GI-LIST |
| 4.423375e-05 | KMYLGYEYVTAIR | 1478.720 | ..|ref|NP 001054.1| transferrin [Homo sapiens] | [4557871,94717618,15021381,31415705,...... |
| 4.423375e-05 | RMYLGYEYVTAIR | 1478.720 | ..|emb|CAH91543.1| hypothetical protein [Pongo | [55729628] |
| 1.488740e-04 | KLGEYGFQNAILVR | 1479.780 | ..|ref|NP 033784.1| albumin 1 [Mus musculus] | [33859506,55391508,191765,19353306, ....... |
| 1.488740e-04 | KLGEYGFQNALLVR | 1479.780 | ..|emb|CAA59279.1| albumin precursor [Felis | [886485,57977283,633938,30962111, ...... |
| 1.488740e-04 | KLGEYGFQNALIVR | 1479.780 | ..|gb|AAT98610.1| albumin [Sus scrofa] | [51235682,52353352,15808978,76445989,...... |
| 1.526504e+00 | KTTLALQFLMEGVR | 1478.800 | ..|ref|YP 466151.1| putative circadian | [86159366] |
| 3.710973e+00 | [MFKANMKQLIVR | 1478.820 | ..|dbj|BAD64473.1| cell wall lytic activ | [56909946] |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
Theoretical peaks of two peptides MYLGYEYVTAIR and LGEYGFQNALLVR. Both peptides are found to be significant by RAId_DbS for a given query spectrum and were found to be partial sequences of proteins originally put in for the experiment. The right column lists the b ∪ y peaks of both peptides in ascending m/z order. The two sets of theoretical peaks only have two pairs that are within three daltons of each other. They are (175.12, 175.12) and (1019.45, 1017.58). This negligible overlap between theoretical peaks reinforces the possibility of co-elution of the two peptides during the experiment
| Peptide/Mass | |
| MYLGYEYVTAIR 1478.72 | 132.04, 175.12, 288.2, 295.11, 359.24, 408.2, 460.29, 465.22, 559.36, 628.28, 722.42, 757.32, 851.46, 920.39, 1014.53, 1019.45, 1071.55, 1120.5, 1184.63, 1304.62, 1347.69 |
| LGEYGFQNALLVR 1479.79 | 114.08, 171.11, 175.12, 274.19, 300.16, 387.27, 463.22, 500.36, 520.24, 571.39, 667.31, 685.44, 795.37, 813.38, 90 9.41, 960.56, 980.45, 1017.58, 1093.53, 1180.65, 1206.62, 1305.68, 1309.69, 1366.71 |
Figure 3Performance analysis of methods tested. Performance analysis of RAId_DbS, X! Tandem(v1.0), Mascot(v2.1), OMSSA(v2.0), and SEQUEST(v3.2). Panels (A) and (C) display the results from 6, 734 spectra in profile format, while panels (B) and (D) display the results from 6,592 centroidized spectra obtained from [19]. In panels (A) and (B), typical ROC curves are shown with the number of false positives (FP) plotted along the abscissa, and the number of true positives (TP) plotted along the ordinate. Thus, a curve that is more to the upper-left corner implies better performance. To unveil the information in the region of small number of false positives, usually the region of most interest, we have plotted the abscissa in log-scale. In panels (C) and (D), a different types of ROC curves are shown. Defining the cumulative number of true negatives by TN and the cumulative number of false negative by FN, the ROC cuves in panels (C) and (D) plot "1 – specificity" (FP/(FP + TN)) along the abscissa (also in log-scale), and the sensitivity (TP/(TP + FN)) along the ordinate. For each method tested, the area under curve (AUC) of this type of ROC curves, when both axes are plotted in linear scale, is also shown inside parentheses in the figure legend. All the AUC have an uncertainty about ± 0.005. Note that ROC curves of this type do not reflect the total number of correct hits and methods that report very few negatives may result in a lower specificity and superficially seems inferior. For example, X! Tandem may be victimized when evaluated using this type of ROC curves. Also note that in panel (D) the trend of AUC for Mascot, X! Tandem, and SEQUEST is consistent with previously reported results [14]. For X! Tandem, Mascot, OMSSA, and SEQUEST, the default parameters for each method were used in every search. However, the maximum number of miscleavages is set to 3 uniformly. It is observed that analysis using profile data giving rise to better ROC curves than those of centoidized data. Although this may be due to the fact that the profile data contain more information, it may also be caused by spectral quality and sample concentration variations.
An example for computing fitting confidence. A randomly chosen spectrum is used to demonstrate the computation of the fitting confidence in detail. In each of the N = 28 numerical rows, the first entry is the score, the second entry records the LDpdf and the third entry corresponds to the LMpdf. Using the LDpdf as the x-coordinate and the Mpdf as the y-coordinate, we perform least square linear regression and find: an intercept value a = -0.00421 and a slope b = 0.9992. Eq. (18) is then used to compute t1 (t1 = 0.0421) and the goodness number, 1 - A(t1|N -2), is found to be 0.96674 through (19). To test the strength of correlation between the second column and the third column, we use (20) to compute r and through (21) we find the t2 value to be 0.99567. Given r = 0.99567 and = 25, through (22) we find the Pvalue to be 2.58 × 10-27.
| S | ln [Dpdf(S)] | ln [Mpdf(S)] |
| 0.0284661 | 0.479518 | 0.438266 |
| 0.0691319 | 0.431753 | 0.407608 |
| 0.109798 | 0.369235 | 0.351511 |
| 0.150463 | 0.2708 | 0.270076 |
| 0.191129 | 0.163419 | 0.163403 |
| 0.231795 | 0.014358 | 0.031592 |
| 0.272461 | -0.156812 | -0.125259 |
| 0.313127 | -0.340242 | -0.307054 |
| 0.353792 | -0.551264 | -0.513698 |
| 0.394458 | -0.79275 | -0.745095 |
| 0.435124 | -1.04746 | -1.00115 |
| 0.47579 | -1.34063 | -1.28178 |
| 0.516456 | -1.63587 | -1.58688 |
| 0.557121 | -1.96251 | -1.91636 |
| 0.597787 | -2.2322 | -2.27015 |
| 0.638453 | -2.72001 | -2.64814 |
| 0.679119 | -3.00809 | -3.05025 |
| 0.719785 | -3.52319 | -3.4764 |
| 0.76045 | -3.94211 | -3.92649 |
| 0.801116 | -4.31754 | -4.40045 |
| 0.841782 | -4.72005 | -4.89819 |
| 0.882448 | -5.27305 | -5.41962 |
| 0.923114 | -5.73387 | -5.96467 |
| 0.963779 | -7.04955 | -6.53326 |
| 1.00445 | -6.55707 | -7.1253 |
| 1.04511 | -7.368 | -7.74071 |
| 1.08578 | -9.44744 | -8.37942 |
| 1.12644 | -8.75429 | -9.04134 |
Figure 4Quantification of goodness of score model used for statistical significance assignment. A global study of the Mpdf accuracy using 10,000 spectra (profile mode). Panel (A) shows the histogram of the goodness number. Panel (B) shows a scattered plot of ν versus r obtained from our spectra as well as a number of curves each corresponds to a fixed Pvalue. Panel (C) displays the histogram of log10(P).