| Literature DB >> 21138573 |
Abstract
BACKGROUND: Mass spectrometry has become a standard method by which the proteomic profile of cell or tissue samples is characterized. To fully take advantage of tandem mass spectrometry (MS/MS) techniques in large scale protein characterization studies robust and consistent data analysis procedures are crucial. In this work we present a machine learning based protocol for the identification of correct peptide-spectrum matches from Sequest database search results, improving on previously published protocols.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21138573 PMCID: PMC3013103 DOI: 10.1186/1471-2105-11-591
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graphical representation of the alternating decision tree learned from. Prediction nodes are represented by ellipses and splitter nodes by rectangles. Each splitter node is associated with a real valued number indicating the rule condition, meaning: If the feature represented by the node is less than or equal to the condition value the prediction path will go through the left child node, otherwise the path will go through the right child node. The numbers behind the feature names in the prediction nodes indicate the order in which the different base rules were discovered, this ordering can to some extend indicate the relative importance of the base rules. A detailed explanation on how to interpret the ADTree is given in the main text along with a discussion of the colored paths outlined. tree.png.
Features used in the machine learning formulation
| Group | Name | Meaning | Origin |
|---|---|---|---|
| SEQUEST | XCorr | Rank score from the SEQUEST search. | SEQUEST |
| deltaMH | Difference between mass of parent ion and identified peptide mass. | SEQUEST | |
| deltCn | Difference between XCorr of the highest ranked peptide and the peptide in question | SEQUEST | |
| SP score | Preliminary score of peptide in search procedure | SEQUEST | |
| SP rank | Initial rank of peptide based on SP-score | SEQUEST | |
| Ion fraction | Percentage of ions in the mass spectra that could be correlated with the spectrum | SEQUEST | |
| Published | Number of tryptic | Number of tryptic cleavage sites in the peptide targets (NTT) | Calculated |
| Peptide lenght | Residue count of the peptide | Calculated | |
| Summed Intesity | Sum of peak intensities in the spectra | Calculated | |
| Mobil proton factor (MPF) | Measure of the proton mobility in peptide | Calculated | |
| C-terminal Residue | Amino acid residue at c-terminal (Arg = 1, Lys = 2, Other = 3) | Calculated | |
| Mass-window peptides | # of DB peptides within prespecified mass-window of the parent ion | Calculated | |
| Proline count | # of Pro residues in the peptide | Calculated | |
| Arginine count | # of Arg residues in the peptide | Calculated | |
| Novel | Intensity Mean | The mean of the peak intensities | Calculated |
| Intensity Std. | Std. of the peak intensities | Calculated | |
| Intensity bins | The distribution of intensities in 20%-bins | Calculated | |
| Protein Hit Count (PHC) | Probability score of observing × number of peptides from parent protein | Calculated | |
| Potential Coverage | The potential sequence coverage | Calculated | |
| PTM percentage | The percentage of possible PTMs found in a peptide | Calculated | |
For each individual feature we give a brief description and indicate whether the feature was obtained from the output of the SEQUEST algorithm or calculated from the identified peptide, the mass spectrum, or database statistics. The features have been divided up into three subgroups SEQUEST, Published, and Novel, denoting those features that can be derived directly from the SEQUEST algorithm output, those used in published studies of the identification problem, and those introduced in this work, respectively.
Validation metrics for a collection of machine learning algorithm runs over testsets containing feature from the groups denoted in table 1
| Feature groups | Algorithm | Accuracy | Sensitivity | Specificity | AUC ROC | |
|---|---|---|---|---|---|---|
| ABWillow | 0.97505 | 0.56504 | 0.96379 | 0.77945 | ||
| ABC4.5 | 0.97361 | 0.58815 | 0.99269 | 0.94821 | 0.79042 | |
| RFC4.5 | 0.97276 | 0.57212 | 0.99259 | 0.87901 | 0.78235 | |
| ADtree | 0.98988 | |||||
| ABWillow | 0.96951 | 0.48762 | 0.99336 | 0.90723 | 0.74050 | |
| ABC4.5 | 0.57018 | 0.99250 | 0.907084 | 0.78139 | ||
| RFC4.5 | 0.97228 | 0.99122 | ||||
| ADtree | 0.96925 | 0.48762 | 0.90604 | 0.74032 | ||
| - | PeptideProphet | 0.9688 | 0.54 | 0.99 | - | 0.765 |
Figure 2Receiver Operator Curves (ROC) (left) and Precision/Recall Curves (PRC) (right). Classifiers trained with the novel set of features have the suffix all, otherwise the suffix S+P is used (this does not apply to the curve for PeptideProphet shown in the ROC plot). The ROC shows how the TPR varies with the FPR, indicating what percentage of true hits one can expect to obtain at a given false-positive-rate. The PRC given an alternate view of the classification depicting the precision as a function of the recall (note PeptideProphet results only shown in ROC). rocspr.png.