| Literature DB >> 35354356 |
Honglan Li1, Seungjin Na2, Kyu-Baek Hwang3, Eunok Paek4.
Abstract
BACKGROUND: In shotgun proteomics, database search engines have been developed to assign peptides to tandem mass (MS/MS) spectra and at the same time post-processing (or rescoring) approaches over the search results have been proposed to increase the number of confident peptide identifications. The most popular post-processing approaches such as Percolator and PeptideProphet have improved rates of peptide identifications by combining multiple scores from database search engines while applying machine learning techniques. Existing post-processing approaches, however, are limited when dealing with results from new search engines because their features for machine learning must be optimized specifically for each search engine.Entities:
Keywords: Data-dependent; Machine learning; Mass spectrometry; PSM rescoring; Peptide identification; Tool-independent
Mesh:
Substances:
Year: 2022 PMID: 35354356 PMCID: PMC8969291 DOI: 10.1186/s12859-022-04640-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features used to represent PSMs in TIDD model
| Index | Name | Description |
|---|---|---|
| 1 | XCorr | cross correlation between theoretical and experimental spectra |
| 2 | delta XCorr | difference of XCorr score between rank 1 and 2 (If there’s rank 2 hit) |
| 3 | charge | vector: 1 to 6 (consider as 6 when the charge is above 6) |
| 4 | pepLen | the length of stripped peptide sequence |
| 5 | tryptic | vector: 0 c-term tryptic; 1 n-term tryptic; 2 fully-tryptic |
| 6 | #missed cleavage | the number of missed cleavages in the peptide sequence |
| 7 | precursorM | observed mass of spectra |
| 8 | massDiff | the mass difference between calculated and observed mass |
| 9 | -absolutMassDiff | the absolute value of the difference between calculated and observed mass |
| 10 | calPepM | calculated mass of the matched peptide |
| 11–13 | sum_intensity_all/y/b | logarithm value of sum of intensity of spectra (TIC) / sum of intensity of matched y ions (or b ions) |
| 14–15 | frac_intensity_y/b | the fraction of sum_intensity_y (sum_intensity_b) among sum_intensity_all |
| 16–18 | max_intensity_all/y/b | logarithm value of maximum intensity of spectra (base peak intensity) / maximum intensity of matched y ions (or b ions) |
| 19–20 | seq_cover y/b | sequence coverage of y ions (or b ions) |
| 21–22 | num_consecutive_y/b | the number of consecutive y ions (or b ions) |
| 23–24 | mean/sd _fragMassErr | mean (or standard deviation) values of mass difference distribution between fragment ions and theoretical ions |
| 25 | #AnnoPeaks | the number of annotated peaks |
Fig. 1The target and decoy distributions of TIDD’s top 4 features. The results are based on A549 dataset searched by a–d Comet; e–h MS-GF + ; i–l MSFragger. Here, solid and dashed line are the distribution of target and decoy, respectively
Fig. 2Performance comparison in terms of PSM identifications using 11 cell line datasets. Four PSM rescoring methods were applied—Percolator, TIDD (iterative SVM learning with tool-independent feature set), TIDD with tool-dependent features (iterative SVM learning with TIDD features augmented with tool-dependent scores), and the iterative SVM using X!Tandem-Percolator features (iterative SVM learning with the feature set used by Percolator on X!Tandem data, while ‘deltascore’ is missed because MSFragger does not provide this score). a the number of identified PSMs based on TD when the three tools were applied; b–d the percent increase in PSM identification numbers compared to TD results
Fig. 3Performance comparison on PSM identification for HEK data. Four PSM rescoring methods were applied—Percolator, TIDD (iterative SVM learning with tool-independent feature set), TIDD with tool-dependent features (iterative SVM learning with TIDD features augmented with tool-dependent scores), and the iterative SVM using X!Tandem-Percolator features (iterative SVM learning with the feature set used by Percolator on X!Tandem data, while ‘deltascore’ is missed because MSFragger does not provide this score). a The number of identified PSMs based on TD when the three tools were applied; b the percent increase in PSM identification numbers compared to TD results
The number of identifications in modification searches
| Data | TD | Percolator or FDR_by_MODplus | TIDD |
|---|---|---|---|
Hela <phospho modification> | 152,186 | 167,925 (+ 10.34%) <Percolator> | 172,742 (+ 13.51%) |
HEK293 <946 variable modifications> | 605,103 | 653,660 (+ 8.02%) <FDR_by_MODplus> | 667,034 (+ 10.23%) |
FDR_by_MODplus means the FDR approach provided by MODplus. (Improved % compared to TD)
Fig. 4Graphical user interface of TIDD. a Example of TIDD input file. b Graphical user interface of TIDD