Literature DB >> 15018565

Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases.

Rovshan G Sadygov1, Hongbin Liu, John R Yates.   

Abstract

The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.

Mesh:

Substances:

Year:  2004        PMID: 15018565     DOI: 10.1021/ac035112y

Source DB:  PubMed          Journal:  Anal Chem        ISSN: 0003-2700            Impact factor:   6.986


  35 in total

1.  Generic comparison of protein inference engines.

Authors:  Manfred Claassen; Lukas Reiter; Michael O Hengartner; Joachim M Buhmann; Ruedi Aebersold
Journal:  Mol Cell Proteomics       Date:  2011-11-04       Impact factor: 5.911

2.  Software Analysis of Uncorrelated MS1 Peaks for Discovery of Post-Translational Modifications.

Authors:  Bruce D Pascal; Graham M West; Catherina Scharager-Tapia; Ricardo Flefil; Tina Moroni; Pablo Martinez-Acedo; Patrick R Griffin; Anthony C Carvalloza
Journal:  J Am Soc Mass Spectrom       Date:  2015-08-12       Impact factor: 3.109

3.  Informatics strategies for large-scale novel cross-linking analysis.

Authors:  Gordon A Anderson; Nikola Tolic; Xiaoting Tang; Chunxiang Zheng; James E Bruce
Journal:  J Proteome Res       Date:  2007-08-03       Impact factor: 4.466

Review 4.  Mass spectrometry-based strategies for characterization of histones and their post-translational modifications.

Authors:  Xiaodan Su; Chen Ren; Michael A Freitas
Journal:  Expert Rev Proteomics       Date:  2007-04       Impact factor: 3.940

5.  Monte carlo simulation-based algorithms for analysis of shotgun proteomic data.

Authors:  Hua Xu; Michael A Freitas
Journal:  J Proteome Res       Date:  2008-06-11       Impact factor: 4.466

6.  MassMatrix: a database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data.

Authors:  Hua Xu; Michael A Freitas
Journal:  Proteomics       Date:  2009-03       Impact factor: 3.984

7.  Motif-specific sampling of phosphoproteomes.

Authors:  Cristian I Ruse; Daniel B McClatchy; Bingwen Lu; Daniel Cociorva; Akira Motoyama; Sung Kyu Park; John R Yates
Journal:  J Proteome Res       Date:  2008-05       Impact factor: 4.466

8.  Simplified validation of borderline hits of database searches.

Authors:  Henrik Thomas; Andrej Shevchenko
Journal:  Proteomics       Date:  2008-10       Impact factor: 3.984

9.  A Multivariate Mixture Model to Estimate the Accuracy of Glycosaminoglycan Identifications Made by Tandem Mass Spectrometry (MS/MS) and Database Search.

Authors:  Yulun Chiu; Paul Schliekelman; Ron Orlando; Joshua S Sharp
Journal:  Mol Cell Proteomics       Date:  2016-12-09       Impact factor: 5.911

10.  Pivotal role of computers and software in mass spectrometry - SEQUEST and 20 years of tandem MS database searching.

Authors:  John R Yates
Journal:  J Am Soc Mass Spectrom       Date:  2015-08-19       Impact factor: 3.109

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.