| Literature DB >> 23176300 |
Yong Fuga Li1, Predrag Radivojac.
Abstract
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programming and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23176300 PMCID: PMC3489551 DOI: 10.1186/1471-2105-13-S16-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of notation and abbreviations used throughout this paper.
| Notation | Description |
|---|---|
| Set of all fragmentation spectra outputted by mass spectrometer | |
| Set of spectra identified for peptide | |
| A single fragmentation spectrum, | |
| Protein | |
| Peptide | |
| Peptide | |
| Protein database, a set of proteins used for peptide and protein identification | |
| Peptide database, the set of all (tryptic) peptides derived from | |
| Set of peptides derived from protein | |
| Indicator variable, set to 1 if peptide is | |
| Set of peptides that are confidently identified | |
| Indicator variable, set to 1 if | |
| Indicator variable, set to 1 if | |
| Indicator vector representing all peptides in | |
| Indicator vector representing all proteins in | |
| Set of peptides mapped to protein | |
| Set of proteins that contain peptide | |
| Indicator vector representing peptides in | |
| Peptide identification probability, the probability that peptide | |
| The probability of the PSM matching to be correct when peptide | |
| Protein posterior probabilities, the probability that protein | |
| Detectability of peptide | |
| Detectability of peptide | |
| Detectability of peptide | |
| The estimated number of (identified) sibling peptides of peptide | |
| PSM | Peptide-spectrum match; when it is clear from the context, we use PSM to also refer to the top-scoring PSM per spectrum |
| FDR | False discovery rate; the fraction of incorrect peptide identifications in |
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6A comparison between different probabilistic protein inference algorithms.
| Methods | ProteinProphet | MSBayesPro | Fido | MIPGEM |
|---|---|---|---|---|
| Underlying graph structure | Bipartite graph with identified peptides and matching proteins1 | Bayesian network with all peptides from proteins with at least one identified peptide | Bayesian network with identified peptides and matching proteins | k-partite graph with identified peptides, matching proteins and (optionally) matching gene models2 |
| Inference algorithm | EM (Expectation Maximization) like | 1) Exact3; | 1) Exact3 ; | 1) Exact3; |
| Input | Probabilities for peptides with user-defined cutoff for | Likelihood ratios for peptides with | Likelihood ratios for peptides | Probabilities for peptides with user-defined cutoff for |
| Output | 1) Protein probabilities; | 1) MAP solution, protein abundances and probabilities; | 1) Protein probabilities; | 1) Protein probabilities; |
| Protein prior estimation | No protein priors | Direct frequency estimation based on protein posterior probabilities in one run of MSBayesPro | Grid search optimizing cross- | Grid search optimizing model likelihood through multi-runs of the MIPGEM with different priors |
| Peptide probability adjustment by | NSP from a parent protein | Protein quantity adjusted peptide detectability | Two detectability-like parameters | Treating peptide identifications as random variables |
| Protein grouping | Yes | No (indistinguishable proteins are resolved) | Yes | No (indistinguishable proteins are not resolved) |
| Peptide charge | Considered | Ignored | Considered | Considered |
| Novel aspects | 1) First probabilistic protein inference algorithm; | 1) A Bayesian network; | 1) Using a noise model to remedy inaccurate peptide probabilities; | Gene model probabilities4 |
| Availability | http://tools.proteomecenter.org | http://darwin.informatics.indiana.edu/yonli/ | http://noble.gs.washington.edu/proj/fido | - |
1. For ProteinProphet, the underlying bipartite graph does not correspond to a Bayesian Network although it guides the EM-like algorithm through inference.
2. MIPGEM uses a rule-based protein removal scheme to simplify the network structure;
3. Exact computation is used only for small connected components;
4. Gene centric proteomics was proposed in [77], and implemented earlier in a deterministic way in [67].