| Literature DB >> 17437027 |
Abstract
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. Traditional search engines, which match peptide sequences with tandem mass spectra to identify the samples' proteins, use protein sequence databases to suggest peptide candidates for consideration. Although the acquisition of tandem mass spectra is not biased toward well-understood protein isoforms, this computational strategy is failing to identify peptides from alternative splicing and coding SNP protein isoforms despite the acquisition of good-quality tandem mass spectra. We propose, instead, that expressed sequence tags (ESTs) be searched. Ordinarily, such a strategy would be computationally infeasible due to the size of EST sequence databases; however, we show that a sophisticated sequence database compression strategy, applied to human ESTs, reduces the sequence database size approximately 35-fold. Once compressed, our EST sequence database is comparable in size to other commonly used protein sequence databases, making routine EST searching feasible. We demonstrate that our EST sequence database enables the discovery of novel peptides in a variety of public data sets.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17437027 PMCID: PMC1865584 DOI: 10.1038/msb4100142
Source DB: PubMed Journal: Mol Syst Biol ISSN: 1744-4292 Impact factor: 11.429
Novel peptides found in LC/MS/MS data sets from public data repositories
| Data set | Peptide | Type | ESTs | mRNA | SPV | IPI | Straddle intron? | Gene | |
|---|---|---|---|---|---|---|---|---|---|
| Raftflow | LQGSATAAEAQVGHQTAR | Novel splice | <10−8 | 11 | Y | N | N | Y | LIME1 |
| Raftaug | TAGSPLCLPTPGAAPGSAGSCSHR | Novel frame | ∼10−4 | 34 | Y | N | N | N | LIME1 |
| Raftflow | LQTASDESYKDPTNIQLSK | Micro-exon | <10−6 | 10 | N | Y | N | Y | SPTAN1 |
| A8 IP | HEQASNVLSDISEFR | Novel start | <10−9 | 86 | Y | N | N | Y | THOC2 |
| PPP 29 | KADDTWEPFASGK | Novel mutation | <10−7 | 2 | N | N | N | Y | TTR |
| PPP 40 | DTEEEDFHVDQATTVK | Known cSNP | <10−9 | 54 | N | Y | N | N | SERPINA1 |
| PPP 40 | DTEEEDFHVDQVTTVK | Wild type | <10−9 | 337 | Y | N | Y | N | SERPINA1 |
| PPP 28 | LQHLVNELTHDIITK | Known cSNP | <10−9 | 4 | N | Y | N | N | SERPINA1 |
| PPP 28 | LQHLENELTHDIITK | Wild type | <10−6 | 351 | Y | N | Y | N | SERPINA1 |
Human lipid raft T-cell study from PeptideAtlas (von Haller et al, 2003).
Human erythroleukemia K562 cell line study from PeptideAtlas (Resing et al, 2004).
HUPO Plasma Proteome Project data set from numbered laboratory (Omenn et al, 2005).
E values computed by X!Tandem.
Swiss-Prot variant annotation.
Figure 1(A) MS/MS spectrum from novel peptide LQGSATAAEAQVGHQTAR, found in PeptideAtlas data set ‘raftflow', and (B) UCSC genome browser (http://genome.ucsc.edu/) screen shot of genomic region.
Figure 2(A) Basic structure of minimum cost network flow instance showing supply in the nodes of S, demand in the nodes of T, and shortcut edges; (B) graph widget to select the Eulerian path start and end nodes; (C) the dense bipartite subgraph to account for restart edges; and (D) the graph widget that replaces it.