| Literature DB >> 22132090 |
Brandi L Cantarel1, Alison R Erickson, Nathan C VerBerkmoes, Brian K Erickson, Patricia A Carey, Chongle Pan, Manesh Shah, Emmanuel F Mongodin, Janet K Jansson, Claire M Fraser-Liggett, Robert L Hettich.
Abstract
Accurate protein identification in large-scale proteomics experiments relies upon a detailed, accurate protein catalogue, which is derived from predictions of open reading frames based on genome sequence data. Integration of mass spectrometry-based proteomics data with computational proteome predictions from environmental metagenomic sequences has been challenging because of the variable overlap between proteomic datasets and corresponding short-read nucleotide sequence data. In this study, we have benchmarked several strategies for increasing microbial peptide spectral matching in metaproteomic datasets using protein predictions generated from matched metagenomic sequences from the same human fecal samples. Additionally, we investigated the impact of mass spectrometry-based filters (high mass accuracy, delta correlation), and de novo peptide sequencing on the number and robustness of peptide-spectrum assignments in these complex datasets. In summary, we find that high mass accuracy peptide measurements searched against non-assembled reads from DNA sequencing of the same samples significantly increased identifiable proteins without sacrificing accuracy.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22132090 PMCID: PMC3223167 DOI: 10.1371/journal.pone.0027173
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Creation of protein sequence databases.
Protein sequence databases were created from metagenomic sequence reads using a variety of methods for assembly and gene finding.
Performance and comparison of the metagenomic predicted protein sequence databases.
| Metagenomic Predicted Protein Sequence Database | Celera Assembler, Fastx, Metagene | Newbler, Metagene | Newbler, Metagene + Kurokawa/Gill | Raw Reads Metagene | Raw Reads, FastX, Metagene | Raw Reads, FastX, Metagene + Kurokawa/Gill | Raw Reads, Metagene Paired Search | |
| Database Acronym | CAFM | NM | NM_KG | RM | RFM | RFM_KG | RMPS | |
| Number of Sequences (thousand) | 1,844 | 190 | 540 | 1,903 | 1,520 | 1,907 | 2,146 | |
| Number of Amino Acids (million bp) | 200 | 45 | 115 | 189 | 173 | 262 | 191 | |
| Compute Time Per Run (minutes) | 670 | 80 | 320 | 750 | 1,060 | 1,030 | 435 | |
| Number of Non-redundant Spectra | 6a Run 2 | 5,179 | 6,235 | 10,441 | 9,100 | 9,074 | 10,975 | 13,806 |
| 6a Run 3 | 4,326 | 5,376 | 9,272 | 8,152 | 8,538 | 10,330 | 18,401 | |
| 6b Run 1 | 4,092 | 5,615 | 10,830 | 8,639 | 8,480 | 11,254 | 12,363 | |
| 6b Run 2 | 3,873 | 5,800 | 10,724 | 8,775 | 8,573 | 11,167 | 12,212 | |
| Total Spectra |
|
|
|
|
|
|
| |
| Total number of PSMs within ±10 ppm |
|
|
|
|
|
|
| |
| Number of Non-redundant Peptides | 6a Run 2 | 4,383 | 3,093 | 5,678 | 4,710 | 4,669 | 5,911 | 7,592 |
| 6a Run 3 | 3,655 | 2,403 | 4,617 | 3,804 | 3,963 | 5,068 | 6,303 | |
| 6b Run 1 | 3,404 | 2,426 | 5,409 | 3,919 | 3,879 | 5,549 | 5,923 | |
| 6b Run 2 | 3,216 | 2,297 | 5,088 | 3,747 | 3,690 | 5,238 | 5,605 | |
| Total Peptides |
|
|
|
|
|
|
| |
| Total NR Peptides |
|
|
|
|
|
|
| |
The database composition and SEQUEST/DTASelect search results (compute time, identified non-redundant spectra and peptides) with a 2-peptide and deltCN of 0.08 filters are shown for samples 6a (Run 2 and 3) and 6b (Run 1 and 2).
Figure 2Comparison of identified peptides using sequence similarity techniques.
Percentage of matches found when comparing identified peptides from sample 6a (left panel) or 6b (right panel) to predicted proteins using FASTS (gray bars) and raw sequencing reads using TFASTS (white striped bars).
Comparison of RFM and RMPS database results with different filtering metrics and a post-database mapping strategy.
| Protein Database | RFM | RMPS | ||
|
| ||||
|
|
|
|
| |
| 6a Run 2 | 3,246 | 1,154 | 6,542 | 1,761 |
| 6a Run 3 | 3,091 | 1,010 | 6,237 | 1,544 |
| 6b Run 1 | 2,639 | 637 | 5,212 | 973 |
| 6b Run 2 | 2,552 | 630 | 4,870 | 955 |
|
| 11,528 | 3,431 | 22,861 | 5,233 |
|
| ||||
|
|
|
|
| |
|
| ≥2 peptide | ≥1 peptide | ||
| 6a Run 2 | 3,541 | 1,252 | 7,497 | 2,069 |
| 6a Run 3 | 3,346 | 1,088 | 7,048 | 1,808 |
| 6b Run 1 | 2,879 | 686 | 5,881 | 1,182 |
| 6b Run 2 | 2,786 | 680 | 5,502 | 1,127 |
|
| 12,552 | 3,706 | 25,928 | 6,186 |
Comparison of SEQUEST/DTASelect database search results, non-redundant spectra and protein counts with different filtering parameters and HM, post-database mapping of identified peptides to a protein dataset generated from assembled reads for the same metagenomic sample.
Figure 3Performance and comparison of de novo peptide sequencing results.
Distribution of assigned spectra per de novo algorithm with a predicted consensus sequence (partial and/or exact sequence match) among all three algorithms, PEAKS, PepNovo+, and SEQUEST. Identified peptides from SEQUEST and RMPS sequence database were compared to the de novo predicted peptides for (A) 6a Run 2, (B) 6a Run 3, (C) 6b Run 1, and (D) 6b Run 2.