| Literature DB >> 21572408 |
Ari M Frank1, Matthew E Monroe, Anuj R Shah, Jeremy J Carver, Nuno Bandeira, Ronald J Moore, Gordon A Anderson, Richard D Smith, Pavel A Pevzner.
Abstract
Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from ∼1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21572408 PMCID: PMC3128193 DOI: 10.1038/nmeth.1609
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Figure 1Clustering of the PNNL dataset. 581 million spectra from the PNNL dataset that passed quality filtration were assigned into 299 million clusters of different sizes. (a) While most spectra form multi-clusters (i.e., clusters containing at least two spectra), most clusters consist of a single spectrum. (b) A breakdown of the clusters according to the number of organisms whose spectra participated in each cluster for each of 21.5 million multi-clusters.
Search results for archive of 14.5 million spectra of Shewanella oneidensis MR-1. For each fraction of the full dataset, the table compares the identifications (number of proteins, peptides and spectra/cluster annotations) made by a regular database search and by searching the clusters in the archive. The searches were done against a database of S. oneidensis protein sequences with false discovery rate of 2%.
| Regular database search | Archive clusters search | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset fraction | Num. spectra searched | Num. protein ids | Num. peptide ids | Num. spectra annotated | Num. clusters searched | Num. protein ids | Num. peptide ids | Num. annotations | |
| Clusters | Spectra | ||||||||
| 1/5 | 2.9M | 2,257 | 28,083 | 0.5M | 0.61M | 2,304 | 29,948 | 0.18M | 0.75M |
| 2/5 | 5.8M | 2,435 | 33,866 | 0.95M | 1.06M | 2,471 | 35,648 | 0.28M | 1.52M |
| 3/5 | 8.7M | 2,518 | 37,093 | 1.42M | 1.49M | 2,566 | 39,355 | 0.37M | 2.34M |
| 4/5 | 11.6M | 2,561 | 39,205 | 1.84M | 1.89M | 2,611 | 41,418 | 0.44M | 3.09M |
| 5/5 | 14.5M | 2,608 | 40,680 | 2.28M | 2.29M | 2,660 | 43,415 | 0.51M | 3.96M |
Clustering HEK293 and Plasma datasets. The table shows the number of peptide and protein identifications made in each of the individual datasets, and the identifications made with the combined archive: the number of ids that were common to both datasets and the number that were added to each dataset because its spectra were joined into clusters with identified spectra from the other dataset.
| Num. peptides | Num. Proteins | |
|---|---|---|
| HEK 293 | 61,380 | 8,066 |
| Plasma | 18,473 | 1,498 |
| Common to both | 1,003 | 600 |
| Ids added to HEK using archive | 954 | 114 |
| Ids added to Plasma using archive | 584 | 207 |
Figure 2Identification of peptides across different species. The diagram compares the number of peptide identifications made with the Shewanella oneidensis (Sone) data using three methods: No clustering: standard MS/MS search, single-species clustering followed by MS/MS search, and multi-species clustering followed by MS/MS search. Results were processed to maintain a 2% FDR.