| Literature DB >> 23216909 |
Steven Lewis1, Attila Csordas, Sarah Killcoyne, Henning Hermjakob, Michael R Hoopmann, Robert L Moritz, Eric W Deutsch, John Boyle.
Abstract
BACKGROUND: For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23216909 PMCID: PMC3538679 DOI: 10.1186/1471-2105-13-324
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1MapReduce jobs to generate a list of peptides to score at a specified m/z ratio. The first mapper generates all possible sequences and modified sequences defined in the search parameters for a given fasta database. The reducer eliminates duplicates, remembers all source proteins and emits the peptide with m/z as the key. The next set of reducers collects all peptides to be scored against a given m/z and stores them in the database.
Figure 2MapReduce jobs to score measured spectra. Spectra are scored against the contents of the peptide database with a series of m/z values. In the next job all scores are combined to generate the best scores. As a single file is the desired output, the last job has a single reducer allowing all output to go to a single file.
Figure 3Search times as a function of job complexity. Complexity is measured as dot products - the score of one spectrum against one peptide. Complexity depends on the number of spectra, the size of the protein database and the modifications and cleavages searched. The measured spectra files used for benchmarking our implementation were picked out of the public experiments of the PRIDE (Proteomics Identifications Database) proteomics repository. The PRIDE accession numbers of the 3 experiments used for making Figure 3 are: 7962, 15459, 10295. The PRIDE xml files containing spectra were downloaded from the PRIDE website and were opened in the PRIDE Inspector [19]. The mgf export functionality of PRIDE Inspector was used to generate the mgf files used in the searches, with only human tissue samples or cell lines being used to generate the mass spectra.
Comparison of search times for standard X!Tandem and Hydra
| Hadoop | 16000 | 43 (344) | ecoli | 5.4 | 1.3 | 164 | 9.8 |
| Hadoop | 256000 | 43 (344) | ecoli | 5.4 | 1.3 | 23395 | 338 |
| Tandem | 4663 | 1 (4) | human | 222 | 168 | 477 | 29 |
| Hadoop | 4663 | 43 (344) | human | 222 | 168 | 477 | 4.7 |
| Tandem | 184880 | 1 (4) | nr | 4370 | 692 | 3291 | 2280 |
| Hadoop | 184880 | 43 (344) | nr | 4370 | 692 | 3291 | 15.4 |
| Tandem | 184880 | 1 (4) | nr | 16392 | 1248 | 13167 | 8410 |
| Hadoop | 184880 | 43 (344) | nr | 16392 | 1248 | 13167 | 52.7 |
Example of comparison of run time for different complexities of search using the standard X!Tandem implementation and Hydra. The scans columns gives the number of spectra searched against, the Nodes column is the number of resources used (the first number of the number of machines, the second number is the number of total cores), the database name is the species database used, the Database Proteins is the number of proteins in the database, the dot product is the number of actual calculations. The times show that Hydra, unlike X!Tandem, is able to scale nearly linearly with the size of the problem. However, due to the startup costs associated with Hydra it is not suited for small searches. The PRIDE accession numbers for the spectra used were 10295 and 7962.
Figure 4Showing database build time as a function of the number of peptides cataloged. The figure shows the time for building a tryptic database against the number of peptides. The data is for a tryptic database with limited modifications. Build times are higher with semitryptic builds or with more modifications. Build times for tryptic digests range from a few minutes, largely representing set up time, to under an hour for the largest databases with over a million proteins.