Literature DB >> 17237061

Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search.

Debojyoti Dutta1, Ting Chen.   

Abstract

MOTIVATION: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates.
RESULTS: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Mesh:

Substances:

Year:  2007        PMID: 17237061     DOI: 10.1093/bioinformatics/btl645

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  12 in total

1.  Clustering millions of tandem mass spectra.

Authors:  Ari M Frank; Nuno Bandeira; Zhouxin Shen; Stephen Tanner; Steven P Briggs; Richard D Smith; Pavel A Pevzner
Journal:  J Proteome Res       Date:  2007-12-08       Impact factor: 4.466

2.  Rapid and accurate peptide identification from tandem mass spectra.

Authors:  Christopher Y Park; Aaron A Klammer; Lukas Käll; Michael J MacCoss; William S Noble
Journal:  J Proteome Res       Date:  2008-05-28       Impact factor: 4.466

3.  An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data.

Authors:  Fahad Saeed; Trairak Pisitkun; Mark A Knepper; Jason D Hoffert
Journal:  Proceedings (IEEE Int Conf Bioinformatics Biomed)       Date:  2012-10-04

Review 4.  A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.

Authors:  Alexey I Nesvizhskii
Journal:  J Proteomics       Date:  2010-09-08       Impact factor: 4.044

5.  Faster SEQUEST searching for peptide identification from tandem mass spectra.

Authors:  Benjamin J Diament; William Stafford Noble
Journal:  J Proteome Res       Date:  2011-07-29       Impact factor: 4.466

6.  Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures.

Authors:  Fahad Saeed; Jason D Hoffert; Trairak Pisitkun; Mark A Knepper
Journal:  Netw Model Anal Health Inform Bioinform       Date:  2014-04

7.  CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling.

Authors:  Fahad Saeed; Jason D Hoffert; Mark A Knepper
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2014 Jan-Feb       Impact factor: 3.710

8.  SweetSEQer, simple de novo filtering and annotation of glycoconjugate mass spectra.

Authors:  Oliver Serang; John W Froehlich; Jan Muntel; Gary McDowell; Hanno Steen; Richard S Lee; Judith A Steen
Journal:  Mol Cell Proteomics       Date:  2013-02-26       Impact factor: 5.911

9.  Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing.

Authors:  Wout Bittremieux; Pieter Meysman; William Stafford Noble; Kris Laukens
Journal:  J Proteome Res       Date:  2018-09-13       Impact factor: 4.466

10.  An improved peptide-spectral matching algorithm through distributed search over multiple cores and multiple CPUs.

Authors:  Jian Sun; Bolin Chen; Fang-Xiang Wu
Journal:  Proteome Sci       Date:  2014-04-11       Impact factor: 2.480

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.