Wout Bittremieux1,2, Kris Laukens2, William Stafford Noble3, Pieter C Dorrestein1. 1. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States. 2. Department of Computer Science, University of Antwerp, Antwerp, Belgium. 3. Department of Genome Sciences, University of Washington, Seattle, Washington, United States.
Abstract
RATIONALE: Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra. METHODS: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters. RESULTS: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing. CONCLUSIONS: falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.
RATIONALE: Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra. METHODS: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters. RESULTS: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing. CONCLUSIONS: falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.
Authors: Jeramie Watrous; Patrick Roach; Theodore Alexandrov; Brandi S Heath; Jane Y Yang; Roland D Kersten; Menno van der Voort; Kit Pogliano; Harald Gross; Jos M Raaijmakers; Bradley S Moore; Julia Laskin; Nuno Bandeira; Pieter C Dorrestein Journal: Proc Natl Acad Sci U S A Date: 2012-05-14 Impact factor: 11.205
Authors: Bernhard Y Renard; Marc Kirchner; Flavio Monigatti; Alexander R Ivanov; Juri Rappsilber; Dominic Winter; Judith A J Steen; Fred A Hamprecht; Hanno Steen Journal: Proteomics Date: 2009-11 Impact factor: 3.984
Authors: Niels Hulstaert; Jim Shofstahl; Timo Sachsenberg; Mathias Walzer; Harald Barsnes; Lennart Martens; Yasset Perez-Riverol Journal: J Proteome Res Date: 2019-12-06 Impact factor: 4.466
Authors: Florian Huber; Lars Ridder; Stefan Verhoeven; Jurriaan H Spaaks; Faruk Diblen; Simon Rogers; Justin J J van der Hooft Journal: PLoS Comput Biol Date: 2021-02-16 Impact factor: 4.475
Authors: Xiyang Luo; Wout Bittremieux; Johannes Griss; Eric W Deutsch; Timo Sachsenberg; Lev I Levitsky; Mark V Ivanov; Julia A Bubis; Ralf Gabriels; Henry Webel; Aniel Sanchez; Mingze Bai; Lukas Käll; Yasset Perez-Riverol Journal: J Proteome Res Date: 2022-05-13 Impact factor: 5.370