Literature DB >> 25182276

Crux: rapid open source protein tandem mass spectrometry analysis.

Sean McIlwain1, Kaipo Tamura, Attila Kertesz-Farkas, Charles E Grant, Benjamin Diament, Barbara Frewen, J Jeffry Howbert, Michael R Hoopmann, Lukas Käll, Jimmy K Eng, Michael J MacCoss, William Stafford Noble.   

Abstract

Efficiently and accurately analyzing big protein tandem mass spectrometry data sets requires robust software that incorporates state-of-the-art computational, machine learning, and statistical methods. The Crux mass spectrometry analysis software toolkit ( http://cruxtoolkit.sourceforge.net ) is an open source project that aims to provide users with a cross-platform suite of analysis tools for interpreting protein mass spectrometry data.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25182276      PMCID: PMC4184452          DOI: 10.1021/pr500741y

Source DB:  PubMed          Journal:  J Proteome Res        ISSN: 1535-3893            Impact factor:   4.466


Modern mass spectrometers produce massive amounts of data. For example, a Thermo Fusion mass spectrometer produces >24 GB of compressed data per day. Keeping pace with such a machine requires balancing three competing needs: analysis software must be robust, ensuring that the program executes successfully and that the results are valid, efficient to keep pace with the rapid rate of data acquisition, and state-of-the-art, gleaning as much information as possible from the data by bringing to bear the latest algorithms and statistical methods. To simultaneously address these three needs, we created an open source software toolkit called Crux (http://cruxtoolkit.sourceforge.net, Figure 1) that is capable of efficiently and accurately analyzing a variety of types of shotgun proteomics data. Originally, Crux consisted of a single search engine.[1] In Crux v2.0, the original search engine has been replaced by two search engines, Comet[2] and Tide,[3] both of which implement SEQUEST-style searching.[4] In addition, a specialized search engine provides the capability to identify cross-linked peptides.[5] The Bullseye preprocessor assigns high-resolution masses to fragmentation spectra,[6] and the Percolator[7] and Barista[8] postprocessors use machine learning techniques to identify and assign statistical confidence estimates to spectra, peptides, and proteins. Peptide and protein quantification can be carried out using a spectral counting tool.[9]
Figure 1

Crux analysis workflow and sample results. Crux provides tools for identifying spectra derived from single peptides or from cross-linked peptides as well as tools for postprocessing the resulting identifications to yield peptide- and protein-level identifications.

Crux analysis workflow and sample results. Crux provides tools for identifying spectra derived from single peptides or from cross-linked peptides as well as tools for postprocessing the resulting identifications to yield peptide- and protein-level identifications. Robust parsing of diverse file formats is an ongoing challenge in computational proteomics. Accordingly, we have adopted the open source ProteoWizard library,[10] which enables Crux to parse a wide variety of file formats (Table 1). In particular, ProteoWizard allows the parsing of vendor-specific raw files when Crux runs under Windows. Furthermore, support for various open file formats allows interoperability between Crux and other search engines as well as toolkits such as the Trans-Proteomic Pipeline[11] and MSDaPl[12] that provide summarization and visualization functionality.
Table 1

File Formats in Crux

commandMS1aMS2variousa,bFASTATide indexTSVpepXMLPINmzIdentMLSQTBarista XML
Bullseyeinin/outin        
Tide index   inout      
Tide search inin inoutoutoutoutout 
Comet ininin outoutoutoutout 
Percolator     in/outin/outinoutinin/out
Barista inin  in/outout  inout
spectral counts inin  in/outin inin 

Additional vendor proprietary formats for MS1 and MS2 data are supported on Windows: Agilent MassHunter .d, Waters RAW, Thermo RAW, Applied Biosciences Wiff, and Bruker Compass .d/YEP/BAF/FID.

Supported open MS2 file formats include BMS2, CMS2, MGF, mzML, and mzXML.

Additional vendor proprietary formats for MS1 and MS2 data are supported on Windows: Agilent MassHunter .d, Waters RAW, Thermo RAW, Applied Biosciences Wiff, and Bruker Compass .d/YEP/BAF/FID. Supported open MS2 file formats include BMS2, CMS2, MGF, mzML, and mzXML. A variety of other mass spectrometry analysis toolkits have been produced, including commercial products (Scaffold, LabKey Server, Mascot tools) and academic software (pFind Studio,[13] Bumbershoot,[14] the Trans-Proteomic Pipeline,[11] MaxQuant,[15] OpenMS,[16] the Global Proteome Machine,[17] and the Central Proteomics Facilities Pipeline[18]). Each of these toolkits offers distinct features (Table 2). Crux offers extensive confidence estimates, including false discovery rate and posterior error probability estimates at the spectrum, peptide, and protein levels and has recently added functionality (to the Tide search engine) to compute exact p values using a dynamic programming approach.[19]
Table 2

Comparison of Mass Spectrometry Analysis Toolkitsa

featureTPPMaxQuantOpenMSGPMCPFPScaffoldLabKey ServerpFind StudioBumbershootMascot toolsCrux
Tools
high-res mass assignment×××× × × ××
peptide database search×××××××××××
machine learning postprocessor××  ××× ×××
protein cross-link searching       ×  ×
RNA cross-link searching  ×        
spectral counting × × ××  ××
isobaric tag quantification×××  × × ××
peak area quantification×××  × × × 
Statistical Confidence Estimates
decoy-based estimates×××××××× ××
parametric PSM p values   × × ×  ×
exact PSM p values          ×
PSM q values× × ×××××××
PSM PEPs×××  ×   ××
peptide q values×   ××××× ×
peptide PEPs××       ××
protein q values×   ××× ×××
protein PEPs××   ×    ×
Input Spectrum File Formats
Thermo.RAW×××   ×××××
Waters.RAW× ×   × ×××
MDS/Sciex.wiff×××   × ×××
Agilent.d× ×   × ×××
Bruker.d× ×   × ×××
MS1        × ×
MS2  ×    ×× ×
mzML× ××× × ×××
mzXML××××  × ×××
MGF  ×××  ××××
Input PSM File Formats
PepXML× ×   ×   ×
mzIdentML× ×  ×  ×  
mzQuantML  ×        
.dat (Mascot)×    ×     
.out (SEQUEST)×    ×     
.sqt (SEQUEST)×    ×    ×
.srf (SEQUEST)     ×     
other tool-specific formats     ×     
Output File Formats
tab-delimited×××× ××××××
mzTab ××      × 
PepXML× × ×    ××
ProtXML×    ×     
mzIdentML× ×  ×  ×××
mzQuantML  ×        
Implementation
free××××× ××× ×
source code available× ××× × × ×
open source license× ××× × × ×
Linux binaries  ×  ××××××
MacOS binaries  ×  ××   ×
native Windows binaries×××  ××××××
command line interface×××× ×  ×××
graphical user interface×××××××××× 
application programming interface  ×   ×  × 

“Mascot tools” refers to Mascot Server and Mascot Distiller, which are licensed separately. GPM is Perl-based, so no binaries are needed. Scaffold parses tool-specific PSM formats produced by Proteome Discoverer, MS Amanda, Byonic, OMSSA, MaxQuant, SpectrumMill, X!Tandem, Waters Identity E, and Phenyx. Note that as of August 2014 CPFP is no longer actively maintained.

“Mascot tools” refers to Mascot Server and Mascot Distiller, which are licensed separately. GPM is Perl-based, so no binaries are needed. Scaffold parses tool-specific PSM formats produced by Proteome Discoverer, MS Amanda, Byonic, OMSSA, MaxQuant, SpectrumMill, X!Tandem, Waters Identity E, and Phenyx. Note that as of August 2014 CPFP is no longer actively maintained. Crux supports a variety of workflows, providing users with flexibility to tailor their analysis to their experimental goals. The choice of search engine—Comet versus Tide—is a matter of personal preference and processing considerations and is not likely to substantially affect the final results. Tide is faster on a single thread, but, unlike Comet, does not yet operate in multithreaded mode. Exact p values, which are only available in Tide, provide significantly improved statistical power at the expense of some computational overhead (roughly 0.2 s per spectrum). The two primary postprocessors, Percolator and Barista, offer more substantial differences. Both use a target-decoy machine learning approach. However, Percolator first learns to rerank peptide-spectrum matches (PSMs) and then performs a probabilistic protein-level inference,[20] whereas Barista formulates both tasks jointly in a single discriminative learning procedure. Which approach performs better in practice is an open question that deserves further exploration. To demonstrate the efficiency and accuracy of our software, we downloaded 224 GB of compressed data from a recent study of genetic control of protein abundance in humans[21] (details in the Supporting Information). Searching these >9 million fragmentation spectra using Tide against a human protein database containing ∼90,000 proteins and a matched set of decoys required 20.2 h of CPU time on a single thread, for a rate of 121 spectra/s. Postprocessing with Percolator required an additional 20.5 min. At 1% false discovery rate (FDR) thresholds for PSMs, peptides, and proteins, respectively, this analysis identified 2 576 283 PSMs, 79 976 peptides, and 11 432 proteins (Figure 2a–c). These results are comparable to the published analysis, which reported 2 726 242 PSMs corresponding to 71 800 distinct peptides at a 1% peptide-level FDR threshold. We also used Comet and Percolator to analyze a collection of 348 157 high-resolution spectra from the erythrocytic cycle of the malaria parasite Plasmodium falciparum,[100] identifying at 1% FDR 74 974 PSMs, 30 640 peptides, and 2618 proteins (Figure 2d–e).
Figure 2

(a–c) We used Tide+Percolator to analyze 9 092 380 fragmentation spectra from 95 different human samples. The figure plots the number of spectra, peptides and proteins identified as a function of false discovery rate threshold. (d–f) Each panel plots, from Comet+Percolator analysis of 348 157 Plasmodium falciparum fragmentation spectra, the number of (respectively) spectra, peptides and proteins identified as a function of false discovery rate threshold. Total analysis time was 61.2 m (34.4 m for Comet and 26.8 m for Percolator). The number of proteins identified at 1% FDR (2618) by Comet+Percolator compares favorably with the published analysis (2767 proteins).

(a–c) We used Tide+Percolator to analyze 9 092 380 fragmentation spectra from 95 different human samples. The figure plots the number of spectra, peptides and proteins identified as a function of false discovery rate threshold. (d–f) Each panel plots, from Comet+Percolator analysis of 348 157 Plasmodium falciparum fragmentation spectra, the number of (respectively) spectra, peptides and proteins identified as a function of false discovery rate threshold. Total analysis time was 61.2 m (34.4 m for Comet and 26.8 m for Percolator). The number of proteins identified at 1% FDR (2618) by Comet+Percolator compares favorably with the published analysis (2767 proteins). Crux is a command line tool, written in C++ and distributed as a single binary executable supporting a variety of commands. Users wishing to compile their own version of Crux can download the source code, which is covered by an Apache license. All Crux code undergoes code review and revisions to reflect our documented coding standards, and the software is automatically tested using a continuous integration system, which compiles Crux on three operating systems—Windows, MacOS and Linux—thereby providing up-to-date binary executables. Crux is under active development, with several important improvements and additions planned for the near future. In addition, we encourage community members to contribute to the toolkit.
  22 in total

1.  Open source system for analyzing, validating, and storing protein identification data.

Authors:  Robertson Craig; John P Cortens; Ronald C Beavis
Journal:  J Proteome Res       Date:  2004 Nov-Dec       Impact factor: 4.466

2.  pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry.

Authors:  Dequan Li; Yan Fu; Ruixiang Sun; Charles X Ling; Yonggang Wei; Hu Zhou; Rong Zeng; Qiang Yang; Simin He; Wen Gao
Journal:  Bioinformatics       Date:  2005-04-07       Impact factor: 6.937

3.  Global analysis of protein expression and phosphorylation of three stages of Plasmodium falciparum intraerythrocytic development.

Authors:  Brittany N Pease; Edward L Huttlin; Mark P Jedrychowski; Eric Talevich; John Harmon; Timothy Dillman; Natarajan Kannan; Christian Doerig; Ratna Chakrabarti; Steven P Gygi; Debopam Chakrabarti
Journal:  J Proteome Res       Date:  2013-08-26       Impact factor: 4.466

4.  CPFP: a central proteomics facilities pipeline.

Authors:  David C Trudgian; Benjamin Thomas; Simon J McGowan; Benedikt M Kessler; Mogjiborahman Salek; Oreste Acuto
Journal:  Bioinformatics       Date:  2010-02-25       Impact factor: 6.937

5.  Detecting cross-linked peptides by searching against a database of cross-linked peptide pairs.

Authors:  Sean McIlwain; Paul Draghicescu; Pragya Singh; David R Goodlett; William Stafford Noble
Journal:  J Proteome Res       Date:  2010-05-07       Impact factor: 4.466

6.  Computing exact p-values for a cross-correlation shotgun proteomics score function.

Authors:  J Jeffry Howbert; William Stafford Noble
Journal:  Mol Cell Proteomics       Date:  2014-06-02       Impact factor: 5.911

7.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data.

Authors:  Oliver Serang; Michael J MacCoss; William Stafford Noble
Journal:  J Proteome Res       Date:  2010-10-01       Impact factor: 4.466

8.  Comparison of database search strategies for high precursor mass accuracy MS/MS data.

Authors:  Edward J Hsieh; Michael R Hoopmann; Brendan MacLean; Michael J MacCoss
Journal:  J Proteome Res       Date:  2010-02-05       Impact factor: 4.466

9.  Estimating relative abundances of proteins from shotgun proteomics data.

Authors:  Sean McIlwain; Michael Mathews; Michael S Bereman; Edwin W Rubel; Michael J MacCoss; William Stafford Noble
Journal:  BMC Bioinformatics       Date:  2012-11-19       Impact factor: 3.169

10.  A cross-platform toolkit for mass spectrometry and proteomics.

Authors:  Matthew C Chambers; Brendan Maclean; Robert Burke; Dario Amodei; Daniel L Ruderman; Steffen Neumann; Laurent Gatto; Bernd Fischer; Brian Pratt; Jarrett Egertson; Katherine Hoff; Darren Kessner; Natalie Tasman; Nicholas Shulman; Barbara Frewen; Tahmina A Baker; Mi-Youn Brusniak; Christopher Paulse; David Creasy; Lisa Flashner; Kian Kani; Chris Moulding; Sean L Seymour; Lydia M Nuwaysir; Brent Lefebvre; Frank Kuhlmann; Joe Roark; Paape Rainer; Suckau Detlev; Tina Hemenway; Andreas Huhmer; James Langridge; Brian Connolly; Trey Chadick; Krisztina Holly; Josh Eckels; Eric W Deutsch; Robert L Moritz; Jonathan E Katz; David B Agus; Michael MacCoss; David L Tabb; Parag Mallick
Journal:  Nat Biotechnol       Date:  2012-10       Impact factor: 54.908

View more
  44 in total

1.  Dynamic Bayesian Network for Accurate Detection of Peptides from Tandem Mass Spectra.

Authors:  John T Halloran; Jeff A Bilmes; William S Noble
Journal:  J Proteome Res       Date:  2016-07-22       Impact factor: 4.466

2.  2018 YPIC Challenge: A Case Study in Characterizing an Unknown Protein Sample.

Authors:  Lindsay Pino; Andy Lin; Wout Bittremieux
Journal:  J Proteome Res       Date:  2019-10-07       Impact factor: 4.466

3.  StPeter: Seamless Label-Free Quantification with the Trans-Proteomic Pipeline.

Authors:  Michael R Hoopmann; Jason M Winget; Luis Mendoza; Robert L Moritz
Journal:  J Proteome Res       Date:  2018-02-14       Impact factor: 4.466

4.  Integrated Identification and Quantification Error Probabilities for Shotgun Proteomics.

Authors:  Matthew The; Lukas Käll
Journal:  Mol Cell Proteomics       Date:  2018-11-27       Impact factor: 5.911

5.  Extending Comet for Global Amino Acid Variant and Post-Translational Modification Analysis Using the PSI Extended FASTA Format.

Authors:  Jimmy K Eng; Eric W Deutsch
Journal:  Proteomics       Date:  2020-04-02       Impact factor: 3.984

6.  Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra.

Authors:  John T Halloran; David M Rocke
Journal:  Adv Neural Inf Process Syst       Date:  2017-12

7.  Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra.

Authors:  John T Halloran; David M Rocke
Journal:  Adv Neural Inf Process Syst       Date:  2018-12

Review 8.  Algorithms and design strategies towards automated glycoproteomics analysis.

Authors:  Han Hu; Kshitij Khatri; Joseph Zaia
Journal:  Mass Spectrom Rev       Date:  2016-01-04       Impact factor: 10.946

9.  Discovery of Protein Modifications Using Differential Tandem Mass Spectrometry Proteomics.

Authors:  Paolo Cifani; Zhi Li; Danmeng Luo; Mark Grivainis; Andrew M Intlekofer; David Fenyö; Alex Kentsis
Journal:  J Proteome Res       Date:  2021-03-22       Impact factor: 4.466

10.  Widespread Accumulation of Ribosome-Associated Isolated 3' UTRs in Neuronal Cell Populations of the Aging Brain.

Authors:  Peter H Sudmant; Hyeseung Lee; Daniel Dominguez; Myriam Heiman; Christopher B Burge
Journal:  Cell Rep       Date:  2018-11-27       Impact factor: 9.423

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.