Literature DB >> 18930952

mspire: mass spectrometry proteomics in Ruby.

Abstract

UNLABELLED: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library 'mspire'-developed in the Ruby programming language-offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates. AVAILABILITY: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire

Entities: Chemical Disease Gene

Mesh：

Year: 2008 PMID： 18930952 PMCID： PMC2639276 DOI： 10.1093/bioinformatics/btn513

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The analysis of mass spectrometry (MS) proteomics data is challenging on many fronts. Datasets are complex, with information spanning multi-level hierarchies, and they are also very large—files are often of near gigabyte size. Access to MS proteomics data is increasing with the advent of standardized formats, such as mzXML and repositories, such as PeptideAtlas (Desiere et al., 2006), but its analysis remains no less daunting. Strongly typed languages (e.g. C/C++ and Java) are well suited for intensive computational tasks, but less so for exploring landscapes of computational possibilities. Scripting languages (e.g. Python, Perl and Ruby) are ideal for quick prototyping and the exploration of new ideas, but can be too slow or memory inefficient for large datasets. Thus, a need exists for scripting language tools capable of dealing with the size and complexity of MS proteomics data. Ruby is a full-featured programming language created with inspiration from Perl, Python, Smalltalk and Lisp. It is object oriented and remarkably consistent in its design. Ruby's syntax encourages the use of blocks and closures which lend flexibility and conciseness to programming style. Also, while it is powerful, Ruby is relatively easy to learn, making it a natural first programming language for budding bioinformaticians. Ruby does not have the same degree of support for scientific computation as Python (e.g. NumPy and PyLab), but it is building significant momentum in this area (e.g. SciRuby at http://sciruby.codeforpeople.com). These features encouraged our use of Ruby in the creation of a high-level library supporting MS proteomics analysis. A few libraries/tools exist for working with MS proteomics data outside of Ruby. InSilicoSpectro, the only other scripting language library, is an open-source library written in Perl for ‘implementing recurrent computations that are necessary for proteomics data analysis’. While there is some overlap with the work described here (e.g. in silico protein digestion), that library is currently geared towards the support of the Phenyx and Mascot search engines and low-level spectral computation (Colinge et al., 2006), while mspire is geared towards supporting Thermo's Bioworks software (SEQUEST) and downstream analysis, such as false identification rate (FIR) determination. The ProteomeCommons.org IO framework also has the ability to read/write and convert common data formats (Falkner et al., 2007), but this library is written in Java and does not provide any higher level language tools.

2 FEATURES

mspire is a software package for working with MS proteomics data as outlined in Figure 1A.

Fig. 1.

(A) Overview of mspire functionality. Black arrows and gray boxes depict mspire functionality. From left to right, mspire creates randomized databases (DBs) for FIR determination. MS::MSRun is a unified model for working with LC-MS/MS datasets. The Bioworks search engine produces peptide spectral matches (PSMs) in a .srf binary file or XML format. mspire extracts PSMs and presents them via a simple interface, SpecID, while preserving access to the underlying data structures. FIRs can be determined with various downstream software tools and reread into SpecID objects. SBV, sample bias validation. (B) mspire uses Arrayclass objects for efficient memory usage. GC, garbage collection; AC, Array-class; AF, Arrayfields; class, a traditional ruby object; SStruct, SuperStruct. (C) Lazy evaluation of spectra allows very large files to be read quickly. Shown are the times to read all 7830 well-formed mzXML files from PeptideAtlas and access two spectra for ‘io’ and ‘string’ lazy evaluation methods. A total of 181 files >350 MB in size were not read with the ‘string’ option. (D) Object model for capturing MS runs. (E) 3: an MSRun object can be instantiated with several lazy evaluation schemes. 4: typical instantiation. 6–8: total number of scans, the number of MS scans, and the number of MS/MS scans. 9: retrieves the start and end m/z values for all MS/MS scans. 11: a Ruby block that selects only MS/MS scans. 13–16: the scans are mapped to intensities; the block (designated between the ‘do’ and ‘end’ receives the scan object and returns the value of the last line, which is collected as an array (list_of_intensities). 14–15: chained method calls (equivalent to calling prc.intensity).

2.1 Memory usage and speed

mspire relies on several memory-saving techniques that are critical for working with large data files. Large quantities of objects are implemented as Arrayclass (http://arrayclass.rubyforge.org) objects, providing highly efficient memory usage (Fig. 1B), while preserving accessor behavior common to typical Ruby objects. By default, spectra from MS file formats (mzXML and mzData) are decoded into memory-efficient strings and are only completely cast when spectral information is accessed. An option is also available for storing only byte indices of spectral information that can be used for fast, random access of spectra or for reading files of essentially unlimited size. REXML, Ruby's standard library XML parser, can be far too slow when reading large XML files generated in MS proteomics. mspire can use either XMLParser or LibXML (both of which have C/C++bindings) for rapid parsing of large files. Performance reading and then accessing two spectra across thousands of mzXML files from the PeptideAtlas is shown in Figure 1C. Late evaluation of a spectrum allows files to be read at ∼20 MB/s with no file-size limit.

2.2 Reading MS proteomics data formats

mspire parses mzXML and mzData formats into a unified object model to simplify working with liquid chromatography (LC) MS and MS/MS runs. Figure 1D shows the basic class hierarchy and Figure 1E demonstrates a simple ‘use case’.

2.3 Bioworks SEQUEST results files (.srf)

Bioworks previously produced separate text files for each spectrum, but now outputs a single SEQUEST results file (.srf) for each set of searches. This increases the speed of a search, decreases disk space usage and is much easier to work with in file system operations. Unfortunately, because the output is binary, accessing its contents can be difficult and downstream analysis tools (outside of Bioworks) do not currently support this format. We created a reader for .srf files using the Ruby ‘unpack’ function. It extracts both spectral information and SEQUEST results. The reader is fast and also works across platforms because it does not rely on any vendor software libraries.

2.4 Reading/writing spectral identification formats

Even when derived from the same upstream data source, formats for working with spectra identifications can vary widely. We designed readers and writers for common downstream spectral-identification software formats for SEQUEST-based data: pepXML files which are used in the trans-proteomic pipeline (Protein Prophet) and also the .sqt format, which can be used with DTASelect and Percolator (Kall et al., 2007). Readers are tailored to their respective format so that users can not only extract format-specific information easily but also implement a common interface so that users can easily extract information shared across these formats.

2.5 Determining FIRs

Bioworks software support for determining FIRs is currently non-existent, and so downstream tools are necessary. mspire supports peptide FIR determination from target-decoy database searches (both the creation of decoy databases and the summary of search results), PeptideProphet and Percolator. Known biases in sample content can also be used to establish an FIR.

Funding

National Science Foundation; the National Institutes of Health oupReleaseDelayRemoved from OA Article (12|0) ; the Welch Foundation (F1515); Packard Fellowship (to E.M.M.). NIH grant numbers (GM067779,GM076536). Conflict of Interest: none declared.

4 in total

1. ProteomeCommons.org IO Framework: reading and writing multiple proteomics data formats.

Authors: J A Falkner; J W Falkner; P C Andrews
Journal: Bioinformatics Date: 2006-11-22 Impact factor: 6.937

2. Semi-supervised learning for peptide identification from shotgun proteomics datasets.

Authors: Lukas Käll; Jesse D Canterbury; Jason Weston; William Stafford Noble; Michael J MacCoss
Journal: Nat Methods Date: 2007-10-21 Impact factor: 28.547

3. InSilicoSpectro: an open-source proteomics library.

Authors: Jacques Colinge; Alexandre Masselot; Pablo Carbonell; Ron D Appel
Journal: J Proteome Res Date: 2006-03 Impact factor: 4.466

4. The PeptideAtlas project.

Authors: Frank Desiere; Eric W Deutsch; Nichole L King; Alexey I Nesvizhskii; Parag Mallick; Jimmy Eng; Sharon Chen; James Eddes; Sandra N Loevenich; Ruedi Aebersold
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

4 in total

10 in total

1. An efficient data format for mass spectrometry-based proteomics.

Authors: Anuj R Shah; Jennifer Davidson; Matthew E Monroe; Anoop M Mayampurath; William F Danielson; Yan Shi; Aaron C Robinson; Brian H Clowers; Mikhail E Belov; Gordon A Anderson; Richard D Smith
Journal: J Am Soc Mass Spectrom Date: 2010-07-07 Impact factor: 3.109

2. Data analysis strategy for maximizing high-confidence protein identifications in complex proteomes such as human tumor secretomes and human serum.

Authors: Huan Wang; Hsin-Yao Tang; Glenn C Tan; David W Speicher
Journal: J Proteome Res Date: 2011-10-18 Impact factor: 4.466

3. Pyteomics--a Python framework for exploratory data analysis and rapid software prototyping in proteomics.

Authors: Anton A Goloborodko; Lev I Levitsky; Mark V Ivanov; Mikhail V Gorshkov
Journal: J Am Soc Mass Spectrom Date: 2013-01-05 Impact factor: 3.109

4. Mining gene functional networks to improve mass-spectrometry-based protein identification.

Authors: Smriti R Ramakrishnan; Christine Vogel; Taejoon Kwon; Luiz O Penalva; Edward M Marcotte; Daniel P Miranker
Journal: Bioinformatics Date: 2009-07-24 Impact factor: 6.937

5. BioRuby: bioinformatics software for the Ruby programming language.

Authors: Naohisa Goto; Pjotr Prins; Mitsuteru Nakao; Raoul Bonnal; Jan Aerts; Toshiaki Katayama
Journal: Bioinformatics Date: 2010-08-25 Impact factor: 6.937

6. Programmed cell death protein 5 interacts with the cytosolic chaperonin containing tailless complex polypeptide 1 (CCT) to regulate β-tubulin folding.

Authors: Christopher M Tracy; Amy J Gray; Jorge Cuéllar; Tanner S Shaw; Alyson C Howlett; Ryan M Taylor; John T Prince; Natalie G Ahn; José M Valpuesta; Barry M Willardson
Journal: J Biol Chem Date: 2013-12-27 Impact factor: 5.157

7. Fast and Efficient XML Data Access for Next-Generation Mass Spectrometry.

Authors: Hannes L Röst; Uwe Schmitt; Ruedi Aebersold; Lars Malmström
Journal: PLoS One Date: 2015-04-30 Impact factor: 3.240

8. Imbalanced sphingolipid signaling is maintained as a core proponent of a cancerous phenotype in spite of metabolic pressure and epigenetic drift.

Authors: Monique M P Speirs; Adam C Swensen; Tsz Y Chan; Peter M Jones; John C Holman; McCall B Harris; John A Maschek; James E Cox; Richard H Carson; Jonathon T Hill; Joshua L Andersen; John T Prince; John C Price
Journal: Oncotarget Date: 2019-01-11

9. multiplierz: an extensible API based desktop environment for proteomics data analysis.

Authors: Jignesh R Parikh; Manor Askenazi; Scott B Ficarro; Tanya Cashorali; James T Webber; Nathaniel C Blank; Yi Zhang; Jarrod A Marto
Journal: BMC Bioinformatics Date: 2009-10-29 Impact factor: 3.169

Review 10. Open source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective.

Authors: Yasset Perez-Riverol; Rui Wang; Henning Hermjakob; Markus Müller; Vladimir Vesada; Juan Antonio Vizcaíno
Journal: Biochim Biophys Acta Date: 2013-03-01

10 in total