Literature DB >> 24470574

PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations.

Andrey Tovchigrechko¹, Pratap Venepally, Samuel H Payne.

Abstract

SUMMARY: We present the first public release of our proteogenomic annotation pipeline. We have previously used our original unreleased implementation to improve the annotation of 46 diverse prokaryotic genomes by discovering novel genes, post-translational modifications and correcting the erroneous annotations by analyzing proteomic mass-spectrometry data. This public version has been redesigned to run in a wide range of parallel Linux computing environments and provided with the automated configuration, build and testing facilities for easy deployment and portability.
AVAILABILITY AND IMPLEMENTATION: Source code is freely available from https://bitbucket.org/andreyto/proteogenomics under GPL license. It is implemented in Python and C++. It bundles the Makeflow engine to execute the workflows. CONTACT: atovtchi@jcvi.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 24470574 PMCID： PMC4016709 DOI： 10.1093/bioinformatics/btu051

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Our pipeline is a tool for improving the existing genomic annotations from available proteomics mass spectrometry data. As most genome annotation pipelines consist of automated gene finding, they lack experimental validation of primary structure (Aziz ; Markowitz ), having to rely on DNA centric sources of data such as sequence homology, transcriptome mapping, codon frequency, etc. By incorporating the orthogonal set of data, proteogenomics is able to discover novel genes, post-translational modifications and correct the erroneous primary sequence annotations. The protocol and the large-scale application of our original pipeline to 46 taxonomically diverse genomes were reported in Venter . The implementation was tightly coupled with the internal computation services framework (VICS) at the J. Craig Venter Institute (JCVI). VICS has never been deployed outside of the JCVI, and the pipeline itself required manual configuration and building by the developers. It could only use Sun Grid Engine (SGE) batch queuing system configured for high-throughput computing (HTC) mode in which large numbers of serial jobs could be efficiently scheduled on a compute cluster. For these reasons, the original pipeline has not been made public. To create the first open source release presented here, we have redesigned the pipeline to run in a wide range of parallel Linux computing environments: The volume of computations in proteogenomics is relatively high, with ∼100 CPU hours for a typical bacterial genome. Our pipeline performs such annotation in ∼3 h of wall clock time on HTC cluster. High-performance computing (HPC) clusters, which are set up to efficiently schedule only large (100s+ of cores) parallel Message Passing Interface (MPI) jobs under a control of batch queuing system such as Sun Grid Engine (SGE) and its clones, Simple Linux Utility for Resource Management (SLURM) or Portable Batch System (PBS)/Torque. Our primary targets for this use case were compute clusters of XSEDE (https://www.xsede.org/), the federation of supercomputers supported by the US National Science Foundation. XSEDE allocates its resources to outside researchers through a peer-reviewed proposal system. The biologists will be able to use our software on this major computational resource. High-throughput computing (HTC) clusters widely used as local bioinformatics computing resources. These clusters are configured to efficiently schedule large numbers of serial jobs under a control of batch queuing system. A single multi-core workstation without a batch queuing system (including an extreme case of single-core machine). We have now designed a fully automated installation procedure preconfigured for several types of specific target systems and easily adaptable to others through editing of a few configuration files. Although several other proteogenomic packages (Chapman ; Kumar ; Risk ; Sanders ) have been developed in recent years, they were designed for execution on a single workstation. None of the other publications matched the breadth of application reported for our pipeline in Venter . The output files from that study are available at (http://omics.pnl.gov/pgp/overview.php). The contributed RefSeq updates can be seen in the Genbank flat files (.gbk) of the corresponding genomes at the NCBI wherever the proteomics data are listed as experimental evidence. One example is the Mycobacterium tuberculosis H37Rv genome (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv_uid57777/NC_000962.gbk) containing the CDS attributes/experiment=“EXISTENCE: identified in proteomics study”.

2 ARCHITECTURE AND IMPLEMENTATION

2.1 Parallelization strategy

In the present work, our main goal was to make the same pipeline protocol portable across different parallel execution environments that users are likely to encounter. The original algorithm is embarrassingly parallel for the most part. It processes each spectrum file independently throughout all computationally intensive stages of the algorithm. There is a global synchronization point in the middle to build a histogram of all scores for P-value computation. Thus, the pipeline corresponds to a distributed workflow where multiple serial processes are executed concurrently following a dependency graph defined by required input and output files. This model is compatible with a wide variety of execution environments such as standalone multicore machines, HTC clusters and, with extra effort, MPI clusters. The original unreleased implementation used HTC model tied into VICS and SGE. We have now achieved the portability across execution environments by generating and running the same workflow under the Makeflow engine (http://nd.edu/∼ccl/software/makeflow/) (Thrasher ) that provides parallel execution on multiple types of batch queuing systems as well as on standalone multicore nodes. On MPI clusters, Makeflow uses ‘glide-in’ approach that we describe in PGP software manual. In short, the ‘glide-in’ approach emulates an HTC cluster inside a single large MPI job. It will be also trivial to deploy our pipeline behind any Web services front-end such as Galaxy (Giardine ) or Taverna (Wolstencroft ). Each run of the pipeline appears to the caller as a single command-line invocation of the entry point script that exits only once it finishes executing its parallel workflow. Backend options (batch queue or local multicore) are passed through the command arguments. No permanently running server components are used by Makeflow. Deployment in Galaxy, for example, would be the same as deployment of a simple serial tool, requiring creation of a single XML tool description file.

2.2 Installation and execution

Newly developed installation procedure and documentation are part of the source code repository. The step-by-step installation and usage manual (also shown on the landing page at BitBucket) covers the execution environments, specific examples of configuration files for each environment and instructions for adapting these files to new compute clusters. The manual also covers sample run-time parameters for different environments, Quick Start instructions for testing the pipeline on a small dataset included in the repository and example of interpreting the pipeline’s output to discover a novel gene. The automated configuration and build procedure is driven by CMake (http://www.cmake.org/). Our installation procedure builds its own local copy of the Makeflow and several proteomics tools from (http://proteomics.ucsd.edu). Funding: National Science Foundation awards (EF-0949047 and 1048199), and XSEDE allocation (DEB100001) on the Texas Advanced Computing Center Ranger. The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. Conflict of Interest: none declared.

9 in total

1. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

2. Plant proteogenomics: from protein extraction to improved gene predictions.

Authors: Brett Chapman; Natalie Castellana; Alex Apffel; Ryan Ghan; Grant R Cramer; Matthew Bellgard; Paul A Haynes; Steven C Van Sluyter
Journal: Methods Mol Biol Date: 2013

3. Proteogenomic analysis of Bradyrhizobium japonicum USDA110 using GenoSuite, an automated multi-algorithmic pipeline.

Authors: Dhirendra Kumar; Amit Kumar Yadav; Puneet Kumar Kadimi; Shivashankar H Nagaraj; Sean M Grimmond; Debasis Dash
Journal: Mol Cell Proteomics Date: 2013-07-23 Impact factor: 5.911

4. Peppy: proteogenomic search software.

Authors: Brian A Risk; Wendy J Spitzer; Morgan C Giddings
Journal: J Proteome Res Date: 2013-05-06 Impact factor: 4.466

5. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud.

Authors: Katherine Wolstencroft; Robert Haines; Donal Fellows; Alan Williams; David Withers; Stuart Owen; Stian Soiland-Reyes; Ian Dunlop; Aleksandra Nenadic; Paul Fisher; Jiten Bhagat; Khalid Belhajjame; Finn Bacall; Alex Hardisty; Abraham Nieva de la Hidalga; Maria P Balcazar Vargas; Shoaib Sufi; Carole Goble
Journal: Nucleic Acids Res Date: 2013-05-02 Impact factor: 16.971

6. Proteogenomic analysis of bacteria and archaea: a 46 organism case study.

Authors: Eli Venter; Richard D Smith; Samuel H Payne
Journal: PLoS One Date: 2011-11-17 Impact factor: 3.240

7. The proteogenomic mapping tool.

Authors: William S Sanders; Nan Wang; Susan M Bridges; Brandon M Malone; Yoginder S Dandass; Fiona M McCarthy; Bindu Nanduri; Mark L Lawrence; Shane C Burgess
Journal: BMC Bioinformatics Date: 2011-04-22 Impact factor: 3.307

8. The RAST Server: rapid annotations using subsystems technology.

Authors: Ramy K Aziz; Daniela Bartels; Aaron A Best; Matthew DeJongh; Terrence Disz; Robert A Edwards; Kevin Formsma; Svetlana Gerdes; Elizabeth M Glass; Michael Kubal; Folker Meyer; Gary J Olsen; Robert Olson; Andrei L Osterman; Ross A Overbeek; Leslie K McNeil; Daniel Paarmann; Tobias Paczian; Bruce Parrello; Gordon D Pusch; Claudia Reich; Rick Stevens; Olga Vassieva; Veronika Vonstein; Andreas Wilke; Olga Zagnitko
Journal: BMC Genomics Date: 2008-02-08 Impact factor: 3.969

9. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions.

Authors: Victor M Markowitz; Ernest Szeto; Krishna Palaniappan; Yuri Grechkin; Ken Chu; I-Min A Chen; Inna Dubchak; Iain Anderson; Athanasios Lykidis; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2007-10-12 Impact factor: 16.971

9 in total

8 in total

1. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes.

Authors: Jia Zhang; Ming-Kun Yang; Honghui Zeng; Feng Ge
Journal: Mol Cell Proteomics Date: 2016-09-14 Impact factor: 5.911

Review 2. Proteogenomics: concepts, applications and computational strategies.

Authors: Alexey I Nesvizhskii
Journal: Nat Methods Date: 2014-11 Impact factor: 28.547

Review 3. Proteogenomics from a bioinformatics angle: A growing field.

Authors: Gerben Menschaert; David Fenyö
Journal: Mass Spectrom Rev Date: 2015-12-15 Impact factor: 10.946

4. The bacterial proteogenomic pipeline.

Authors: Julian Uszkoreit; Nicole Plohnke; Sascha Rexroth; Katrin Marcus; Martin Eisenacher
Journal: BMC Genomics Date: 2014-12-08 Impact factor: 3.969

5. An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

Authors: Ulrich Omasits; Adithi R Varadarajan; Michael Schmid; Sandra Goetze; Damianos Melidis; Marc Bourqui; Olga Nikolayeva; Maxime Québatte; Andrea Patrignani; Christoph Dehio; Juerg E Frey; Mark D Robinson; Bernd Wollscheid; Christian H Ahrens
Journal: Genome Res Date: 2017-11-15 Impact factor: 9.043

Review 6. Proteomics progresses in microbial physiology and clinical antimicrobial therapy.

Authors: B Chen; D Zhang; X Wang; W Ma; S Deng; P Zhang; H Zhu; N Xu; S Liang
Journal: Eur J Clin Microbiol Infect Dis Date: 2016-11-04 Impact factor: 3.267

7. A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry.

Authors: Christian H Ahrens; Joseph T Wade; Matthew M Champion; Julian D Langer
Journal: J Bacteriol Date: 2021-11-08 Impact factor: 3.476

8. A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma.

Authors: Paul A Stewart; Katja Parapatics; Eric A Welsh; André C Müller; Haoyun Cao; Bin Fang; John M Koomen; Steven A Eschrich; Keiryn L Bennett; Eric B Haura
Journal: PLoS One Date: 2015-11-05 Impact factor: 3.240

8 in total