Jochen Singer1,2, Hans-Joachim Ruscheweyh1,2,3, Ariane L Hofmann1,2, Thomas Thurnherr1,2, Franziska Singer2,4, Nora C Toussaint2,4, Charlotte K Y Ng5,6,7, Salvatore Piscuoglio8, Christian Beisel1, Gerhard Christofori5, Reinhard Dummer9, Michael N Hall10, Wilhelm Krek11, Mitchell P Levesque9, Markus G Manz12, Holger Moch13, Andreas Papassotiropoulos14,15,16,17, Daniel J Stekhoven2,4, Peter Wild13, Thomas Wüst2,3, Bernd Rinn2,3, Niko Beerenwinkel1,2. 1. Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland. 2. SIB Swiss Institute of Bioinformatics, Basel, Switzerland. 3. Scientific IT Services, ETH Zurich, Basel, Switzerland. 4. NEXUS Personalized Health Technologies, Zurich, Switzerland. 5. Department of Biomedicine, University of Basel, Basel, Switzerland. 6. Institute of Pathology. 7. Division of Gastroenterology and Hepatology, University Hospital Basel, Basel, Switzerland. 8. Institute of Pathology, University Hospital Basel, Basel, Switzerland. 9. Department of Dermatology, University Hospital Zurich, Zurich, Switzerland. 10. Biozentrum, University of Basel, Basel, Switzerland. 11. Institute for Molecular Health Sciences, ETH Zurich, Zurich, Switzerland. 12. Division of Hematology, University Hospital Zurich, Zurich, Switzerland. 13. Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland. 14. Division of Molecular Neuroscience, Department of Psychology. 15. Transfaculty Research Platform Molecular and Cognitive Neurosciences. 16. Psychiatric University Clinics University of Basel, Basel, Switzerland. 17. Department Biozentrum, Life Sciences Training Facility, University of Basel, Basel, Switzerland.
Abstract
Motivation: Next-generation sequencing is now an established method in genomics, and massive amounts of sequencing data are being generated on a regular basis. Analysis of the sequencing data is typically performed by lab-specific in-house solutions, but the agreement of results from different facilities is often small. General standards for quality control, reproducibility and documentation are missing. Results: We developed NGS-pipe, a flexible, transparent and easy-to-use framework for the design of pipelines to analyze whole-exome, whole-genome and transcriptome sequencing data. NGS-pipe facilitates the harmonization of genomic data analysis by supporting quality control, documentation, reproducibility, parallelization and easy adaptation to other NGS experiments. Availability and implementation: https://github.com/cbg-ethz/NGS-pipe. Contact: niko.beerenwinkel@bsse.ethz.ch.
Motivation: Next-generation sequencing is now an established method in genomics, and massive amounts of sequencing data are being generated on a regular basis. Analysis of the sequencing data is typically performed by lab-specific in-house solutions, but the agreement of results from different facilities is often small. General standards for quality control, reproducibility and documentation are missing. Results: We developed NGS-pipe, a flexible, transparent and easy-to-use framework for the design of pipelines to analyze whole-exome, whole-genome and transcriptome sequencing data. NGS-pipe facilitates the harmonization of genomic data analysis by supporting quality control, documentation, reproducibility, parallelization and easy adaptation to other NGS experiments. Availability and implementation: https://github.com/cbg-ethz/NGS-pipe. Contact: niko.beerenwinkel@bsse.ethz.ch.
Advances in next-generation sequencing (NGS) have led to technologies capable of producing massive amounts of data at low costs. However, the analysis of these data is usually carried out using lab-specific in-house solutions. As a consequence, many different workflows are implemented for the same type of data, such that results are not easily comparable and are often hard to reproduce. Several studies have shown that individual pipelines often have limited overlap in their results (Alioto ; Denroche ; Hofmann ), which impedes the potential of identifying true biological signals and of clinical applications. The developers of the Genome Analysis Toolkit attempt to stratify genome analysis by providing best practices (https://software.broadinstitute.org/gatk/best-practices/), but these recommendations are currently not fully implemented computationally.Here, we introduce NGS-pipe, an automated and user friendly framework for the design of pipelines for the analysis of large-scale sequencing data, such as cancer genomics data. NGS-pipe allows to easily develop tailored workflows for the analysis of whole-exome (WES), whole-genome (WGS) and transcriptome (RNA-seq) sequencing data by providing building blocks to execute state-of-the-art tools, as well as appropriate error handling. An important goal of NGS-pipe is to overcome the common lack of automated procedures to ensure reproducibility. This is particularly important for clinical applications, where well documented and standardized protocols are a requirement (Aziz ).
2 Features of NGS-pipe
NGS-pipe incorporates tools for detecting single nucleotide variants (SNVs), insertions and deletions (indels) and copy number variants (CNVs), as well as for estimating gene expression levels. In addition to the primary read data analysis, NGS-pipe also generates runtime statistics and quality reports. It can be launched on a single computer or a cluster, where independent steps are executed in parallel. A practical introduction and examples can be found in the GitHub repository.Modularity. NGS-pipe is implemented using the workflow management system Snakemake (Koster and Rahmann, 2012). In combination with a modular backbone, where the execution of each analysis step is controlled by a rule, NGS-pipe is a flexible, easily extendable and highly configurable framework for NGS analysis. By modifying a configuration file, users can easily adjust the parameters for each rule without changing its implementation and include or exclude complete analysis steps in order to adapt the pre-configured workflows to the specific needs of their own experiment.Workflows for WES, WGS and RNA-seq data. To illustrate NGS-pipe, we have implemented and tested predefined workflows for the automated analysis for cancer WES, WGS and RNA-seq data (Fig. 1) to assist users inexperienced in data analysis or pipeline design. A description of these workflows, including the computational tools they integrate, can be found in the GitHub repository. Similar workflows can be implemented using NGS-pipe for other NGS applications.
Fig. 1
Schematic overview of the different pre-configured pipelines available in NGS-pipe
Schematic overview of the different pre-configured pipelines available in NGS-pipeQuality control and statistics. NGS-pipe supports quality control and provides statistics on each step of the analysis. Users can assess the quality of each sequencing file in the output of FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or Qualimap2 (Okonechnikov ), and inspect basic statistics such as how many reads passed the individual analysis steps.Performance and scalability. With NGS-pipe, samples can be analyzed independently of each other, providing full parallelization. For instance, we analyzed WES data from a tumor and matched normal sample comprising 60 million paired-end reads in 20 h and 10 such pairs in 22 h on a compute cluster [HPE ProLiant BL460c Gen9 – Two 12-core Intel Xeon E5-2680v3 processors (2.5–3.3 GHz)], where the two-hour overhead is due to waiting times of the local batch queuing system. Similarly, one RNA-seq dataset consisting of 80 million single-end reads and 10 such datasets were analyzed in 2.5 and 3 h, respectively.Reproducibility, documentation and error handling. A high level of automation, a clear documentation of the pipeline and strict error handling facilitate reproducibility, a major goal of NGS-pipe. All parameters for all tools included in the analysis of an NGS experiment are documented in a configuration file. Using Snakemake functionality, there are several additional layers of documentation within NGS-pipe, e.g. logging the executed commands and generating graphical representations of the workflows. As NGS-pipe has been designed to analyze a large number of datasets in parallel, automatized error handling is a fundamental requirement. If one of the steps of the pipeline failed and produced incomplete or no results, the computation of all depending steps is halted and an error message is thrown, using Snakemake intrinsics. After the issue is resolved the pipeline independently resumes the analysis.
3 Conclusion
NGS has become a standard genomics method in research labs and is currently implemented in clinical settings to aid patient diagnostics and treatment. NGS-pipe provides a Snakemake-based framework for analyzing such NGS data in a transparent and reproducible manner. The pre-configured workflows are easy to extend and adapt, extending the range of possible applications, including beyond cancer genomics.
Funding
This work was supported by the European Research Council [ERC Synergy Grant No. 609883]; SystemsX.ch [RTD Grant 2013/150, IPhD Grant SXPHI0_142005 and SyBIT]; the Swiss Cancer League [KLS-2892-02-2012]; the Swiss National Science Foundation [Ambizione grant number PZ00P3_168165 to S.P.].Conflict of Interest: none declared.
Authors: Nazneen Aziz; Qin Zhao; Lynn Bry; Denise K Driscoll; Birgit Funke; Jane S Gibson; Wayne W Grody; Madhuri R Hegde; Gerald A Hoeltge; Debra G B Leonard; Jason D Merker; Rakesh Nagarajan; Linda A Palicki; Ryan S Robetorye; Iris Schrijver; Karen E Weck; Karl V Voelkerding Journal: Arch Pathol Lab Med Date: 2014-08-25 Impact factor: 5.534
Authors: Ariane L Hofmann; Jonas Behr; Jochen Singer; Jack Kuipers; Christian Beisel; Peter Schraml; Holger Moch; Niko Beerenwinkel Journal: BMC Bioinformatics Date: 2017-01-03 Impact factor: 3.169
Authors: Robert E Denroche; Laura Mullen; Lee Timms; Timothy Beck; Christina K Yung; Lincoln Stein; John D McPherson; Andrew M K Brown Journal: BMC Res Notes Date: 2015-12-26
Authors: Tyler S Alioto; Ivo Buchhalter; Sophia Derdak; Barbara Hutter; Matthew D Eldridge; Eivind Hovig; Lawrence E Heisler; Timothy A Beck; Jared T Simpson; Laurie Tonon; Anne-Sophie Sertier; Ann-Marie Patch; Natalie Jäger; Philip Ginsbach; Ruben Drews; Nagarajan Paramasivam; Rolf Kabbe; Sasithorn Chotewutmontri; Nicolle Diessl; Christopher Previti; Sabine Schmidt; Benedikt Brors; Lars Feuerbach; Michael Heinold; Susanne Gröbner; Andrey Korshunov; Patrick S Tarpey; Adam P Butler; Jonathan Hinton; David Jones; Andrew Menzies; Keiran Raine; Rebecca Shepherd; Lucy Stebbings; Jon W Teague; Paolo Ribeca; Francesc Castro Giner; Sergi Beltran; Emanuele Raineri; Marc Dabad; Simon C Heath; Marta Gut; Robert E Denroche; Nicholas J Harding; Takafumi N Yamaguchi; Akihiro Fujimoto; Hidewaki Nakagawa; Víctor Quesada; Rafael Valdés-Mas; Sigve Nakken; Daniel Vodák; Lawrence Bower; Andrew G Lynch; Charlotte L Anderson; Nicola Waddell; John V Pearson; Sean M Grimmond; Myron Peto; Paul Spellman; Minghui He; Cyriac Kandoth; Semin Lee; John Zhang; Louis Létourneau; Singer Ma; Sahil Seth; David Torrents; Liu Xi; David A Wheeler; Carlos López-Otín; Elías Campo; Peter J Campbell; Paul C Boutros; Xose S Puente; Daniela S Gerhard; Stefan M Pfister; John D McPherson; Thomas J Hudson; Matthias Schlesner; Peter Lichter; Roland Eils; David T W Jones; Ivo G Gut Journal: Nat Commun Date: 2015-12-09 Impact factor: 14.919
Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410
Authors: Ariane L Moore; Aashil A Batavia; Jack Kuipers; Jochen Singer; Elodie Burcklen; Peter Schraml; Christian Beisel; Holger Moch; Niko Beerenwinkel Journal: Cancers (Basel) Date: 2021-04-30 Impact factor: 6.575
Authors: Franziska Singer; Anja Irmisch; Nora C Toussaint; Linda Grob; Jochen Singer; Thomas Thurnherr; Niko Beerenwinkel; Mitchell P Levesque; Reinhard Dummer; Luca Quagliata; Sacha I Rothschild; Andreas Wicki; Christian Beisel; Daniel J Stekhoven Journal: BMC Med Inform Decis Mak Date: 2018-10-29 Impact factor: 2.796