Literature DB >> 28968639

NGS-pipe: a flexible, easily extendable and highly configurable framework for NGS analysis.

Jochen Singer^1,2, Hans-Joachim Ruscheweyh^1,2,3, Ariane L Hofmann^1,2, Thomas Thurnherr^1,2, Franziska Singer^2,4, Nora C Toussaint^2,4, Charlotte K Y Ng^5,6,7, Salvatore Piscuoglio⁸, Christian Beisel¹, Gerhard Christofori⁵, Reinhard Dummer⁹, Michael N Hall¹⁰, Wilhelm Krek¹¹, Mitchell P Levesque⁹, Markus G Manz¹², Holger Moch¹³, Andreas Papassotiropoulos^14,15,16,17, Daniel J Stekhoven^2,4, Peter Wild¹³, Thomas Wüst^2,3, Bernd Rinn^2,3, Niko Beerenwinkel^1,2.

Abstract

Motivation: Next-generation sequencing is now an established method in genomics, and massive amounts of sequencing data are being generated on a regular basis. Analysis of the sequencing data is typically performed by lab-specific in-house solutions, but the agreement of results from different facilities is often small. General standards for quality control, reproducibility and documentation are missing.
Results: We developed NGS-pipe, a flexible, transparent and easy-to-use framework for the design of pipelines to analyze whole-exome, whole-genome and transcriptome sequencing data. NGS-pipe facilitates the harmonization of genomic data analysis by supporting quality control, documentation, reproducibility, parallelization and easy adaptation to other NGS experiments. Availability and implementation: https://github.com/cbg-ethz/NGS-pipe. Contact: niko.beerenwinkel@bsse.ethz.ch.

Entities: Chemical

Mesh：

Year: 2018 PMID： 28968639 PMCID： PMC5870795 DOI： 10.1093/bioinformatics/btx540

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Advances in next-generation sequencing (NGS) have led to technologies capable of producing massive amounts of data at low costs. However, the analysis of these data is usually carried out using lab-specific in-house solutions. As a consequence, many different workflows are implemented for the same type of data, such that results are not easily comparable and are often hard to reproduce. Several studies have shown that individual pipelines often have limited overlap in their results (Alioto ; Denroche ; Hofmann ), which impedes the potential of identifying true biological signals and of clinical applications. The developers of the Genome Analysis Toolkit attempt to stratify genome analysis by providing best practices (https://software.broadinstitute.org/gatk/best-practices/), but these recommendations are currently not fully implemented computationally. Here, we introduce NGS-pipe, an automated and user friendly framework for the design of pipelines for the analysis of large-scale sequencing data, such as cancer genomics data. NGS-pipe allows to easily develop tailored workflows for the analysis of whole-exome (WES), whole-genome (WGS) and transcriptome (RNA-seq) sequencing data by providing building blocks to execute state-of-the-art tools, as well as appropriate error handling. An important goal of NGS-pipe is to overcome the common lack of automated procedures to ensure reproducibility. This is particularly important for clinical applications, where well documented and standardized protocols are a requirement (Aziz ).

2 Features of NGS-pipe

NGS-pipe incorporates tools for detecting single nucleotide variants (SNVs), insertions and deletions (indels) and copy number variants (CNVs), as well as for estimating gene expression levels. In addition to the primary read data analysis, NGS-pipe also generates runtime statistics and quality reports. It can be launched on a single computer or a cluster, where independent steps are executed in parallel. A practical introduction and examples can be found in the GitHub repository. Modularity. NGS-pipe is implemented using the workflow management system Snakemake (Koster and Rahmann, 2012). In combination with a modular backbone, where the execution of each analysis step is controlled by a rule, NGS-pipe is a flexible, easily extendable and highly configurable framework for NGS analysis. By modifying a configuration file, users can easily adjust the parameters for each rule without changing its implementation and include or exclude complete analysis steps in order to adapt the pre-configured workflows to the specific needs of their own experiment. Workflows for WES, WGS and RNA-seq data. To illustrate NGS-pipe, we have implemented and tested predefined workflows for the automated analysis for cancer WES, WGS and RNA-seq data (Fig. 1) to assist users inexperienced in data analysis or pipeline design. A description of these workflows, including the computational tools they integrate, can be found in the GitHub repository. Similar workflows can be implemented using NGS-pipe for other NGS applications.

Fig. 1

Schematic overview of the different pre-configured pipelines available in NGS-pipe

Schematic overview of the different pre-configured pipelines available in NGS-pipe Quality control and statistics. NGS-pipe supports quality control and provides statistics on each step of the analysis. Users can assess the quality of each sequencing file in the output of FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or Qualimap2 (Okonechnikov ), and inspect basic statistics such as how many reads passed the individual analysis steps. Performance and scalability. With NGS-pipe, samples can be analyzed independently of each other, providing full parallelization. For instance, we analyzed WES data from a tumor and matched normal sample comprising 60 million paired-end reads in 20 h and 10 such pairs in 22 h on a compute cluster [HPE ProLiant BL460c Gen9 – Two 12-core Intel Xeon E5-2680v3 processors (2.5–3.3 GHz)], where the two-hour overhead is due to waiting times of the local batch queuing system. Similarly, one RNA-seq dataset consisting of 80 million single-end reads and 10 such datasets were analyzed in 2.5 and 3 h, respectively. Reproducibility, documentation and error handling. A high level of automation, a clear documentation of the pipeline and strict error handling facilitate reproducibility, a major goal of NGS-pipe. All parameters for all tools included in the analysis of an NGS experiment are documented in a configuration file. Using Snakemake functionality, there are several additional layers of documentation within NGS-pipe, e.g. logging the executed commands and generating graphical representations of the workflows. As NGS-pipe has been designed to analyze a large number of datasets in parallel, automatized error handling is a fundamental requirement. If one of the steps of the pipeline failed and produced incomplete or no results, the computation of all depending steps is halted and an error message is thrown, using Snakemake intrinsics. After the issue is resolved the pipeline independently resumes the analysis.

3 Conclusion

NGS has become a standard genomics method in research labs and is currently implemented in clinical settings to aid patient diagnostics and treatment. NGS-pipe provides a Snakemake-based framework for analyzing such NGS data in a transparent and reproducible manner. The pre-configured workflows are easy to extend and adapt, extending the range of possible applications, including beyond cancer genomics.

Funding

This work was supported by the European Research Council [ERC Synergy Grant No. 609883]; SystemsX.ch [RTD Grant 2013/150, IPhD Grant SXPHI0_142005 and SyBIT]; the Swiss Cancer League [KLS-2892-02-2012]; the Swiss National Science Foundation [Ambizione grant number PZ00P3_168165 to S.P.]. Conflict of Interest: none declared.

6 in total

1. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

2. College of American Pathologists' laboratory standards for next-generation sequencing clinical tests.

Authors: Nazneen Aziz; Qin Zhao; Lynn Bry; Denise K Driscoll; Birgit Funke; Jane S Gibson; Wayne W Grody; Madhuri R Hegde; Gerald A Hoeltge; Debra G B Leonard; Jason D Merker; Rakesh Nagarajan; Linda A Palicki; Ryan S Robetorye; Iris Schrijver; Karen E Weck; Karl V Voelkerding
Journal: Arch Pathol Lab Med Date: 2014-08-25 Impact factor: 5.534

3. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.

Authors: Konstantin Okonechnikov; Ana Conesa; Fernando García-Alcalde
Journal: Bioinformatics Date: 2015-10-01 Impact factor: 6.937

4. Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers.

Authors: Ariane L Hofmann; Jonas Behr; Jochen Singer; Jack Kuipers; Christian Beisel; Peter Schraml; Holger Moch; Niko Beerenwinkel
Journal: BMC Bioinformatics Date: 2017-01-03 Impact factor: 3.169

5. A cancer cell-line titration series for evaluating somatic classification.

Authors: Robert E Denroche; Laura Mullen; Lee Timms; Timothy Beck; Christina K Yung; Lincoln Stein; John D McPherson; Andrew M K Brown
Journal: BMC Res Notes Date: 2015-12-26

6. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing.

Authors: Tyler S Alioto; Ivo Buchhalter; Sophia Derdak; Barbara Hutter; Matthew D Eldridge; Eivind Hovig; Lawrence E Heisler; Timothy A Beck; Jared T Simpson; Laurie Tonon; Anne-Sophie Sertier; Ann-Marie Patch; Natalie Jäger; Philip Ginsbach; Ruben Drews; Nagarajan Paramasivam; Rolf Kabbe; Sasithorn Chotewutmontri; Nicolle Diessl; Christopher Previti; Sabine Schmidt; Benedikt Brors; Lars Feuerbach; Michael Heinold; Susanne Gröbner; Andrey Korshunov; Patrick S Tarpey; Adam P Butler; Jonathan Hinton; David Jones; Andrew Menzies; Keiran Raine; Rebecca Shepherd; Lucy Stebbings; Jon W Teague; Paolo Ribeca; Francesc Castro Giner; Sergi Beltran; Emanuele Raineri; Marc Dabad; Simon C Heath; Marta Gut; Robert E Denroche; Nicholas J Harding; Takafumi N Yamaguchi; Akihiro Fujimoto; Hidewaki Nakagawa; Víctor Quesada; Rafael Valdés-Mas; Sigve Nakken; Daniel Vodák; Lawrence Bower; Andrew G Lynch; Charlotte L Anderson; Nicola Waddell; John V Pearson; Sean M Grimmond; Myron Peto; Paul Spellman; Minghui He; Cyriac Kandoth; Semin Lee; John Zhang; Louis Létourneau; Singer Ma; Sahil Seth; David Torrents; Liu Xi; David A Wheeler; Carlos López-Otín; Elías Campo; Peter J Campbell; Paul C Boutros; Xose S Puente; Daniela S Gerhard; Stefan M Pfister; John D McPherson; Thomas J Hudson; Matthias Schlesner; Peter Lichter; Roland Eils; David T W Jones; Ivo G Gut
Journal: Nat Commun Date: 2015-12-09 Impact factor: 14.919

6 in total

7 in total

Review 1. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

2. Spatial Distribution of Private Gene Mutations in Clear Cell Renal Cell Carcinoma.

Authors: Ariane L Moore; Aashil A Batavia; Jack Kuipers; Jochen Singer; Elodie Burcklen; Peter Schraml; Christian Beisel; Holger Moch; Niko Beerenwinkel
Journal: Cancers (Basel) Date: 2021-04-30 Impact factor: 6.575

Review 3. Bioinformatics for precision oncology.

Authors: Jochen Singer; Anja Irmisch; Hans-Joachim Ruscheweyh; Franziska Singer; Nora C Toussaint; Mitchell P Levesque; Daniel J Stekhoven; Niko Beerenwinkel
Journal: Brief Bioinform Date: 2019-05-21 Impact factor: 11.622

Review 4. Comprehensive Outline of Whole Exome Sequencing Data Analysis Tools Available in Clinical Oncology.

Authors: Áron Bartha; Balázs Győrffy
Journal: Cancers (Basel) Date: 2019-11-04 Impact factor: 6.639

5. A showcase study on personalized in silico drug response prediction based on the genetic landscape of muscle invasive bladder cancer.

Authors: Friedemann Krentel; Franziska Singer; María Lourdes Rosano-Gonzalez; Ewan A Gibb; Yang Liu; Elai Davicioni; Nicola Keller; Daniel J Stekhoven; Marianna Kruithof-de Julio; Roland Seiler
Journal: Sci Rep Date: 2021-03-12 Impact factor: 4.379

6. iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data.

Authors: Anjana Anilkumar Sithara; Devi Priyanka Maripuri; Keerthika Moorthy; Sai Sruthi Amirtha Ganesh; Philge Philip; Shayantan Banerjee; Malvika Sudhakar; Karthik Raman
Journal: NAR Genom Bioinform Date: 2022-07-25

7. SwissMTB: establishing comprehensive molecular cancer diagnostics in Swiss clinics.

Authors: Franziska Singer; Anja Irmisch; Nora C Toussaint; Linda Grob; Jochen Singer; Thomas Thurnherr; Niko Beerenwinkel; Mitchell P Levesque; Reinhard Dummer; Luca Quagliata; Sacha I Rothschild; Andreas Wicki; Christian Beisel; Daniel J Stekhoven
Journal: BMC Med Inform Decis Mak Date: 2018-10-29 Impact factor: 2.796

7 in total