Literature DB >> 24470576

NGSANE: a lightweight production informatics framework for high-throughput data analysis.

Fabian A Buske¹, Hugh J French, Martin A Smith, Susan J Clark, Denis C Bauer.

Abstract

SUMMARY: The initial steps in the analysis of next-generation sequencing data can be automated by way of software 'pipelines'. However, individual components depreciate rapidly because of the evolving technology and analysis methods, often rendering entire versions of production informatics pipelines obsolete. Constructing pipelines from Linux bash commands enables the use of hot swappable modular components as opposed to the more rigid program call wrapping by higher level languages, as implemented in comparable published pipelining systems. Here we present Next Generation Sequencing ANalysis for Enterprises (NGSANE), a Linux-based, high-performance-computing-enabled framework that minimizes overhead for set up and processing of new projects, yet maintains full flexibility of custom scripting when processing raw sequence data.
AVAILABILITY AND IMPLEMENTATION: Ngsane is implemented in bash and publicly available under BSD (3-Clause) licence via GitHub at https://github.com/BauerLab/ngsane. CONTACT: Denis.Bauer@csiro.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Year: 2014 PMID： 24470576 PMCID： PMC4016703 DOI： 10.1093/bioinformatics/btu036

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The initial steps of analyzing next-generation sequencing (NGS) data can be automated in standardized pipelines, for e.g. the many steps in SNP calling and RNA-Seq analysis (Anders ). This is critical, as further decreasing sequencing costs and expanding use of replicates to assess biological variability (Auer and Doerge, 2010) will substantially increase future study sizes, therefore making the automated, documented and reproducible processing of large numbers of samples across diverse projects using high-performance computing (HPC) clusters paramount. Yet, because of the constantly evolving technology, software and new application areas, maintaining such production informatics pipelines can be labor intensive. To address this issue, several software packages have been published in recent years. However, currently available tools are either web-based services, e.g. Galaxy (Goecks ), where even API-based access to the web service functionality is not readily amenable to production-scale analysis practices, or heavyweight frameworks written in user-friendly languages, such as snakemake and nestly (Python) (Köster and Rahmann, 2012; McCoy ), GATK’s Queue (Scala) – https://github.com/broadgsa/gatk/) or Bpipe (Groovy) (Sadedin ), which encapsulate the actual program call in a wrapper script specific syntax, hindering the development of pipeline extensions. Ngsane is a lightweight, Linux-based, HPC-enabled framework that minimizes overhead for set up and processing of new projects, yet maintains full flexibility of custom scripting for processing raw sequence data. Ngsane allows end users and developers to construct pipelines from call statements that can be tested on the command line directly without syntax alterations or wrapper script involvement providing flexibility in software usage – a substantial advantage when analysis pipelines are constantly revised as new algorithms are developed. We describe Ngsane’s aims below. Data security and reusability. The framework separates project-specific files from reference data, scripts and software suites that are reusable in other projects (Fig. 1a). Access to confidential data is handled transparently via the underlying Linux permission system. The transaction between projects and framework is facilitated by a project-specific configuration file that defines paths to reference data as well as the analysis tasks to perform. Ngsane supports systems with hierarchical storage management, specifically Data Migration Facility, by ensuring files are online when needed.

Fig. 1.

(a) Separation of project data from NGSANE core. (b) Workflow of NGSANE. (c) Example of automatically created project summary

(a) Separation of project data from NGSANE core. (b) Workflow of NGSANE. (c) Example of automatically created project summary HPC and parallel execution. Ngsane supports Sun Grid Engine and Portable Batch System job scheduling and can be operated in different modes for development and production, thus enabling flexible processing of NGS data. HPC job partitioning and submission is independent from the program calls, therefore enabling new technologies (e.g. Hadoop) to be incorporated. Hot swapping and adaptability. Individual task blocks (e.g. read mapping) are packaged in bash script modules, which can be executed locally or on subsets to test module code, submission parameters and compute node environment in stages. During production, Ngsane automatically submits separate module calls for each input file or set of files to the HPC queue. This allows different existing modules, parameter settings or software versions to be executed by changes to the project-specific configuration file rather than the software code (hot swapping). Reproducibility and checkpoint recovery. A full audit trail is generated recording performed tasks, used reference data, timestamps, software version as well as HPC log files, including any errors. Ngsane gracefully recovers from unsuccessfully executed jobs, be it owing to failed commands, missing or incorrect input or under-resourced HPC jobs by cleanly restarting after the most recent successfully executed checkpoint. Robust execution and full monitoring. In our experience, modular workflows are executed in stages with optional human quality control; NSANE hence focuses on providing robust checkpointing and intuitive report generation (Fig. 1b). However, workflows can be fully automated by using NGSANE’s control over HPC queuing systems and by leveraging the customizable interfaces between modules when submitting multiple dependent stages at once. Automated project summary creation. Ngsane generates a high-level summary (Project Card, Fig. 1b and c) to enable informed decisions about the experimental success. This interactive HTML report provides an access point for new lab members or collaborators. Furthermore, the Project Card can be used as a gold standard for software development when using a continuous integration server. Complete customization. Ngsane’s configuration file contains details about the submission system, typical HPC resource allocations and location of third-party software. However, Ngsane’s credo is that every parameter can be overwritten; hence, default parameters can be adjusted in the project-specific configuration file to indicate different software versions, additional resources or an altered output location. Additional parameters, such as a specific HPC queue, or new parameters in a software release, can be provided to each program via a special ‘free form’ variable in the configuration file. Repeated calls. As stated by McCoy , pipelines often have to be rerun on the full or a subset of the data with possibly altered parameter settings. Ngsane facilitates and documents this by allowing multiple (automatically created) configuration files. Knowledge transfer. Ngsane provides a unified framework (i.e. folder structure) for processing data from different experimental protocols. This allows co-investigators and reviewers to easily understand and reproduce work from Ngsane’s log and report files. Ngsane is open source and available via GitHub. Currently implemented workflows include those for adapter trimming, read mapping, peak calling, motif discovery, transcript assembly, variant calling and chromatin conformation analysis. These workflows use publicly available published software, yet allow the end user to add his/her own code and create new workflows as required. Ngsane is also available as Amazon Machine Image and can be deployed to the Amazon Elastic Compute Cloud (EC2) using StarCluster to allow on-demand processing of samples without requiring software installation or HPC maintenance.

6 in total

1. Bpipe: a tool for running and managing bioinformatics pipelines.

Authors: Simon P Sadedin; Bernard Pope; Alicia Oshlack
Journal: Bioinformatics Date: 2012-04-12 Impact factor: 6.937

2. Statistical design and analysis of RNA sequencing data.

Authors: Paul L Auer; R W Doerge
Journal: Genetics Date: 2010-05-03 Impact factor: 4.562

3. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.

Authors: Simon Anders; Davis J McCarthy; Yunshun Chen; Michal Okoniewski; Gordon K Smyth; Wolfgang Huber; Mark D Robinson
Journal: Nat Protoc Date: 2013-08-22 Impact factor: 13.491

4. Nestly--a framework for running software with nested parameter choices and aggregating results.

Authors: Connor O McCoy; Aaron Gallagher; Noah G Hoffman; Frederick A Matsen
Journal: Bioinformatics Date: 2012-12-06 Impact factor: 6.937

5. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

6 in total

11 in total

1. Global Distribution Patterns and Pangenomic Diversity of the Candidate Phylum "Latescibacteria" (WS3).

Authors: Ibrahim F Farag; Noha H Youssef; Mostafa S Elshahed
Journal: Appl Environ Microbiol Date: 2017-05-01 Impact factor: 4.792

2. Leveraging the power of high performance computing for next generation sequencing data analysis: tricks and twists from a high throughput exome workflow.

Authors: Amit Kawalia; Susanne Motameny; Stephan Wonczak; Holger Thiele; Lech Nieroda; Kamel Jabbari; Stefan Borowski; Vishal Sinha; Wilfried Gunia; Ulrich Lang; Viktor Achter; Peter Nürnberg
Journal: PLoS One Date: 2015-05-05 Impact factor: 3.240

3. Long Non-Coding RNA Expression during Aging in the Human Subependymal Zone.

Authors: Guy Barry; Boris Guennewig; Samantha Fung; Dominik Kaczorowski; Cynthia Shannon Weickert
Journal: Front Neurol Date: 2015-03-09 Impact factor: 4.003

4. Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations.

Authors: Phillippa C Taberlay; Joanna Achinger-Kawecka; Aaron T L Lun; Fabian A Buske; Kenneth Sabir; Cathryn M Gould; Elena Zotenko; Saul A Bert; Katherine A Giles; Denis C Bauer; Gordon K Smyth; Clare Stirzaker; Sean I O'Donoghue; Susan J Clark
Journal: Genome Res Date: 2016-04-06 Impact factor: 9.043

5. GenPipes: an open-source framework for distributed and scalable genomic analyses.

Authors: Mathieu Bourgey; Rola Dali; Robert Eveleigh; Kuang Chung Chen; Louis Letourneau; Joel Fillon; Marc Michaud; Maxime Caron; Johanna Sandoval; Francois Lefebvre; Gary Leveque; Eloi Mercier; David Bujold; Pascale Marquis; Patrick Tran Van; David Anderson de Lima Morais; Julien Tremblay; Xiaojian Shao; Edouard Henrion; Emmanuel Gonzalez; Pierre-Olivier Quirion; Bryan Caron; Guillaume Bourque
Journal: Gigascience Date: 2019-06-01 Impact factor: 6.524

6. Constitutively bound CTCF sites maintain 3D chromatin architecture and long-range epigenetically regulated domains.

Authors: Amanda Khoury; Joanna Achinger-Kawecka; Saul A Bert; Grady C Smith; Hugh J French; Phuc-Loi Luu; Timothy J Peters; Qian Du; Aled J Parry; Fatima Valdes-Mora; Phillippa C Taberlay; Clare Stirzaker; Aaron L Statham; Susan J Clark
Journal: Nat Commun Date: 2020-01-07 Impact factor: 14.919

7. Cpipe: a shared variant detection pipeline designed for diagnostic settings.

Authors: Simon P Sadedin; Harriet Dashnow; Paul A James; Melanie Bahlo; Denis C Bauer; Andrew Lonie; Sebastian Lunke; Ivan Macciocca; Jason P Ross; Kirby R Siemering; Zornitza Stark; Susan M White; Graham Taylor; Clara Gaff; Alicia Oshlack; Natalie P Thorne
Journal: Genome Med Date: 2015-07-10 Impact factor: 11.117