| Literature DB >> 28854616 |
Baekdoo Kim1, Thahmina Ali1, Carlos Lijeron1, Enis Afgan2, Konstantinos Krampis1,3,4.
Abstract
Processing of next-generation sequencing (NGS) data requires significant technical skills, involving installation, configuration, and execution of bioinformatics data pipelines, in addition to specialized postanalysis visualization and data mining software. In order to address some of these challenges, developers have leveraged virtualization containers toward seamless deployment of preconfigured bioinformatics software and pipelines on any computational platform. We present an approach for abstracting the complex data operations of multistep, bioinformatics pipelines for NGS data analysis. As examples, we have deployed 2 pipelines for RNA sequencing and chromatin immunoprecipitation sequencing, preconfigured within Docker virtualization containers we call Bio-Docklets. Each Bio-Docklet exposes a single data input and output endpoint and from a user perspective, running the pipelines as simply as running a single bioinformatics tool. This is achieved using a "meta-script" that automatically starts the Bio-Docklets and controls the pipeline execution through the BioBlend software library and the Galaxy Application Programming Interface. The pipeline output is postprocessed by integration with the Visual Omics Explorer framework, providing interactive data visualizations that users can access through a web browser. Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Beyond end users, the Bio-Docklets also enables developers to programmatically deploy and run a large number of pipeline instances for concurrent analysis of multiple datasets.Entities:
Keywords: CHIPseq; NGS; RNAseq; bioinformatics; docker
Mesh:
Year: 2017 PMID: 28854616 PMCID: PMC5569920 DOI: 10.1093/gigascience/gix048
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Benchmark run times of the Bio-Docklet pipeline containers with the CHIPseq and RNAseq pipelines, using as input large-scale NGS data downloaded from public databases
| CHIP-seq (total: 31 GB) | RNAseq (total: 43 GB) | |
|---|---|---|
| Dataset location | • | • |
| Dataset details | • ERR411994.fastq 192 465 714 single-end reads | • SRR1797219_1.fastq - 47 209 075 forward reads, cancer cells |
| Running times (HH:MM:SS) | ||
| Lab server | 7:16:34 | 20:10:38 |
| AWS | 6:09:16 | 16:50:11 |
Figure 1:The Bio-Docklets environment with an (a) interactive meta-script that enables users to start the pipelines (b), select analysis parameters (c), and set input (d) and output (e) directories. Shell scripts and Python code were used for connecting to the Galaxy API, retrieving required data such as reference genomes, initializing environment variables in the containers, starting and monitoring the pipeline execution (f). Postprocessing and loading of the pipeline output on Visual Omics Explorer interactive visualizations are saved as output in HTML/Javascript files, which can be opened on a web browser at any time after pipeline completion and container shutdown; using the visualization, the output can be mined for clusters of differentially expressed genes or histone interaction peaks, and users can export the graphics in vectorized SVG format for use in manuscripts.
Figure 2:(a) Galaxy workflow canvas running inside the Bio-Docklets, with the composed RNAseq and CHIPseq pipelines, respectively (b). User interface of the “meta-script” interactively guides the users to select which pipeline to run, input and output file directories, and reference genome indices (c, d). Postprocessed pipeline output, loaded on interactive HTML/Javascript-D3 visualizations using the Visual Omics Explorer framework, can be opened in a web browser and also exported as high-resolution, manuscript-ready graphics.