| Literature DB >> 30515360 |
ChangHyuk Kwon1,2, Jason Kim2, Jaegyoon Ahn1.
Abstract
BACKGROUND ANDEntities:
Keywords: Bioinformatics; DNA pipeline; DNA-Seq; Docker; Dockerbio; Mygenomebox; NGS pipeline; RNA pipeline; RNA-Seq
Year: 2018 PMID: 30515360 PMCID: PMC6266945 DOI: 10.7717/peerj.5954
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Pros and cons of existing platforms for biomedical data analysis.
| Pros (characteristics) | Cons | |
|---|---|---|
| DockerBIO |
Easy installation on local environment Provides easy-to-use GUI Easy to analyze NGS data |
No community for support yet |
| Galaxy ( |
Provides easy-to-use GUI Good community for support |
Slow running time on hosted servers Relatively difficult to install on local servers |
| BioContainers ( |
Framework or infrastructure for software standardization |
GUI is not provided Pre-registered data is not provided |
| RUbioSeq+ ( |
Automated and parallelized workflows to analyze NGS data |
Limited to NGS data analysis |
| Bioconda ( |
Provides efficient way to install and manage most of bioinformatics tools |
GUI is not provided |
Figure 1Overview of the workflow.
DockerBIO is composed of RegisterDocker and RunDocker. In RegisterDocker, users can use Docker images registered in DockerBIO, or search Docker images from Docker hub. They can also use data registered in DockerBIO, or search data from other data repositories. After the options are set and tested, a pipeline is made and registered to DockerBIO in RunDocker. In RunDocker, users can upload their own data, change options, run the registered pipeline and check results.
Pre-registered Dockers.
| Docker name | Tool lists | Reference |
|---|---|---|
| netbuyer/wgs | bwa, picard, gatk | |
| netbuyer/rna_seq | hisat2, samtools, stringtie, gffcompare | |
| conradstoerker/fastqc | FastQC | |
| comics/bwa | BWA(mem), BWA | |
| alexcoppe/picard | Picard(sort) | |
| comics/bowtie2 | bowtie2 | |
| netbuyer/rna_seq:0.1 | hisat2 | |
| alexcoppe/picard | picard | |
| biocontainers/samtools | samtools | |
| biodckrdev/gatk | GATK3.5 | |
| alexcoppe/snpsift | SnpSift annotate |
Pre-registered datasets.
| Data name | Description | Reference |
|---|---|---|
| Reference DNA(DNA-Seq): hg | Reference whole human genome sequence for running DNA-Seq | |
| Reference RNA(RNA-Seq): RNA_hg19, RNA_hg38 | Reference transcriptome sequence for running RNA-Seq | |
| dbSNP(hg38): dbSNP141, 142, 144, 146, 147, 150 | SNP identified from hg38 | |
| dbSNP(hg19): dbSNP138, 141, 142, 144, 146, 147, 150 | SNP identified from hg19 | |
| Annotation RNA(hg38): RNA_hg38_annotated.gtf | Exon and intron annotations based on hg38 | |
| Annotation RNA(hg19): RNA_hg19_annotated.gtf | Exon and intron annotations based on hg19 |
Note:
hg: human genome.
Figure 2(A) Docker LIST, (B) Docker info register and (C) SIMULATE in RegisterDocker.
(A) Docker LIST: menus for editing and testing options. (B) Docker Info Register: Menus for searching Docker images from Docker Hub, registering dataset and setting options. (C) SIMULATE: menus for testing registered Docker and options.
Figure 3Options and menus on the RunDocker Page.
UPLOAD USER FILE: for uploading user data files for analysis., DOCKER RUN: menus for running registered pipeline. Please refer to the UserManual for a detailed description of each command., JOB REQUEST LIST: menu for checking the result.
Data description.
| Experiment | Sample ID | Reference | #Reads | Description |
|---|---|---|---|---|
| DNA-seq1 | NA12750 | 1000 Genome project ( | 11,964,008 | Lymphoblastoid cell lines from the 1000 Genomes |
| DNA-seq2 | NA12878 | GIAB ( | 28,991,397 | Extracted by random sampling using a 300× high depth file of NA12878. |
| RNA-seq1 | SRX1952336 | GEO ( | 36,313,342 | Transcriptomic differences associated with TSC2 Gene expression loss in Lymphangioleiomyomatosis (human cells) |
| RNA-seq2 | GSM3244545 | GEO ( | 129,881,552 | Role of AHR in lymphoblastoid cell lines. |
Running time for DNA-seq1 (average of six run times).
| Average running time (hh:mm:ss) | ||||
|---|---|---|---|---|
| OneD | MultiD | LocalM (java1.7) | LocalM (java1.8) | |
| Alignment | 0:27:53 (±3) | 0:27:24 (±62) | 0:27:18 (±35) | 0:27:20 (±25) |
| Sorting | 0:04:57 (±12) | 0:04:26 (±13) | 0:04:32 (±18) | 0:04:32 (±18) |
| RemoveDuplicate | 0:05:57 (±29) | 0:05:20 (±6) | 0:05:50 (±19) | 0:05:48 (±20) |
| BaseRecal | 0:19:31 (±62) | 0:19:12 (±46) | 0:19:12 (±62) | 0:19:11 (±62) |
| BaseRecal_post | 0:25:53 (±4) | 0:26:22 (±23) | 0:26:05 (±34) | 0:26:33 (±53) |
| PrintReads | 0:12:55 (±3) | 0:13:20 (±13) | 0:14:10 (±11) | 0:14:18 (±17) |
| HaplotypeCaller | 2:13:56 (±42) | 2:15:17 (±28) | 2:13:36 (±252) | 2:52:05 (±498) |
| Mean overall time | 3:51:02 (±123) | 3:51:22 (±152) | 3:50:43 (±335) | 4:29:47 (±526) |
Notes:
1. 11,964,008 DNA-seq reads from NA12750.
2. Number in parenthesis is a standard deviation in seconds.
3. Alignment: BWA alignment, Sorting: Picard sorting, BaseRecal: GATK BaseRecalibrator, BaseRecal_post: GATK BaseRecalibrator second step.
4. LocalM: Local Machine implemented a local pipeline., MultiD: Multiple Docker, multiple Dockers connected as a pipeline., OneD: One Docker implemented whole pipeline is one Docker image.
Running time for DNA-seq2 (average of six run times).
| Average running time (hh:mm:ss) | ||||
|---|---|---|---|---|
| OneD | MultiD | LocalM (java1.7) | LocalM (java1.8) | |
| Alignment | 2:26:00 (±212) | 2:24:45 (±180) | 2:24:53 (±130) | 2:26:47 (±326) |
| Sorting | 0:28:16 (±49) | 0:28:59 (±71) | 0:28:14 (±50) | 0:28:05 (±35) |
| RemoveDuplicate | 0:26:14 (±58) | 0:25:40 (±19) | 0:25:29 (±40) | 0:25:22 (±53) |
| BaseRecal | 1:55:24 (±95) | 2:05:48 (±71) | 1:54:05 (±87) | 1:54:12 (±44) |
| BaseRecal_post | 2:57:38 (±2,292) | 3:25:45 (±1,867) | 3:11:06 (±1,348) | 3:29:24 (±746) |
| PrintReads | 2:16:45 (±48) | 2:14:23 (±75) | 2:19:04 (±108) | 2:18:50 (±116) |
| HaplotypeCaller | 4:41:57 (±43) | 4:55:41 (±125) | 4:40:11 (±81) | 7:38:42 (±79) |
| Mean overall time | 15:12:15 (±2,473) | 16:01:01 (±1,989) | 15:23:01 (±1,364) | 18:41:23 (±1,024) |
Notes:
1. 28,991,397 DNA-seq reads from NA12878.
2. Number in parenthesis is a standard deviation in seconds.
3. Alignment: BWA alignment, Sorting: Picard sorting, BaseRecal: GATK BaseRecalibrator, BaseRecal_post: GATK BaseRecalibrator second step.
4. LocalM, Local Machine implemented a local pipeline; MultiD, Multiple Docker, multiple Dockers connected as a pipeline; OneD, One Docker implemented whole pipeline is one Docker image.
Running time for RNA-seq1 (average of five run times).
| Average running time (hh:mm:ss) | |||
|---|---|---|---|
| OneD | MultiD | LocalM | |
| Alignment | 0:21:02 (±117) | 0:21:10 (±119) | 0:20:52 (±118) |
| Sorting | 0:13:35 (±82) | 0:14:42 (±90) | 0:14:16 (±81) |
| ST_gtf | 0:06:29 (±35) | 0:06:29 (±37) | 0:06:08 (±36) |
| ST_Mer_gtf | 0:00:11 (±1) | 0:00:11 (±1) | 0:00:07 (±1) |
| GT_Mer_gtf | 0:00:14 (±2) | 0:00:14 (±2) | 0:00:19 (±2) |
| BG_Mer_gtf | 0:08:58 (±49) | 0:08:58 (±50) | 0:08:48 (±50) |
| Mean overall time | 0:50:29 (±283) | 0 :51:44 (±293) | 0:50:31 (±283) |
Notes:
1. 36,313,342 RNA-seq reads from SRX1952336.
2. Number in parenthesis is a standard deviation in seconds.
3. Alignment: BWA alignment, Sorting: Picard sorting, ST_gtf: Stringtie gtf generation step, ST_Mer_gtf: Stringtie merge step, GT_Mer_gtf: gffcompare step, BG_Mer_gtf: BallGown step.
4. LocalM, Local Machine implemented a local pipeline; MultiD, Multiple Docker, multiple Dockers connected as a pipeline; OneD, One Docker implemented whole pipeline is one Docker image.
Running time for RNA-seq2 (average of five run times).
| Average running time (hh:mm:ss) | |||
|---|---|---|---|
| OneD | MultiD | LocalM | |
| Alignment | 46:42 (±34) | 46:20 (±21) | 47:39 (±32) |
| Sorting | 44:16 (±69) | 44:18 (±70) | 42:08 (±61) |
| ST_gtf | 05:55 (±4) | 05:48 (±2) | 06:26 (±17) |
| ST_Mer_gtf | 00:12 (±1) | 00:11 (±0) | 00:13 (±1) |
| GT_Mer_gtf | 00:16 (±0) | 00:16 (±0) | 00:16 (±0) |
| BG_Mer_gtf | 05:32 (±2) | 05:27 (±3) | 05:35 (±2) |
| Mean overall time | 1:42:53 (±100) | 1:42:21 (±78) | 1:42:18 (±66) |
Notes:
1. 129,811,552 RNA-seq reads from GSM3244545.
2. Number in parenthesis is a standard deviation in seconds.
3. Alignment: BWA alignment, Sorting: Picard sorting, ST_gtf: Stringtie gtf generation step, ST_Mer_gtf: Stringtie merge step, GT_Mer_gtf: gffcompare step, BG_Mer_gtf : BallGown step.
4. LocalM, Local Machine implemented a local pipeline; MultiD, Multiple Docker, multiple Dockers connected as a pipeline; OneD, One Docker implemented whole pipeline is one Docker image.