| Literature DB >> 25819078 |
Dries Decap1, Joke Reumers2, Charlotte Herzeel3, Pascal Costanza4, Jan Fostier1.
Abstract
MOTIVATION: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.Entities:
Mesh:
Year: 2015 PMID: 25819078 PMCID: PMC4514927 DOI: 10.1093/bioinformatics/btv179
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of the Halvade framework. The entries of pairs of input FASTQ files (containing paired-end reads) are interleaved and stored as smaller chunks. Map tasks are executed in parallel, each task taking a single chunk as input and aligning the reads to a reference genome using an existing tool. The map tasks emit
Overview of the steps and tools involved in the DNA-sequencing pipeline according to the GATK Best Practices recommendations described by Van der Auwera
| step | program | input | output |
|---|---|---|---|
| align reads | BWA | FASTQ | SAM |
| convert SAM to BAM | Picard | SAM | BAM |
| sort reads | Picard | BAM | BAM |
| mark duplicates | Picard | BAM | BAM |
| identify realignment intervals | GATK | BAM | Intervals |
| realign intervals | GATK | BAM and intervals | BAM |
| build BQSR table | GATK | BAM | table |
| recalibrate base quality scores | GATK | BAM and table | BAM |
| call variants | GATK | BAM | VCF |
Fig. 2.The parallel speedup (multithreading) of five GATK modules used in the Best Practices pipeline on a 16-core node with 94 GB of RAM. The limited speedup prevents the efficient use of this node with more than a handful of CPU cores. Option -nt denotes data threads while option -nct denotes CPU threads (cfr. GATK manual)
Runtime as a function of the number of parallel tasks (mappers/reducers) on the Intel Big Data cluster and Amazon EMR
| Cluster | No. worker nodes | No. parallel tasks | No. CPU cores | Runtime |
|---|---|---|---|---|
| Intel Big Data cluster | 1 | 3 | 18 | 47 h 59 min |
| 4 | 15 | 90 | 9 h 54 min | |
| 8 | 31 | 186 | 4 h 50 min | |
| 15 | 59 | 354 | 2 h 39 min | |
| Amazon EMR | 1 | 4 | 32 | 38 h 38 min |
| 2 | 8 | 64 | 20 h 19 min | |
| 4 | 16 | 128 | 10 h 20 min | |
| 8 | 32 | 256 | 5 h 13 min | |
| 16 | 64 | 512 | 2 h 44 min |
The time for uploading data to S3 over the internet is not included in the runtimes for Amazon EMR.
Fig. 3.The speedup (primary y-axis) and parallel efficiency (secondary y-axis) of Halvade as a function of number of parallel tasks (cluster size) on both an Intel Big Data cluster and Amazon EMR