| Literature DB >> 28358893 |
Dries Decap1,2, Joke Reumers3,2, Charlotte Herzeel4,2, Pascal Costanza5,2, Jan Fostier1,2.
Abstract
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28358893 PMCID: PMC5373595 DOI: 10.1371/journal.pone.0174575
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
RNA-seq variant calling pipeline used in this work.
| Step | Tool | Input | Output |
|---|---|---|---|
| Read mapping (1st pass) | STAR | FASTQ + index | SAM + splice junctions |
| Rebuild genome index | STAR | ref. genome + splice junctions | new index |
| Read mapping (2nd pass) | STAR | FASTQ + new index | SAM |
| Add readgroups and sort | Picard | SAM | BAM |
| Mark duplicates | Picard | BAM | BAM |
| Split ‘N’ Trim | GATK | BAM | BAM |
| Indel realignment | GATK | BAM | BAM |
| Base quality score recalibration | GATK | BAM | BAM |
| Variant calling | GATK | BAM | VCF |
Fig 1Overview of the RNA-seq pipeline in Halvade-RNA.
In the first job, reads are aligned in parallel in order to identify splice junctions and the reference genome index is rebuilt using this information. In the second job, final alignments are produced and after sorting and grouping the aligned reads by genomic region, the different Picard and GATK steps are executed in parallel.
Benchmarks of the RNA-seq variant calling pipeline per sample.
| Runtime for sample (speedup) | ||||
|---|---|---|---|---|
| Classical pipeline | Halvade-RNA pipeline | |||
| Sample | 1 node × single core | 1 node × 20 cores | 1 node × 20 cores | 2 nodes × 20 cores |
| SNU-1033 | 26h 6min (n/a) | 12h 53min (2.03×) | 3h 25min (7.65×) | 2h 44min (9.56×) |
| SNU-1041 | 27h 48min (n/a) | 12h 51min (2.16×) | 3h 3min (9.09×) | 1h 48min (15.46×) |
| SNU-1214 | 34h 40min (n/a) | 13h 38min (2.54×) | 3h 33min (9.77×) | 2h 23min (14.56×) |
| SNU-213 | 27h 11min (n/a) | 12h 59min (2.09×) | 2h 50min (9.61×) | 1h 44min (15.66×) |
| SNU-216 | 27h 21min (n/a) | 13h 1min (2.10×) | 2h 46min (9.87×) | 1h 44min (15.81×) |
| SNU-308 | 27h 48min (n/a) | 13h 25min (2.07×) | 2h 48min (9.95×) | 1h 43min (16.26×) |
| SNU-489 | 27h 10min (n/a) | 12h 30min (2.17×) | 3h 8min (8.69×) | 2h 16min (11.98×) |
| SNU-601 | 26h 48min (n/a) | 12h 59min (2.07×) | 2h 59min (9.01×) | 2h 5min (12.91×) |
| SNU-668 | 25h 59min (n/a) | 12h 7min (2.14×) | 2h 49min (9.24×) | 1h 52min (13.97×) |
| average | 27h 52min (n/a) | 12h 56min (2.16×) | 3h 2min (9.18×) | 2h 2min (13.72×) |
Average runtimes per phase of the RNA-seq pipeline.
| Runtime and speedup per phase | |||||
|---|---|---|---|---|---|
| Pipeline | No. of nodes and cores | Pass 1 map | Rebuild genome | Pass 2 map | Variant calling steps |
| Classical pipeline | 1 node × single core | 1h 19min (n/a) | 4min (n/a) | 3h 29min (n/a) | 23h 1min (n/a) |
| 1 node × 20 cores | 6min (14.24×) | 2min (2.18×) | 22min (9.69×) | 12h 27min (1.85×) | |
| Halvade-RNA | 1 nodes × 20 cores | 14min (5.69×) | 4min (1.02×) | 39min (5.29×) | 2h 3min (11.21×) |
| 2 nodes × 20 cores | 8min (9.29×) | 4min (1.01×) | 22min (9.49×) | 1h 26min (16.04×) | |
Fig 2Runtime distribution of Map and Reduce tasks of both jobs.
Note that the average runtime of the pass 2 map phase increases slightly due to the sparse index. Also note that the Reduce phase of the first job is not displayed as this comprises only a single job.
Runtime for the batch processing of all 9 RNA-seq samples.
| Pipeline | No. of nodes and cores | Sum of per-sample runtime (speedup) | Batch processing runtime (speedup) |
|---|---|---|---|
| Classical pipeline | 1 node × single core | 250h 52min (n/a) | 250h 52min (n/a) |
| 1 node × 20 cores | 116h 24min (2.16×) | 40h 7min (6.25×) | |
| 2 node × 20 cores | 63h 21m (3.96×) | 21h 3min (11.92×) | |
| Halvade-RNA | 1 nodes × 20 cores | 27h 20min (9.18×) | 22h 17min (11.26×) |
| 2 nodes × 20 cores | 18h 17min (13.72×) | 11h48min(21.26×) |
Per sample overlap and average quality score.
| Sample | Overlapping variants(%) | Avg. qual score overlapping variants | Avg. qual score Halvade-unique variants | Avg. qual score reference-unique variants |
|---|---|---|---|---|
| SNU-1033 | 93.9 | 651.6 | 80.3 | 92.4 |
| SNU-1041 | 93.4 | 803.2 | 85.6 | 92.7 |
| SNU-1214 | 93.4 | 741.3 | 82.6 | 92.7 |
| SNU-213 | 94.3 | 612.7 | 74.0 | 87.2 |
| SNU-216 | 94.2 | 660.8 | 83.7 | 94.1 |
| SNU-308 | 93.4 | 522.7 | 71.5 | 71.8 |
| SNU-489 | 94.4 | 773.4 | 81.9 | 98.3 |
| SNU-601 | 93.3 | 742.1 | 94.4 | 116.4 |
| SNU-668 | 93.9 | 671.0 | 73.9 | 88.0 |
Average per-sample variant quality score (QUAL) for i) variants called by both the single-threaded pipeline and Halvade-RNA on 2 nodes × 20 cores, ii) variants called only by Halvade-RNA (‘Halvade-unique’), iii) variants called only by the single-threaded pipeline (‘reference-unique’).
Fig 3Comparison of variant Quality between Halvade and the reference pipeline.
Shows the quality distribution of all variants taken from all 9 samples.