| Literature DB >> 32224843 |
Karl R Franke1, Erin L Crowgey1.
Abstract
Advancements in next generation sequencing (NGS) technologies have significantly increased the translational use of genomics data in the medical field as well as the demand for computational infrastructure capable processing that data. To enhance the current understanding of software and hardware used to compute large scale human genomic datasets (NGS), the performance and accuracy of optimized versions of GATK algorithms, including Parabricks and Sentieon, were compared to the results of the original application (GATK V4.1.0, Intel x86 CPUs). Parabricks was able to process a 50× whole-genome sequencing library in under 3 h and Sentieon finished in under 8 h, whereas GATK v4.1.0 needed nearly 24 h. These results were achieved while maintaining greater than 99% accuracy and precision compared to stock GATK. Sentieon's somatic pipeline achieved similar results greater than 99%. Additionally, the IBM POWER9 CPU performed well on bioinformatic workloads when tested with 10 different tools for alignment/mapping.Entities:
Keywords: GPUs; Genome Analysis Toolkit; clinical genomics; next generation sequencing; variant detection
Year: 2020 PMID: 32224843 PMCID: PMC7120354 DOI: 10.5808/GI.2020.18.1.e10
Source DB: PubMed Journal: Genomics Inform ISSN: 1598-866X
Variant calling pipelines' speed
| GATK | Sentieon | Parabricks | ||||
|---|---|---|---|---|---|---|
| Haplotypecaller | DNAseq | DNAscope | ||||
| Germline pipeline | x86 baseline | x86 32-interval | Power9 64-interval | |||
| Times | ||||||
| BWA | 8:30:54 | 8:28:32 | 4:41:49 | 5:15:37 | 5:23:05 | 1:47:04 |
| MarkDupes | 6:35:29 | 5:35:05 | 3:05:17 | 0:39:03 | 0:38:55 | |
| Samtools Index | 1:51:02 | 1:33:05 | 1:16:05 | |||
| BaseRecalibrator | 10:10:20 | 0:38:20 | 0:18:01 | 0:19:34 | 0:19:45 | |
| ApplyBQSR | 7:59:57 | 0:43:08 | 0:23:55 | 0:22:31 | 0:22:45 | 0:34:25 |
| HaplotypeCaller | 71:02:51 | 5:30:32 | 2:28:12 | 1:05:11 | 1:03:12 | |
| CombineGVCF | - | 0:51:42 | 0:59:12 | - | - | - |
| GenotypeGVCFs | 0:40:38 | 0:29:11 | 0:30:50 | 0:01:12 | 0:01:16 | 0:33:13 |
| Total | 106:51:11 | 23:49:35 | 13:43:21 | 7:43:08 | 7:48:58 | 2:54:42 |
| Somatic pipeline | Mutect2 | TNseq | TNscope | N/A | ||
| Times | - | |||||
| Mutect2 | 109:43:31 | 14:46:15 | 9:06:52 | 7:31:46 | 3:30:38 | - |
| MergeVCF | - | 0:00:30 | 0:00:44 | - | - | - |
| GetPileupSummaries | 8:47:05 | 8:40:35 | 9:17:00 | 0:04:52 | - | - |
| CalculateContamination | 0:00:11 | 0:00:11 | 0:00:13 | - | - | |
| FilterMutectCalls | 0:02:22 | 0:03:26 | 0:02:51 | - | - | |
| Total | 118:33:09 | 23:30:57 | 18:27:40 | 7:36:38 | 3:30:38 | - |
N/A, not available.
Fig. 1.Variant calling pipelines’ scalability. (A) To determine the most efficient way of parallelizing GATK’s HaplotypeCaller, different combinations of scattered intervals and PairHMM OpenMP threads were tested on x86 and Power9 systems. The recorded times also include the CombineGVCFs step. (B) The scalability of GATK’s HaplotypeCaller pipeline on x86 and Power9 was tested with varying amounts of compute resources alongside Sentieon’s DNAseq and Parabricks. (C) Scalability of GATK’s Mutect2 pipeline on x86 and Power9 was tested with varying amounts of compute resources alongside Sentieon’s TNseq and TNscope. At the time of this analysis, Parabricks had not yet ported their somatic pipeline to Power9 therefore it could not be tested.
Variant calling pipelines' accuracy
| Germline | Somatic | ||||||
|---|---|---|---|---|---|---|---|
| GATK | Sentieon | Parabricks | GATK | Sentieon | |||
| Haplotypecaller | DNAseq | DNAscope | Mutect2 | TNseq | TNscope | ||
| VS Baseline VCF | |||||||
| SNP | |||||||
| True-positive | - | 3,827,008 | 3,782,857 | 3,830,446 | - | 980,680 | 1,036,385 |
| False-positive | - | 9,703 | 149,987 | 7,500 | - | 28,850 | 174,668 |
| False-negative | - | 11,202 | 55,353 | 7,764 | - | 75,037 | 19,332 |
| Sensitivity | - | 0.99708 | 0.98558 | 0.99798 | - | 0.92892 | 0.98169 |
| Precision | - | 0.99747 | 0.96186 | 0.99805 | - | 0.97142 | 0.85577 |
| INDEL | - | ||||||
| True-positive | - | 815,205 | 752,611 | 818,642 | - | 67,636 | 82,766 |
| False-positive | - | 11,314 | 75,083 | 7,431 | - | 20,197 | 71,222 |
| False-negative | - | 10,756 | 73,350 | 7,319 | - | 28,441 | 13,311 |
| Sensitivity | - | 0.98698 | 0.91119 | 0.99114 | - | 0.70398 | 0.86145 |
| Precision | - | 0.98631 | 0.90929 | 0.99100 | - | 0.77005 | 0.53748 |
| VS Truthset VCF | |||||||
| SNP | |||||||
| True-positive | 3,486,614 | 3,486,443 | 3,493,799 | 3,486,520 | 827,366 | 814,549 | 910,320 |
| False-positive | 2,345 | 2,344 | 6,541 | 2,360 | 983 | 2,586 | 9,212 |
| False-negative | 108,558 | 108,729 | 101,373 | 108,652 | 125,113 | 137,930 | 42,159 |
| Sensitivity | 0.9698 | 0.96976 | 0.97180 | 0.96978 | 0.86864 | 0.85519 | 0.95574 |
| Precision | 0.99933 | 0.99933 | 0.99813 | 0.99932 | 0.99881 | 0.99684 | 0.98998 |
| INDEL | |||||||
| True-positive | 548,276 | 548,574 | 548,368 | 548,247 | 59,112 | 59,941 | 82,501 |
| False-positive | 9,496 | 8,987 | 9,635 | 9,393 | 5,654 | 3,035 | 12,343 |
| False-negative | 24,451 | 24,153 | 24,359 | 24,480 | 71,354 | 70,525 | 47,965 |
| Sensitivity | 0.95731 | 0.95783 | 0.95747 | 0.95726 | 0.45308 | 0.45944 | 0.63236 |
| Precision | 0.98298 | 0.98388 | 0.98273 | 0.98316 | 0.9127 | 0.95181 | 0.86986 |
GATK, Genome Analysis Toolkit; VCF, Variant Call Format; SNP, single nucleotide polymorphism.
Fig. 2.x86/POWER9 performance comparison of aligners/mappers. The performance of 10 different tools for alignment/mapping was compared between POWER9 and ×86 systems. Jobs were run in triplicate across different days; averaged results are shown in the graphs with the error bars representing standard deviation.