| Literature DB >> 31481971 |
Katherine I Kendig1, Saurabh Baheti2, Matthew A Bockol3, Travis M Drucker3, Steven N Hart4, Jacob R Heldenbrand1, Mikel Hernaez5, Matthew E Hudson5,6, Michael T Kalmbach3, Eric W Klee4, Nathan R Mattson3, Christian A Ross3, Morgan Taschuk7, Eric D Wieben8, Mathieu Wiepert3, Derek E Wildman5,9, Liudmila S Mainzer1,5.
Abstract
As reliable, efficient genome sequencing becomes ubiquitous, the need for similarly reliable and efficient variant calling becomes increasingly important. The Genome Analysis Toolkit (GATK), maintained by the Broad Institute, is currently the widely accepted standard for variant calling software. However, alternative solutions may provide faster variant calling without sacrificing accuracy. One such alternative is Sentieon DNASeq, a toolkit analogous to GATK but built on a highly optimized backend. We conducted an independent evaluation of the DNASeq single-sample variant calling pipeline in comparison to that of GATK. Our results support the near-identical accuracy of the two software packages, showcase optimal scalability and great speed from Sentieon, and describe computational performance considerations for the deployment of DNASeq.Entities:
Keywords: DNASeq; GATK; Sentieon; benchmarking; variant calling
Year: 2019 PMID: 31481971 PMCID: PMC6710408 DOI: 10.3389/fgene.2019.00736
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Sentieon DNASeq vs. GATK pipelines.
| Pipeline Step | Sentieon | GATK 3.8/4.0 |
|---|---|---|
| Alignment | BWA MEM† | BWA MEM |
| Sorting | Sort utility | NovoSort |
| Deduplication | LocusCollector and Dedup | MarkDuplicates |
| Realignment | Realigner |
|
| Quality score recalibration | QualCal | BaseRecalibrator |
| Apply new quality scores |
| PrintReads (3.8)/ApplyBQSR (4.0) |
| Variant callling | Haplotyper | HaplotypeCaller |
Sentieon’s optimized BWA MEM marked with †.
Figure 1Sentieon DNASeq pipeline: demonstrated scaling across threads on Skylake architecture vs. optimal (linear) scaling. Sample: NA12878, WGS, 20X. Data points reflect averages over two replicates, highlighting (A) post-alignment steps only and (B) the full pipeline including alignment.
Figure 2Sentieon DNASeq scalability as a function of sequencing coverage depth (A) by tool and (B) across the entire pipeline. Sample: NA24694, WGS, 25X-100X. Datapoints reflect averages over two replicates. Error bars are included in (B) but are too small to be visible.
Summary of optimized parameter values for GATK3.8 and GATK4.0 in reference to parallel garbage collection (PGC) threads, tool threads, async I/O, and AVX threads.
| Tool name | GATK3.8 | GATK4.0 | |||
|---|---|---|---|---|---|
| PGC | Tools threads | PGC | Async | AVX threads | |
| MarkDuplicates | 2 threads | 1 | 2 threads | N/A | N/A |
| BaseRecalibrator | 20 threads | −nct 40 | 20 threads | Yes for Samtools, No for Tribble | N/A |
| ApplyBQSR | Off | −nct 3 | Off | N/A | |
| HaplotypeCaller | Off | −nt 1 – nct 39 | Off | 8 | |
Reproduced from Heldenbrand et al. (2018) with permission.
Speed comparison: Sentieon DNASeq vs. GATK.
| Pipeline | Walltime (h) | Sentieon Speedup |
|---|---|---|
| DNASeq | .49 | – |
| GATK3.8 Baseline | 21.7 | x44 |
| GATK3.8 Optimized | 15.3 | x31 |
| GATK4.0 Baseline | 24.9 | x51 |
| GATK4.0 Optimized | 20.7 | x42 |
Speedup factor indicates n-fold speedup represented by DNASeq walltime as compared to GATK walltime. Sample: NA12878, WGS, 20X.
Variant detection accuracy in Sentieon DNASeq and GATK4: F1 scores.
| Dataset | Synthetic WGS, chr 20-22 | NA12878 |
|---|---|---|
| Sentieon vs. GATK4 | 0.99 | 0.997 w/ Realigner |
| 0.997 w/o Realigner | ||
|
|
|
|
| Sentieon vs. Truth set | 0.96 | 0.96 |
| GATK4 vs. Truth set | 0.95 | 0.96 |
Figure 3CPU utilization, memory usage and I/O of the Sentieon DNASeq tools, excluding BWA MEM. The pipeline steps are labeled in the middle panel, following the –algo options used in the script. CPU utilization in the top panel corresponds to the sum total across the 40 cores on the node. RAM utilization in the middle panel was measured as resident set size (VmRSS) and total RAM reserved for computation (VmSize). I/O rates in the bottom panel were measured in reads and writes per second. Sample: NA12878, WGS, 20X.