| Literature DB >> 35022699 |
Dries Decap1, Louise de Schaetzen van Brienen1, Maarten Larmuseau1, Pascal Costanza2, Charlotte Herzeel3, Roel Wuyts3, Kathleen Marchal1, Jan Fostier1.
Abstract
BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample.Entities:
Keywords: Apache Spark; GATK/Mutect2; Strelka2; somatic variant calling
Mesh:
Year: 2022 PMID: 35022699 PMCID: PMC8756192 DOI: 10.1093/gigascience/giab094
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1: Somatic variant calling pipeline implemented in Halvade Somatic. Strelka2 can be run as an alternative or complementary tool to Mutect2.
Figure 2: Overview of the somatic variant calling framework in Spark. The workflow consists of 3 Spark jobs where the data at the end of Jobs 1 and 2 are persisted. During the first job, the reads of the tumor sample are aligned to the reference genome and N chromosomal regions are determined such that each region contains roughly an equal number of aligned tumor reads. In the second job, the reads of the normal sample are aligned. The aligned reads (tumor and normal) are grouped per chromosomal region. Next, for each genomic region independently, reads are sorted according to the position to which they align and read duplicates are marked. This output is again persisted. Per genomic region, partial BQSR statistics are computed and merged into a genome-wide table. The last job uses this merged table to apply the BQSR to each read and call the somatic variants in all regions. The variants are merged into a single VCF output file. Note that certain tools in the workflow also require the (indexed) reference genome or dbSNP database. For simplicity, these input files are not shown.
Runtime of the original pipeline and Halvade Somatic for different combinations of samples (WGS or WES), input (FASTQ or BAM), and somatic variant calling tools (Mutect2, Strelka2, or both).
| Input | Variant caller | Original pipeline (h) | Halvade Somatic (h) | |||||
|---|---|---|---|---|---|---|---|---|
| 1 node | 1 node | 2 nodes | 4 nodes | 8 nodes | 12 nodes | 16 nodes | ||
| WGS | ||||||||
| FASTQ | Mutect2 | 84.57 | 19.45 | 9.35 | 4.81 | 2.47 | 1.74 | 1.36 |
| FASTQ | Strelka2 | 55.66 | 18.74 | 9.19 | 4.45 | 2.31 | 1.57 | 1.21 |
| FASTQ | Both | 86.03 | 21.89 | 10.50 | 5.22 | 2.74 | 1.90 | 1.51 |
| BAM | Mutect2 | 71.53 | 10.09 | 5.28 | 2.47 | 1.21 | 0.94 | 0.73 |
| BAM | Strelka2 | 42.62 | 9.94 | 5.24 | 2.28 | 1.07 | 0.83 | 0.61 |
| BAM | Both | 72.99 | 12.77 | 6.91 | 2.99 | 1.53 | 1.13 | 0.96 |
| WES | ||||||||
| FASTQ | Mutect2 | 12.59 | 2.38 | 1.21 | ||||
| FASTQ | Strelka2 | 7.03 | 1.63 | 0.82 | ||||
| FASTQ | Both | 12.66 | 2.65 | 1.36 | ||||
| BAM | Mutect2 | 10.72 | 1.70 | 0.85 | ||||
| BAM | Strelka2 | 5.16 | 0.86 | 0.42 | ||||
| BAM | Both | 10.79 | 1.90 | 1.04 | ||||
Figure 3: Comparison and breakdown of the runtime of Halvade Somatic and the original Mutect2 pipeline on a single node. Owing to efficient multi-threading support in BWA, the reduction in runtime for Job 1 is limited. Jobs 2 and 3 show a significant reduction in runtime due to the limited support for multi-core architectures in GATK/Mutect2.
Figure 4: Runtime and parallel speedup for the WGS sample using FASTQ input.
Runtime of Halvade somatic for the Mutect2 pipeline on Amazon EMR
| Input | No. of nodes | Halvade Somatic runtime (h) | Cost (USD) |
|---|---|---|---|
| WGS | |||
| FASTQ | 8 | 3.25 | 83.81 |
| BAM | 8 | 1.43 | 41.90 |
| WES | |||
| FASTQ | 1 | 2.75 | 8.80 |
| FASTQ | 2 | 1.42 | 11.02 |
| BAM | 1 | 1.88 | 5.87 |
| BAM | 2 | 1.08 | 11.02 |
The cost is calculated using standard pricing of region eu-west-1 (Ireland) at the time of writing.
Figure 5: Cumulative number of corresponding and discordant somatic variants between the original, sequential pipeline and Halvade Somatic as a function of the tumor variant allele frequency (VAF) for FASTQ input (left) and BAM input (right). “Corresponding” refers to somatic variants identified by both methods; “Original only” refers to somatic variants called only by the original, sequential pipeline; “Halvade only” refers to somatic variants identified only by Halvade Somatic. In all cases, the Mutect2 variant caller was used.