| Literature DB >> 32952600 |
Po-Jung Huang1,2,3, Jui-Huan Chang2, Hou-Hsien Lin4, Yu-Xuan Li2, Chi-Ching Lee5, Chung-Tsai Su1,6, Yun-Lung Li1,6, Ming-Tai Chang1,6, Sid Weng1,6, Wei-Hung Cheng7, Cheng-Hsun Chiu3, Petrus Tang2,3,7.
Abstract
Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform (GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary study, while reserving the flexibility and scalability for large-scale sequencing projects.Entities:
Mesh:
Year: 2020 PMID: 32952600 PMCID: PMC7481958 DOI: 10.1155/2020/7231205
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Framework of DeepVaraint-on-Spark. DeepVariant-on-Spark is based on the Google Dataproc service. After importing the BAM file into the DeepVariant-on-Spark cluster, the BAM file will be segmented into several 1 Mbp blocks in the “Adam Transform” step, and these blocks will be merged into 155 small BAM files in the “Select BAM” step. The 1 Mbp blocks and small BAM files are stored in the HDFS. PiedPiper will pipe the path of each BAM file to SeqPiper, which launches DeepVariant to produce the VCF file. Finally, in the “Merge VCFs” steps, each VCF file will be merged into a complete VCF file.
Comparison of variant calling results of DeepVariant and DeepVariant-on-Spark with different combinations of CPUs/GPUs.
| Variant calling pipeline | Variant type | CPUa | GPUb | F1c | Recall | Precision | True positive | False negative | False positive | Genotype mismatch | Total number of SNV calls |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DeepVariant | SNP | 16 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040855 | 1928 | 1744 | 363 | 3886287 |
| 32 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886337 | ||
| 64 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886366 | ||
| 96 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040855 | 1928 | 1744 | 363 | 3886339 | ||
| 16 | 1 | 0.99940 | 0.99937 | 0.99943 | 3040855 | 1928 | 1744 | 363 | 3886287 | ||
| 16 | 4 | 0.99940 | 0.99937 | 0.99943 | 3040855 | 1928 | 1744 | 363 | 3886287 | ||
| 32 | 2 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886337 | ||
| 64 | 4 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886366 | ||
| DeepVariant-on-Spark | 32 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886403 | |
| 64 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886403 | ||
| 128 | 0 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886403 | ||
| 32 | 2 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886403 | ||
| 64 | 4 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886404 | ||
| 128 | 8 | 0.99940 | 0.99937 | 0.99943 | 3040856 | 1927 | 1744 | 363 | 3886403 | ||
|
| |||||||||||
| DeepVariant | Indel | 16 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868527 |
| 32 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868535 | ||
| 64 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868520 | ||
| 96 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868535 | ||
| 16 | 1 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868527 | ||
| 16 | 4 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868528 | ||
| 32 | 2 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868535 | ||
| 64 | 4 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868520 | ||
| DeepVariant-on-Spark | 32 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868541 | |
| 64 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868541 | ||
| 128 | 0 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868541 | ||
| 32 | 2 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868542 | ||
| 64 | 4 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868542 | ||
| 128 | 8 | 0.96168 | 0.95711 | 0.96628 | 478265 | 21432 | 17373 | 11151 | 868541 | ||
aCPU means the number of CPU cores. bGPU means the number of NVIDIA Tesla P100 GPUs. cF1 means F1 score calculated by 2∗(recall∗precision)/(recall + precision).
Figure 2Wall-clock time and speedup of DeepVariant and DeepVariant-on-Spark with different combinations of CPU/GPU. Runtime comparison of DeepVariant and DeepVariant-on-Spark with different combinations of CPU/GPU. (a) DeepVariant runs on the pure CPU machine. (b) DeepVariant runs on the CPU/GPU hybrid machine. (c) DeepVariant-on-Spark runs on the pure CPU cluster. (d) DeepVariant-on-Spark runs on the CPU/GPU hybrid cluster. AdamTransform, SelectBAM, Make_Examples, Call_Variants, Postprocess_Variants, and Merge VCF represent each step in DeepVariant or DeepVariant-on-Spark. Speedup represents how many times each condition is faster than DeepVariant's (16 CPU) mode. The speed improvement of DeepVariant-on-Spark over DeepVariant is provided above. DeepVariant-on-Spark using 128-CPU and 8-GPU configurations improved the wall-clock time by 11.58x compared to DeepVariant using 16 CPUs.
Comparison of the wall-clock time of DeepVariant and DeepVariant-on-Spark with different combinations of CPUs/GPUs.
| Variant caller | DeepVariant | DeepVariant-on-Spark | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Machine model | CPU only | CPU+GPU | CPU only | CPU+GPU | |||||||||
| CPUa | 16 | 32 | 64 | 96 | 16 | 32 | 64 | 32 | 64 | 128 | 32 | 64 | 128 |
| GPUb | 0 | 0 | 0 | 0 | 1 | 2 | 4 | 0 | 0 | 0 | 2 | 4 | 8 |
| Sparkc | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes |
| AdamTransform (hr) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.56 | 0.32 | 0.2 | 0.58 | 0.31 | 0.2 |
| SelectBAM (hr) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.33 | 0.23 | 0.48 | 0.29 | 0.2 |
| Make_examples (hr) | 6.13 | 3.15 | 1.73 | 1.2 | 5.93 | 3.1 | 1.6 | 2.72 | 1.6 | 1 | 2.82 | 1.48 | 0.83 |
| Call_variants (hr) | 10.8 | 6.53 | 5.35 | 3.83 | 1.51 | 1.52 | 1.5 | 3.66 | 2.02 | 0.98 | 0.7 | 0.38 | 0.21 |
| Postprocess_variants (hr) | 0.56 | 0.54 | 0.53 | 0.48 | 0.46 | 0.46 | 0.45 | 0.2 | 0.13 | 0.07 | 0.2 | 0.1 | 0.06 |
| Merge VCF (hr) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 |
| Total time (hr) | 17.49 | 10.22 | 7.61 | 5.51 | 7.9 | 5.08 | 3.55 | 7.66 | 4.42 | 2.5 | 4.8 | 2.58 | 1.52 |
| USD/per genome | 14.02 | 15.94 | 20.77 | 25.31 | 17.86 | 22.72 | 31.76 | 23.25 | 23.98 | 25.54 | 28.57 | 29.23 | 33.17 |
| #genomes/300USDd | 21 | 18 | 14 | 11 | 16 | 13 | 9 | 12 | 12 | 11 | 10 | 10 | 9 |
aCPU means the number of CPU cores. bGPU means the number of NVIDIA Tesla P100 GPU. cSpark means using Apache Spark or not. d#genomes/300USD means the numbers of whole-genome sequence jobs that can be completed under the trial credit of 300 USD.