| Literature DB >> 30552369 |
Anna Supernat1, Oskar Valdimar Vidarsson2, Vidar M Steen3,4, Tomasz Stokowy5,6,7.
Abstract
Testing of patients with genetics-related disorders is in progress of shifting from single gene assays to gene panel sequencing, whole-exome sequencing (WES) and whole-genome sequencing (WGS). Since WGS is unquestionably becoming a new foundation for molecular analyses, we decided to compare three currently used tools for variant calling of human whole genome sequencing data. We tested DeepVariant, a new TensorFlow machine learning-based variant caller, and compared this tool to GATK 4.0 and SpeedSeq, using 30×, 15× and 10× WGS data of the well-known NA12878 DNA reference sample. According to our comparison, the performance on SNV calling was almost similar in 30× data, with all three variant callers reaching F-Scores (i.e. harmonic mean of recall and precision) equal to 0.98. In contrast, DeepVariant was more precise in indel calling than GATK and SpeedSeq, as demonstrated by F-Scores of 0.94, 0.90 and 0.84, respectively. We conclude that the DeepVariant tool has great potential and usefulness for analysis of WGS data in medical genetics.Entities:
Mesh:
Year: 2018 PMID: 30552369 PMCID: PMC6294778 DOI: 10.1038/s41598-018-36177-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Possible applications of human whole genome sequencing (WGS) with respect to the source of biological material. Abbreviations: FF – Fresh Frozen Tissue; FFPE – Formalin Fixed Paraffin Embedded; LCM – Laser Capture Microdissection; FACS – Fluorescence Activated Cell Sorting; HLA – Human Leukocyte Antigen; CTCs – Circulating Tumor Cells; cfDNA – Circulating Free DNA; ctDNA – Circulating Tumor DNA (*detectable also in other body fluids).
Figure 2Current gold standard workflow for analysis of whole genome sequencing data.
Variant calling statistics computed using RTG Tools for the three different variant calling methods.
| Sample | NA12878 DeepVariant | NA12878 GATK | NA12878 SpeedSeq |
|---|---|---|---|
| Failed Filters | 4,453,285 | 129,228 | 0 |
| Passed Filters | 4,544,442 | 4,434,965 | 4,324,047 |
| SNPs | 3,753,358 | 3,819,071 | 3,627,315 |
| Short Insertions | 375,878 | 293,187 | 263,120 |
| Short Deletions | 399,843 | 315,637 | 292,050 |
| Other Complex Indels | 15,363 | 7,070 | 49,685 |
| Same as reference | 0 | 0 | 2,546 |
| SNP Transitions/Transversions | 2.01 (3477625/1734085) | 1.98 (3491448/1762615) | 2.03 (3350062/1650203) |
| Total Het/Hom ratio | 1.64 (2819897/1724545) | 1.69 (2787845/1647120) | 1.60 (2657943/1663558) |
| SNP Het/Hom ratio | 1.58 (2296426/1456932) | 1.66 (2385446/1433625) | 1.64 (2255860/1371455) |
| Insertion Het/Hom Ratio | 1.79 (241230/134648) | 1.68 (183941/109246) | 1.06 (135399/127721) |
| Deletion Het/Hom ratio | 2.01 (267064/132779) | 2.03 (211388/104249) | 1.41 (170791/121259) |
| Insertion/Deletion ratio | 0.94 (375878/399843) | 0.93 (293187/315637) | 0.90 (263120/292050) |
Values were computed for the raw vcf files produced by the callers.
Comparison of variant calling pipelines.
| SNV | True positive SNV calls | False negative SNV calls | False positive SNV calls | Genotype mismatch | Total number of SNV calls | SNV calling precision | SNV recall | F1 Score |
|---|---|---|---|---|---|---|---|---|
| SpeedSeq. 30× | 2,942,217 | 100,572 | 38,107 | 11,869 | 3,802,913 | 0.987223 | 0.966947 | 0.97698 |
| SpeedSeq. 15× | 2,814,843 | 227,946 | 57,654 | 31,131 | 3,613,466 | 0.97994 | 0.925086 | 0.951724 |
| SpeedSeq. 10× | 2,589,184 | 453,605 | 84,123 | 53,955 | 3,334,440 | 0.968548 | 0.850925 | 0.905934 |
| DeepVariant 0.4.1 30× | 2,948,290 | 94,499 | 22,902 | 19,595 | 3,714,945 | 0.992294 | 0.968943 | 0.98048 |
| DeepVariant 0.4.1 15× | 2,903,519 | 139,270 | 55,261 | 41,999 | 3,674,970 | 0.981328 | 0.954229 | 0.967589 |
| DeepVariant 0.4.1 10× | 2,809,014 | 233,775 | 84,054 | 61,314 | 3,573,547 | 0.970952 | 0.923171 | 0.946459 |
| GATK 4.0 – WDL 30× | 2,952,605 | 90,184 | 41,684 | 12,579 | 3,814,443 | 0.986082 | 0.970361 | 0.978159 |
| GATK 4.0 – WDL 15× | 2,891,815 | 150,974 | 59,476 | 31,151 | 3,698,103 | 0.979851 | 0.950383 | 0.964892 |
| GATK 4.0 – WDL 10× | 2,763,913 | 278,876 | 82,452 | 57,639 | 3,526,795 | 0.971036 | 0.908349 | 0.938647 |
|
|
|
|
|
|
|
|
|
|
| SpeedSeq. 30× | 383,930 | 115,767 | 32,263 | 13,310 | 619,159 | 0.923499 | 0.768326 | 0.838796 |
| SpeedSeq. 15× | 337,815 | 161,882 | 34,635 | 16,172 | 542,025 | 0.907915 | 0.67604 | 0.775005 |
| SpeedSeq. 10× | 290,678 | 209,019 | 35,029 | 18,179 | 466,079 | 0.893253 | 0.581709 | 0.704578 |
| DeepVariant 0.4.1 30× | 460,271 | 39,426 | 16,122 | 8,147 | 816,456 | 0.967406 | 0.9211 | 0.943685 |
| DeepVariant 0.4.1 15× | 428,557 | 71,140 | 29,651 | 15,010 | 748,972 | 0.937303 | 0.857634 | 0.8957 |
| DeepVariant 0.4.1 10× | 387,075 | 112,622 | 38,695 | 20,121 | 668,593 | 0.911332 | 0.774619 | 0.837433 |
| GATK 4.0 – WDL 30× | 429,859 | 69,838 | 24,191 | 9,251 | 764,422 | 0.948269 | 0.860239 | 0.902112 |
| GATK 4.0 – WDL 15× | 380,932 | 118,765 | 30,603 | 11,918 | 655,658 | 0.927084 | 0.762326 | 0.836671 |
| GATK 4.0 – WDL 10× | 335,446 | 164,251 | 34,626 | 14,030 | 569,141 | 0.90753 | 0.671299 | 0.771742 |
Variants were called from 30×, 15× and 10× coverage of the NA12878 sample (HiSeq4000, Genomics Core Facility, Bergen, Norway) and compared to GIAB NISTv3.3.2 (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/). The GIAB true variant set included 3,042,789 SNV variants and 499,697 indels. Variant counts and performance scores were estimated using hap.py – an Illumina haplotype comparison/benchmarking tool.