| Literature DB >> 30936564 |
Justin M Zook1, Jennifer McDaniel2, Nathan D Olson2, Justin Wagner2, Hemang Parikh2, Haynes Heaton3,4, Sean A Irvine5, Len Trigg5, Rebecca Truty6, Cory Y McLean7,8, Francisco M De La Vega9, Chunlin Xiao10, Stephen Sherry10, Marc Salit2,11,12.
Abstract
Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a 'first of its kind' resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.Entities:
Mesh:
Year: 2019 PMID: 30936564 PMCID: PMC6500473 DOI: 10.1038/s41587-019-0074-6
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1:Arbitration process used to form our benchmark set from multiple technologies and callsets. (a) The arbitration process has two cycles. The first cycle ignores “Filtered Outliers”. Calls that are supported by at least 2 technologies in the first cycle are used to train a model that identifies variants from each callset with any annotation value that is an “outlier” compared to these 2-technology calls. In the second cycle, the outlier variants and surrounding 50bp are excluded from the callable regions for that callset. (b) For each variant calling method, we delineate callable regions by subtracting regions around filtered/outlier variants as at locus (1), regions with low coverage or mapping quality (MQ) as at locus (2), and “difficult regions” prone to systematic miscalling or missing variants for the particular method as at locus (3). For callsets in gVCF format, we exclude homozygous reference regions and variants with genotype quality (GQ) < 60. Difficult regions include different categories of tandem repeats (TR) and segmental duplications. (c) Four arbitration examples with two arbitrary input methods. (1) Both methods have the same genotype and variant and it is in their callable regions, so the variant and region are included in the benchmark set. (2) Method 1 calls a heterozygous variant and Method 2 implies homozygous reference, and it is in both methods’ callable regions, so the discordant variant is not included in the benchmark calls and 50bp on each side is excluded from the benchmark regions. (3) The methods have discordant genotypes, but the site is only inside Method 2’s callable regions, so the heterozygous genotype from Method 2 is trusted and is included in the benchmark regions. (4) The two methods’ calls are identical, but they are outside both methods’ callable regions, so the site is excluded from benchmark variants and regions.
Summary of statistics of GIAB benchmark calls and regions from HG001 from v2.18 to v3.3.2 and their comparison to Illumina Platinum Genomes 2016-v1.0 calls (PG). Note that PG bed files were contracted by 50bp to minimize partial complex variant calls in the PG calls.
| Integration Version | v 2.18 | v 2.19 | v. 3.2 | v. 3.2.2 | v 3.3 | v 3.3.1 | v 3.3.2 | v 3.3.1 | v 3.3.2 |
|---|---|---|---|---|---|---|---|---|---|
| Reference | GRCh37 | GRCh37 | GRCh37 | GRCh37 | GRCh37 | GRCh37 | GRCh37 | GRCh38 | GRCh38 |
| Integration Date | Sep 2014 | Apr 2015 | May 2016 | June 2016 | Aug 2016 | Oct 2016 | Nov 2016 | Oct 2016 | Nov 2016 |
| Number of bases in benchmark regions (chromosomes 1-22 + X) | 2.20 Gb | 2.22 Gb | 2.54 Gb | 2.53 Gb | 2.57 Gb | 2.58 Gb | 2.58 Gb | 2.45 Gb | 2.44 Gb |
| Fraction of non-N bases covered in chromosomes 1-22 + X) | 77.4% | 78.1% | 89.6% | 89.2% | 90.5% | 91.1% | 90.8% | 84.2% | 83.8% |
| Fraction of RefSeq coding sequence covered | 73.9% | 74.0% | 87.8% | 87.9% | 89.9% | 90.0% | 89.9% | 83.4% | 83.3% |
| Total number of calls in benchmark regions | 2915731 | 3153247 | 3433656 | 3512990 | 3566076 | 3746191 | 3691156 | 3617168 | 3542487 |
| 2741359 | 2787291 | 3084406 | 3154259 | 3191811 | 3221456 | 3209315 | 3058368 | 3042789 | |
| 86204 | 172671 | 171866 | 176511 | 171715 | 243856 | 225097 | 269331 | 241176 | |
| 87161 | 189932 | 169389 | 173976 | 189807 | 266386 | 245552 | 275041 | 247178 | |
| 1005 | 2532 | 7476 | 7716 | 10364 | 13332 | 11192 | 13976 | 11344 | |
| 2.12 | 2.12 | 2.14 | 2.14 | 2.11 | 2.10 | 2.10 | 2.10 | 2.11 | |
| 0.0% | 0.3% | 3.9% | 3.9% | 8.8% | 99.0% | 99.6% | 98.5% | 99.5% | |
| Number of GIAB calls concordant with PG in both PG and GIAB beds | 2825803 | 3030703 | 3312580 | 3391783 | 3441361 | 3550914 | 3529641 | 3459674 | 3431752 |
| Number of PG-only calls in both beds | 194 | 404 | 81 | 52 | 60 | 67 | 61 | 202 | 180 |
| Number of GIAB-only calls in both beds | 49 | 87 | 56 | 57 | 40 | 50 | 47 | 105 | 94 |
| Number of PG-only calls: | 1223697 605142 | 1018795 | 274671 | 138894 | 550982 | 445563 | 469202 | 659870 | 690887 |
| Number of GIAB-only calls in GIAB benchmark bed | 90722 | 122544 | 121076 | 121207 | 124715 | 195277 | 163467 | 157494 | 111787 |
| Number of concordant calls that are filtered by the GIAB benchmark bed | 12 | 0 | 736918 | 657715 | 608137 | 53460 | 51255 | 48696 | 45779 |
Figure 2:Complex variant discordant between GIAB and Illumina Platinum Genomes. Compound heterozygous insertion and deletion in HG001 in a tandem repeat at 2:207404940 (GRCh37), for which Illumina Platinum Genomes only calls a heterozygous deletion. When a callset with the true compound heterozygous variant is compared to Platinum Genomes, it is counted as both a FP and a FN. Both the insertion and deletion are supported by PCR-free Illumina (bottom) and Moleculo assembled long reads (middle), and reads assigned haplotype 1 in 10x support the insertion and reads assigned haplotype 2 in 10x support the deletion (top).