| Literature DB >> 35132260 |
Chen-Shan Chin1, Justin M Zook2, Fritz J Sedlazeck3, Justin Wagner4, Nathan D Olson4, Lindsay Harris4, Jennifer McDaniel4, Haoyu Cheng5, Arkarachai Fungtammasan6, Yih-Chii Hwang6, Richa Gupta6, Aaron M Wenger7, William J Rowell7, Ziad M Khan8, Jesse Farek8, Yiming Zhu8, Aishwarya Pisupati8, Medhat Mahmoud8, Chunlin Xiao9, Byunggil Yoo10, Sayed Mohammad Ebrahim Sahraeian11, Danny E Miller12,13, David Jáspez14, José M Lorenzo-Salazar14, Adrián Muñoz-Barrera14, Luis A Rubio-Rodríguez14, Carlos Flores14,15,16, Giuseppe Narzisi17, Uday Shanker Evani17, Wayne E Clarke17, Joyce Lee18, Christopher E Mason19, Stephen E Lincoln20, Karen H Miga21, Mark T W Ebbert22,23,24, Alaina Shumate25,26, Heng Li5.
Abstract
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.Entities:
Mesh:
Year: 2022 PMID: 35132260 PMCID: PMC9117392 DOI: 10.1038/s41587-021-01158-1
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 68.164
Figure 1:GIAB developed a process to create new phased small variant and structural variant benchmarks for 273 challenging, medically relevant genes. (A) We developed a list of 4,701 autosomal potentially medically relevant genes. We generated a new benchmark for 273 of the 4,701 genes that were completely resolved by our hifiasm haplotype-resolveddiploid assembly and <=90% included in the v4.2.1 GIAB small variant benchmark for HG002 (V4.2.1 Regions). (B) We required that the entire gene region (pink) and 20 kb flanking sequence on each side (blue) were completely resolved by both haplotypes in the assembly (hifiasm Hap1 and hifiasm Hap2), indicated with the hifiasm Dipcall Bed track. In addition, we required that any segmental duplications overlapping the gene were completely resolved. From the small variant benchmark regions (CMRG Small Variant blue bars), we excluded SVs and any tandem repeats or homopolymers overlapping SVs (right TR and Homopol. region in brown). The left TR and Homopol. region in brown is excluded from the small variant benchmark regions because the larger tandem repeat contains an imperfect homopolymer longer than 20 bp, which we exclude because long homopolymers have a higher error rate in the assembly. All regions of this gene were included in the SV benchmark regions (CMRG Structural Variant blue bar). The vertical red lines in CMRG Small Variant and CMRG Structural Variant indicate locations of benchmark small variants and SVs, respectively. Finally, we evaluated the small variant and structural variant benchmarks with manual curation and long range PCR, and also ensured they accurately identify false positives and false negatives after excluding errors found during curation.
Figure 2.The new CMRG benchmark contains more challenging variants and regions than previous benchmarks. (A) Fraction of each gene region (blue) and exonic regions (red) included in the new CMRG small variant or SV benchmark regions. (B) Comparison of fraction of challenging sequences and variants for genes included in the new CMRG benchmark vs. the previous v4.2.1 HG002 benchmark vs. genes excluded from both benchmarks. 99% of CMRG benchmark genes have at least 15% of the gene region with challenging sequences or variants. The catalog of repetitive challenging sequences comes from GIAB and the Global Alliance for Genomics and Health (see text). Challenging variants for HG002 are defined as complex variants (i.e., more than one variant within 10 bp) as well as putative SVs and putative duplications excluded from the HG002 v4.2.1 benchmark regions. C) Size distribution of INDELs in the small variant benchmark, which includes some larger INDELs in introns (light blue) and exons (dark blue). D) Size distribution of large insertions and deletions in the SV benchmark in introns (light blue) and exons (dark blue).
Number of bases and variants in different HG002 GIAB benchmarks sets included in the 273 genes in the CMRG benchmark. We denote the percent of base pairs or variants across exons in the brackets. Difficult context defined as union of all tandem repeats, all homopolymers >6 bp, all imperfect homopolymers >10 bp, all difficult to map regions, all segmental duplications, GC <25% or >65%, “Bad Promoters”, and “OtherDifficultregions”. Challenging variants for HG002 are defined as complex variants (i.e., more than one variant within 10 bp) as well as putative SVs and putative duplications excluded from the HG002 v4.2.1 benchmark regions. Number of bases and variants are provided for benchmarks on GRCh38, except for v0.6 where only a GRCh37 benchmark is available.
| Benchmark Set | bp in CMRG Benchmark Genes | bp in Difficult Context | # of Variants |
|---|---|---|---|
| CMRG Small Var. | 11 719 200 (11.5 %) | 4 500 129 (13.4 %) | 27 178 ( 5.2 %) |
| v4.2.1 Small Var. | 9 763 722 (12.0 %) | 2 637 132 (16.3 %) | 16 804 ( 6.3 %) |
| CMRG SV | 12 020 518 (11.4 %) | 4 792 096 (13.0 %) | 217 ( 5.1 %) |
| v0.6 SV | 10 569 811 ( 9.6 %) | 3 215 766 (13.1 %) | 170 ( 4.7 %) |
Figure 3:The new benchmark covers the gene SMN1, which was previously excluded due to mapping challenges for all technologies in the highly identical segmental duplication. (A) Dotplot of GRCh38 against GRCh38 in the SMA region, showing a complex set of inverted repeats that make it challenging to assemble. (B) IGV view showing that only a small portion of SMN1 was included in v4.2.1, and that all technologies have challenges mapping in the region, but 10x Genomics and ultralong ONT reads support the variants called in the new CMRG benchmark. For the CMRG and v4.2.1 benchmarks, thick blue bars indicate regions included by each benchmark and orange and light blue lines indicate positions of homozygous and heterozygous benchmark variants, respectively. CMRG variants were called from the trio-based hifiasm assembly of paternal and maternal haplotypes (Hifiasm-pat and Hifiasm-mat, respectively). Coverage tracks are show for 60x PCR-free Illumina 2×150 bp reads (Illumina-60x), 10x Genomics linked reads (10X Genomics), 50x PacBio HiFi 15 kbp and 20 kbp reads (PB Hifi-50x), and 60x Oxford Nanopore ultralong reads (ONT-UL-60x).
Figure 4:(A) The benchmark resolves the gene CBS, which has a highly homologous gene CBSL due to a false duplication in GRCh38 that is not in HG002 or GRCh37. The duplication in GRCh38 causes Illumina and PacBio HiFi reads from one haplotype to mismap to CBSL instead of CBS. The ultralong ONT reads, 10x Genomics linked reads, and assembled PacBio HiFi contigs map properly to this region for both haplotypes because they contain sufficient flanking sequence. When the falsely duplicated sequence is masked using our new version of GRCh38, variant calls from a standard Illumina-GATK pipeline (ILMN-GATK w/ Mask VCF) are completely concordant with the new benchmark. Pink shaded box indicates CMRG benchmark regions, only variants within the benchmark regions are included in the benchmark. (B) Comparison of variant accuracy for GRCh38 before and after masking false duplications on chromosome 21. The new benchmark demonstrates decreases in false negative and false positive errors for 3 callsets in the falsely duplicated genes CBS, CRYAA, and KCNE1 when mapping to the masked GRCh38.
Figure 5:The new CMRG small variant benchmark includes more challenging variants and identifies more false negatives in a standard short-read callset (Illumina-bwamem-GATK) than the previous v4.2.1 benchmark in these challenging genes. While the false negative rate (circles) is similar in easier regions (purple “Not In All Difficult” points), the false negative rate is much higher overall (green “All CMRG Benchmark Regions” points). The fraction of variants excluded from the benchmark regions (triangles) is much higher for the v4.2.1 benchmark in all stratifications. This information is also presented in “summary stats NYGC” in Supplementary Data 4.