| Literature DB >> 32963235 |
Chen-Shan Chin1, Justin Wagner2, Qiandong Zeng3, Erik Garrison4, Shilpa Garg5, Arkarachai Fungtammasan1, Mikko Rautiainen6,7,8, Sergey Aganezov9, Melanie Kirsche9, Samantha Zarate9, Michael C Schatz9,10, Chunlin Xiao11, William J Rowell12, Charles Markello4, Jesse Farek13, Fritz J Sedlazeck13, Vikas Bansal14, Byunggil Yoo15, Neil Miller15, Xin Zhou16, Andrew Carroll17, Alvaro Martinez Barrio18, Marc Salit19, Tobias Marschall20, Alexander T Dilthey21, Justin M Zook22.
Abstract
Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.Entities:
Mesh:
Year: 2020 PMID: 32963235 PMCID: PMC7508831 DOI: 10.1038/s41467-020-18564-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Assembling a single contig for each haplotype.
a We regenotyped DeepVariant (DV) heterozygous SNVs with WhatsHap using Oxford Nanopore Technologies (ONT) and PacBio HiFi (CCS) reads to find a confident set of SNVs with concordant genotypes from DV/CCS, WhatsHap/ONT, and WhatsHap/CCS—our Confident HETs for phasing. We selected 10x Genomics (10X) variants with phased blocks from the 10X VCF. For phasing, we used WhatsHap to combine phased blocks from 10X with ONT reads to get a single phased block across the MHC. b We binned PacBio HiFi reads into two haplotypes, which are denoted as orange and blue reads, using WhatsHap. c We performed diploid assembly using the Peregrine Assembler with the haplotype-binned HiFi reads. d We generated the benchmark variant callset from the assembled haplotigs using dipcall, and defined benchmark regions excluding SVs, exceptionally divergent regions, low-quality regions in the assembly, and long homopolymers.
Fig. 2Alignments of the two main haplotigs to the primary GRCh37 MHC region.
We compute the local divergence (est. difference) of the HG002 MHC haplotigs to the MHC of GRCh37 by performing local alignment. The differences between the assembled contigs to the references are computed using sequence blocks anchored with minimers and aligned locally using an O(ND) alignment algorithm[33].
Fig. 3Evaluation of benchmark’s ability to reliably identify FNs and FPs across technologies.
a Proportion of 10 randomly selected FPs and 10 randomly selected FNs from 11 callsets from Illumina (Ill), 10x Genomics (10x), PacBio HiFi (PB), and Oxford Nanopore (ONT) that were determined to be fully correct in the benchmark and incorrect or only partially correct in the query callset. b Breakdown of variants potentially incorrect in the benchmark or correct in the query, where curation of the benchmark determined it to be incorrect (no), correct (yes), or unclear (unsure).
Fig. 4Example of partially called complex variant counted as both false positives and false negatives.
The CCS-DeepVariant VCF from PacBio HiFi reads incorrectly filters the 2-bp deletion and 9 of the 13 SNVs in the region (filtered variants are light gray boxes). The benchmark correctly calls this complex variant, and represents it as a 26-bp insertion of a TG tandem repeat followed by a 29-bp deletion of adjacent tandem repeats. When comparing this VCF to our MHC benchmark, the benchmark insertion and deletion variants are counted as false negatives, while the 5 SNVs called are counter-intuitively counted as false positives because the other variants are incorrectly filtered. If the CCS-DeepVariant VCF had not filtered all of the other variants, all variants would be counted as true positives.
PacBio HiFi reads used in evaluation of benchmark.
| Instrument | Insert Size | SRA | FTP |
|---|---|---|---|
| Sequel system | 10 kb | – | |
| Sequel system | 15 kb | SRX5327410 | |
| Sequel II system | 11 kb | SRX5527202 |
GRCh37 reference used for alignment: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.
GRCh38 reference used for alignment: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.
https://github.com/PacificBiosciences/pbmm2.
https://github.com/broadinstitute/gatk/releases/tag/4.0.10.1.