| Literature DB >> 30013044 |
Heng Li1, Jonathan M Bloom2, Yossi Farjoun2, Mark Fleharty2, Laura Gauthier2, Benjamin Neale3,4, Daniel MacArthur5,6.
Abstract
Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.Entities:
Mesh:
Year: 2018 PMID: 30013044 PMCID: PMC6341484 DOI: 10.1038/s41592-018-0054-7
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Fig. 1Constructing the Syndip benchmark dataset. CHM1 and CHM13 cell lines were sequenced with PacBio and de novo assembled independently. Assembly contigs were aligned to the human reference genome. Differences in the alignment were taken as ‘true’ SNPs and INDELs; regions covered by exactly one contig from each CHM assembly were identified as confident regions where true variants can be called to high accuracy. For the evaluation of diploid variant calling with short reads, equal quantities of DNA from the two cell lines were experimentally mixed. A PCR-free library was constructed from the mix and sequenced to ~45-fold coverage with 151bp paired-end reads. Variants called from the short reads were compared to the PacBio variants as truth to measure variant caller accuracy.
Fig. 2Evaluating variant calling accuracy with Syndip. %FNR denotes percent false negative rate, and FPPM is the number of false positives per million bases. (a) Comparison of Syndip, GIAB and PlatGen benchmark datasets on filtered calls. For GIAB and PlatGen, variants were called from the HiSeq X Ten run ‘NA12878_L7_S7’ available from the Illumina BaseSpace. (b) Effect of evaluation regions. Low-complexity regions were identified with the symmetric DUST algorithm. The ‘hard-to-call’ regions include low-complexity regions, regions unmappable with 75bp single-end reads and regions susceptible to common copy number variations. Panels (c)–(f) only show metrics in ‘coding+conserved’ regions. (c) Effect of variant filters. Green bars applied Platypus built-in filters. (d) Effect of the human genome reference build. Decoy sequences[17] are real human sequences that are missing from GRCh37. (e) Effect of the mapping algorithms and post-processing. BWA-MEM* represents alignment post-processed with base quality recalibration and INDEL realignment; other alignments were not processed with these steps. (f) Effect of replicates. Replicate 1–4 were sequenced from four independent libraries, respectively, by mixing equal amount of DNA prior to library construction. Replicate 5* was generated by computationally subsampling and mixing reads sequenced from the two CHM cell lines separately. Replicate 1 is used in panels (a)–(e). Numerical data and the script to generate the figure are available as Supplementary Data.