| Literature DB >> 31406327 |
Aaron M Wenger1, Paul Peluso1, William J Rowell1, Pi-Chuan Chang2, Richard J Hall1, Gregory T Concepcion1, Jana Ebler3,4,5, Arkarachai Fungtammasan6, Alexey Kolesnikov2, Nathan D Olson7, Armin Töpfer1, Michael Alonge8, Medhat Mahmoud9, Yufeng Qian1, Chen-Shan Chin6, Adam M Phillippy10, Michael C Schatz8, Gene Myers11, Mark A DePristo2, Jue Ruan12, Tobias Marschall3,4, Fritz J Sedlazeck9, Justin M Zook7, Heng Li13, Sergey Koren10, Andrew Carroll2, David R Rank14, Michael W Hunkapiller15.
Abstract
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31406327 PMCID: PMC6776680 DOI: 10.1038/s41587-019-0217-9
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1.Sequencing HG002 with highly-accurate, long reads.
(a) Circular consensus sequencing (CCS) derives a consensus or CCS read from multiple passes of a single template molecule, producing accurate reads from noisy individual subreads (passes). (b) Accuracy – predicted by CCS software – of reads with different numbers of passes, for sequencing of the human male HG002. At 10 passes, the median read achieves Q30 predicted accuracy. Orange lines are medians; boxes extend from lower to upper quartiles; whiskers extend 1.5 interquartile distances; n=1,000 CCS reads for each number of passes. (c) Length and predicted accuracy of CCS reads.
Figure 2.Mappability of the human genome with CCS reads.
(a) Percentage of the non-gap GRCh37 human genome covered by at least 10 reads from 28-fold coverage NGS (2×250 bp, HiSeq 2500) and CCS (13.5 kb) datasets at different mapping quality thresholds. (b) Coverage of the congenital deafness gene STRC in HG002 with 2×250 bp NGS reads and 13.5 kb CCS reads at a mapping quality threshold of 10. (c) Improvement in mappability with 13.5 kb CCS reads for 193 human genes previously reported as medically-relevant and problematic to map with NGS reads[28].
Performance of small variant calling with CCS reads.
Precision, recall, and F1 of small variant calling measured against the Genome in a Bottle v3.3.2 benchmark using hap.py. Bold indicates the highest value in each column. Underline indicates a value higher than the GATK HaplotypeCaller run on 30-fold Illumina NovaSeq reads. Coverage is 28-fold for PacBio CCS and 30-fold for Illumina NovaSeq. Rows are sorted (“^”) based on F1 for SNVs.
| Variant caller (training model) | SNVs | Indels | |||||
|---|---|---|---|---|---|---|---|
| Platform | Precision | Recall | F1 ^ | Precision | Recall | F1 | |
| Illumina (NovaSeq) | DeepVariant (Illumina model) | ||||||
| PacBio (CCS) | DeepVariant (CCS model) | 96.901% | 95.980% | 96.438% | |||
| PacBio (CCS) | DeepVariant (haplotype-sorted CCS model) | 97.835% | 97.141% | 97.486% | |||
| Illumina (NovaSeq) | GATK HaplotypeCaller (no filter) | 99.852% | 99.910% | 99.881% | 99.371% | 99.156% | 99.264% |
| PacBio (CCS) | GATK HaplotypeCaller (hard filter) | 99.468% | 99.559% | 99.513% | 78.977% | 81.248% | 80.097% |
Figure 3.Variant calling and phasing with CCS reads.
(a) Agreement of DeepVariant SNV and indel calls with Genome in a Bottle v3.3.2 benchmark measured with hap.py. (b) Phasing of heterozygous DeepVariant variant calls with WhatsHap, compared to theoretical phasing of HG002 with 13.5 kb reads. (c) Agreement of integrated CCS structural variant calls with the Genome in a Bottle v0.6 structural variant benchmark measured with Truvari, (d) by variant length. Negative length indicates a deletion; positive length indicates an insertion. The histogram bin size is 50 bp for variants shorter than 1 kb, and 500 bp for variants >1 kb. All comparisons to GIAB are for the benchmark subset of the genome.
Statistics for de novo assembly of CCS reads.
The “mixed” haplotype assemblies use all reads. The “maternal” and “paternal” assemblies use parent-specific reads from trio binning plus unassigned reads. HG002 concordance is measured against the Genome in a Bottle benchmark. BUSCO gene completeness uses the Mammalia ODB9 gene set. Ensembl genes is the percentage of genes from Ensembl R94 that are full-length, single-copy in the assembly relative to the full-length, single-copy count for GRCh38. Contigs shorter than 13 kb were excluded from genome size and contiguity measurements; contigs shorter than 100 kb were excluded from the concordance measurement. “*” indicates polishing with Arrow.
| Canu | 3.42 | 18,006 | 22.78 | 25.02 | 108.46 | 30.16 | 31.1 | 92.3% | 93.2% |
| FALCON | 2.91 | 2,541 | 28.95 | 24.51 | 110.21 | 38.04 | 25.8 | 87.6% | 97.6% |
| wtdbg2 | 2.79 | 1,554 | 15.43 | 12.62 | 84.67 | 22.61 | 44.6 | 94.2% | 96.1% |
| Canu* | 3.04 | 5,854 | 18.02 | 17.04 | 48.81 | 19.78 | 47.2 | 94.1% | 98.1% |
| FALCON* | 2.80 | 924 | 19.99 | 15.54 | 74.33 | 24.07 | 43.5 | 95.1% | 97.8% |
| wtdbg2 | 2.75 | 2,637 | 12.10 | 9.29 | 66.34 | 16.55 | 43.5 | 93.8% | 95.6% |
| Canu* | 2.96 | 6,868 | 16.14 | 14.90 | 64.83 | 20.19 | 47.7 | 93.4% | 98.2% |
| FALCON* | 2.70 | 1,489 | 16.40 | 14.06 | 95.34 | 25.61 | 43.5 | 93.6% | 97.7% |
Figure 4.Impact of read accuracy on de novo assembly.
(a) The concordance of seven assemblies to the Genome in a Bottle (GIAB) v3.3.2 benchmark (Supplementary Table 8). Contigs longer than 100 kb were segmented into 100 kb chunks and aligned to GRCh37. Concordance was measured per chunk, and chunks with no discordances were assigned concordance of Q51. PB=PacBio, ONT=Oxford Nanopore, CLR=continuous (“noisy”) long reads. (b) Predicted contiguity of a human assembly based on ability to resolve repeats of different lengths (x-axis) and percent identities (colored lines)[21]. The solid line indicates the contiguity of GRCh38. The 97.0% identity line is representative of CLR assemblies using standard read-to-read error correction. The points show example CCS and CLR[46] assemblies using Canu. Repeat identity and length are proxies for read accuracy and length.