| Literature DB >> 31015479 |
Rory Bowden1, Robert W Davies2,3, Andreas Heger2, Alistair T Pagnamenta1,4, Mariateresa de Cesare1, Laura E Oikkonen1, Duncan Parkes1, Colin Freeman1, Fatima Dhalla5,6, Smita Y Patel5,7, Niko Popitsch1,4,8, Camilla L C Ip1, Hannah E Roberts1, Silvia Salatino1, Helen Lockstone1, Gerton Lunter1,2, Jenny C Taylor1,4, David Buck1, Michael A Simpson2, Peter Donnelly9,10,11.
Abstract
Whole-genome sequencing (WGS) is becoming widely used in clinical medicine in diagnostic contexts and to inform treatment choice. Here we evaluate the potential of the Oxford Nanopore Technologies (ONT) MinION long-read sequencer for routine WGS by sequencing the reference sample NA12878 and the genome of an individual with ataxia-pancytopenia syndrome and severe immune dysregulation. We develop and apply a novel reference panel-free analytical method to infer and then exploit phase information which improves single-nucleotide variant (SNV) calling performance from otherwise modest levels. In the clinical sample, we identify and directly phase two non-synonymous de novo variants in SAMD9L, (OMIM #159550) inferring that they lie on the same paternal haplotype. Whilst consensus SNV-calling error rates from ONT data remain substantially higher than those from short-read methods, we demonstrate the substantial benefits of analytical innovation. Ongoing improvements to base-calling and SNV-calling methodology must continue for nanopore sequencing to establish itself as a primary method for clinical WGS.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31015479 PMCID: PMC6478738 DOI: 10.1038/s41467-019-09637-5
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Characteristics of sequencing using ONT for NA12878. a Yield per flow cell, with flow cells organized left to right by run date. The total size of the bar represents the number of reads from each flow cell and is split into the proportion of reads that have been mapped in a single alignment (single alignment), mapped in multiple alignments (multiple alignments), have been base-called, but not been mapped (unmapped) and reads that have not been base-called (not basecalled). b Average read length per flow cell. c Yield (base pairs) per flow cell. d Distribution of per-read substitution, insertion and deletion error rates in the high-quality read set. e Distribution of genomic coverage. f Proportion of sites with a read depth less than 40 binned by G + C content of the surrounding window. Shown are read depth in windows of size 100 bp and 6000 bp
SNV discovery and filtering
| Filtering approach | Chromosomes | F1 score | FDR | FNR |
|---|---|---|---|---|
| QUAL + contamination | 22 | 86.4% | 12.8% | 14.4% |
| All | 88.3% | 10.9% | 12.5% | |
| Phasing + heuristics | 22 | 91.8% | 7.9% | 8.5% |
| All | 93.4% | 5.3% | 7.8% |
Filtering approaches are either pre-phasing, optimising over contamination parameters in FreeBayes and QUAL score, or post-phasing, optimising over use of phasing metrics given a fixed QUAL score, strand bias, and contamination parameters. Shown are both results for chromosome 22 on which parameter cutoffs were derived, as well as for the full set of autosomes
Fig. 2Investigation of residual errors in NA12878 data. Annotation of called or truth SNPs using genomic features or sequencing context in NA12878. Results across columns give the different sets of SNPs, either pre or post-phasing, and for post-phasing, optionally all SNPs or those at high local depth ( > = 60× coverage). Results across rows give SNP classes of true positives, false positives and false negatives. Bars are broken horizontally to reflect multiple possible annotations, while vertical splits represent SNPs with multiple annotations. Annotations are: homopolymer, SNP intersects a homopolymer of length at least 5 bases; Coverage <40×, per-base coverage of less than 40×; 40%
Fig. 3Phasing clinical sample using ONT. Top of figure shows Illumina unphased genotypes for the mother (M I), father (F I) and proband (P I), as well as phased genotypes for the proband using ONT, at bi-allelic PASS SNVs identified by Illumina sequencing that have a heterozygote genotype in at least one member of the trio. Unphased genotypes are represented with triangles in boxes where blue = alt and orange = ref. Phased proband genotypes (P N) are represented by two rows of vertical bars, where each row is an arbitrarily labelled haplotype, and each bar is split by colour according to the probability of that haplotype having reference or alternate base. Middle of figure shows two rows with the reads for haplotype 1 or haplotype 2, where for each read, bases are rectangles, and read span is given by a horizontal line. Gaps represent either a gap (deletion), or a base that corresponds to neither the reference nor the alternate allele. Bottom shows physical position, with sites of interest in red. Note that some of the phase set containing the sites of interest extends another 150 kb distally but is not shown in the interests of clarity. Based on GRCh37 and NM_152703.3, 92761932 T > C corresponds to c.3353 A > G whilst 92764209 C > T is corresponds to c.1076 G > A