| Literature DB >> 26419432 |
Neil A Miller1, Emily G Farrow1,2,3,4, Margaret Gibson1, Laurel K Willig1,2,4, Greyson Twist1, Byunggil Yoo1, Tyler Marrs1, Shane Corder1, Lisa Krivohlavek1, Adam Walter1, Josh E Petrikin1,2,4, Carol J Saunders1,2,3,4, Isabelle Thiffault1,3, Sarah E Soden1,2,4, Laurie D Smith1,2,3,4, Darrell L Dinwiddie5, Suzanne Herd1, Julie A Cakici1, Severine Catreux6, Mike Ruehle6, Stephen F Kingsmore7,8,9,10,11.
Abstract
While the cost of whole genome sequencing (WGS) is approaching the realm of routine medical tests, it remains too tardy to help guide the management of many acute medical conditions. Rapid WGS is imperative in light of growing evidence of its utility in acute care, such as in diagnosis of genetic diseases in very ill infants, and genotype-guided choice of chemotherapy at cancer relapse. In such situations, delayed, empiric, or phenotype-based clinical decisions may meet with substantial morbidity or mortality. We previously described a rapid WGS method, STATseq, with a sensitivity of >96 % for nucleotide variants that allowed a provisional diagnosis of a genetic disease in 50 h. Here improvements in sequencing run time, read alignment, and variant calling are described that enable 26-h time to provisional molecular diagnosis with >99.5 % sensitivity and specificity of genotypes. STATseq appears to be an appropriate strategy for acutely ill patients with potentially actionable genetic diseases.Entities:
Mesh:
Year: 2015 PMID: 26419432 PMCID: PMC4588251 DOI: 10.1186/s13073-015-0221-8
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Breakdown of times of principal steps for rapid diagnostic whole genome sequencing
| Method | Sample | Site | DNA isolation, QC and shearing | PCR-free library prep | WGS library QC | SBS | Yield (GB) | % > Q30 | Alignment | Variant calling | RUNES variant annotation | Provisional diagnosis | Total time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Published WGS50 | Multiplea | Both | 2:30 | 3:15 | 1:30 | 25:30 | 139 | 90 | 14:40 | 2:30 | 0:05 | 50:00 | |
| SBS18, GSNAP/GATK/noVQSR | 5006-01 | CMH | 2:30 | 3:15 | 1:30 | 19:45 | 128 | 91 | 22:30 | 0:29 | n.a. | 49:59 | |
| WGS26, SBS18, and Dragen v1.2 | UDT_173 | Essex | 2:30 | 3:02 | 1:30 | 17:58 | 106 | 92 | 0:15 | 0:15 | 0:34 | 0:04 | 26:08 |
| WGS26, SBS18, and Dragen v1.2 | UDT_103 | Essex | 2:30 | 3:05 | 1:30 | 18:25 | 130 | 90 | 0:19 | 0:22 | 0:31 | 0:05 | 26:47 |
| WGS26, SBS18, and Dragen v1.2 | NA12878 | Essex | 2:30 | 3:15 | 1:30 | 18:00 | 143 | 85b | 0:19 | 0:22 | 0:33 | n.a. | 26:28 |
| WGS26, SBS18, and Dragen v1.2 | NA12878 | CMH | 2:30 | 3:15 | 1:30 | 18:36 | 65c | 85b | 0:10 | 0:11 | 0:35 | n.a. | 26:47 |
GB, gigabases; Q, Phred-like quality score QC, quality control; SBS, 2 × 101 cycle sequencing-by-synthesis
aReference 12
bPrior to SBS18, after failing tiles were removed
cSingle flowcell
Fig. 1Comparison of quality metrics of 18-h and 26-h 2 × 100 nt runs. The runs were WGS of sample UDT_173 [12]. a–d. Base composition was not materially different in the 18-h and 26-h runs. However, the % non-AGTC reads was lower in the 18-h run. This may either reflect better sequence quality or lower cluster density. e–h. Frequency distribution of GC content of 18-h and 26-h runs. While the number of reads (y-axis) differed between runs, 18-h and 26-h runs had identical GC content distributions, with sequence representation between GC content of 15 % and 75 %. GC content varies widely across the human genome ― the isochore structure of the human genome [24, 35]. The median genome GC content estimated by 18-h and 26-h WGS (35–40 %) agreed with the estimated median from the 1,000 genomes project [36] (38.6 %), and is slightly lower than estimates by cesium density gradient centrifugation [42, 43] (39.6–40.3 %). i–l. Quality scores of nucleotide calls as a function of cycle were indistinguishable in 18-h and 26-h runs
Comparison of the analytic performance of a conventional alignment and variant calling pipeline (GSNAP with GATK minus VQSR), with a novel, extremely rapid method (DRAGEN)
| Sample | SBS18 yield (GB) | Site | Pipeline | Reads aligned | Alignments with mapping quality >20 | Variants called | Mismatch rate | Indel rate | % Paired Reads | Strand balance | % Chimeric Reads | Rare, potentially pathogenic variants | Analytic sensitivity (GeT-RM or SNP array) | Analytic specificity (GeT-RM or SNP array) | Analytic sensitivity (full GIAB) | Analytic specificity (full GIAB) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NA12878 | 133 | Essex | DRAGEN | 99.4 % | 95.48 % | 4,782,970 | 0.0029 | 0.00017 | 99.55 % | 0.500 | 0.69 % | 658 | 99.93 % | 99.87 % | 99.69 % | 99.99 % |
| GSNAP/GATK-1.6/noVQSR | 98.5 % | 96.33 % | 5,343,988 | 0.0056 | 0.00017 | 98.55 % | 0.496 | 0.82 % | 783 | 99.54 % | 98.57 % | 98.21 % | 99.99 % | |||
| NA12878 | 65a | CMH | DRAGEN | 97.7 % | 91.31 % | 4,633,357 | 0.0060 | 0.00023 | 99.18 % | 0.501 | 1.89 % | 775 | 99.42 % | 99.46 % | 98.63 % | 99.99 % |
| GSNAP/GATK-3.2/noVQSR | 96.2 % | 92.86 % | 4,571,157 | 0.0079 | 0.00021 | 97.55 % | 0.499 | 1.75 % | 593 | 97.29 % | 95.35 % | 95.74 % | 99.99 % | |||
| UDT_173 | 106 | Essex | DRAGEN | 99.5 % | 94.92 % | 4,742,150 | 0.0034 | 0.00020 | 99.80 % | 0.500 | 1.12 % | 620 | 96.13 % | 97.74 % | n.a. | n.a. |
| GSNAP/GATK-1.6/noVQSR | 99.3 % | 96.88 % | 4,294,504 | 0.0034 | 0.00019 | 99.34 % | 0.500 | 0.90 % | 512 | 88.54 % | 98.06 % | n.a. | n.a. |
All runs were 18-h WGS. The NA12878 reference genotypes were NIST High Confidence calls from GeT-RM/NA12878.NIST-GIAB_v.2.18 (labeled ‘GeT-RM’) or the full GIAB dataset (labeled ‘full GIAB’). UDT_173 were results of hybridization to the Omni4 SNP array. GSNAP was version 2012.07.12, with default parameters, and GATK was version 1.6.13 or 3.2, without VQSR. DRAGEN was version 1.2. % paired, percentage of reads whose mate was also aligned; Strand balance, reads aligned to the forward strand divided by total reads aligned; % chimeric, percentage of chimeric alignments (mates >100 kb apart or on different chromosomes). aSingle flowcell
Fig. 2Improving the sensitivity of nucleotide variant identification for diagnosis of rare genetic diseases in approximately 35X human WGS. a. Venn diagram comparing nucleotide variants identified in WGS of sample UDT_173 (HiSeq 2500, 2 × 100 nt, 18-h run time) with previously disclosed methods for 50-h diagnostic WGS (Published WGS50 pipeline) [12], or with parameters described herein to improve sensitivity (GSNAP/GATK-VQSR). b. Pie charts showing the distribution of allele frequencies and pathogenicity of nucleotide variants reported by the three pipelines (Published WGS50, GSNAP/GATK-VQSR, and DRAGEN) in WGS of the same sample. Rare variants had allele frequencies <0.01, based on genomic sequences of approximately 3,000 internal samples. Previously reported disease causing variants are ACMG Category 1 mutations. Likely pathogenic variants are ACMG Category 2 variants (loss of initiation, premature stop codon, disruption of stop codon, whole gene deletion, frameshifting indel, disruption of splicing). Possibly pathogenic variants are ACMG Category 3 (non-synonymous substitution, in-frame indel, disruption of polypyrimidine tract, overlap with 5’ exonic, 5’ flank or 3’ exonic splice contexts, and intragenic mitochondrial variants). c Graphs of variant density versus variant allele frequency. Values for the two pipelines are plotted. Results represent the sum of approximately 40X WGS in three samples. Upper panel shows results for all variants. Lower panel shows results for ACMG Category 1–3 variants
Fig. 3Variation in the sensitivity and specificity of nucleotide variant calls and genotypes as a product of the depth of the sequence. Several 2 × 100 nt runs of WGS of sample NA12878 were generated and the sensitivity (red diamonds) and specificity (blue squares) of variant calls (a) or genotypes (b) by GSNAP/GATK-VQSR were examined by comparison with a reference set (GeT-RM/NA12878.NIST-GIAB_v.2.18) at depth of coverage of 10X to 100X
Fig. 4Comparison of the number and rate of true positive variant calls with GSNAP/GATK-VQSR and DRAGEN. The three samples and reference datasets are as in Tables 1 and 2. Numbers are variant calls. TP: Variants in the NA12878 CDC/GeT-RM clinical validation set in which true positive variant calls were made. %TP variants in the larger NIST/GIAB reference set were similar to those in the GeT-RM set (NA12878-essex, DRAGEN only 92.3 % of 143,385 TP, GATK only 19.8 % of 96,003 TP; NA12878-Gill, DRAGEN only 98.0 % of 1,335,504 TP, GATK only 91.8 % of 58,571 TP)