| Literature DB >> 21478889 |
Mark A DePristo1, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, Mark J Daly.
Abstract
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.Entities:
Mesh:
Year: 2011 PMID: 21478889 PMCID: PMC3083463 DOI: 10.1038/ng.806
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Figure 1Framework for variation discovery and genotyping from next-generation DNA sequencing. See text for a detailed description.
Next-generation DNA sequencing data sets analyzed
| HiSeq | Exome | Low-pass | |
|---|---|---|---|
| Samples | NA12878 | NA12878 | NA12878 + 60 unrelated CEPH individuals |
| Sequencing technologies | Whole genome shotgun; Illumina HiSeq 2000 | Agilent exome hybrid capture | Whole genome shotgun; Illumina GenomeAnalyzer |
| Coverage per sample | ~60× | ~150×; 93% of bases at >20× coverage | ~4× |
| Read architecture | 101bp paired end | 76/101bp paired end | 25, 36, 51, 76, ~250 (454) bp single and paired ends |
| Targeted area | 2.85 Gb of autosomes and chrX | 28 Mb | 2.85 Gb of autosomes and chrX |
| Data set source | Novel, generated for this article | Novel, generated for this article | 1000 Genomes Project |
| Aligner(s) | BWA | MAQ | MAQ |
Raw to recalibrated, imputed SNP calls HiSeq, Exome, and 61 sample low-pass data sets. Part one of each section summarizes the impact of local realignment and base quality recalibration by comparing SNP calls on reads with raw quality scores and alignments to those made on the realigned, recalibrated reads.
| Site discovery | Comparison to NA12878 variants | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| No. of SNPs | Ti/Tv | HM3 concordance | HM3 concordance | |||||||
| All | Known | Novel | dbSNP% | Known | Novel | NR | NRD rate | NR | NRD rate | |
| Raw reads, all calls | 4.43M | 3.49M | 941K | 78.77 | 2.05 | 1.29 | 99.74 | 0.10 | 99.57 | 0.20 |
| Unique to raw read calls | 263K | 37K | 226K | 13.95 | 1.37 | 0.70 | 0.02 | 37.97 | 0.09 | 12.64 |
| Unique to +recal/+MSA calls | 9.8K | 1.8K | 8.0K | 18.08 | 1.38 | 1.39 | 0.00 | 18.18 | 0.00 | 9.93 |
| +recal/+MSA, all calls | 4.18M | 3.45M | 722K | 82.71 | 2.06 | 1.57 | 99.72 | 0.09 | 99.48 | 0.19 |
| Filtered by variant recalibration | 595K | 235K | 360K | 39.44 | 1.19 | 1.21 | 0.67 | 3.00 | 2.2 | 4.31 |
| Raw reads, all calls | 13.4M | 6.5M | 6.9M | 48.77 | 2.05 | 1.13 | 83.97 | 20.34 | 80.45 | 22.53 |
| Unique to raw read calls | 670K | 32K | 638K | 4.74 | 1.19 | 0.67 | 0.01 | 49.21 | 0.02 | 52.57 |
| Unique to +recal/+MSA calls | 45K | 2.5K | 42K | 5.62 | 0.94 | 0.68 | 0.00 | N/A | 0.00 | 38.89 |
| +recal/+MSA, all calls | 12.8M | 6.5M | 6.3M | 50.92 | 2.06 | 1.18 | 83.97 | 20.33 | 80.43 | 22.52 |
| Filtered by variant recalibration | 5.5M | 706K | 4.8M | 12.84 | 1.31 | 1.01 | 0.95 | 26.54 | 3.44 | 32.91 |
| | ||||||||||
| Variant recalibrated NGS reads only | 2.44M | 2.30M | 140K | 94.28 | 2.15 | 2.06 | 83.02 | 20.26 | 76.99 | 22.01 |
| Recalibrated with Beagle imputation | 3.20M | 3.01M | 191K | 94.03 | 2.18 | 2.09 | 96.72 | 3.32 | 91.21 | 3.35 |
| Raw reads, all calls | 18.9K | 16.8K | 2.1K | 88.83 | 3.20 | 1.16 | 99.10 | 0.09 | 99.12 | 0.12 |
| Unique to raw read calls | 483 | 39 | 444 | 8.07 | 2.55 | 0.31 | 0.04 | 25.00 | 0.03 | 33.33 |
| Unique to +recal/+MSA calls | 81 | 40 | 41 | 49.38 | 3.44 | 1.73 | 0.01 | 0.00 | 0.04 | 16.67 |
| +recal/+MSA, all calls | 18.5K | 16.8K | 1.7K | 90.77 | 3.20 | 1.61 | 99.07 | 0.08 | 99.13 | 0.11 |
| Filtered by variant recalibration | 1274 | 609 | 665 | 47.8 | 1.85 | 0.84 | 0.59 | N/A | 0.76 | N/A |
Figure 4(a) Relationship in the HiSeq call set between strand bias and quality by depth, for genomic locations in HapMap3 (red) and dbSNP (yellow) used for training the variant quality score recalibrator (left) and the same annotations applied to differentiate likely true positive (green) from false positive (purple) novel SNPs. (b,c,d) Quality tranches in the recalibrated HiSeq (b), exome (c), and low-pass CEU (d) calls beginning with (top) the highest-quality but smallest call set with an estimated false positive rate among novel SNP calls of <1/1000 to a more comprehensive call set (bottom) that includes effectively all true positives in the raw call set along with more false positive calls for a cumulative false positive rate of 10%. Each successive call set contains within it the previous tranche’s true and false positive calls (shaded bars) as well as tranche-specific calls of both classes (solid bars). The tranche selected for further analyses here is indicated.
Figure 5Variation discovered among 60 individuals from the CEPH population from 1000 Genomes pilot phase plus low-pass NA12878. (a) Discovered SNPs by non-reference allele count in the 61 CEPH cohort, colored by known (light blue, striped) and novel (dark blue, filled) variation, along with non-reference sensitivity to CEU HapMap3 and 1000 Genomes low-pass variants. (b) Quality and certainty of discovered SNPs by non-reference allele count. The histogram depicts the certainty of called variation broken out into 0.1, 1, and 10% novel FDR tranches. The Ti/Tv ratio is shown for known and novel variation for each allele count, aggregating the novel calls with allele count > 74 due to their limited numbers. (c,d) Genotyping accuracy for NA12878 from reads alone (blue circles) and following genotype-likelihood based imputation (pink squares) called in the 61 sample call set as assessed by the NRD rate to HiSeq genotypes, as a function of allele count (c) and sequencing depth (d).
Figure 6Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples. (a) Receiver operating characteristic (ROC) curves for SNP calls relating specificity and sensitivity to discover non-reference sites from the NA12878 HiSeq call set. The maximum callable sensitivity, 66%, is the percent of sites from the HiSeq call set where at least one read carries the alternate allele in the low-pass data for NA12878; it reflects both differences in the sequencing technologies (36–76bp GAII for the low-pass NA12878 sample vs. 101bp HiSeq) as well as the vagaries of sampling at 4× coverage. Because most of these missed sites are common and are consequently called in the other samples, imputation recovers ~50% of these sites. (b,c) Increasing power to identify strand-biased, likely false positive SNP calls with additional samples. Histograms of the Strand Bias annotation at raw variant calls discovered in the low-pass CEU data using NA12878 at 4× combined with one other CEU individual (b) and with 60 other individuals (c) stratified into sites present (green) and not (purple) in the 1000 Genomes CEU trio.
Figure 3Raw (violet) and recalibrated (blue) base quality scores for NGS paired end read sets of NA12878 of (a) Illumina/GA (b) Life/SOLiD and (c) Roche/454 lanes from 1000 Genomes, and (d) Illumina/HiSeq. For each technology: top panel: shows reported base quality scores compared to the empirical estimates (Methods); middle panel: the difference between the average reported and empirical quality score for each machine cycle, with positive and negative cycle values given for the first and second read in the pair, respectively; bottom panel: the difference between reported and empirical quality scores for each of the 16 genomic dinucleotide contexts. For example, the AG context occurs at all sites in a read where G is the current nucleotide and A is the preceding one in the read. Root-mean-square errors (RMSE) are given for the pre- and post-recalibration curves.