| Literature DB >> 25326702 |
Neil I Weisenfeld1, Shuangye Yin1, Ted Sharpe1, Bayo Lau1, Ryan Hegarty1, Laurie Holmes1, Brian Sogoloff1, Diana Tabbaa1, Louise Williams1, Carsten Russ1, Chad Nusbaum1, Eric S Lander1, Iain MacCallum1, David B Jaffe1.
Abstract
Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.Entities:
Mesh:
Year: 2014 PMID: 25326702 PMCID: PMC4244235 DOI: 10.1038/ng.3121
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Some categories of challenging variants
| Fosmid reference variants | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| category | number | % of each category missing from call set | % of total missing variants in each category | ||||||||
| Platinum-100 | GATK-250 | CORT EX-250 | DISCO VAR-250 | union | Platinum-100 | GATK-250 | CORT EX-250 | DISCO VAR-250 | union | ||
| low complexity | 1115 | 53.6 | 32.1 | 63.9 | 16.2 | 11.6 | 53.5 | 64.5 | 40.4 | 66.5 | 64.8 |
| near A/T homopolymers | 360 | 68.3 | 45.0 | 74.4 | 17.5 | 13.9 | 22.0 | 29.2 | 15.2 | 23.2 | 25.1 |
| segmental duplication | 351 | 78.1 | 37.3 | 90.3 | 17.7 | 16.5 | 24.5 | 20.4 | 18.0 | 22.8 | 29.1 |
| long insertion | 9 | 100.0 | 88.9 | 100.0 | 77.8 | 77.8 | 0.8 | 1.4 | 0.5 | 2.6 | 3.5 |
| extremely high GC | 5 | 100.0 | 60.0 | 80.0 | 60.0 | 60.0 | 0.4 | 0.5 | 0.2 | 1.1 | 1.5 |
| challenging | 1402 | 58.8 | 33.0 | 69.1 | 16.8 | 12.9 | 73.6 | 83.2 | 55.0 | 86.4 | 91.0 |
| ordinary | 3084 | 9.5 | 3.0 | 25.7 | 1.2 | 0.6 | 26.4 | 16.8 | 45.0 | 13.6 | 09.0 |
Several categories of challenging variants are described and compared to the Fosmid reference set. The “number” shows the raw count of events in regions defined by the Fosmid reference set, while subsequent columns show the false negative rate in each category of variant (“% of each category missing from call set”), as well as the breakdown by category of all false negatives (“% of total missed variants in each category”). Variants are categorized by type and/or region of the genome as follows: Low complexity: bases identified as having low complexity by symmetric DUST with default parameters[15]. Near A/T homopolymers: bases lying within 5 bases of a run of 15 or more identical A or T bases. Of 360 such events, all but 8 were labeled low complexity by DUST. Segmental duplication: a segmental duplication as defined in the Segmental Duplication DB[16] (see URL section). Long insertion: an insertion of > 100 bases Extremely high GC: defined[17] by bases occurring in 200-base windows whose middle 100 bases have %GC 85. Challenging: union of low complexity, segmental duplication, long insertion and extremely high GC categories. Ordinary: the complement of challenging. This table has been adjusted to reflect the manual corrections of Supplementary Tables 1a, b.
Estimated sensitivity and specificity of variant call sets
| call set | read length | %FN | #heterozygous/#homozygous | %FP | ||
|---|---|---|---|---|---|---|
| heterozygous variants | homozygous variants | all variants | ||||
| Platinum | 100 | 25.0 ± 2.5 | 1.49 | 0.83 ± 0.07 | 0.85 ± 0.26 | 0.84 ± 0.11 |
| GATK | 250 | 12.3 ± 1.8 | 1.54 | 1.82 ± 0.45 | 0.74 ± 0.72 | 1.39 ± 0.39 |
| Cortex | 250 | 39.3 ± 2.6 | 1.39 | 0.33 ± 0.18 | 3.46 ± 0.61 | 1.64 ± 0.28 |
| DISCOVAR | 250 | 6.0 ± 1.2 | 1.57 | 1.44 ± 0.23 | 1.94 ± 0.40 | 1.63 ± 0.21 |
For each of four variant call sets, we estimated the percent of false negatives (%FN) and false positives (%FP). False negative rates were estimated using 100 randomly selected Fosmids, as described in the text. False positive rates for homozygous variants were estimated by computing the fraction of homozygous calls that were not in the Fosmid reference set. In both cases standard errors were obtained by bootstrapping, using 1000 bootstrap samples from the set of 100 Fosmids[21]. False positive rates for heterozygous variants were estimated by dividing the number of heterozygous events observed in the 100 Mb region of X from 10–110 Mb on NA12878’s father by the number observed on the same region for NA12878, then dividing by 1.8 = (heterozygous calls per Mb of genome)/(heterozygous calls per Mb of X), from the Platinum-100 call set for NA12878, thus correcting for the difference between X and the genome. Standard errors were obtained using 1000 bootstrap samples from the set of 100 regions of size 1 Mb obtained by segmenting the 100 Mb region. Intermediate calculations for false negatives and positives are shown in Supplementary Table 6. #heterozygous/#homozygous: genome-wide ratio of number of heterozygous calls divided by number of homozygous calls. %FP for all variants: the average of %FP for heterozygous and homozygous variants, weighted by #heterozygous/#homozygous. The FN and FP (homozygous) values in the this table are corrected values, after taking account of manual corrections from Supplementary Table 1b. Data for the Platinum-100 call set had 48x PF Q30 coverage, while data for the 250-base analyses had 40x raw Q30 coverage, and 39x PF Q30 coverage.
Classification of Fosmid variants
| Called by | Total | Substitutions | Insertions by size in bp | Deletions by size in bp | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D | G1 | G2 | C | 1 | 2–10 | 11–100 | >100 | 1 | 2–10 | 11–100 | >100 | ||
| • | • | • | • | 2542 | 2151 | 117 | 67 | 6 | 0 | 109 | 87 | 5 | 0 |
| 200 | 109 | 9 | 21 | 16 | 7 | 17 | 14 | 4 | 3 | ||||
| • | 253 | 145 | 25 | 22 | 14 | 1 | 32 | 10 | 4 | 0 | |||
| • | 9 | 6 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | |||
| • | 39 | 25 | 1 | 6 | 2 | 0 | 2 | 1 | 2 | 0 | |||
| • | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | |||
| • | • | 36 | 20 | 5 | 1 | 0 | 0 | 7 | 3 | 0 | 0 | ||
| • | • | 489 | 373 | 21 | 42 | 12 | 1 | 7 | 17 | 16 | 0 | ||
| • | • | 14 | 6 | 0 | 1 | 0 | 0 | 4 | 0 | 2 | 1 | ||
| • | • | 10 | 5 | 0 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | ||
| • | • | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ||
| • | • | 3 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | ||
| • | • | • | 725 | 542 | 52 | 37 | 1 | 0 | 41 | 43 | 9 | 0 | |
| • | • | • | 38 | 26 | 6 | 0 | 0 | 0 | 5 | 1 | 0 | 0 | |
| • | • | • | 118 | 65 | 12 | 10 | 9 | 0 | 4 | 10 | 8 | 0 | |
| • | • | • | 5 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | |
| 4486 | 3476 | 248 | 209 | 60 | 9 | 240 | 190 | 50 | 4 | ||||
We classify all the variants that are in the Fosmid truth call set, and which are obtained from the Fosmid reference sequences (thus representing single haplotypes). For each of four call sets D = DISCOVAR-250, G1 = Platinum-100, G2 = GATK-250 and C = Cortex, and each of 2[4] = 16 possible call combinations, variants are classified as substitutions, insertions and deletions, and by size. This table has been modified to reflect the manual corrections in Supplementary Table 1a.