| Literature DB >> 23216810 |
Enrique Ramos1, Benjamin T Levinson, Sara Chasnoff, Andrew Hughes, Andrew L Young, Katherine Thornton, Allie Li, Francesco L M Vallania, Michael Province, Todd E Druley.
Abstract
BACKGROUND: Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.Entities:
Mesh:
Year: 2012 PMID: 23216810 PMCID: PMC3534616 DOI: 10.1186/1471-2164-13-683
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Allele frequency by individual versus pooled exome sequencing. Correlation plot comparing a total of 2,937 positions found in the Agilent SureSelect hybridization exome capture and Affymetrix 6.0 array between individual versus pooled exomes. Difference in size between the spheres represents the relative number of variant positions with the same minor allele frequency.
Figure 2Pooled capture SNV minor allele frequency correlation. Correlation plot comparing a total of 499 positions that overlapped between the custom hybridization targeted regions and the Illumina Omni-2.5-8 genome wide SNV array with at least one variant called by the array or pooled analysis. Of these 499 positions, 477 (95.6%) had at least one variant allele call by both SPLINTER and the array, 20 were called as SNVs by SPLINTER but not by the array, and 2 were called as SNVs by the array but not by SPLINTER.
Comparing sensitivity and specificity to coverage achieved for three sets of multiplexes
| | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22-24 | ≥3 | 84.4(1.29) | 229483 | 39100 | 24855 | 95.02(1.42) | 99.82(0.14) | 172205 | 2057 | 20 | 92.49 | 99.98 | 656 | 91.6 | 99.98 | 127 | 88.3 | 99.99 |
| | ≥5 | 84.6(1.75) | 221031 | 37332 | 23810 | 96.75(1.23) | 99.82(0.14) | 166050 | 1961 | 19 | 95.30 | 99.98 | 624 | 94.9 | 99.99 | 119 | 90.2 | 99.99 |
| | ≥10 | 75.0(2.65) | 201268 | 33626 | 21451 | 98.22(0.98) | 99.86(0.13) | 151395 | 1768 | 17 | 97.42 | 99.98 | 563 | 96.6 | 99.99 | 111 | 93.8 | 100 |
| | ≥15 | 67.1(3.57) | 182332 | 30279 | 19510 | 98.60(0.93) | 99.88(0.12) | 137327 | 1579 | 17 | 98.50 | 99.98 | 507 | 98.0 | 99.99 | 100 | 96.0 | 100 |
| | ≥20 | 59.8(4.51) | 163986 | 27202 | 17643 | 98.94(0.83) | 99.89(0.12) | 123572 | 1418 | 14 | 98.81 | 99.98 | 460 | 98.7 | 100 | 90 | 95.6 | 100 |
| 48 | ≥3 | 89.0(2.30) | 239300 | 39218 | 26310 | 95.80(1.42) | 99.90(0.08) | 187077 | 2109 | 25 | 92.97 | 99.97 | 651 | 92.9 | 99.99 | 131 | 94.7 | 100 |
| | ≥5 | 84.1(3.14) | 230685 | 37434 | 25232 | 96.43(0.95) | 99.91(0.08) | 179900 | 1997 | 25 | 95.95 | 99.97 | 619 | 95.8 | 99.99 | 125 | 96.8 | 100 |
| | ≥10 | 74.8(4.54) | 209960 | 33448 | 22739 | 98.96(0.51) | 99.94(0.07) | 163128 | 1805 | 22 | 98.30 | 99.97 | 566 | 98.4 | 99.99 | 111 | 100 | 100 |
| | ≥15 | 67.1(5.79) | 190652 | 30010 | 20491 | 99.33(0.42) | 99.95(0.07) | 147843 | 1630 | 19 | 98.85 | 99.97 | 517 | 99.0 | 99.99 | 100 | 100 | 100 |
| | ≥20 | 60.3(6.98) | 172373 | 26937 | 18505 | 99.60(0.37) | 99.96(0.06) | 133259 | 1449 | 18 | 99.05 | 99.97 | 467 | 99.6 | 100 | 88 | 100 | 100 |
| 30-32 | ≥3 | 89.2(2.55) | 245952 | 40732 | 26487 | 96.52(1.08) | 99.97(0.03) | 184498 | 1903 | 16 | 93.54 | 99.97 | 664 | 95.6 | 99.98 | 102 | 97.1 | 99.99 |
| | ≥5 | 84.4(3.49) | 235796 | 38696 | 25251 | 98.23(0.69) | 99.97(0.03) | 177279 | 1809 | 16 | 96.33 | 99.97 | 644 | 97.1 | 99.99 | 102 | 97.1 | 100 |
| | ≥10 | 74.8(5.35) | 212716 | 34440 | 22505 | 99.43(0.40) | 99.97(0.03) | 160352 | 1600 | 14 | 98.88 | 99.97 | 567 | 99.1 | 99.99 | 99 | 96.8 | 100 |
| | ≥15 | 66.8(7.25) | 192642 | 30957 | 20272 | 99.80(0.19) | 99.98(0.03) | 145504 | 1403 | 12 | 99.58 | 99.98 | 497 | 99.4 | 100 | 82 | 97.6 | 100 |
| ≥20 | 59.5(9.08) | 173212 | 27775 | 18237 | 99.90(0.15) | 99.98(0.03) | 130959 | 1260 | 10 | 99.76 | 99.98 | 448 | 99.6 | 100 | 72 | 98.8 | 100 | |
The percent of bases in the targeted intervals that reached the specified coverage threshold is listed. Homozygous wild type, heterozygous, and homozygous variant sites surveyed indicates the number of each of those positions as seen by the Illumina Omni 2.5-8 array that were used to determine sensitivity and specificity. Sensitivity is the percentage of heterozygous and homozygous variant sites correctly called heterozygous and homozygous variant, respectively. Specificity is the percentage of homozygous wild type sites called as homozygous wild type. For “All allele frequencies” these values were averaged among all non-excluded samples. For ≤5%, 2% and 0.5% minor allele frequencies, these values were determined from the cumulative metrics of all non-excluded samples at sites with 9 or fewer, 4 or fewer, or 1 variant allele in the entire pool, respectively.
Figure 3Sensitivity as a function of total paired-end read counts at different coverage thresholds for all variants in the sets of 48 multiplexed samples. This graph shows how the sensitivity increases per the number of reads for a given sample with a shallow increase at 5-fold coverage after a sample reaches 1.8 million reads. Sensitivity is defined as the percentage of heterozygous or homozygous variant genotypes as seen by the array called correctly as heterozygous or homozygous by sequencing. Red symbols: ≥3x coverage. Green symbols: ≥5x coverage. Blue symbols: ≥10x coverage. Brown symbols: ≥15x coverage. Black symbols: ≥20x coverage. Trend lines were generated using the “lowess” function in R with default parameters. The sensitivity appears to plateau at around 1.8 million reads, which is prior to excluding duplicate reads and reads that align off-target. PE = paired-end.
Figure 4Specificity as a function of total paired-end read counts at different coverage thresholds for the sets of 48 multiplexed samples. Coverage has little impact on specificity, which starts at ≥99.8% for most individuals with >1.25 million reads. Specificity is defined as the percentage of homozygous wild type genotypes as seen by the array called correctly as homozygous wild type by sequencing. Red symbols: ≥3x coverage. Green symbols: ≥5x coverage. Blue symbols: ≥10x coverage. Brown symbols: ≥15x coverage. Black symbols: ≥20x coverage. Trend lines were generated using the “lowess” function in R with default parameters. The dip in specificity at around 2.25 million reads is likely due to noise as a single extra false positive can cause the observed shift downward. PE = paired-end.