| Literature DB >> 21479135 |
Vikas Bansal1, Ryan Tewhey, Emily M Leproust, Nicholas J Schork.
Abstract
High-throughput sequencing of targeted genomic loci in large populations is an effective approach for evaluating the contribution of rare variants to disease risk. We evaluated the feasibility of using in-solution hybridization-based target capture on pooled DNA samples to enable cost-efficient population sequencing studies. For this, we performed pooled sequencing of 100 HapMap samples across ∼ 600 kb of DNA sequence using the Illumina GAIIx. Using our accurate variant calling method for pooled sequence data, we were able to not only identify single nucleotide variants with a low false discovery rate (<1%) but also accurately detect short insertion/deletion variants. In addition, with sufficient coverage per individual in each pool (30-fold) we detected 97.2% of the total variants and 93.6% of variants below 5% in frequency. Finally, allele frequencies for single nucleotide variants (SNVs) estimated from the pooled data and the HapMap genotype data were tightly correlated (correlation coefficient > = 0.995).Entities:
Mesh:
Substances:
Year: 2011 PMID: 21479135 PMCID: PMC3068187 DOI: 10.1371/journal.pone.0018353
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Capture Efficiency.
| Total | On Target | Mean | % of Bases | |
| Pool | Sequence | Sequence | Coverage | Between 1/5 and 5x |
| Mean Coverage | ||||
| CEU | 722 Mb | 401.8 Mb (55.7%) | 676 | 95.7% |
| CHB | 709.6 Mb | 388.7 Mb (54.8%) | 654 | 94% |
| TSI | 694.7 Mb | 376.8 Mb (54.2%) | 634 | 94.1% |
| YRI 1 | 606.4 Mb | 327.8 Mb (54.1%) | 552 | 92.6% |
| YRI 2 | 744.6 Mb | 415.2 Mb (55.8%) | 699 | 94.6% |
Variant Detection Statistics for Pooled Sequencing.
| Coverage > = 10 | Coverage > = 30 | |||||||||||
| Total Variants | Detected | Detected | False Positives | Detected | False Positives | |||||||
| Pool | HM & 1KG | Variants | Variants | in dbSNP 130 | False Negative | Variants | in dbSNP 130 | False Negative | ||||
| at Targets | in Pool | in Pool | Absent | Included | Absent | Included | ||||||
| CEU | 826 (283) | 744 (213) | 702 (213) | 22 (15) | 11 (6) | 47 (45) | 6.4% (19.2%) | 479 (158) | 16 (12) | 10 (6) | 17 (17) | 3.4% (9.7%) |
| CHB | 731 (268) | 645 (194) | 599 (191) | 34 (29) | 9 (7) | 56 (56) | 8.7% (24.7%) | 410 (150) | 25 (22) | 8 (6) | 22 (22) | 5% (12.4%) |
| TSI | 850 (377) | 748 (287) | 700 (280) | 37 (33) | 26 (15) | 54 (53) | 7.1% (16.2%) | 460 (198) | 29 (26) | 24 (13) | 10 (10) | 1.9% (4.2%) |
| YRI 1 | 1194 (419) | 1084 (338) | 995 (324) | 56 (50) | 13 (6) | 54 (50) | 5.1% (13.2%) | 551 (192) | 31 (29) | 11 (5) | 17 (16) | 2.9% (7.1%) |
| YRI 2 | 1216 (573) | 1106 (482) | 1041 (470) | 33 (27) | 15 (8) | 65 (62) | 6% (12.3%) | 712 (332) | 23 (19) | 15 (8) | 16 (16) | 2.1% (4.5%) |
| Sum | 4817 (1920) | 4327 (1514) | 4037 (1478) | 182 (154) | 74 (42) | 276 (266) | 6% (13.7%) | 2612 (1030) | 124 (108) | 68 (38) | 82 (81) | 2.8% (6.4%) |
The number in parentheses represents only variants at 5% or lower frequency in the dataset.
1- Statistics for variant sites which were sequenced to a depth of 10 or 30 fold per individual in the pooled dataset.
2- Variants called in the pooled dataset not present in either HapMap or the 1000 Genome Project. Variants were further classified as being included or absent in dbSNP v130.
3- Variants called in the HapMap or 1000 Genome Project that were not called in our pooled dataset.
Figure 1Comparison of pooled allele frequency estimates with actual allele frequencies.
Scatter plots for each of the 5 pools (CEU, TSI, CHB, YRI 1, YRI 2) of the estimated allele frequency as calculated by read counts from the sequence data plotted against the actual allele frequency from either the HapMap or 1000 Genome project. Only sites that contained genotype information for all 20 individuals in that particular pool are included. The insert displays the area of the graph representing 1–3 copies of the alternate allele as a jitter plot. In both graphs, the points are shaded to represent overall read coverage in our sequencing data at that site.
Figure 2Error in the pooled allele frequency estimate for each variant.
Histogram of the estimated error in measurement of allele frequency from the pooled sequencing data. For each variant, the absolute difference between the pooled allele frequency estimate and the actual allele frequency derived from the 1000 Genomes or HapMap data was computed.
Cost Estimates for Pooled Sequencing Projects.
| 750 Kb Capture Project | 3 Mb Capture Project | |||||||||
| Library Prep | Total Project | Total Project | ||||||||
| Cost ($1000) | Sequencing | Cost ($1000) | Cost | Sequencing | Cost ($1000) | Cost | ||||
| Number of | Cost ($1000) | Difference | Cost ($1000) | Difference | ||||||
| Samples | Single Plex | Pooled | Single Plex | Pooled | Single Plex | Pooled | ||||
| 400 | 110 | 5.5 | 2.7 | 112.7 | 8.2 | 10.7 | 120.7 | 16.2 | ||
| 4000 | 1100 | 55 | 26.8 | 1126.8 | 81.8 | 13.8 | 107 | 1207 | 162 | 7.5 |
| 10000 | 2750 | 137.5 | 66.9 | 2816.9 | 204.4 | 267.6 | 3017.6 | 405.1 | ||
Cost estimates based on a 100 bp HiSeq paired-end run with Illumina's published reagent costs ($11,150 per flowcell) and an average throughput of 200 Gb per run. Sample preparation includes $75 per samples for library prep and $200 per sample for solution based capture. All calculations for pooled sequencing assume 20 individuals per pool.