| Literature DB >> 22069447 |
Aaron G Day-Williams1, Kirsten McLay, Eleanor Drury, Sarah Edkins, Alison J Coffey, Aarno Palotie, Eleftheria Zeggini.
Abstract
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22069447 PMCID: PMC3206031 DOI: 10.1371/journal.pone.0026279
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Target sequence enrichment success before duplicate removal.
| Pool | Number | Total Number | % Reads Mapped | % Reads Mapped | % Reads Mapped |
| of | Lanes | Reads | to Reference | to Target | to Target w/ |
| 1 PCR | 1 | 44,232,852 | 48.97 | 46.05 | 44.27 |
| 1 aPD | 1 | 61,487,334 | 95.80 | 21.82 | 21.58 |
| 1 sPD | 1 | 35,813,898 | 97.90 | 46.55 | 45.95 |
| 2 PCR | 1 | 30,843,770 | 97.92 | 85.97 | 79.61 |
| 2 aPD | 1 | 58,352,664 | 92.19 | 13.07 | 12.91 |
| 2 sPD | 1 | 29,554,192 | 97.50 | 46.96 | 46.36 |
| 10 PCR | 2 | 55,278,922 | 84.51 | 73.44 | 67.02 |
| 10 aPD | 2 | 90,319,688 | 96.44 | 18.62 | 18.15 |
| 10 sPD | 2 | 85,783,964 | 97.83 | 48.13 | 47.48 |
| 20 PCR | 3 | 121,378,560 | 89.33 | 80.88 | 75.37 |
| 20 aPD | 3 | 103,231,280 | 97.24 | 34.05 | 33.44 |
| 20 sPD | 3 | 111,444,476 | 97.11 | 45.91 | 45.31 |
| 50 PCR | 7 | 132,547,082 | 99.74 | 70.90 | 67.42 |
| 50 aPD | 7 | 251,257,124 | 96.02 | 22.62 | 22.27 |
| 50 sPD | 7 | 295,115,044 | 97.52 | 49.97 | 49.30 |
For each pool and sequence enrichment method this table details the total number of reads generated for the pool, the percentage of total reads mapped to the reference genome, the percentage of total reads mapped to the target regions, and the percentage of mapped reads that mapped to the target regions with mapping quality 20. The total number of reads for a pool is calculated from the fastq file(s) generated for each lane of sequencing. The percentage of reads mapped to the reference is calculated from the BAM file generated from merging all the Maq map files for each lane for a pool. The percentage of reads mapped to the target regions is calculated as the number of reads with at least one base overlapping a target region divided by the total number of reads. The percentage of reads mapped to the target regions with a mapping quality score Q20 is calculated as the number of reads with at least one base overlapping a target region with mapping Q20 divided by the total number of reads.
: Calculated by samtools view –c.
: Calculated by samtoools veiw -c -q 20.
Figure 1Target coverage per individual in pool before duplicate removal.
This shows a cumulative relative frequency plot of the percentage of target bases with X coverage depth normalized by the number of individuals sequenced for: (A) Pool of 2, (B) Pool of 10, (C) Pool of 20 and (D) Pool of 50 individuals. The x-axis is in increments of 10× coverage. The black squares/lines illustrate the data for PCR enrichment, the blue squares/lines illustrate the data for aHC enrichment and the orange squares/lines illustrate the data for sHC enrichment. The first square represents the percentage of target bases with 10× coverage per individual in the pool, and so on for each square in increments of 10×. This analysis assumes equal representation of each individual in the pool of DNA.
HapMap variation detection sensitivity after duplicate removal.
| Pool | Pool | Pool | Pool | Pool | |
| of 1 | of 2 | of 10 | of 20 | of 50 | |
| (1089) | (1459) | (1999) | (2067) | (2145) | |
| PCR | 26.26 | 87.46 | 92.35 | 96.27 | 95.80 |
| aHC | 97.15 | 85.33 | 96.60 | 97.82 | 94.41 |
| sHC | 94.12 | 95.07 | 98.30 | 98.16 | 96.88 |
This table contains the percentage of the known HapMap variants with at least one non-reference allele in the pool that each pool and enrichment method discovered (true positives). The false negative rate is 100 minus this value.
: number of non-reference HapMap variants in pool.
HapMap variation detection specificity after duplicate removal.
| Pool | Pool | Pool | Pool | |
| of 1 | of 2 | of 10 | of 20 | |
| (1722) | (1353) | (683) | (590) | |
| PCR | 99.88 | 98.97 | 97.66 | 96.95 |
| aHC | 98.84 | 98.67 | 97.22 | 96.61 |
| sHC | 99.07 | 98.74 | 97.22 | 96.95 |
This table contains the percentage of the known HapMap variants with no non-reference alleles and no missing genotypes in the pool that each pool and enrichment method correctly didn't call as a variant (true negatives). The false positive rate is 100 minus this value.
: number of reference HapMap variants in pool.
1KG support for HapMap false positive loci after duplicate removal.
| Pool | Pool | Pool | Pool | |
| of 1 | of 2 | of 10 | of 20 | |
| PCR | 2(50%) | 14(100%) | 15(93.33%) | 14(92.86%) |
| aHC | 19(94.74%) | 17(94.12%) | 16(100%) | 15(100%) |
| sHC | 16(100%) | 16(100%) | 16(100%) | 16(100%) |
This table contains the number of loci considered false positives based on HapMap data that are present in 1KG and the percentage of these overlapping loci that the 1KG data supports the presence of non-reference alleles in the pool.
Figure 2Accuracy of non-reference allele frequency estimation at HapMap/58C intersection variants for the Pool of 50 after duplicate removal.
An analysis of the correlation between the non-reference allele frequency estimates from the sequencing based variant caller and the allele frequency calculated from the reference genotypes. The analysis includes the true positive variants called by the sequencing based variant caller for which there were missing genotypes in the reference genotypes. The correlation coefficient is the Pearson's correlation coefficient. The figure shows the analysis for: (A) PCR enrichment, (B) aHC enrichment and (C) sHC enrichment.
Figure 3Accuracy of non-reference allele frequency estimation at HapMap/58C intersection variants for the Pool of 50 before duplicate removal.
An analysis of the correlation between the non-reference allele frequency estimates from the sequencing based variant caller and the allele frequency calculated from the reference genotypes. The analysis includes the true positive variants called by the sequencing based variant caller for which there were missing genotypes in the reference genotypes. The correlation coefficient is the Pearson's correlation coefficient. The figure shows the analysis for: (A) PCR, (B) aHC and (C) sHC enrichment.