| Literature DB >> 20529923 |
Abstract
MOTIVATION: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing.Entities:
Mesh:
Year: 2010 PMID: 20529923 PMCID: PMC2881398 DOI: 10.1093/bioinformatics/btq214
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Illustration of how comparison of allele counts across multiple DNA pools can be used to distinguish rare variants from sequencing errors. (a) Four sequenced pools are represented as boxes with each base call shown as a circle. All five of the alternate base calls are present in a single pool. The P-value of the contingency table corresponding to four pools is 0.002 suggesting that the five base calls represent a rare SNP rather than sequencing errors. (b) Five of the nine alternate base calls are present in a single pool. The P-value of the corresponding contingency table is 0.24 indicating that the presence of five alternate base calls in a single pool is likely due to sequencing errors alone.
Fig. 2.Description of the algorithm CRISP for detection of SNPs using sequencing data from k DNA pools.
Fig. 3.Empirical distribution of the sequence coverage per haplotype (one pool) in the two-pooled sequencing datasets: (a) 50 individuals in two pools and (b) 48 individuals in six pools.
Fig. 4.(a) Comparison of SNPs identified from the second pooled sequencing dataset using two independent statistics: contingency table P-value and quality values-based P-value. Only SNPs that were also identified from the individual sequencing of the 48 samples are shown. (b) Precision–recall curve for SNPs identified by CRISP from the second pooled dataset using different thresholds for the two P-values: contingency table P-value and the quality values-based P-value. The P-value thresholds (log base 10) are shown for each point on the curve.
Comparison of the number of false positive and false negative SNP calls using CRISP, SNPseeker, VarScan and MAQ (pooled) for the two datasets
| 50 samples in two pools | 48 samples in six pools | |||||
|---|---|---|---|---|---|---|
| Method | No. of SNPs | False positives | False negatives | No. of SNPs | False positives | False negatives |
| CRISP | 665 | 38 (5.6%) | 190/817 | 541 | 16 (3%) | 162/687 |
| SNPSeeker | 739 | 307 (41%) | 385/817 | 508 | 199 (39%) | 378/687 |
| VarScan | 1849 | 1244 (67%) | 212/817 | 715 | 234 (33%) | 206/687 |
| MAQ (pooled) | 367 | 279 (76%) | 729/817 | 948 | 681 (71%) | 420/687 |
| Col 1 | Col 2 | …… | Col k | Total | |
|---|---|---|---|---|---|
| Row 1 | …… | ||||
| Row 2 | …… | ||||
| Total | …… |