| Literature DB >> 21264334 |
Quan Long1, Daniel C Jeffares, Qingrun Zhang, Kai Ye, Viktoria Nizhynska, Zemin Ning, Chris Tyler-Smith, Magnus Nordborg.
Abstract
With the advance of next-generation sequencing (NGS) technologies, increasingly ambitious applications are becoming feasible. A particularly powerful one is the sequencing of polymorphic, pooled samples. The pool can be naturally occurring, as in the case of multiple pathogen strains in a blood sample, multiple types of cells in a cancerous tissue sample, or multiple isoforms of mRNA in a cell. In these cases, it's difficult or impossible to partition the subtypes experimentally before sequencing, and those subtype frequencies must hence be inferred. In addition, investigators may occasionally want to artificially pool the sample of a large number of individuals for reasons of cost-efficiency, e.g., when carrying out genetic mapping using bulked segregant analysis. Here we describe PoolHap, a computational tool for inferring haplotype frequencies from pooled samples when haplotypes are known. The key insight into why PoolHap works is that the large number of SNPs that come with genome-wide coverage can compensate for the uneven coverage across the genome. The performance of PoolHap is illustrated and discussed using simulated and real data. We show that PoolHap is able to accurately estimate the proportions of haplotypes with less than 2% error for 34-strain mixtures with 2X total coverage Arabidopsis thaliana whole genome polymorphism data. This method should facilitate greater biological insight into heterogeneous samples that are difficult or impossible to isolate experimentally. Software and users manual are freely available at http://arabidopsis.gmi.oeaw.ac.at/quan/poolhap/.Entities:
Mesh:
Year: 2011 PMID: 21264334 PMCID: PMC3016441 DOI: 10.1371/journal.pone.0015292
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Error estimation of PoolHap applied on simulated whole-genome sequencing based on real SNPs of A. thaliana.
The y-axis is the average % difference (real – predicted, absolute value), the x-axis is the number of selected SNPs; different curve in the same panel stands for different total coverage of the pool. Panel (a) is the pool of 34 strains with coverage standard deviation SD = 0, and (b) is the pool of 34 strains with coverage standard deviation SD = 0.5. Panel (c) is the pool of 6 strains with coverage standard deviation SD = 0, and (d) is the pool of 6 strains with coverage standard deviation SD = 0.5.
Figure 2Error estimation of PoolHap applied on simulated RNA-Seq data based on real gene models of A. thaliana.
The x-axis is the total coverage of the pool, the y-axis is the average % difference (real – predicted, absolute value). Different curves stand for genes with different number of isoforms, ranging from 2 to 6.
Inferred frequencies with pooled Illumina reads of six A. thaliana strains and the corresponding results from simulations.
| Strain ID | Real Frequency | Inferred frequency | SD of inferred frequency | Inferred freq. in Simulation. (SD = .5) |
| Lom1_1 | 1.80% | 1.90% | +/−0.41% | 1.50% |
| Ull2_5 | 4.60% | 4.70% | +/−0.41% | 4.80% |
| Kavlinge_1 | 14.70% | 13.20% | +/−0.42% | 14.50% |
| Sr_5 | 18.60% | 20.80% | +/−0.41% | 18.30% |
| Vastervik | 21.50% | 24.30% | +/−0.40% | 21.60% |
| Sanna_2 | 38.80% | 35.40% | +/−0.40% | 38.90% |
Coverage = 20x. Selected SNP number = 10,000.
Inferred frequencies with pooled Illumina reads of sixteen A. thaliana strains and the corresponding results from simulations.
| Strain ID | Real frequency | Inferred frequency | SD of inferred frequency | Inferred freq. in Simulation (SD = .5) |
| Nyl_2 | 0.50% | 1.30% | +/−0.33% | 0.70% |
| Lis_2 | 1.30% | 2.70% | +/−0.29% | 1.30% |
| Fab_4 | 3.20% | 3.90% | +/−0.31% | 3.40% |
| Omo2_1 | 4.10% | 3.50% | +/−0.30% | 4.10% |
| Kni_1 | 4.20% | 3.90% | +/−0.32% | 4.10% |
| Eden_1 | 4.80% | 4.30% | +/−0.32% | 4.60% |
| Eden_2 | 4.80% | 4.50% | +/−0.31% | 4.70% |
| Eds_1 | 5.20% | 4.30% | +/−0.32% | 5.00% |
| Rev_1 | 6.00% | 6.70% | +/−0.31% | 6.00% |
| Or_1 | 7.90% | 8.00% | +/−0.29% | 8.10% |
| Spr1_2 | 8.70% | 8.90% | +/−0.30% | 8.80% |
| Bil_7 | 9.20% | 8.20% | +/−0.30% | 9.10% |
| Lov_5 | 9.50% | 10.20% | +/−0.31% | 9.50% |
| Tottarp_2 | 9.50% | 8.80% | +/−0.30% | 9.60% |
| Dra3_1 | 10.10% | 9.90% | +/−0.30% | 10.00% |
| San_2 | 10.80% | 11.00% | +/−0.31% | 10.70% |
Coverage = 20x. Selected SNP number = 10,000.