| Literature DB >> 22992255 |
Emanuele Raineri1, Luca Ferretti, Anna Esteve-Codina, Bruno Nevado, Simon Heath, Miguel Pérez-Enciso.
Abstract
BACKGROUND: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.Entities:
Mesh:
Year: 2012 PMID: 22992255 PMCID: PMC3475117 DOI: 10.1186/1471-2105-13-239
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Priors for = 0,= 1, and 0 1
| unfolded | ||
| folded |
We discretize the interval [0,1] with N= 100 breakpoints. The numbers 99, 100 and 200 appearing in the formulæ in the table are normalization factors (respectively N− 1, Nand 2N). βis a normalization constant for the divergent function 1/f, where γ = 0.57721… is the Euler-Mascheroni constant.
Figure 1Average power (left) and false discovery rate (right) for each of the methods considered. Results are shown when using N = 20,50 and 100 chromosomes (from top to bottom) for average sequencing depth 20X. Bottom row shows the result of unequal contribution of individuals to a pool of 50 chromosomes. With uneven contribution, half of the individuals were sampled 50% more times and the remaining half were sampled 50% fewer times. Average of 100 replicates.
Figure 2Power and false discovery rate (FDR) according to actual depth and minimum allele frequency when using= 20,50 and 100 chromosomes (from top to bottom) obtained with different methods (legend on upper-right panel). Average depth was 20X. Left column panels show power as a function of actual depth, middle column is the false discovery rate as a function of actual depth, and right column, power as a function of true minor allele frequency (MAF). Average of 100 replicates.
Figure 3Effect of pooling and SNP calling on estimated site frequency spectrum (SFS). The thick black line depicts the true SFS after pooling, but before sequencing, i.e., as if power was 1 and no false discoveries. Boxplots show the estimated SFS after sequencing and SNP calling for reads with exact depth 20. The two best methods are compared: samtools mpileup and snape with informative prior. Results for 100 replicates and N = 100 chromosomes.