| Literature DB >> 27670852 |
Santosh Anand1,2, Eleonora Mangano1, Nadia Barizzone3,4, Roberta Bordoni1, Melissa Sorosina5, Ferdinando Clarelli5, Lucia Corrado3,4, Filippo Martinelli Boneschi5,6, Sandra D'Alfonso3,4, Gianluca De Bellis1.
Abstract
Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.Entities:
Year: 2016 PMID: 27670852 PMCID: PMC5037392 DOI: 10.1038/srep33735
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(a) Allele Frequency distribution of all variants. (b) Distribution of variants according to the number of pools in which they are found.
Figure 2Comparison of poolAF with AF of 1000genomes.
(a) Histogram of differences between poolAF and 1000genomes European population [1 kg(EUR)]. Minimum: −0.494;1st Quartile: 0.005; Median: 0.000; Mean: −0.002; 3rd Quartile: 0.005; Maximum: 0.308. (b) Boxplot of differences: Left panel 1000genomes_ALL (delta.kg.all) and Right panel 1000genomes_EUR (delta.kg.eur). The overall similarity between poolAF and 1000Genomes is higher for 1000genomes_EUR population as shown by smaller IQR and lesser spread of data.
Figure 3Pool sequencing AF vs. AF obtained from individual genotyping by ImmunoChip SNP-array.
(a) Correlation scatterplot. The points are colour coded according to the absolute difference (delta) between the two frequencies; the number of points for corresponding ranges of delta is shown in top left inset. (b) Pool-by-pool correlation. A representative scatter plot for one of the pools (12 individuals) for 1535 SNVs is shown.
Figure 4QUAL(ity) score distribution of all variants.
The dashed red vertical line denotes the ad-hoc threshold of low-quality (QUAL = 100).
Figure 5Density distributions of QUAL(ity) scores of variants found in public databases (in.db), and those not found in any database (novel).
(a) Distributions for all variants (QUAL > 0) (b) Distribution for variants having QUAL > 100.
Summary of comparison of Pool-seq variants with variants obtained from individual sequencing of the same pool (before and after filtering).
| Variant Type | Filters Applied | Pool-seq variants | TP variants | TP variants retained | FPR |
|---|---|---|---|---|---|
| Original Variants | 8195 | 3772 | 100.00 | 53.97 | |
| After MPF & QF Filters | 3896 | 3636 | 96.39 | 6.67 | |
| Original Variants | 3911 | 3406 | 100.00 | 12.91 | |
| After MPF & QF Filters | 3566 | 3326 | 97.65 | 6.73 | |
| Original Variants | 4284 | 366 | 100.00 | 91.46 | |
| Both MPF and QF | 330 | 310 | 84.70 | 6.06 |
Original Variants = Original number of variants without any filter. MPF = Minimum Percentage Filter; QF = Quality Filter. Pool-seq variants = number of variants called by CRISP in this pool. True Positive (TP) variants = Number of pool-seq variants confirmed by individual sequencing. TP variants retained = % of TP variants retained after applying the respective filters. False Positive Rate (FPR) = Rate of False positive variants in respective data (before or after filters). See methods for details about TP and FPR calculations.