| Literature DB >> 27633116 |
Jingwen Wang1,2, Tiina Skoog1, Elisabet Einarsdottir1,3, Tea Kaartokallio4, Hannele Laivuori4,5,6, Anna Grauers7,8, Paul Gerdhem7, Marjo Hytönen9, Hannes Lohi9, Juha Kere1,2,3, Hong Jiao1,2.
Abstract
High-throughput sequencing using pooled DNA samples can facilitate genome-wide studies on rare and low-frequency variants in a large population. Some major questions concerning the pooling sequencing strategy are whether rare and low-frequency variants can be detected reliably, and whether estimated minor allele frequencies (MAFs) can represent the actual values obtained from individually genotyped samples. In this study, we evaluated MAF estimates using three variant detection tools with two sets of pooled whole exome sequencing (WES) and one set of pooled whole genome sequencing (WGS) data. Both GATK and Freebayes displayed high sensitivity, specificity and accuracy when detecting rare or low-frequency variants. For the WGS study, 56% of the low-frequency variants in Illumina array have identical MAFs and 26% have one allele difference between sequencing and individual genotyping data. The MAF estimates from WGS correlated well (r = 0.94) with those from Illumina arrays. The MAFs from the pooled WES data also showed high concordance (r = 0.88) with those from the individual genotyping data. In conclusion, the MAFs estimated from pooled DNA sequencing data reflect the MAFs in individually genotyped samples well. The pooling strategy can thus be a rapid and cost-effective approach for the initial screening in large-scale association studies.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27633116 PMCID: PMC5025741 DOI: 10.1038/srep33256
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Comparison of SNV detection by SAMtools, GATK (ploidy setting) and Freebayes (ploidy setting) in the WES studies.
The blue, red and yellow circles represent the SNVs detected by GATK UnifiedGenotyper (ploidy setting), Freebayes (ploidy setting) and SAMtools, rescpectively. (a) The SNVs in the Agilent SureSelect target regions in the scoliosis study. (b) The SNVs in the Agilent SureSelect target regions in the pre-eclampsia study.
Comparison of SNV detection by SAMtools, GATK, and Freebayes using multiple pools as simultaneous input.
| Study | SAMtools | GATK (ploidy = 20) | Freebayes (ploidy = 20) | |
|---|---|---|---|---|
| Idiopathic scoliosis | On enrichment regions | 79 677 | 171 286 | 219 526 |
| Annotated in dbSNP 144 | 74 098 (93.0%) | 143 920 (84.0%) | 134 575 (61.3%) | |
| Rare | 2 049 (2.6%) | 20 414 (11.9%) | 15 857 (7.2%) | |
| Low-frequency | 5 884 (7.4%) | 24 790 (14.5%) | 22 625 (10.3%) | |
| Common | 63 505 (79.7%) | 71 560 (41.8%) | 71 064 (32.4%) | |
| Unknown frequency | 8 239 (10.3%) | 54 522 (31.8%) | 109 980 (50.1%) | |
| Pre-eclampsia | On enrichment regions | 69 500 | 180 607 | 192 716 |
| Annotated in dbSNP 144 | 64 665 (93.0%) | 153 740 (85.1%) | 137 961 (71.6%) | |
| Rare | 583 (0.8%) | 26 859 (14.9%) | 20 066 (10.4%) | |
| Low-frequency | 2 329 (3.4%) | 26 242 (14.5%) | 23 576 (12.2%) | |
| Common | 59 320 (85.3%) | 72 432 (40.1%) | 71 984 (37.4%) | |
| Unknown frequency | 7 268 (10.5%) | 55 074 (30.5%) | 77 090 (40.0%) | |
| Bull Terrier | Total | 4 736 038 | 7 323 018 | 7 612 527 |
| Mean depth >30× | 4 253 121 (89.6%) | 6 704 136 (91.5%) | 6 756 976 (88.8%) | |
| On Illumina array | 90 190 | 100 678 | 101 082 |
&The number of SNVs in the target regions captured by the Agilent SureSelect enrichment kit.
*Rare: alternative allele frequency <1%; Low-frequency: alternative allele frequency between 1% and 5%; Common: alternative allele frequency >5%. All of the allele frequencies were retrieved from the 1000 Genome European population (August 2015).
Figure 2MAF comparison between the WES and the Sequenom genotyping.
The estimated MAFs were estimated with GATK based on allele counts. The blue squares represent the 43 validated SNVs in the scoliosis study and the red dots the 47 validated SNVs in the pre-eclampsia study. The diagonal is shown with the grey dashed line.
Figure 3Distribution and variation of total MAFs estimated using a variable number of pre-eclampsia pools.
The X-axis represents the number of pools used for estimating MAF from the exome sequencing data and the Y-axis shows the MAF difference between the exome sequencing estimation and the genotyping validation. (a) Low-frequency SNV rs36051194. (b) Rare SNV rs3803339.
Figure 4SNV detection across variant detection tools and minor allele count comparison between the WGS and the Illumina array in the bull terrier WGS study.
(a) Polymorphic and monomorphic markers in the affected or the unaffected pool detected with the three variant detection tools and the Illumina array. (b) The variants in blue are SNVs with MAF < 5% in the genotyping data and the variants marked in brown are SNVs with MAF > 5% in the genotyping data. (c) Percentage of allele count difference (absolute value) between two platforms among polymorphic loci.
Evaluation of variant detection tools in the Bull Terrier study.
| Tool | Sensitivity (%) | Specificity (%) | Precision (%) | Negative Predictive Value (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SAMtools | 91.04 | 88.09 | 79.74 | 95.02 | 89.09 |
| GATK (ploidy = 20) | 96.59 | 97.62 | 96.24 | 97.85 | 97.22 |
| Freebayes (ploidy = 20) | 96.79 | 97.73 | 96.42 | 97.97 | 97.37 |