| Literature DB >> 23221639 |
Haojing Shao1, Evangelos Bellos, Hanjiudai Yin, Xiao Liu, Jing Zou, Yingrui Li, Jun Wang, Lachlan J M Coin.
Abstract
Insertion and deletion polymorphisms (indels) are an important source of genomic variation in plant and animal genomes, but accurate genotyping from low-coverage and exome next-generation sequence data remains challenging. We introduce an efficient population clustering algorithm for diploids and polyploids which was tested on a dataset of 2000 exomes. Compared with existing methods, we report a 4-fold reduction in overall indel genotype error rates with a 9-fold reduction in low coverage regions.Entities:
Mesh:
Year: 2012 PMID: 23221639 PMCID: PMC3562001 DOI: 10.1093/nar/gks1143
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Illustration of population clustering method on real data. (A–D) Clustering at different putative indel sites, with different depth of coverage, as well as site-specific error rates. Each point represents the total number of aligned reads (X-axis), as well as the number of indel aligned reads (Y-axis) for each individual in the population. Shapes indicate the genotype called by SOAP-popIndel: squares, circles and triangles indicate homozygous reference, heterozygous and homozygous indels, respectively. (A and C) low-to-medium depth of coverage, low error rate. Panel B: medium-to-high depth of coverage, low error rate. (D) low-to-medium depth of coverage, high error rate.
Figure 2.Genotyping accuracy and missing rates. Dashed-line, solid line, circles and diamonds represent SOAP-popIndel, Dindel, SAMTools and piCALL, respectively. Black: real exome data; Red: 4× simulation; Green: 20× simulation and Blue: 40× simulation. Lines for Dindel and SOAP-popIndel are based on posterior probability thresholds between 0.90 and 0.99. SAMTools and piCALL do not report probability of assignment, so are represented by a single point. (A) Results on 44 Sequenom validated sites. (B) Restricted to sites within samples that had <5× coverage. (C) Results on simulated data.
Comparison of false-discovery and false-negative rates of different methods in detecting indels on simulated data
| Method | Diploid (%) | Triploid (%) | |||
|---|---|---|---|---|---|
| 4x | 20x | 40x | 40x | ||
| SOAP_popIndel | FN | 0.22 | 0.33 | 0.55 | 0.66 |
| FD | 0.22 | 0.11 | 0.22 | 0.87 | |
| Dindel | FN | 0.99 | 0.99 | 1.20 | NA |
| FD | 9.60 | 5.54 | 1.20 | NA | |
| SAMtools | FN | 1.20 | 11.8 | 18.84 | NA |
| FD | 64.28 | 63.10 | 63.55 | NA | |
| piCALL | FN | 11.83 | 1.42 | 1.42 | NA |
| FD | 53.18 | 57.18 | 64.19 | NA | |
NA, not applicable; FD, false discovery; FN, false-negative.