| Literature DB >> 22682067 |
Jinzhuang Dou1, Xiqiang Zhao, Xiaoteng Fu, Wenqian Jiao, Nannan Wang, Lingling Zhang, Xiaoli Hu, Shi Wang, Zhenmin Bao.
Abstract
BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.Entities:
Mesh:
Year: 2012 PMID: 22682067 PMCID: PMC3472322 DOI: 10.1186/1745-6150-7-17
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1Schematic illustration of an occurrence of a false SNP afterclustering of reads derived from repetitive genomic regions. Both ML and iML perform well in the genotyping of SNPs derived from single-copy genomic regions (left), but iML is more efficient to identify and exclude false SNPs resulting from repetitive regions (right).
Figure 2Observed distribution of cluster depth (black) and the fitted mixed Poisson model (purple) at different sequencing depths (10x, 20x, 30x and 40x) for the simulation datasets of The mixed Poisson model well fits the observed distribution especially at higher sequencing coverages. The depth threshold for genotyping is indicated by a dashed line.
Figure 3Comparison of the performance of threeSNP calling approaches based on the simulation datasets of(A) and(B). iML outperforms ML or a threshold approach by improving genotyping accuracy remarkably at the expense of little decreased sensitivity. ML_ref, reference-based SNP calling using the ML algorithm; iML_denovo, de novo SNP calling using the iML algorithm; ML_denovo, de novo SNP calling using the ML algorithm; TH_denovo, de novo SNP calling using the threshold approach.
Summary of two real sequencing datasets used for evaluation of the iML algorithm
| Library preparation | 2b-RAD | RAD |
| Restriction enzyme | BsaXI | SbfI |
| Trimmed read length | 27 bp | 55 bp |
| High-quality reads | 5,845,509 | 4,672,098 |
| Mapped reads | 5,339,662 | 4,139,761 |
| Clustered reads | 5,809,558 | 4,220,881 |
| No. of | 39,678 | 45,600 |
| No. of | 35,362 | 40,125 |
| No. of read clusters | 33,877 | 42,352 |
| Reference | [ | [ |
a, the total number of restriction sites that were predicted from the genome assemblies of TAIR8 and BROADS1 for A. thaliana and G. aculeatus, respectively.
The K-S test for the model fitness of four distribution models on two real datasets
| Poisson | 137.2 | - | - | - | - | - | - | 0 | |
| | Mixed Poisson | 125.0 | 0.83 | 0.09 | 0.09 | - | - | - | 0 |
| | Normal | 137.2 | - | - | - | 74.1 | - | - | 2.5E-7 |
| | Mixed normal | 110.0 | 0.80 | 0.18 | 0.01 | 48.2 | 39.1 | 34.2 | 0.378 |
| Poisson | 98.1 | - | - | - | - | - | - | 0 | |
| | Mixed Poisson | 105.0 | 0.84 | 0.09 | 0.07 | - | - | - | 0 |
| | Normal | 98.1 | - | - | - | 50.5 | - | - | 0.029 |
| Mixed normal | 100.0 | 0.98 | 0.02 | 0.00 | 45.3 | 24.3 | 24.1 | 0.288 | |
C, a and σ represent the mean, the coefficient and standard deviation of i-copy clusters in a given model.
Figure 4Observed distribution of cluster depth (black) and the fitted mixed normal model (purple) for the real sequencing datasets ofand The depth threshold for iML genotyping is indicated by a dashed line
Figure 5Comparison of the performance ofSNP calling approaches based on the real sequencing datasets of(A) and(B). FPR/FNR, false positive or negative rate.