| Literature DB >> 26309201 |
Jiyuan Hu1, Tengfei Li2, Zidi Xiu1, Hong Zhang1.
Abstract
Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package "MAFsnp" implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.Entities:
Mesh:
Year: 2015 PMID: 26309201 PMCID: PMC4550471 DOI: 10.1371/journal.pone.0135332
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Description of real datasets (1000 Genomes Project).
|
|
|
| |
|---|---|---|---|
| Whole genome | 156 | CHS(71),CDX(3),CHB(49),JPT(33) | 4.7× |
| Targeted exon | 110 | CHD(32),CHB(17),JPT(61) | 2.4× |
CHB, Han Chinese; CDX, Dai Chinese; CHD, Denver Chinese; CHS, Southern Han Chinese; JPT, Japanese.
Fig 1The relationship between (, ) and (mean coverage N, error rate e).
(A) Scatterplot of vs. N; (B) Scatterplot of vs. e; (C) Scatterplot of vs. N; (D) Scatterplot of vs. e.
Fig 2Boxplot of FDRs and powers for MAFsnp0 and MAFsnp at nominal FDR level α = 0.01, 0.05, 0.1.
(A) Boxplot of FDRs with 96 read count datasets; (B) Boxplot of powers with 96 read count datasets; (C) Boxplot of FDRs with 18 sequence read datasets; (D) Boxplot of powers with 18 sequence read datasets.
Fig 3F 1 scores of MAFsnp (α = 0.01) and seqEM for read count data.
SNP calling results of seqEM, MAQ, GATK, SAMtools, and MAFsnp for the 1000 Genomes Project data.
| Number of called SNPs | CA(%)[ | Ti/Tv Ratio[ | ||||||
|---|---|---|---|---|---|---|---|---|
| All | Known | Novel | All | Known | Novel | |||
|
| seqEM | 5351 | 1698 | 3653 | 31.7 | 1.232 | 2.225 | 0.878 |
| MAQ | 2768 | 1632 | 1136 | 59.0 | 1.582 | 2.372 | 0.889 | |
| GATK | 1709 | 1544 | 165 | 90.3 | 2.276 | 2.393 | 1.429 | |
| SAMtools | 1762 | 1564 | 198 | 88.8 | 2.329 | 2.453 | 1.592 | |
| MAFsnp[ | 1699 | 1550 | 149 | 91.2 | 2.299 | 2.422 | 1.403 | |
|
| seqEM | 950 | 348 | 602 | 36.6 | 1.263 | 2.48 | 0.612 |
| MAQ | 654 | 356 | 298 | 54.4 | 1.389 | 2.594 | 0.383 | |
| GATK | 171 | 156 | 15 | 91.2 | 1.803 | 2.12 | 0.364 | |
| SAMtools | 585 | 433 | 152 | 74.0 | 1.763 | 2.305 | 0.875 | |
| MAFsnp[ | 470 | 405 | 65 | 86.2 | 1.749 | 2.375 | 0.275 | |
Calling accuracy;
transition/transversion ratio;
nominal FDR level = 0.01.