| Literature DB >> 23919637 |
Ziwen He1, Xinnian Li, Shaoping Ling, Yun-Xin Fu, Eric Hungate, Suhua Shi, Chung-I Wu.
Abstract
BACKGROUND: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23919637 PMCID: PMC3750404 DOI: 10.1186/1471-2164-14-535
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Estimating θ with constant sequencing error rate
| | S>0 | 0.099 (0.007) | 0.100 (0.003) | 0.100 (0.007) | 0.999 (0.023) | 1.000 (0.009) | 0.999 (0.022) |
| 0 | S>1 | 0.100 (0.012) | 0.100 (0.005) | 0.100 (0.010) | 0.999 (0.041) | 1.000 (0.016) | 1.000 (0.034) |
| | S>2 | 0.099 (0.016) | 0.100 (0.007) | 0.100 (0.013) | 0.998 (0.050) | 1.000 (0.023) | 1.000 (0.041) |
| | S>0 | 5.992 (0.131) | 22.054 (0.212) | 0.323 (0.026) | 6.884 (0.131) | 22.872 (0.225) | 1.226 (0.033) |
| 0.001 | S>1 | 0.129 (0.017) | 0.507 (0.032) | 0.100 (0.011) | 1.032 (0.040) | 1.409 (0.035) | 1.003 (0.034) |
| | S>2 | 0.101 (0.017) | 0.105 (0.008) | 0.100 (0.013) | 0.999 (0.052) | 1.009 (0.023) | 1.002 (0.042) |
| | S>0 | 28.389 (0.271) | 90.901 (0.357) | 5.269 (0.116) | 29.184 (0.275) | 91.485 (0.358) | 6.165 (0.118) |
| 0.005 | S>1 | 0.810 (0.054) | 9.295 (0.149) | 0.112 (0.012) | 1.716 (0.064) | 10.167 (0.152) | 1.024 (0.035) |
| | S>2 | 0.112 (0.017) | 0.662 (0.039) | 0.100 (0.014) | 1.016 (0.053) | 1.575 (0.046) | 1.010 (0.042) |
| | S>0 | 53.883 (0.351) | 146.007 (0.357) | 18.820 (0.215) | 54.615 (0.331) | 146.329 (1.111) | 19.684 (0.218) |
| 0.01 | S>1 | 2.861 (0.105) | 31.994 (0.258) | 0.257 (0.025) | 3.777 (0.107) | 32.823 (0.821) | 1.179 (0.040) |
| S>2 | 0.180 (0.027) | 4.061 (0.108) | 0.102 (0.013) | 1.091 (0.057) | 4.993 (0.327) | 1.017 (0.041) | |
The average depth is 2X per individual in single line method, 2X per haploid genome in single platform method and 1X per haploid genome in each application in dual applications method. The means (and the standard deviations) of θ are estimated from 1000 replicates. Error rate is per site.
Estimating θ with Beta distributed sequencing error rate
| | S>0 | 19.917 (0.237) | 38.335 (0.274) | 2.534 (0.083) | 20.759 (0.227) | 39.095 (0.267) | 3.444 (0.085) |
| 0.1 | S>1 | 4.709 (0.139) | 17.500 (0.197) | 0.336 (0.032) | 5.605 (0.133) | 18.347 (0.197)) | 1.248 (0.043) |
| | S>2 | 1.172 (0.073) | 9.541 (0.159) | 0.130 (0.017) | 2.073 (0.081) | 10.425 (0.156) | 1.041 (0.042) |
| | S>0 | 23.170 (0.255) | 51.539 (0.290) | 3.499 (0.098) | 23.964 (0.230) | 52.237 (0.306) | 4.393 (0.097) |
| 0.2 | S>1 | 3.343 (0.112) | 18.415 (0.203) | 0.250 (0.026) | 4.243 (0.120) | 19.249 (0.208) | 1.160 (0.040) |
| | S>2 | 0.534 (0.049) | 7.555 (0.143) | 0.109 (0.014) | 1.443 (0.071) | 8.444 (0.145) | 1.015 (0.041) |
| | S>0 | 25.398 (0.240) | 64.217 (0.333) | 4.243 (0.105) | 26.210 (0.239) | 64.855 (0.335) | 5.134 (0.108) |
| 0.4 | S>1 | 2.259 (0.095) | 17.193 (0.201) | 0.181 (0.019) | 3.164 (0.097) | 18.017 (0.201) | 1.090 (0.038) |
| | S>2 | 0.267 (0.034) | 4.916 (0.114) | 0.103 (0.013) | 1.175 (0.059) | 5.808 (0.114) | 1.010 (0.042) |
| | S>0 | 26.772 (0.270) | 74.504 (0.355) | 4.717 (0.112) | 27.578 (0.262) | 75.097 (0.340) | 5.609 (0.111) |
| 0.8 | S>1 | 1.591 (0.073) | 14.860 (0.196) | 0.143 (0.015) | 2.492 (0.087) | 15.706 (0.181) | 1.054 (0.035) |
| S>2 | 0.171 (0.024) | 2.870 (0.084) | 0.101 (0.013) | 1.076 (0.057) | 3.773 (0.088) | 1.010 (0.041) | |
The average error rate is 0.005 per site. The average depth is 2X per individual in single line method, 2X per haploid genome in single platform method and 1X per haploid genome in each application in dual applications method. The means (and the standard deviations) of θ are estimated from 1000 replicates.
Figure 1Error rate correlation patterns. a) MAF (minor allele frequency) of putative SNPs called by either SOLiD or Illumina GA. b) MAF in two samples (Bangkunsha and Thongnian) sequenced by Illumina HiSeq.
Figure 2θ estimation of simulation data of pooled-lines sample with 3 different sequencing errors. The θ value of simulation data is set to 0.1 / 1 per kb. Singletons are discarded in dual applications method (S>1). Singletons and doubletons are discarded in single platform method (S>2). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.
Figure 3θ estimation of dual applications for different region length. The θ value of simulation data is set to 1 per kb. The sequencing error rate is set to 0.005. Singletons are discarded in the estimation (S>1). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.