| Literature DB >> 27102907 |
Xuan Zhu1, Jian Wang1, Bo Peng2, Sanjay Shete3,4.
Abstract
BACKGROUND: Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.Entities:
Keywords: Empirical error rate; Frequency-based simulation; Next-generation sequencing; Short reads; Smoothing spline
Mesh:
Substances:
Year: 2016 PMID: 27102907 PMCID: PMC4840868 DOI: 10.1186/s12859-016-1052-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Median error rates in MAQC data using shadow linear regression and smoothing spline approachesa
| Samples | Expected ER | SRER | SRER Bias | EER_CS | EER_CS Bias | EER_RS | EER_RS Bias |
|---|---|---|---|---|---|---|---|
| SRR037452 | 0.3305 | 0.2578 | 0.0727 | 0.3104 | 0.0201 | 0.3096 | 0.0209 |
| SRR037453 | 0.1917 | 0.1584 | 0.0333 | 0.1824 | 0.0093 | 0.1818 | 0.0099 |
| SRR037454 | 0.2354 | 0.1515 | 0.0839 | 0.2060 | 0.0294 | 0.2059 | 0.0295 |
| SRR037455 | 0.1759 | 0.1448 | 0.0311 | 0.1675 | 0.0084 | 0.1668 | 0.0091 |
| SRR037456 | 0.2312 | 0.1622 | 0.0690 | 0.2037 | 0.0275 | 0.2035 | 0.0277 |
| SRR037457 | 0.1841 | 0.1480 | 0.0361 | 0.1777 | 0.0064 | 0.1771 | 0.0070 |
| SRR037458 | 0.2653 | 0.2321 | 0.0332 | 0.2582 | 0.0071 | 0.2575 | 0.0078 |
| SRR037459 | 0.2371 | 0.1943 | 0.0428 | 0.2202 | 0.0169 | 0.2203 | 0.0168 |
| SRR037460 | 0.2530 | 0.2018 | 0.0512 | 0.2503 | 0.0027 | 0.2490 | 0.0040 |
| SRR037461 | 0.2180 | 0.1704 | 0.0476 | 0.2105 | 0.0075 | 0.2104 | 0.0076 |
| SRR037462 | 0.2443 | 0.1734 | 0.0709 | 0.2322 | 0.0121 | 0.2308 | 0.0135 |
| SRR037463 | 0.2154 | 0.1654 | 0.0500 | 0.2023 | 0.0131 | 0.2045 | 0.0109 |
| SRR037464 | 0.2624 | 0.1666 | 0.0958 | 0.2392 | 0.0232 | 0.2403 | 0.0221 |
| SRR037465 | 0.2145 | 0.1742 | 0.0403 | 0.2038 | 0.0107 | 0.2037 | 0.0108 |
aBased on 1000 replicates. The frequency-based simulation approach was applied. For each replicate, we considered the top 1000 reads with the highest frequencies as the error-free reads and generated 1000 pairs of error-free read counts and shadow counts
Expected ER expected error rate in simulation studies
SRER error rate estimated using shadow regression
SRER Bias the absolute value of the difference between SRER and Expected ER
EER_CS empirical error rate estimated using cubic smoothing spline
EER_CS Bias the absolute value of the difference between EER_CS and Expected ER
EER_RS empirical error rate estimated using robust smoothing spline
EER_RS Bias the absolute value of the difference between EER_RS and Expected ER
Median error rates in mutation screening data using shadow linear regression and smoothing spline approachesa
| Samples | Expected ER | SRER | SRER Bias | EER_CS | EER_CS Bias | EER_RS | EER_RS Bias |
|---|---|---|---|---|---|---|---|
| SRR032565 | 0.1991 | 0.0493 | 0.1498 | 0.1080 | 0.0911 | 0.1082 | 0.0910 |
| SRR032566 | 0.0734 | 0.0412 | 0.0323 | 0.0705 | 0.0029 | 0.0705 | 0.0029 |
| SRR032567 | 0.2003 | 0.0542 | 0.1461 | 0.1111 | 0.0892 | 0.1111 | 0.0893 |
| SRR032568 | 0.2040 | 0.0437 | 0.1603 | 0.1057 | 0.0984 | 0.1072 | 0.0968 |
| SRR032569 | 0.1598 | 0.0509 | 0.1089 | 0.1018 | 0.0580 | 0.1014 | 0.0585 |
| SRR032570 | 0.0985 | 0.0641 | 0.0345 | 0.0959 | 0.0026 | 0.0954 | 0.0032 |
| SRR032571 | 0.1236 | 0.0575 | 0.0661 | 0.1406 | 0.0170 | 0.1406 | 0.0170 |
| SRR032572 | 0.1495 | 0.0530 | 0.0965 | 0.1181 | 0.0314 | 0.1173 | 0.0323 |
| SRR032573 | 0.1779 | 0.0912 | 0.0867 | 0.1518 | 0.0261 | 0.1506 | 0.0273 |
| SRR032574 | 0.0986 | 0.0384 | 0.0602 | 0.0626 | 0.0361 | 0.0618 | 0.0368 |
| SRR032575 | 0.1169 | 0.0839 | 0.0330 | 0.1228 | 0.0059 | 0.1227 | 0.0058 |
| SRR032576 | 0.1554 | 0.0945 | 0.0609 | 0.1887 | 0.0333 | 0.1880 | 0.0326 |
| SRR032577 | 0.1052 | 0.0694 | 0.0359 | 0.1054 | 0.0002 | 0.1055 | 0.0002 |
| SRR032578 | 0.1143 | 0.0448 | 0.0695 | 0.1077 | 0.0067 | 0.1076 | 0.0067 |
| SRR032580 | 0.0870 | 0.0619 | 0.0251 | 0.1169 | 0.0298 | 0.1163 | 0.0292 |
| SRR032581 | 0.0770 | 0.0347 | 0.0424 | 0.0752 | 0.0018 | 0.0751 | 0.0019 |
| SRR032582 | 0.0400 | 0.0041 | 0.0359 | 0.0309 | 0.0091 | 0.0310 | 0.0090 |
| SRR032583 | 0.1224 | 0.0707 | 0.0517 | 0.1279 | 0.0055 | 0.1280 | 0.0056 |
| SRR032584 | 0.1290 | 0.0540 | 0.0750 | 0.1052 | 0.0238 | 0.1032 | 0.0259 |
| SRR032586 | 0.0445 | 0.0102 | 0.0343 | 0.0287 | 0.0158 | 0.0287 | 0.0159 |
| SRR032587 | 0.1486 | 0.0786 | 0.0700 | 0.1562 | 0.0076 | 0.1552 | 0.0066 |
| SRR032588 | 0.1240 | 0.0470 | 0.0770 | 0.1024 | 0.0216 | 0.1021 | 0.0220 |
| SRR033543 | 0.1151 | 0.0587 | 0.0564 | 0.0828 | 0.0323 | 0.0832 | 0.0318 |
| SRR033544 | 0.1267 | 0.0524 | 0.0743 | 0.1086 | 0.0181 | 0.1085 | 0.0182 |
aBased on 1000 replicates. The frequency-based simulation approach was applied. For each replicate, we considered the top 1000 reads with the highest frequencies as the error-free reads and generated 1000 pairs of error-free read counts and shadow counts
Expected ER expected error rate in simulation studies
SRER error rate estimated using shadow regression
SRER Bias the absolute value of the difference between SRER and Expected ER
EER_CS empirical error rate estimated using cubic smoothing spline
EER_CS Bias the absolute value of the difference between EER_CS and Expected ER
EER_RS empirical error rate estimated using robust smoothing spline
EER_RS Bias the absolute value of the difference between EER_RS and Expected ER
Median error rates in ENCODE data using shadow linear regression and smoothing spline approachesa
| Samples | Expected ER | SRER | SRER Bias | EER_CS | EER_CS Bias | EER_RS | EER_RS Bias |
|---|---|---|---|---|---|---|---|
| SRR002053 | 0.5548 | 0.4153 | 0.1395 | 0.4609 | 0.0939 | 0.4565 | 0.0983 |
| SRR002056 | 0.3646 | 0.2906 | 0.0740 | 0.3270 | 0.0376 | 0.3233 | 0.0413 |
| SRR002065 | 0.4578 | 0.3371 | 0.1207 | 0.3740 | 0.0838 | 0.3701 | 0.0877 |
| SRR005092 | 0.6300 | 0.4047 | 0.2253 | 0.4797 | 0.1503 | 0.4727 | 0.1573 |
| SRR005093 | 0.4839 | 0.3928 | 0.0911 | 0.4221 | 0.0618 | 0.4173 | 0.0666 |
aBased on 1000 replicates. The frequency-based simulation approach was applied. For each replicate, we considered the top 1000 reads with the highest frequencies as the error-free reads and generated 1000 pairs of error-free read counts and shadow counts
Expected ER expected error rate in simulation studies
SRER error rate estimated using shadow regression
SRER Bias the absolute value of the difference between SRER and Expected ER
EER_CS empirical error rate estimated using cubic smoothing spline
EER_CS Bias the absolute value of the difference between EER_CS and Expected ER
EER_RS empirical error rate estimated using robust smoothing spline
EER_RS Bias the absolute value of the difference between EER_RS and Expected ER
Median error rates in PhiX DNA data using shadow linear regression and smoothing spline approachesa
| Samples | Expected ER | SRER | SRER Bias | EER_CS | EER_CS Bias | EER_RS | EER_RS Bias |
|---|---|---|---|---|---|---|---|
| 100217 | 0.0323 | 0.0250 | 0.0073 | 0.0315 | 0.0008 | 0.0315 | 0.0008 |
| 100514 | 0.0152 | 0.0143 | 0.0009 | 0.0152 | 0.0000 | 0.0152 | 0.0000 |
aBased on 1000 replicates. The frequency-based simulation approach was applied. For each replicate, we considered the top 1000 reads with the highest frequencies as the error-free reads and generated 1000 pairs of error-free read counts and shadow counts
Expected ER expected error rate in simulation studies
SRER error rate estimated using shadow regression
SRER Bias the absolute value of the difference between SRER and Expected ER
EER_CS empirical error rate estimated using cubic smoothing spline
EER_CS Bias the absolute value of the difference between EER_CS and Expected ER
EER_RS empirical error rate estimated using robust smoothing spline
EER_RS Bias the absolute value of the difference between EER_RS and Expected ER
Error rates in real MAQC data using shadow linear regression and smoothing spline approaches
| Samples | SRER | EER_CS | EER_RS |
|---|---|---|---|
| SRR037452 | 0.2695 | 0.3124 | 0.3362 |
| SRR037453 | 0.1598 | 0.1819 | 0.1822 |
| SRR037454 | 0.1596 | 0.2041 | 0.2196 |
| SRR037455 | 0.1482 | 0.1694 | 0.1700 |
| SRR037456 | 0.1657 | 0.2062 | 0.2162 |
| SRR037457 | 0.1541 | 0.1796 | 0.1793 |
| SRR037458 | 0.2386 | 0.2573 | 0.2575 |
| SRR037459 | 0.1996 | 0.2216 | 0.2233 |
| SRR037460 | 0.2027 | 0.2504 | 0.2647 |
| SRR037461 | 0.1779 | 0.2058 | 0.2093 |
| SRR037462 | 0.1858 | 0.2329 | 0.2319 |
| SRR037463 | 0.1771 | 0.2019 | 0.2072 |
| SRR037464 | 0.1850 | 0.2377 | 0.2448 |
| SRR037465 | 0.1842 | 0.2019 | 0.2070 |
SRER error rate estimated using shadow regression
EER_CS empirical error rate estimated using cubic smoothing spline
EER_RS empirical error rate estimated using robust smoothing spline
Error rates in real mutation screening data using shadow linear regression and smoothing spline approaches
| Samples | SRER | EER_CS | EER_RS |
|---|---|---|---|
| SRR032565 | 0.0753 | 0.1167 | 0.1206 |
| SRR032566 | 0.0584 | 0.0746 | 0.0745 |
| SRR032567 | 0.0846 | 0.1420 | 0.1446 |
| SRR032568 | 0.0686 | 0.1566 | 0.1566 |
| SRR032569 | 0.0597 | 0.1015 | 0.1020 |
| SRR032570 | 0.0691 | 0.0973 | 0.0954 |
| SRR032571 | 0.0724 | 0.1400 | 0.1400 |
| SRR032572 | 0.0818 | 0.1611 | 0.1593 |
| SRR032573 | 0.1602 | 0.2756 | 0.2752 |
| SRR032574 | 0.0557 | 0.1004 | 0.1053 |
| SRR032575 | 0.0882 | 0.1191 | 0.1179 |
| SRR032576 | 0.1101 | 0.1915 | 0.1899 |
| SRR032577 | 0.0762 | 0.1282 | 0.1280 |
| SRR032578 | 0.1365 | 0.1874 | 0.1873 |
| SRR032580 | 0.0727 | 0.1262 | 0.1266 |
| SRR032581 | 0.0981 | 0.0506 | 0.0511 |
| SRR032582 | 0.0941 | 0.1689 | 0.1679 |
| SRR032583 | 0.1141 | 0.2057 | 0.2042 |
| SRR032584 | 0.0849 | 0.0963 | 0.0742 |
| SRR032586 | 0.0623 | 0.3606 | 0.3621 |
| SRR032587 | 0.0857 | 0.1524 | 0.1532 |
| SRR032588 | 0.0701 | 0.1446 | 0.1440 |
| SRR033543 | 0.0802 | 0.1084 | 0.1102 |
| SRR033544 | 0.1175 | 0.1588 | 0.1586 |
SRER error rate estimated using shadow regression
EER_CS empirical error rate estimated using cubic smoothing spline
EER_RS empirical error rate estimated using robust smoothing spline
Error rates in real ENCODE data using shadow linear regression and smoothing spline approaches
| Samples | SRER | EER_CS | EER_RS |
|---|---|---|---|
| SRR002053 | 0.4134 | 0.4469 | 0.4453 |
| SRR002056 | 0.3225 | 0.3020 | 0.3355 |
| SRR002065 | 0.3842 | 0.3918 | 0.3913 |
| SRR005092 | 0.4628 | 0.4884 | 0.4776 |
| SRR005093 | 0.4090 | 0.4724 | 0.4013 |
SRER error rate estimated using shadow regression
EER_CS empirical error rate estimated using cubic smoothing spline
EER_RS empirical error rate estimated using robust smoothing spline
Error rates in real PhiX DNA data using shadow linear regression and smoothing spline approaches
| Samples | SRER | EER_CS | EER_RS |
|---|---|---|---|
| 100217 | 0.0261 | 0.0317 | 0.0315 |
| 100514 | 0.0142 | 0.0155 | 0.0157 |
SRER error rate estimated using shadow regression
EER_CS empirical error rate estimated using cubic smoothing spline
EER_RS empirical error rate estimated using robust smoothing spline