| Literature DB >> 26678663 |
Subrata Saha, Sanguthevar Rajasekaran.
Abstract
BACKGROUND: In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads.Entities:
Mesh:
Year: 2015 PMID: 26678663 PMCID: PMC4674864 DOI: 10.1186/1471-2105-16-S17-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Real Sequencing Datasets.
| Dataset | Name | Accession |
| | | | | | | Coverage |
|---|---|---|---|---|---|---|---|
| D1 | SRR088759 | NC 013656.1 | 2,598,144 | 36 | 4,370,050 | 60 | |
| D2 | SRR022918 | NC 000913.1 | 4,771,872 | 47 | 6,740,651 | 68 | |
| D3 | SRR396536 | NC 000913.2 | 4,639,675 | 75 | 3,453,957 | 55 | |
| D4 | DRR000852 | NC 000964.3 | 4,215,606 | 75 | 1,744,210 | 31 | |
| D5 | SRR396532 | NC 000913.2 | 4,639,675 | 75 | 4,341,061 | 70 | |
| D6 | SRR353563 | NC 004342.2 | 4,338,762 | 100 | 3,530,694 | 81 | |
Random Sequencing Datasets.
| Dataset | Name | Accession |
| | | | |
|---|---|---|---|---|---|
| D7 | SRR088759 | NC 013656.1 | 2,598,144 | 2,165,120 | |
| D8 | SRR361468 | CP002376.1 | 1,139,417 | 9,495,14 | |
| D9 | SRR396536 | NC 000913.2 | 4,639,675 | 3,868,043 | |
| D10 | DRR000852 | NC 000964.3 | 4,215,606 | 3,513,005 | |
| D11 | SRR396641 | NC 002516.2 | 6,264,404 | 5,220,336 | |
| D12 | SRR353563 | NC 004342.2 | 4,338,762 | 3,615,635 | |
Performance evaluation on real sequencing datasets.
| Dataset | Method | % Sensitivity | % Specificity | % Accuracy | % Mapped Reads | CPU Time (m) |
|---|---|---|---|---|---|---|
| D1 | EC | 88.59 | 99.99 | 85.78 | 95.75 | 5.76 |
| Racer | 99.99 | 95.81 | 12.58 | |||
| Musket | 80.99 | 99.99 | 80.94 | |||
| Coral | 71.06 | 99.99 | 70.64 | 95.79 | 40.98 | |
| Reptile | 11.77 | 99.99 | 11.33 | 95.81 | 9.83 | |
| D2 | EC | 99.94 | 93.22 | 80.02 | ||
| Racer | 93.92 | 99.99 | 13.93 | |||
| Musket | 47.82 | 99.99 | 47.79 | 66.33 | 38.62 | |
| Coral | 33.68 | 99.99 | 33.22 | 65.32 | 141.64 | |
| Reptile | 44.76 | 99.99 | 44.73 | 67.51 | 34.02 | |
| D3 | EC | 99.97 | ||||
| Racer | 88.87 | 99.99 | 88.75 | 96.09 | 13.68 | |
| Musket | 69.50 | 99.99 | 69.43 | 94.09 | 19.00 | |
| Coral | 67.53 | 99.99 | 67.25 | 93.78 | 207.13 | |
| Reptile | - | - | - | - | - | |
| D4 | EC | 99.98 | 94.79 | |||
| Racer | 93.42 | 99.99 | 93.30 | 6.40 | ||
| Musket | 74.52 | 99.99 | 74.43 | 93.20 | 8.23 | |
| Coral | 74.40 | 99.99 | 74.07 | 92.54 | 28.47 | |
| Reptile | - | - | - | - | - | |
| D5 | EC | 99.97 | 18.27 | |||
| Racer | 89.97 | 99.99 | 89.77 | 94.99 | ||
| Musket | 63.64 | 99.93 | 63.58 | 91.44 | 27.22 | |
| Coral | 61.56 | 99.99 | 61.23 | 91.08 | 98.89 | |
| Reptile | - | - | - | - | - | |
| D6 | EC | 93.04 | 99.99 | 86.32 | 89.51 | 24.33 |
| Racer | 99.99 | 12.63 | ||||
| Musket | 84.58 | 99.99 | 84.39 | 89.79 | ||
| Coral | 89.34 | 99.99 | 83.28 | 90.14 | 233.33 | |
| Reptile | - | - | - | - | - | |
Best results are shown in bold.
Performatnce evaluation on simulated sequencing datasets having length 60 and coverage 50.
| Dataset | Method | % Sensitivity | % Specificity | % Accuracy | EBA | CHD | CPU Time (m) |
|---|---|---|---|---|---|---|---|
| D7 | EC | 98.13 | 99.99 | 97.99 | 6.49 | ||
| Racer | 99.91 | 94.43 | 9.55 × 10−4 | 1,47,171 | |||
| Musket | 95.63 | 99.99 | 95.61 | 6.99 × 10−5 | 1,14,474 | 9.42 | |
| Coral | 93.47 | 99.99 | 92.76 | 1.08 × 10−4 | 1,88,584 | 23.02 | |
| D8 | EC | 98.11 | 99.99 | 97.79 | 6.88 × 10−5 | 25,238 | 2.91 |
| Racer | 99.98 | 98.34 | 2.11 × 10−4 | ||||
| Musket | 95.67 | 99.99 | 95.67 | 49,438 | 6.84 | ||
| Coral | 93.46 | 99.99 | 93.15 | 5.16 × 10−5 | 78,162 | 9.57 | |
| D9 | EC | 98.05 | 99.98 | 97.10 | 1.50 × 10−4 | 12.26 | |
| Racer | 99.95 | 96.36 | 5.57 × 10−4 | 1,71,542 | |||
| Musket | 95.63 | 99.99 | 95.59 | 2,04,776 | 16.88 | ||
| Coral | 93.45 | 99.98 | 92.53 | 1.42 × 10−4 | 3,47,258 | 38.01 | |
| D10 | EC | 98.14 | 99.99 | 97.78 | 5.99 × 10−5 | 11.12 | |
| Racer | 99.96 | 97.13 | 4.28 × 10−4 | 1,22,798 | |||
| Musket | 95.70 | 99.99 | 95.69 | 1,81,777 | 22.76 | ||
| Coral | 93.51 | 99.99 | 93.07 | 6.32 × 10−5 | 2,92,345 | 34.21 | |
| D11 | EC | 98.08 | 99.99 | 97.69 | 19.64 | ||
| Racer | 99.58 | 77.23 | 4.23 × 10−3 | 14,46,476 | |||
| Musket | 95.54 | 99.99 | 95.47 | 1.36 × 10−4 | 2,84,642 | 29.44 | |
| Coral | 93.48 | 99.98 | 92.75 | 1.10 × 10−4 | 4,54,814 | 59.33 | |
| D12 | EC | 99.84 | 89.51 | 1.34 × 10−3 | 4,59,980 | 16.43 | |
| Racer | 97.27 | 99.69 | 82.19 | 2.96 × 10−3 | 7,82,370 | ||
| Musket | 94.78 | 99.99 | 94.46 | 28.50 | |||
| Coral | 93.31 | 99.91 | 88.79 | 7.00 × 10−4 | 4,88,658 | 50.91 | |
Best results are shown in bold.
Figure 1% Accuracy of different algorithms including EC for real sequencing datasets D1-D6.
Figure 2Elapsed time of different algorithms including EC for real sequencing datasets D1-D6.
Figure 3% Accuracy of different algorithms including EC for synthetic datasets D7-D12.
Figure 4Elapsed time of different algorithms including EC for synthetic datasets D7-D12.