| Literature DB >> 19208114 |
Francis Y L Chin1, Henry C M Leung, Wei-Lin Li, Siu-Ming Yiu.
Abstract
BACKGROUND: DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19208114 PMCID: PMC2648749 DOI: 10.1186/1471-2105-10-S1-S15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1k = 15. Number of false positives, false negatives and their sum of 15-tuples from real data versus multiplicity.
Figure 2k = 20. Number of false positives, false negatives and their sum of 20-tuples from real data versus multiplicity.
Comparison of 15-tuples on real data set
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 12494 | 176 | 12670 |
| SRCorr | 15 | 7890 | 250 | 8140 |
| Our Algorithm | 32 | 1583 | 1256 | 2839 |
Comparison of corrected reads on real data set
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 19334 | 173199 | 192533 |
| SRCorr | 15 | 181628 | 30659 | 212287 |
| Our Algorithm | 32 | 30039 | 3438 | 33477 |
Comparison of 15-tuples on simulated data (coverage = 2.75×)
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 4 | 18 | 22 |
| SRCorr | 4 | 5264 | 5 | 5269 |
| Our Algorithm | 11 | 5 | 13 | 18 |
Comparison of corrected reads on simulated data (coverage = 2.75×)
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 46459 | 49204 | 95663 |
| SRCorr | 4 | 53036 | 31699 | 84735 |
| Our Algorithm | 11 | 308 | 8564 | 8872 |
Comparison of k-tuples on simulated data (coverage = 12.53×)
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 1699 | 4 | 1703 |
| SRCorr | 15 | 1011 | 5 | 1016 |
| Our Algorithm | 52 | 3 | 15 | 18 |
Comparison of corrected reads on simulated data (coverage = 12.53×)
| M | # of false positives | # of false negatives | total errors | |
| ECINDEL | 12 | 132850 | 30338 | 163188 |
| SRCorr | 15 | 102288 | 1512 | 103800 |
| Our Algorithm | 52 | 1216 | 49 | 1265 |