| Literature DB >> 21342585 |
Xiao Yang1, Srinivas Aluru, Karin S Dorman.
Abstract
BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.Entities:
Mesh:
Year: 2011 PMID: 21342585 PMCID: PMC3044310 DOI: 10.1186/1471-2105-12-S1-S52
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A The neighborhood of trimer AAA is the collection of trimers in R3 that have a nonvanishing chance of being misread as AAA, in this case trimers with at most one substitution.
Experimental datasets
| Dataset | Type | Reference genome | Genome length | Repeats | Repeat Types (length, multiplicity) | Number of reads | |
|---|---|---|---|---|---|---|---|
| 1(a) | - | 1M | 20% | (1000, 200) | 80x | 2.2M | |
| 1(a) | - | 1M | 50% | (500, 400), (1500, 200) | 80x | 2.2M | |
| 1(a) | - | 1M | 80% | (500, 400), (1500, 200) | 80x | 2.2M | |
| (3000, 100) | |||||||
| 1(b) | 2.1M | - | - | 80x | 4.8M | ||
| 1(b) | Maize | 418K | - | - | 80x | 0.92M | |
| 2 | 4.6M | - | - | 160x | 20.7M |
‘-’ denotes the information that is not quantified; K: thousand; M: million.
Estimated error probabilities q(·, ·), position i = 11
| ×10–2 | A | C | G | T | ×10–2 | A | C | G | T | |
| A | 98.96 | 0.63 | 0.18 | 0.23 | A | 96.18 | 2.53 | 0.19 | 1.10 | |
| C | 0.15 | 99.60 | 0.10 | 0.15 | C | 0.20 | 99.32 | 0.08 | 0.40 | |
| G | 0.05 | 0.17 | 99.25 | 0.53 | G | 0.12 | 0.30 | 97.60 | 1.98 | |
| T | 0.05 | 0.19 | 0.18 | 99.58 | T | 0.09 | 0.18 | 0.13 | 99.60 | |
A comparison of minimum error rates.
| Data | Minimum (FP + FN) Value | ||||
|---|---|---|---|---|---|
| tIED | wIED | tUED | wUED | ||
| 2212 | 4648 | ||||
| 6392 | 6729 | ||||
| 6809 | |||||
| 216 | 236 | 719 | |||
| 552 | 1346 | ||||
| 14236 | 18793 | ||||
A comparison of the minimum number of wrong predictions achieved by applying optimum thresholds to observed occurrences , and our model with each of the error distributions tested. Bold numbers indicate that our model outperforms.
Figure 2Plots of log(FP + FN) vs. threshold for all datasets. In each plot, we compare the results by applying thresholds to and to estimated by our model using the tIED, wIED, tUED and wUED error distributions.
Error correction results
| Data | Method ( | Sensitivity | Specificity | Gain | CPU Time(min) | Memory(GB) |
|---|---|---|---|---|---|---|
| SHREC | 81.2% | 99.9% | 80.3% | 23.9 | 5.9 | |
| Reptile | 78.9% | 99.9% | 78.9% | 0.6 | 0.19 | |
| REDEEM | 71.3% | 99.9% | 51.5% | 114.1 | 2.5 | |
| SHREC | 54.0% | 99.9% | 52.7% | 22.7 | 5.8 | |
| Reptile | 57.8% | 99.9% | 57.8% | 0.5 | 0.16 | |
| REDEEM | 78.6% | 99.9% | 64.6% | 72.7 | 1.6 | |
| SHREC | 29.3% | 99.9% | 26.7% | 21.7 | 5.8 | |
| Reptile | 46.8% | 99.9% | 46.8% | 0.5 | 0.13 | |
| REDEEM | 86.4% | 99.9% | 79.4% | 31.2 | 0.63 | |
Figure 3Histogram of estimated T for E. coli dataset.