| Literature DB >> 25592313 |
Eric Marinier1, Daniel G Brown2, Brendan J McConkey3.
Abstract
BACKGROUND: Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25592313 PMCID: PMC4307147 DOI: 10.1186/s12859-014-0435-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The -mer counts associated with an Illumina MiSeq read containing two highlighted substitution errors. The k-mers are of length 31. A data point corresponds to the number of times a left-anchored k-mer, starting at a given position within the sequence, has been observed in the entire read set. The first error is located near the middle of the read and affects the counts associated with 31 k-mers. The second error located only four positions in from the 3’ end of the read and affects only 4 k-mer counts.
Figure 2Algorithm pseudocode. A pseudocode for the error correction algorithm.
The number of corrections reported and low-information reads removed by Pollux
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| 454 GS Junior (1) | 135,992 | 70,999,968 | 24,100 | 164,198 | 56,144 | 17,221 | 1% |
| 454 GS Junior (2) | 137,528 | 71,710,564 | 21,004 | 167,999 | 53,947 | 13,535 | 1% |
| Ion Torrent PGM (1) | 2,483,868 | 303,579,279 | 610,872 | 2,609,205 | 1,863,522 | 1,106,908 | 11% |
| Ion Torrent PGM (2) | 2,154,577 | 260,017,346 | 561,024 | 2,215,086 | 1,765,495 | 968,961 | 13% |
| MiSeq | 1,766,516 | 250,356,566 | 250,896 | 1,893 | 3,349 | 0 | 1% |
All reads are sequenced from the same O104:H4 E. coli isolate. Substitution, insertion, deletion, and homopolymer corrections are performed on all data sets except for MiSeq, for which we do not perform homopolymer corrections. The percentage of reads which were removed as a consequence of more than 50% unique k-mers is provided under Reads Removed.
Alignment comparison of corresponding uncorrected and corrected reads against the reference genome
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| GS Junior (1) |
| 76% (0.33) | 82% (2.28) | 81% (0.81) |
| 11% (0.048) | 2% (0.065) | 6% (0.063) | |
| GS Junior (2) |
| 79% (0.24) | 86% (2.23) | 83% (0.76) |
| 15% (0.044) | 2% (0.059) | 6% (0.050) | |
| PGM (1) |
| 82% (1.68) | 91% (7.67) | 86% (7.10) |
| 2% (0.046) | 2% (0.16) | 5% (0.44) | |
| PGM (2) |
| 80% (1.72) | 90% (7.47) | 84% (7.70) |
| 2% (0.052) | 2% (0.16) | 6% (0.52) | |
| MiSeq |
| 95% (0.85) | 10% (0.0015) | 78% (0.007) |
| 1% (0.010) | 4% (0.0007) | 8% (0.0007) | |
All reads are sequenced from the same O104:H4 E. coli isolate. Corresponding uncorrected and corrected reads are aligned to the reference genome using SMALT. Incomparable alignments are removed. Corrected errors reflect alignment errors which are found in uncorrected reads but not in corrected reads. Similarly, introduced errors are a consequence of alignment errors found in corrected reads but not in uncorrected reads.
Comparison of various error correction software
|
| ||||
|---|---|---|---|---|
|
|
|
|
| |
|
|
|
|
|
|
| Pollux |
| 1.28 | 0.89 | 3.01 |
| Quake | 58.78 | 0.05 | 0.92 | 4.09 |
| SGA | 78.65 | 0.09 | 1.11 | 15.51 |
| BLESS | 83.21 | 0.10 | 0.00 | 0.86 |
| Musket | 81.75 | 0.15 | 0.00 | 2.37 |
| RACER | 86.59 | 1.60 | 0.00 | 1.07 |
|
| ||||
|
|
|
|
| |
|
|
|
|
|
|
| Pollux |
| 0.38 | 31.73 | 3.67 |
| Quake | 75.30 | 0.10 | 29.81 | 4.81 |
| SGA | 47.45 | 0.02 | 10.71 | 14.28 |
| BLESS | 55.32 | 0.06 | 0.00 | 0.49 |
| Musket | 45.04 | 0.14 | 0.00 | 6.96 |
| RACER | 75.76 | 0.28 | 0.00 | 0.68 |
|
| ||||
|
|
|
|
| |
|
|
|
|
|
|
| Pollux | 96.16 | 0.13 | 6.01 | 14.40 |
| Quake |
| 0.00 | 4.25 | 22.25 |
| SGA | 84.61 | 0.03 | 3.87 | 73.82 |
| BLESS | 87.61 | 0.03 | 0.00 | 3.56 |
| Musket | 83.33 | 0.10 | 0.00 | 18.67 |
| RACER | 94.09 | 0.16 | 0.00 | 4.14 |
|
| ||||
|
|
|
|
| |
|
|
|
|
|
|
| Pollux |
| 0.83 | 6.51 | 4.24 |
| Quake | 68.06 | 0.12 | 3.06 | 5.62 |
| SGA | 31.99 | 0.16 | 0.46 | 17.75 |
| BLESS | 60.01 | 0.13 | 0.00 | 1.04 |
| Musket | 44.68 | 0.80 | 0.00 | 5.93 |
| RACER | 65.14 | 1.28 | 0.00 | 0.79 |
|
| ||||
|
|
|
|
| |
|
|
|
|
|
|
| Pollux | 81.02 | 4.14 | 0.82 | 5.45 |
| Quake | 0.21 | 0.00 | 0.07 | 3.62 |
| SGA | 8.94 | 3.63 | 1.64 | 5.33 |
| BLESS | 34.68 | 1.27 | 0.00 | 0.24 |
| Musket | 0.00 | 0.00 | 0.00 | 0.05 |
| RACER |
| 24.27 | 0.00 | 0.36 |
|
| ||||
|
|
|
|
| |
|
|
|
|
|
|
| Pollux |
| 3.43 | 10.90 | 26.75 |
| Quake | 12.35 | 2.03 | 37.60 | 11.11 |
| SGA | 5.43 | 1.12 | 0.16 | 55.93 |
| BLESS | 22.82 | 0.52 | 0.00 | 1.25 |
| Musket | 9.40 | 4.88 | 0.00 | 47.27 |
| RACER | 67.86 | 15.95 | 0.00 | 1.64 |
The evaluation is performed by aligning corresponding uncorrected reads and corrected reads, which were not removed, against a reference genome using SMALT. Corrected errors are an aggregate of all alignment errors which are found in uncorrected reads but not in corrected reads. Similarly, introduced errors are an aggregate of all alignment errors found in corrected reads but not in uncorrected reads and are relative to the sum of corrected and uncorrected errors. The percentage of reads removed by each software is noted. We note that Quake, SGA, and Musket were intended to only correct Illumina sequencing data.
Comparison of assemblies using uncorrected and corrected reads
|
|
| ||||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |
|
| 2,120 | 31 | 84 | 1,840 | 37 | 85 | |
|
| 737 | 192 | 145 | 603 | 1,771 | 202 | |
Assemblies of uncorrected and corrected reads using Velvet. E. coli reads are paired and assembled using default parameters. S. aureus reads are comprised of paired fragment reads with average inserts of length 180 and short jump reads with average inserts of length 3500. These reads are assembled using parameterization as described in GAGE.
Figure 3The -mer counts associated with the uncorrected and corrected version of an Ion Torrent PGM read containing a highlighted deletion error. As is common with PGM data, the counts associated with the deletion error are not all reduced to one. This is a result of multiple similar deletion errors coinciding. The overall depth of the read increases after correction, suggesting a large number of errors are removed in other reads.