| Literature DB >> 26083032 |
Guillaume Marçais1, James A Yorke1, Aleksey Zimin1.
Abstract
MOTIVATION: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term "error correction" to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.Entities:
Mesh:
Year: 2015 PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Runtime of each program in hours:minutes:seconds, using 16 threads, and memory usage in giga-bytes.
The number of bases in each genome is reported in each column.
| Corrector | Rhodobacter 4:6Mb | Staphylococcus 2:9Mb | Mouse C16 98:2Mb | |||
|---|---|---|---|---|---|---|
| Time | Mem | Time | Mem | Time | Mem | |
| Coral | 0:09:46 | 35 | 0:06:18 | 33 | - | - |
| Echo | 2:10:46 | 58 | 1:06:11 | 39 | - | - |
| HiTec | 0:41:51 | 4.0 | 0:22:09 | 2.3 | - | - |
| Quake | 0:03:01 | 0.37 | 0:04:18 | 1.3 | 1:13:30 | 5.7 |
| SGA | 0:05:14 | 0.34 | 0:03:23 | 0.28 | 0:32:33 | 2.1 |
| Racer | 0:01:58 | 2.0 | 0:01:01 | 1.4 | 0:34:35 | 11 |
| Musket | 0:06:54 |
| 0:01:49 |
| 0:58:11 |
|
| QuorUM |
| 0.44 |
| 0.74 |
| 8.8 |
Percent of false 31-mers remaining and true 31-mers missing in error corrected reads.
The numbers for “false remain” and “true missing” in the table are percentages. We list the denominators used for the percentages in the headers of each of these columns. For the “false remain”, this denominator is the number of the false 31-mers in the original reads and for the “true missing”, it is the number of 31-mers in the reference. The “score” π = the product of the “false remain” and “true missing” columns. QuorUM’s π score is the best with a factor of 30, 15, and 3.5 better than the second best for Rhodobacter, Staphylococcus and Mouse C16 data sets respectively.
| Corrector | Rhodobacter | Staphylococcus | Mouse C16 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| False remain (55 M) | True missing (4:6 M) | Score π | False remain (33 M) | True missing (2:9 M) | Score π | False remain (410 M) | True missing (87 M) | Score π | |
| none | 100 | 0.36 | 40 | 100 | 0.037 | 4 | 100 | 0.069 | 7 |
| trim20B | 55 | 0.39 | 20 | 64 | 0.085 | 5 | 50 |
| 4 |
| trimQual5 | 9.4 | 0.71 | 7 | 96 | 0.039 | 4 | 34 | 0.10 | 3 |
| Coral | 69 | 0.38 | 30 | 56 | 0.13 | 7 | 52 | 0.22 | 10 |
| Echo | 60 |
| 20 | 55 |
| 2 | - | - | |
| HiTec | 42 | 1.1 | 50 | 33 | 0.23 | 8 | - | - | |
| Quake | 8.3 | 0.71 | 6 | 3.3 | 0.24 | 0.8 | 4.6 | 0.16 | 0.7 |
| SGA | 2.3 | 1.5 | 3 | 0.49 | 0.61 | 0.3 | 7.1 | 0.16 | 1 |
| Racer | 40 | 0.93 | 40 | 35 | 0.26 | 9 | 30 | 0.27 | 8 |
| Musket | 40 | 0.52 | 20 | 44 | 0.067 | 3 | 29 | 0.15 | 4 |
| QuorUM |
| 0.40 |
|
| 0.087 |
|
| 0.11 |
|
Idealized contig size statistics (in kb).
| Corrector | Rhodobacter | Staphylococcus | Mouse C16 | |||
|---|---|---|---|---|---|---|
| N50 | E-size | N50 | E-size | N50 | E-size | |
| none | 2.7 | 4.1 | 43 | 42.4 | 32 | 40.4 |
| trim20B | 3.9 | 5.8 | 17 | 20.4 | 32 | 40.1 |
| trimQual5 | 3.2 | 4.3 | 35 | 42.3 | 38 | 47.8 |
| Coral | 4.7 | 6.9 | 65 | 87.7 | 17 | 22.6 |
| Echo | 5.6 | 8.0 |
| 110 | - | - |
| HiTec | 5.7 | 8.1 | 55 | 56.3 | - | - |
| Quake | 3.2 | 4.3 | 21 | 23.1 | 36 | 45.1 |
| SGA | 4.7 | 6.8 | 15 | 16.2 | 38 | 48.1 |
| Racer | 5.7 | 9.3 | 39 | 44.3 | 24 | 30.2 |
| Musket | 5.1 | 7.8 | 61 | 78.1 | 31 | 38.6 |
| QuorUM |
|
| 86 |
|
|
|
Percentage of the original reads that are perfect after error reduction, and percentage of bases contained in perfect reads compared with bases in original reads.
The number in parenthesis is the denominator used to compute the percentage, the number of original reads and the amount of sequence in the original reads respectively.
| Corrector | Rhodobacter | Staphylococcus | Mouse C16 | |||
|---|---|---|---|---|---|---|
| Reads (2 M) | Sequence (202 M) | Reads (1:2 M) | Sequence (120 M) | Reads (41 M) | Sequence (4:2 G) | |
| none | 21 | 21 | 33 | 33 | 48 | 48 |
| trim20B | 44 | 36 | 46 | 37 | 79 | 64 |
| trimQual5 | 76 | 51 | 35 | 35 | 78 | 72 |
| Coral | 58 | 58 | 74 | 74 | 81 | 81 |
| Echo | 56 | 56 | 65 | 65 | - | - |
| HiTec | 61 | 61 | 78 | 78 | - | - |
| Quake | 81 | 59 | 69 | 60 | 89 | 81 |
| SGA | 62 | 62 | 75 | 75 | 85 | 85 |
| Racer | 63 | 63 | 78 | 78 | 84 | 84 |
| Musket | 76 | 70 | 80 | 78 | 88 | 86 |
| QuorUM |
|
|
|
|
|
|
Number of chimeric reads per 10000 after correction.
| Corrector | Rhodobacter | Staphylococcus | Mouse C16 |
|---|---|---|---|
| none | 11 | 7.3 | 59 |
| trim20B | 7.9 | 4.9 | 46 |
| trimQual5 | 2.9 | 7.1 | 26 |
| Coral | 11 | 9.3 | 52 |
| Echo | 9.6 | 7.6 | - |
| HiTec | 35 | 12 | - |
| Quake |
|
|
|
| SGA | 3.8 | 5.7 | 14 |
| Racer | 18 | 8.8 | 65 |
| Musket | 15 | 8.6 | 40 |
| QuorUM | 0.17 | 7.2 | 13 |
Fig 1Runtime of the error corrector programs on Rhodobacter vs. the number of threads.
The solid lines represent the actual runtime while the dashed lines represent the perfect linear speed-up (except for HiTec which is not multi-threaded). The plot uses a log-log scale.
The assembled NGA50 contig size in kilo-bases for SOAPdenovo.
The “-d0” and “-d1” are parameters to SOAPdenovo instructing the assemblers to use all 31-mers or to ignore the 31-mers occurring only once. For MaSuRCA, which incorporates QuorUM, the result is in parentheses.
| Corrector | Rhodobacter | Staphylococcus | Mouse C16 | |||
|---|---|---|---|---|---|---|
| -d0 | -d1 | -d0 | -d1 | -d0 | -d1 | |
| none | 0. | 2.7 | 0. | 4.8 | 0.64 | 1.5 |
| Coral | 0. | 3.4 | 0.67 | 16 | - | - |
| Echo | 0. | 3.1 | 0.92 | 9.1 | - | - |
| HiTec | 0.96 | 2.3 | 3.0 | 8.5 | - | - |
| Quake | 2.9 | 1.6 | 10 | 5.7 |
|
|
| SGA | 3.4 | 2.3 | 8.4 | 5.9 | 1.4 |
|
| Racer | 1.2 | 2.5 | 5.2 | 7.8 | 1.2 | 1.4 |
| Musket | 0.53 | 2.9 | 2.1 | 9.5 | 1.2 | 1.4 |
| QuorUM |
| 5.9 |
| 16.206 |
|
|
| MaSuRCA | (19) | (33) | (5.7) | |||