| Literature DB >> 28821237 |
Mahdi Heydari1,2, Giles Miclotte1,2, Piet Demeester1,2, Yves Van de Peer3,4,2,5, Jan Fostier6,7.
Abstract
BACKGROUND: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods.Entities:
Keywords: Error correction; Genome assembly; Illumina; Next-generation sequencing
Mesh:
Substances:
Year: 2017 PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
List of EC tools evaluated in this paper
| EC tool | Algorithm | Data structure | Indel support | Accuracy analysis | Assembly analysis | Year |
|---|---|---|---|---|---|---|
| ACE |
|
| Read level | - | 2015 | |
| BayesHammer |
| Hamming graph | Read level | SPAdes | 2013 | |
| BFC |
| Bloom filter | Read level | Velvet, ABySS [ | 2015 | |
| BLESS 2 |
| Bloom filter | Read level | Gossamer [ | 2016 | |
| Blue |
| Hash table |
| Read level | Velvet | 2014 |
| Fiona | MSA | Suffix tree |
| Base level | - | 2014 |
| Karect | MSA | Partially-ordered graph |
| Read, base level | Velvet, SGA, Celera [ | 2015 |
| Lighter |
| Bloom filter | Read level | Velvet | 2013 | |
| Musket |
| Bloom filter | Base level | SGA | 2013 | |
| RACER |
| Hash table | Read level | - | 2013 | |
| SGA-EC | MSA | Suffix array | Read level | SGA | 2012 | |
| Trowel |
| Hash table | Read, base level | Velvet, SOAPdenovo [ | 2014 |
The algorithmic approach is either k-mer spectrum based (‘k-mer’) or multiple sequence alignment based (‘MSA’). Tools can be further classified according to data structure and heuristics used. Some tools are able to correct insertions or deletions. In their accompanying publication, all tools were assessed directly on their ability to reduce error rate, either on the read or base level. Most tools did not use assembly analyses with modern assemblers in their evaluation. SPAdes was used for the evaluation of BayesHammer, but no comparison was made with assembly results from uncorrected data
NGA50 of respectively contigs (top) and scaffolds (bottom) assembled by SPAdes before and after error correction
| Tools | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
|---|---|---|---|---|---|---|---|---|
| Contig NGA50 | ||||||||
| Uncorrected | 397 392 | 92 570 | 119 253 | 231 409 | 264 881 | 8 559 | 6 429 | 50 484 |
| ACE | 397 392 = | 92 570 = | 125 608 | 231 409 = | 264 881 = | 8 771 | 3 143 | 28 679 |
| BayesHammer | 397 392 = | 92 344 | 132 564 | 231 409 = | 264 881 = | 9 075 | 6 540 | 53 534 |
| BFC | 397 392 = | 92 570 = | 132 876 | 231 409 = | 264 881 = | 9 375 | 6 389 | 49 185 |
| BLESS 2 | 397 392 = | 92 570 = | 119 265 | 231 409 = | 264 881 = | 7 975 | 3 047 | 23 814 |
| Blue | 397 392 = | 92 708 | 132 876 | 231 409 = | 289 353 | 7 628 | 6 191 | 50 486 |
| Fiona | 397 392 = | 92 611 | 119 253 = | 231 409 = | 264 881 = | 9 224 | 5 346 | 45 472 |
| Karect | 397 392 = | 92 611 | 132 876 | 231 409 = | 264 881 = | 9 865 | 6 392 | 54 132 |
| Lighter | 397 392 = | 92 570 = | 132 564 | 231 409 = | 289 353 | 9 609 | 6 423 | 50 440 |
| Musket | 397 392 = | 92 566 | 132 876 | 231 409 = | 264 881 = | 9 293 | 6 170 | 46 377 |
| RACER | 397 392 = | 92 523 | 112 393 | 231 409 = | 264 881 = | 7 336 | 3 244 | 21 538 |
| SGA-EC | 397 392 = | 92 344 | 119 255 | 231 409 = | 264 881 = | 9 296 | 6 435 | 52 105 |
| Trowel | 397 392 = | 92 344 | 119 335 | 231 409 = | 264 881 = | 7 808 | 6 389 | 48 357 |
| Scaffold NGA50 | ||||||||
| Uncorrected | 397 392 | 97 353 | 132 876 | 231 409 | 289 353 | 8 829 | 6 472 | 60 554 |
| ACE | 397 392 = | 97 353 = | 133 713 | 231 409 = | 264 881 | 9 190 | 3 158 | 35 392 |
| BayesHammer | 397 392 = | 97 353 = | 133 309 | 231 409 = | 264 881 | 9 443 | 6 576 | 58 570 |
| BFC | 397 392 = | 97 353 = | 133 088 | 231 409 = | 264 881 | 9 664 | 6 419 | 59 613 |
| BLESS 2 | 397 392 = | 97 353 = | 132 876 = | 231 409 = | 264 881 | 8 441 | 3 073 | 35 638 |
| Blue | 397 392 = | 97 288 | 133 309 | 231 409 = | 289 353 = | 7 841 | 6 183 | 61 289 |
| Fiona | 397 392 = | 97 353 = | 132 876 = | 231 409 = | 264 881 | 9 491 | 5 385 | 54 188 |
| Karect | 397 392 = | 97 353 = | 133 058 | 231 409 = | 264 881 | 10 302 | 6 446 | 62 304 |
| Lighter | 397 392 = | 97 353 = | 133 309 | 231 409 = | 289 353 = | 9 955 | 6 468 | 59 697 |
| Musket | 397 392 = | 97 353 = | 133 088 | 231 409 = | 264 881 | 9 502 | 6 219 | 55 842 |
| RACER | 397 392 = | 97 353 = | 132 876 = | 231 409 = | 264 881 | 7 603 | 3 266 | 23 783 |
| SGA-EC | 397 392 = | 97 353 = | 132 876 = | 231 409 = | 264 881 | 9 640 | 6 483 | 60 636 |
| Trowel | 397 392 = | 97 353 = | 132 876 = | 231 409 = | 264 881 | 8 107 | 6 435 | 57 078 |
Arrows in the table are based on their value relative to the NGA50 value obtained from uncorrected data as follows: < -10% < ↓ < 0% < ↑ < +10% <
Real datasets used for the evaluation of EC tools
| Abbr. | Organism | Reference ID | Genome size | Cov. | Sequencing platform | Read length | Trimmed reads | Dataset ID | Ref. |
|---|---|---|---|---|---|---|---|---|---|
| D1 |
| Nc013714.1 | 2.6 Mbp | 373 X | Illumina MiSeq | 251 bp | SRR1151311 | [ | |
| D2 |
| NC010473 | 4.5 Mbp | 418 X | Illumina MiSeq | 150 bp | Ill. Data library | [ | |
| D3 |
| NC000913 | 4.5 Mbp | 612 X | Illumina GAII | 100 bp | ERA000206 | [ | |
| D4 |
| NC011083.1 | 4.7 Mbp | 97 X | Illumina MiSeq | 239 bp |
| SRR1206093 | [ |
| D5 |
| ERR330008 | 6.1 Mbp | 169 X | Illumina MiSeq | 120 bp |
| ERR330008 | [ |
| D6 |
| HG19 | 45.2 Mbp | 29 X | Illumina HiSeq | 100 bp | Ill. Data library | [ | |
| D7 |
| WS222 | 97.6 Mbp | 58 X | Illumina HiSeq | 101 bp | SRR543736 | [ | |
| D8 |
| Release 5 | 116.4 Mbp | 52 X | Illumina HiSeq | 100 bp | SRR823377 | [ |
Fig. 1Mismatches in read alignment. Classification of (un)corrected reads for D. melanogaster, based on the number of mismatches in their alignment to the reference genome
Accuracy comparison of EC tools in terms of EC gain, percentage of corrected errors, and number of newly introduced errors per Mbp of read data
| D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | |
|---|---|---|---|---|---|---|---|---|
| Error correction gain (%) | ||||||||
| ACE | 96.3 | 97.9 | 98.7 | 96.2 | 91.1 | 41.7 | -3.3 | 25.9 |
| BFC | 78.7 | 84.3 | 80.2 | 81.4 | 78.6 | 52.8 | 63.3 | 24.1 |
| Blue | 98.5 | 98.8 | 98.7 | 96.7 | 95.4 | 51.1 | 65.2 | 28.8 |
| Fiona | 87.4 | 94.6 | 97.5 | 85.5 | 91.4 | 55.0 | 65.8 | 29.8 |
| Karect | 99.4 | 99.8 | 99.7 | 98.5 | 98.2 | 63.1 | 75.5 | 34.3 |
| Lighter | 85.4 | 93.8 | 92.5 | 80.1 | 84.6 | 45.7 | 50.3 | 21.7 |
| Musket | 91.3 | 93.6 | 93.4 | 88.0 | 87.1 | 49.5 | 59.2 | 23.5 |
| RACER | 92.3 | 94.4 | 97.0 | 88.3 | 94.0 | 17.4 | 32.6 | 22.3 |
| SGA-EC | 55.3 | 67.2 | 45.5 | 53.1 | 65.2 | 48.7 | 60.6 | 23.0 |
| Trowel | 38.4 | 49.4 | 38.8 | 40.5 | 46.8 | 13.2 | 1.1 | 10.5 |
| Percentage of corrected errors (sensitivity) | ||||||||
| ACE | 97.7 | 98.5 | 99.2 | 98.0 | 97.0 | 61.3 | 73.8 | 34.5 |
| BFC | 78.8 | 84.4 | 80.2 | 81.4 | 78.7 | 54.1 | 63.8 | 24.7 |
| Blue | 98.7 | 99.3 | 99.1 | 97.0 | 95.7 | 59.9 | 70.6 | 31.4 |
| Fiona | 87.5 | 94.8 | 97.7 | 85.5 | 91.7 | 60.6 | 71.7 | 31.5 |
| Karect | 99.4 | 99.9 | 99.7 | 98.5 | 98.2 | 64.4 | 76.7 | 35.5 |
| Lighter | 85.5 | 94.0 | 92.7 | 80.2 | 86.3 | 48.9 | 59.1 | 24.3 |
| Musket | 91.3 | 93.6 | 93.4 | 88.1 | 87.3 | 52.9 | 65.3 | 26.4 |
| RACER | 92.9 | 95.8 | 98.2 | 89.0 | 94.8 | 59.2 | 68.2 | 34.0 |
| SGA-EC | 55.3 | 67.2 | 45.5 | 53.1 | 65.3 | 50.4 | 61.3 | 23.2 |
| Trowel | 39.0 | 49.9 | 43.4 | 40.9 | 47.6 | 23.6 | 31.2 | 11.8 |
| Number of errors introduced per Mbp | ||||||||
| ACE | 44 | 23 | 40 | 151 | 194 | 1217 | 2375 | 1123 |
| BFC | 2 | 3 | 7 | 2 | 3 | 83 | 15 | 73 |
| Blue | 8 | 20 | 30 | 31 | 10 | 547 | 167 | 341 |
| Fiona | 2 | 7 | 14 | 6 | 9 | 347 | 183 | 218 |
| Karect | 0 | 1 | 3 | 1 | 1 | 80 | 36 | 157 |
| Lighter | 2 | 6 | 14 | 8 | 56 | 202 | 273 | 332 |
| Musket | 1 | 2 | 5 | 3 | 6 | 214 | 190 | 383 |
| RACER | 21 | 62 | 97 | 58 | 27 | 2603 | 1097 | 1524 |
| SGA-EC | 1 | 3 | 6 | 2 | 3 | 105 | 22 | 24 |
| Trowel | 21 | 26 | 376 | 41 | 25 | 647 | 930 | 172 |
Fig. 2SPAdes assemblies. SPAdes assembly results for D. melanogaster for (un)corrected data. Scaffolds with length NGAx or larger contain x% of the genome
Fig. 3Fragmented assembly using corrected data. Contigs assembled from corrected data are aligned to the largest (top) and second largest (bottom) contig obtained from uncorrected data. Different colors denote different contigs. Black bars indicate the location of lost true k-mers in the contigs. This indicates a possible causal relationship between lost true k-mers and the breakpoints in the assemblies of corrected data
Fig. 4Lost true 21-mers spectrum. For dataset D8, this figure shows the 21-mer spectrum of the uncorrected data, along with the lost true 21-mer spectrum for all EC tools. EC tools erroneously remove low frequency true 21-mers during error correction
Fig. 5Alignment of uncorrected and ACE-corrected reads in the neighborhood of a contig breakpoint: The first track shows the 21-mer coverage of the uncorrected data. The second track (Ref) contains part of the reference genome, which is assembled into one contig from uncorrected data. A repeated 21-mer is indicated in red. The third track (Uncorrected) shows the alignment of the uncorrected, but error-free reads to the reference. The fourth track (Corrected) uses these same alignment positions, but with the sequence content of the corrected reads. Newly introduced errors are indicated by a character in the reads. The rectangle in the fourth track indicates 75 overlapping 21-mers that are lost as a result of erroneous error correction
Fig. 6Alignment of uncorrected and corrected reads by Musket and Fiona in the neighborhood of a contig breakpoint: Lost true k-mer can result in two different scenarios. The first track shows the 21-mer coverage of the uncorrected data. The second track (Ref) shows a part of the reference genome, which is assembled into one contig from uncorrected data. A frequently occurring AT-repeat is indicated in red. The third track (Uncorrected) shows the alignment of the uncorrected reads to the reference. The fourth and the fifth tracks (Corrected Musket and Corrected Fiona) use these same alignment positions, but with the sequence content of corrected reads by Musket and Fiona. The sixth track is the assembled contig from corrected reads by Fiona. The rectangles indicate the regions in corrected reads by Musket and Fiona that no longer contain any true 21-mers. The coverage is low around an ‘AT’ repeat with coverage 13750x in the uncorrected data. Musket incorrectly changed two bases, breaking the connection between two groups of reads. In contrast, in the Fiona-corrected reads, the connection is not lost. Instead the lost true k-mers in Fiona appear as mismatches in the assembled contig
Fig. 7Peak memory usage of the EC tools
Fig. 8Runtime of the EC tools