| Literature DB >> 27461106 |
Isaac Akogwu1, Nan Wang1, Chaoyang Zhang1, Ping Gong2.
Abstract
BACKGROUND: Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.Entities:
Keywords: Bloom filter; Error correction; Next-generation sequencing (NGS); Sequence analysis; k-mer; k-spectrum
Mesh:
Year: 2016 PMID: 27461106 PMCID: PMC4965716 DOI: 10.1186/s40246-016-0068-0
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Sequence error rates for different NGS platforms in comparison with the traditional Sanger technology (updated as of 2012 at http://www.molecularecologist.com/next-gen-table-3c-2014/). See Glenn (2011) [41] for more details. Single-pass reads are those raw sequences that have not been subject to consensus adjustment incorporated in final base calling
| Instrument | Primary error type | Single-pass error rate (%) | Final error rate (%) |
|---|---|---|---|
| Sanger—ABI 3730XL capillary (benchmark) | Substitution | 0.1–1 | 0.1–1 |
| 454—all models | Indel | 1 | 1 |
| Illumina—all models | Substitution | ~0.1 | ~0.1 |
| Ion Torrent—all chips | Indel | ~1 | ~1 |
| SOLiD—5500XL | A-T bias | ~5 | ≤0.1 |
| Oxford Nanopore | Deletion | ≥4 | 4 |
| PacBio RS | Indel | ~13 | ≤1 |
Fig. 1General framework of k-spectrum-based error correctors
Characteristic features of the six k-spectrum-based methods investigated in the present comparative study which distinguish one method from others
| Tools | Algorithm highlight | Data structure | Pros | Cons | Quality score | Target error type |
|---|---|---|---|---|---|---|
| Reptile | Explore multiple alternative | Hamming graph | Contextual information can help resolve errors without increasing | Uses a single core (non-parallelized) | Used | Substitution |
| Musket | Multi-stage correction: two-sided conservative, one-sided aggressive and voting-based refinement | Bloom filter | Multi-threading based on a master–slave model results in high parallel scalability | A single static coverage cut-off to differentiate trusted | Not used | Substitution |
| Bless | Count | Bloom filter | High memory efficiency; handle genome repeats better; correct read ends | Cannot automatically determine the optimal | Not used | Substitution |
| Bloocoo | Parallelized multi-stage correction algorithm (similar to Musket) | Blocked Bloom filter | Faster and lower memory usage than Musket | Not extensively evaluated | Not used | Substitution |
| Trowel | Rely on quality values to identify solid | Hash table | Correct erroneous bases and boost base qualities | Only accept FASTQ files as input | Used | Substitution |
| Lighter | Random sub-fraction sampling; parallelized error correction | Pattern-blocked Bloom filter | No | A user must specify | Used | Substitution |
Synthetic paired-end Illumina sequencing datasets simulated using ART
| Organism (dataset ID) | Accession number of reference genome assembly | ART simulation parameter | Genome size (MB) | |||
|---|---|---|---|---|---|---|
| Read length (bp) | Genome coverage | Fragment/insert size | Error rate (%) | |||
|
| GCF_000005845.2 (ASM584v2) | 36 | 70× | 200 | 0.866 | 4.6 |
|
| GCF_000005845.2 (ASM584v2) | 36 | 20× | 200 | 0.866 | 4.6 |
|
| GCF_000005845.2 (ASM584v2) | 100 | 20× | 200 | 0.952 | 4.6 |
|
| GCF_000007825.1 (ASM782v1) | 56 | 50× | 200 | 0.175 | 5.4 |
|
| GCF_000007825.1 (ASM782v1) | 100 | 120× | 300 | 0.109 | 5.4 |
|
| GCF_000001215.4 (Release 6) | 100 | 10× | 300 | 0.854 | 143 |
Fig. 2Workflow of error correction performance analysis using ECET (Error Correction Evaluation Toolkit [15]). See http://aluru-sun.ece.iastate.edu/doku.php?id=ecr for more information
Performance analysis of six k-spectrum-based error correctors as evaluated using six synthetic Illumina datasets
| Dataset | Method | TP | FP | FN | Recall | Gain | Precision | F-score |
|---|---|---|---|---|---|---|---|---|
| EC-1 | Reptile | 2335361 | 144751 | 451889 | 0.8378 | 0.7859 | 0.9416 | 0.8867 |
| 36 bp | Lighter | 2695425 | 72843 | 91825 | 0.9671 | 0.9409 | 0.9737 | 0.9704 |
| 70× | Bless | 2624659 | 48342 |
|
|
| 0.9819 |
|
|
| Bloocoo | 2411701 |
| 375549 | 0.8653 | 0.8573 |
| 0.9238 |
| Musket |
| 61096 | 85365 | 0.9694 | 0.9474 | 0.9779 | 0.9736 | |
| Trowel | 1246340 | 705438 | 1539825 | 0.4473 | 0.1941 | 0.6386 | 0.5261 | |
| EC-2 | Reptile | 681551 | 140039 | 114910 | 0.8557 | 0.6799 | 0.8296 | 0.8424 |
| 36 bp | Lighter | 108241 | 58579 | 688220 | 0.1359 | 0.0624 | 0.6488 | 0.2247 |
| 20× | Bless |
| 18095 |
|
|
| 0.9773 |
|
|
| Bloocoo | 689322 |
| 107139 | 0.8655 | 0.8574 |
| 0.9239 |
| Musket | 767087 | 18182 | 29374 | 0.9631 | 0.9403 | 0.9768 | 0.9699 | |
| Trowel | 434885 | 19167 | 361576 | 0.5460 | 0.5220 | 0.9578 | 0.6955 | |
| EC-3 | Reptile | 105 | 461 | 876053 | 0.0001 | -0.0004 | 0.1855 | 0.0002 |
| 100 bp | Lighter | 858125 | 2446 | 18033 | 0.9794 | 0.9766 | 0.9972 | 0.9882 |
| 20× | Bless | 746 | 872860 | 875412 | 0.0008 | -0.9954 | 0.0009 | 0.0009 |
|
| Bloocoo | 79790 | 3644539 | 796368 | 0.0911 | -4.0686 | 0.0214 | 0.0347 |
| Musket |
|
|
|
|
|
|
| |
| Trowel | 155 | 178354 | 876003 | 0.0002 | -0.2034 | 0.0009 | 0.0003 | |
| BC-1 | Reptile | 382043 | 22303 | 16602 | 0.9584 |
| 0.9448 |
|
| 56 bp | Lighter | 331759 |
|
| 0.7008 | 0.6682 |
| 0.8086 |
| 50× | Bless |
| 34018 | 11943 |
| 0.8958 | 0.9265 | 0.9492 |
|
| Bloocoo | 410156 | 24127 | 63221 | 0.8664 | 0.8155 | 0.9444 | 0.9038 |
| Musket | 355015 | 47460 | 118362 | 0.7500 | 0.6497 | 0.8821 | 0.8107 | |
| Trowel | 55277 | 4976 | 26744 | 0.6739 | 0.6133 | 0.9174 | 0.7770 | |
| BC-2 | Reptile | 497425 | 116 | 208081 | 0.7051 | 0.7049 | 0.9998 | 0.8269 |
| 100 bp | Lighter | 698089 | 159 | 7417 | 0.9895 | 0.9893 | 0.9998 | 0.9946 |
| 120× | Bless | – | – | – | – | – | – | – |
|
| Bloocoo | 27409 | 1278837 | 678097 | 0.0389 | -1.7738 | 0.0210 | 0.0272 |
| Musket |
|
|
|
|
|
|
| |
| Trowel | 652845 | 108 | 52661 | 0.9254 | 0.9252 | 0.9998 | 0.9612 | |
| DM | Reptile |
| 187733 |
|
|
| 0.9842 |
|
| 100 bp | Lighter | 42 | 23055867 | 12224293 | 0.0000 | -1.8861 | 0.0000 | 0.0000 |
| 10× | Bless | 11122683 |
| 1101652 | 0.9099 | 0.8995 |
| 0.9477 |
|
| Bloocoo | – | – | – | – | – | – | – |
| Musket | 11550483 | 163838 | 673852 | 0.9449 | 0.9315 | 0.9860 | 0.9650 | |
| Trowel | 1197127 | 384403 | 11027208 | 0.0979 | 0.0665 | 0.7569 | 0.1734 |
In the first column, dataset ID, read length, genome coverage, and the optimal k estimated using KmerGenie are shown. The values in TP, FP, and FN columns are numbers of bases. Italicized values denote the best performer with regard to a specific evaluation measure for a dataset. The symbol “–” indicates that a method failed to process a specific dataset
Fig. 3Impact of read length (a), coverage depth (b), and genome size (c) on the performance of six k-spectrum-based error correction methods. The six datasets are reordered according to the factor examined in order to show visually the effect of each factor on F-score for each method (see Table 3 for dataset, method, and F-score information)