| Literature DB >> 32183840 |
Keith Mitchell1, Jaqueline J Brito2, Igor Mandric1,3, Qiaozhen Wu4, Sergey Knyazev3, Sei Chang1, Lana S Martin2, Aaron Karlsberg2, Ekaterina Gerasimov3, Russell Littman5, Brian L Hill1, Nicholas C Wu6, Harry Taegyun Yang1, Kevin Hsieh1, Linus Chen1, Eli Littman1, Taylor Shabani1, German Enik1, Douglas Yao7, Ren Sun8, Jan Schroeder9, Eleazar Eskin1, Alex Zelikovsky3,10, Pavel Skums3, Mihai Pop11, Serghei Mangul12.
Abstract
BACKGROUND: Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Study design for benchmarking computational error-correction methods. a Schematic representation of the goal of error correction algorithms. Error correction aims to fix sequencing errors while maintaining the data heterogeneity. b Error-free reads for gold standard were generated using UMI-based clustering. Reads were grouped based on matching UMIs and corrected by consensus, where an 80% majority was required to correct sequencing errors without affecting naturally occurring single nucleotide variations (SNVs). c Framework for evaluating the accuracy of error-correction methods. Multiple sequence alignment between the error-free, uncorrected (original), and corrected reads was performed to classify bases in the corrected read. Bases fall into the category of trimming, true negative (TN), true positive (TP), false negative (FN), and false positive (FP)
Overview of the gold standard datasets
| D1 | D2 | D3 | D4 | D5 | |
|---|---|---|---|---|---|
| Technology | Whole genome sequencing | T cell receptor sequencing | T cell receptor sequencing | Viral sequencing | Viral sequencing |
| Heterogeneity | Low | High | High | High | High |
| Technique | Simulated | UMI-based error-free reads | Simulated | UMI-based error-free reads | Haplotype-based error-free reads |
| Number of samples | 12 | 8 | 6 | 1 | 11 |
Summary of error-correction methods’ parameters and publication details. Error-correction methods are sorted by the year of publication (indicated in the column “Published year”). We documented the underlying algorithm (indicated in the column “Underlying algorithm”), version of the error correction tool used (indicated in the column “Version”), and the name of the software tool (indicated in the column “Software tool”)
| Software tool | Version | Underlying algorithm | Published year | Programming language | Default k-mer size |
|---|---|---|---|---|---|
| Coral | 1.4.1 | Multiple Sequence Alignment (MSA) | 2011 | C | N/A |
| SGA | 0.10.15 | FM-index search | 2012 | C++ | 31 |
| Musket | 1.1 | k-mer spectrum | 2012 | C++ | N/A |
| Racer | 1.0.1 | k-mer spectrum | 2013 | C++ | N/A |
| Bless | 1.02 | k-mer spectrum | 2014 | C++ | N/A |
| Lighter | 1.1.1 | k-mer spectrum | 2014 | C++ | N/A |
| Fiona | 0.2.8 | k-mer spectrum | 2014 | C++ | N/A |
| BFC | 1 | k-mer spectrum | 2015 | C | N/A |
| Pollux | 1.0.2 | k-mer spectrum | 2015 | C | 31 |
| RECKONER | 0.2.1 | k-mer spectrum | 2017 | C++ | N/A |
Summary of technical characteristics of the error-correction methods assessed in this study
| Software tool | Data structure | Types of reads accepted | Organism | Journal | In the publication compared to | Tools webpage |
|---|---|---|---|---|---|---|
| Bless | Bloom filter and hash table | SE/PE | Human, | SGA, QuorUM, Lighter, BFC, DecGPU, ECHO, HiTEC, Musket, Quake, Reptile | ||
| Fiona | Partial suffix array | SE | Human, | Allpaths-LG, Coral, H-Shrec, ECHO, HiTEC, Quake | ||
| Pollux | Hash table | SE/PE | Human, | Quake, SGA, Bless, Musket, Racer | ||
| BFC | Bloom filter and hash table | SE/PE | Human, | Bless, Bloocoo, fermi2, Lighter, Musket, and SGA | ||
| Lighter | Bloom filter | SE/PE | Human, | Quake, Musket, Bless, Soapec | ||
| Musket | Bloom filter and hash table | SE/PE | Human, | SGA, Quake | ||
| Racer | Hash table | SE/PE | Human, | Coral, HITEC, Quake, Reptile, SHREC | ||
| Coral | Hash table | SE/PE | Human, | COMPASS 3.0, HHalign 1.5.1.1 and PSI-BLAST | ||
| RECKONER | Hash table | SE | Human, | Ace, BFC, Bless, Blue, Karect, Lighter, Musket, Pollux, Racer, Trowel | ||
| SGA | FM-index | SE/PE | Human, | Velvet, ABySS, SOAPdenovo, Quake, HiTEC |
Evaluation of the accuracy of error-correction methods
| Metric name | Metric formula | |
|---|---|---|
| Precision | ||
| Sensitivity | ||
| Gain | ||
| Trim percent | ||
| Trim efficiency |
a. Precision evaluates the proportion of proper corrections among the total number of performed corrections. INDEL refers to insertion/deletion polymorphism. b. Sensitivity evaluates the proportion of fixed errors among all existing errors in the data. c. Gain represents whether an algorithm is producing an overall benefit (more TP then FP) or is having a negative effect (more FP then TP). Values ranging from 1.0 to, but not including, 0.0 represent a benefit; 0.0 is neutral; and less than 0.0 is considered a negative effect. d. Trim percent is the proportion of nucleotides trimmed out of all nucleotides analyzed. e. Trim efficiency is the proportion of trimmed bases from the tool that were considered to be TP trimming
Fig. 2Correcting errors in whole genome sequencing data (D1 dataset). For each tool, the best k-mer size was selected. a–f WGS human data. g–l WGS E. coli data. a, g Heatmap depicting the gain across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. b, h Heatmap depicting the precision across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. c, i Heatmap depicting the sensitivity across various coverage settings. Each row corresponds to an error correction tool, and each column corresponds to a dataset with a given coverage. d, j Scatter plot depicting the number of TP corrections (x-axis) and FP corrections (y-axis) for datasets with 32x coverage. e, k Scatter plot depicting the number of FP corrections (x-axis) and FN corrections (y-axis) for datasets with 32x coverage. f, l Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) for datasets with 32x coverage
Fig. 3Correcting errors in TCR-Seq data (D2 dataset). For all plots, the mean value across 8 TCR-Seq samples is reported for each tool. a Bar plot depicting the gain across various error-correction methods. b Scatter plot depicting the number of TP corrections (x-axis) and FP corrections (y-axis). c Scatter plot depicting the number of FP corrections (x-axis) and FN corrections (y-axis). d Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) of each tool
Fig. 4Correcting errors in viral sequencing data (D4 dataset). For all plots, the best k-mer size was selected. a Bar plot depicting the gain across various error-correction methods. b Scatter plot depicting the sensitivity (x-axis) and precision (y-axis) of each tool