| Literature DB >> 26026159 |
David Laehnemann, Arndt Borkhardt, Alice Carolyn McHardy.
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.Entities:
Keywords: bias; error correction; error model; error profile; high-throughput sequencing; next-generation sequencing
Mesh:
Year: 2015 PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Error rates of high-throughput sequencing platforms (per 100 sequenced bases)
| Platform | Subs | SD Subs | Indels | SD Indels | SD All | |
|---|---|---|---|---|---|---|
| 454 GS FLX | 0.09000 | 0.90000 | 0.99000 | |||
| 454 GS Junior | 0.05430 | 0.39055 | 0.45540 | |||
| Complete Genomics | 2.30000 | 0.01900 | 2.31900 | |||
| Illumina HiSeq | 0.11238 | 0.02351 | 0.11875 | |||
| Illumina MiSeq | 0.11079 | 0.01436 | 0.18867 | |||
| Ion Torrent PGM | 0.17253 | |||||
| Pacific Biosciences RS | 0.44761 | 3.29386 | 3.16667 |
Data from [10–17]. For all used values from these studies, please see the comprehensive Supplementary Table S1.
Only numbers based on a sufficiently large sample size to provide a standard deviation are set in bold.
Subs = substitution errors per 100 bases; Indels = insertion and deletion errors per 100 bases; SD = standard deviation
aSome studies contain only aggregated measures for indels and/or total error. Therefore, the value of All is not necessarily the sum of Subs and Indels.
bFor these platforms only one sample was available. Therefore, error rates should be considered with care as SDs are not available.
cOne study with three samples (out of 12 samples in total) used indel-tolerant mapping, resulting in almost 100% of reads being mapped, but also producing much higher indel error rates. This sample also explains the high SD.
Figure 2.Error rate biases in homopolymers of varying lengths and due to different local GC sequence content. (A) Top panels show the average error rates at homopolymers of different lengths per genome and platform. (B) Bottom panels show error rates across different GC sequence contents of 100-base windows. This figure is aggregated and adapted from Figures 4 and 5 in [13], according to the Creative Commons Attribution license CC-BY 2.0 (http://creativecommons.org/licenses/by/2.0/). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 1.Sequencing coverage across different local GC contents in three microbes (P. falciparum, E. coli and R. sphaeroides) and a human genome. The bottom panels show the relative fraction of 100-base windows in the respective genome having a certain GC content. The top panels show the relative sequencing coverage for 100-base windows with a certain GC content compared to the respective platform sample's average. This figure is aggregated and adapted from Figures 2 and 3 in [13], according to the Creative Commons Attribution license CC-BY 2.0 (http://creativecommons.org/licenses/by/2.0/). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 4.Deriving a k-mer Spectrum or a Hamming graph from k-mer counts. Some error correction tools work directly with the k-mer frequencies as counted from the read set. Others set a minimum k-mer coverage (2 in this example, green) to consider a k-mer as correct (trusted k-mers, green counts) and then derive a (C) k-mer Spectrum of all trusted k-mers. In this simplified example, this step classifies k-mers from the end of the queried sequence as untrusted (the reference in Figure 3A could be considered the sequence queried by the four reads). By using a Bloom filter, space usage of the k-mer spectrum can be reduced. Another concept used in k-mer approaches is the Hamming graph, where nodes are k-mers from the read set and nodes are connected if the Hamming distance (i.e. the number of base substitutions between them; see also the section ‘Substitutions only versus substitutions plus indels: Hamming versus Levenshtein distance’) is below a given threshold. In this simplified example, k-mers are too short and the Hamming graph therefore connects three correct k-mers. In a real setting, the k-mer length must be chosen with care (section ‘Optimal k-mer length’) and most connected components of the Hamming graph should contain only a single correct k-mer plus k-mers generated from the same sequence with errors. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 5.A suffix trie is a tree of all suffixes from the indexed example read set. Every existing suffix can be spelled out by a path from the root node to one of the read indices, indicated by arrowheads and read numbers at corresponding nodes. Numbers at trie edges correspond to the number of suffixes passing through them, i.e. edge weights give the coverage of a sequence from the root down to the following node. For example, the 2-mer ‘CC’ occurs six times in the four example reads.
Figure 6.Steps for deriving first a suffix array and then the BWT and the FM index from the running example read set. Also given is an example for a string search using BWT and FM index, with the colours purple, red and green tracing corresponding indices and nucleotides. For suffix array construction, a unique termination symbol ($x) is appended to each read and reads are concatenated to a string R in lexicographical order of their termination symbol ($1<$2 < $3 < $4). All possible suffixes are formed and sorted unambiguously, as termination symbols have an order, as do the other symbols $ < A < C < G < T. A suffix array entry at suffix array index i then corresponds to the position in string R at which the i-th (lexicographically) lowest suffix starts. The LCP of a suffix array entry and the preceding entry is then recorded and suffix array plus LCP already form an efficient data structure for determining string occurrence frequencies. The BWT enables further compression of the data. Here, an entry at BWT index i corresponds to the symbol before the i-th lowest suffix in R. Together with the FM index, it allows for linear time string searches (and thus also determination of the coverage of a string) in the whole read set. The FM index gives the number of occurrences of each of the symbols up to any index of the BWT and for each symbol counts all occurrences of all lexicographically lower symbols (e.g. 4 + 4 + 10 = 18 for symbol ‘G’, all counts in blue). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Overview of software implementing the discussed error correction approaches (more details in Supplementary Table S2). This includes software where error correction is only one step of a more complex procedure, such as in software for de novo assembly or haplotype reconstruction. Column Q specifies, whether a tool makes use of the base calling quality scores. Column I specifies, whether a tool can correct indels. Values of ‘0.5’ indicate indel correction potential, i.e. that a tool already implicitly corrects some but not all indels or that it could easily be extended to correct them
| Tool | Main approach | Data structure(s) | k choice reasoning | Global | Error model | Correction | Q | I | Further distinctions | |
|---|---|---|---|---|---|---|---|---|---|---|
| Acacia | MSA and clust | Hash of RLE prefixes | 6 (RLE!) | – | Emp hopo uc and oc probs | Iterative MSA, iterative clust refinement using statistical tests, clust cons | – | 0.5 | 6-mers of RLE read prefixes; tests all RL discrepancies using 3 bins: uc, main mode, oc; breaks clust w/ consistent mismatches to cons; includes demultiplexing | |
| AHA | Read alignment | Implicit LR corr in scaffolding | – | 0.5 | Scaffolding of existing contigs from SRs by LRs, initial corr of LRs by short reads | |||||
| ALLPATHS | – | 16, 20 and 24 | – | Dyn: 1st loc min of emp | – | Min Hamming D (Q weighted) corr per read | 1 | – | Read correct only if all | |
| ALLPATHS-LG | – | 24 | Balance uniq versus sensitivity | – | Majority col vote (Q weighted) | 1 | – | 24-mer with 1 base gap in middle (every 13th base corr), contiguous 24-mer: all cols with >6 reads, consistency check of changes within read | ||
| AmpliconNoise | Flowgram and corr reads clust (EM) | – | (i) Emp pyroseq errors, (ii) emp PCR errors (learned confusion matrix) | (i) and (ii): EM of mixture model for flowgram/read generation likelihood of obs freq assuming certain generating sequences | – | 1 | Mixture model with 1 exponential dist component per true seq, learns confusion matrix for PCR error corr from corrected pyroseq reads, aimed at 454 pyrosequencing of amplicons | |||
| ARACHNE | MSA | – | 8–24 | – | Majority col vote (Q weighted), only applied when alt base has strong cov diff | 1 | 1 | Chimera detection if central region with decreased cov (= drop in overlaps with other reads) | ||
| AutoEdit | MSA and chromatogram | – | Resolvedness of chromatogram peaks over regions | Base recall using base freqs from assembly/mapping | 1 | 1 | Base recall, integration of chromatogram context info, aimed at Sanger data | |||
| BayesHammer | Repl sorted | 21 (def) | – | Error prob = Q | Improves Hammer by subclustering of connected components and read-based correction | 1 | – | |||
| BFC | Bloom filter and hash table | 31,55 | Trade-off repeat resolution versus | Man: 3 | – | Walk through read from longest trusted region, find optimal correction by extension with penalties for corr | 1 | – | Refined from fermi; exhaustive search of corr, including corr of trusted | |
| BLESS | Hash table (counting), bloom filter ( | – | (i) # | Dyn: 1st loc min of emp | – | Find solid min edit D path between solid | 1 | .5 | Reverses some bloom filter false positive changes, better corr of errors at read ends by read extension | |
| Bloocoo | Bloom filter | 31 (def) | – | Man (def = 3–6) | – | As Musket | – | .5 | As Musket; accepts only corr supported by multiple solid | |
| Blue | Hash table (partitioned) | Uniq (manual) | Dyn per-read thr: 1/3 of harmonic mean of | Low-cov, cov drops, end of hopo runs | Left-to-right depth-first traversal of all viable | – | 1 | Decouples | ||
| Coral | MSA | Hash table ( | k ≈ log_4|G| | Uniq, but k < L/2 (existence of correct | – | Corr w/ thr of Q weighted rel base covs | 1 | 1 | Considers MSA Q | |
| CUDA-EC | Counting bloom filter | 20 | ‘Illumina def’ | Man: assume Poisson dist of true | – | Min Hamming D corr, max # | 1 | – | Intro of bloom filter and CUDA parallelization | |
| DecGPU | Bloom filter | 21 (def) | – | Man (def = 6) | – | Majority col vote | – | – | CUDA and MPI parallelization | |
| DeNoiser | Flowgram clust (EM) | – | (AmpliconNoise:(i)) | (AmpliconNoise:(i)) | – | 1 | Fast pre-clust using exact prefixes, aimed at 454 pyrosequencing of amplicons | |||
| ECHO | MSA | Hash table ( | k ≈ L/6 | Trade-off discussion | Read pos specific confusion matrix (EM estimated) | Max a posteriori estimate of base call given base covs at col weighted by confusion matrix | – | – | Heterozygosity model | |
| ECTools | MSA | – | – | Corr of LRs with unitigs from pre-assembled SRs | – | 1 | (i) Unitigs from SRs; (ii) align LRs to unitigs; (iii) optimization using longest increasing subsequence; (iv) correction to unitigs | |||
| EDAR | – | – | – | Dyn: 1st loc min of emp | – | Removes error bases (splits reads) | – | 0.5 | GC content adjustment (per | |
| EULER | – | 20, 100 | – | Man: X | – | Min Hamming D corr, max # | – | – | Initial usage of | |
| EULER | – | 15–20 | Shorter | Man: X | – | Min Levenshtein D corr per read, dyn programming | – | 1 | ||
| EULER-SR | – | 15 | – | Man: assume Poisson dist of true | – | Min Hamming D corr, max # | – | – | Iterative relaxation of | |
| EULER-USR | Repeat graph (simplified de Bruijn) of read prefixes | 20 | – | Dyn: 1st loc min of mixed Poisson and Gaussian dist fit | – | Min Hamming D corr, max # | – | – | ||
| fermi | Suffix array (BWT and FMD index), hash table ( | 23 (def) | – | Man: def=3 | – | Prefix majority vote for low Q base occ | 1 | – | ||
| Fiona | Suffix array (partial) | Various | As HiTEC | Dyn: from | Hierarchichal statistical model | Per possible corr: majority vote of all overlapping pos of correct reads, greedy among diff corr pos within read | – | 1 | Details of statistical model: see | |
| FreClu | Clust reads w/ Hamming D | Hash table, repl sorted read lists (Hamming D) | Read pos and Q specific base confusion matrix | Count corr | 1 | – | Cluster reads by Hamming distance, create cluster tree from covs, then maps only root read, extends POLYBAYES error model, aimed at RNAseq | |||
| Hammer | Repl sorted | 55 (def) | Trade-off discussion | – | Corr of | 1 | – | No uniform cov assumption, singletonCutoff for min cov (def = 1), saveCutoff to keep multiple | ||
| HECTOR | k-hopo spec | Bloom filter and hash table | 21 (def, k-hopo) | – | Dyn: 1st loc min of emp | – | As Musket, but using k-hopos to test for base trust | – | 1 | k-hopos instead of |
| HiTEC | Suffix array | Various | Min false negatives or false positives given per-base error rate | Dyn: from |G|, L, N, substitution error prob | – | Unambiguous corr: majority vote, break ties by 2 nt lookahead | – | – | Iterates over various | |
| Hybrid-SHREC | suffix trie | (SHREC) | (SHREC) | Adjusted to current k by tuning parameter alpha | – | (SHREC) plus look up and down one trie level (rerooting for indel corr) | – | 1 | Inspects multiple k | |
| KEC | Hash | 25 (def) | Clear weak/solid separation versus error resolution | Dyn: end of 1st stretch of 0 cov in emp | – | Min edits per read in weak regions with max resulting | – | 1 | Identify error regions as in EDAR; iterate over rounds of error corr; eliminate very low-frequency haplotypes from MSA of unique reads; not benchmarked against other | |
| Lighter | Bloom filter | 23 | Max gain in correct bases | – | as BLESS | 1 | – | Three passes over reads: (i) subsample | ||
| LoRDEC | de Bruijn graph in bloom filters | 17–21 (def) | – | Man (def = 2–3) | – | Corr of weak LR regions by traversal of SR de Bruijn graph | – | 1 | de Bruijn graph from solid | |
| LSC | MSA | – | Hybrid read set cons corr of RLE compressed reads | – | 1 | (i) RLE compress SRs and LRs, (ii) map RLE SRs onto RLE LRs, (iii) correct LRs to MSA cons and trim to SR covered regions | ||||
| MisEd | MSA | Table ( | – | Majority vote if col NOT defined nucleotide pos (prob model to test coincidence of deviations) | 1 | – | Alt base linkage within reads (differentiation repeats/polymorphisms versus errors) | |||
| Musket | Bloom filter into hash table | 21 (def) | – | Dyn: 1st loc min of emp | – | Two-sided conservative, one-sided aggressive and voting-based refinement | – | – | Corr unambiguous errors if change makes leftmost and rightmost overlap | |
| MyHybrid | MSA | suffix array | k ≈ log_4(200*|G|) | (Quake) | Dyn: half the exp | – | Majority col vote | – | 1 | |
| N-corr | Hopo pattern in 454 reads | – | Hopo pattern | – | – | – | ||||
| Nanocorr | MSA | – | – | Cons of SR alignment | – | 1 | Align SRs to LRs, dynamically find optimal alignment | |||
| pacbio_qc | Read filter (SVM) | See LIBSVM package | Mean Q and CCS pass # of read | SVM regression trained on known spike-in | 1 | |||||
| PBcR | MSA | – | Hybrid read set corr | – | 0.5 | (i) Map SRs onto LRs, (ii) correct LRs to MSA cons | ||||
| Potts model | Reads clust w/ Hamming D | – | Error prob = Q | Max likelihood Potts model estimation on Hamming neighbourhoods by clust col pos | 1 | – | ||||
| PREMIER | – | Optimize performance | Uniq versus sufficient cov | Max likelihood of seq generation w/ HMM | 1 | – | ||||
| proovread | Read mapping | Alignment matrix per LR | Mapping w/ penalties portraying error probs | Cons calling by majority vote | 1 | Penalties for mapping reflect PacBio error probs; quality scores from SR base support; chimera detection; iterative w/ increasing sensitivity | ||||
| PSAEC | Suffix array (partial) | 20 (ex) | – | Dyn: from |G|, L, N, substitution error prob | – | (HiTEC) | – | – | Runtime and memory optimization by using only partial suffix array | |
| PyroNoise | Flowgram clust | Emp pyroseq errors | EM of mixture model for flowgram generation likelihood of obs freq assuming certain generating sequences | |||||||
| Quake | Bit array index ( | k ≈ log_4(200*|G|) | Prob = 0.01 for | Dyn: mixed dist model fit and likelihood ratio | Q specific confusion matrix (learned from unambiguous initial corrections) | Max likelihood | 1 | – | 1st proper discussion of | |
| QuorUM | – | 24 (def) | – | – | Trim if no alt base with cov, correct if only one alt base with cov, one-pos lookahead if multiple alts with non-zero cov, break ties by cov continuity check | 1 | – | No uniform cov assumption: correcting | ||
| RACER | Hash table | ‘Computed from |G|’ | – | Dyn: from |G| | – | Unambiguous weak to solid corrs | – | – | ||
| RECOUNT | Read counts (EM) | – | Average of Q values in alignment col gives read and pos specific error prob | Count corr | 1 | EM of exp read counts | ||||
| REDEEM | Sparsehash | 11 (def) | Uniq of non-repeat | Man (def = 20), BUT dyn freq counts | Learned sparsified | Set each read pos to nt with max prob over all covering | – | – | Compute exp | |
| Reptile | Repl sorted | k ≈ log_4|G| or 10 ≤ k ≤ 16 | Uniq of average | Man: high and medium confidence | – | Min Hamming D corr of two consecutive | 1 | – | ||
| SEECER | MSA, col clust and HMM | Hash table ( | 17 (def) | Learn HMM parameters for each MSA contig | One HMM per contig, read assignments to HMMs by log likelihood from Viterbi's alg | – | 1 | Separates polymorphisms from errors by spec clust and spec relaxation of | ||
| SGA | Suffix array (BWT and FM index) | 31 (def) | Emp choice from error corr on read subset with various (‘sga stats’) | Dyn: loc min of emp | – | Leftmost and rightmost overlap | 1 | – | Optional to use only base pos of PHRED Q above 20 for | |
| SGA | MSA | Suffix array (BWT and FM index) | – | Corr conflicting cols if single base above corr cov thr (def=3) | 1 | – | Checks mismatch linkage (conflict cols of read = multiple alt bases above conflict cov thr; exclude reads with mismatch at all conflicts from read's corr MSA) | |||
| ShoRAH | MSA and read clust | Hash tables | – | Loc majority rule within haplotype (in three windows overlapping each seq pos) | – | 1 | Clust reads into haplotypes with Gibbs sampler of posterior dist of Dirichlet process mixture that models haplotypes | |||
| SHREC | Suffix trie | [min {log_4(|G|), log_4(n)} + q] ≤ k ≤ s | n = #reads, q: (1/4)^q < p, s: low cov thr for trie level | Adjusted to current k by tuning parameter x | – | Reroot branch if actual node cov below exp node cov (±SD*tuning param) for current trie level | – | – | Inspects multiple k | |
| SleepEC | Read freq and Hamming graph | Lists of nodes, node properties, node links | Read length | Read pos specific base confusion matrix | Stat test of actual read abundance versus predicted | – | – | For RNAseq reads; builds Hamming graph, learn base substitution matrix from trusted connected components | ||
| SOAPdenovo | Hash table ( | k ≈ log_4|G| | Uniq of average | Man | – | Min Hamming D corr per read by extending high-cov region, majority col vote, dyn programming | – | – | ||
| SOAPdenovo2 | Hash or index table | k ≈ log_4(20*|G|) | Man | – | Col voting on unambiguous errors, voting on possible change paths (rooted at correct pos) for ambiguous errors | – | – | |||
| SysCall | MSA col classifier (logistic regression) | Matrix (rows = pot het pos, cols = features) | Posterior prob based on (i) Q diff to pos neighbours, (ii) seq context, (iii) strand bias | Logistic regression model, distinguish errors from het at each pot het pos (thr 0.5) | 1 | – | ||||
| Trowel | Hash table ( | As Quake | As Quake | – | Error prob = Q | Compare Musket: (i) gapped | 1 | – |
Abbreviations: # = number of, alt = alternative, alg = algorithm, BWT = Burrows Wheeler transform, CCS = circular consensus sequence, clust = cluster(ing), col = column, cons = consensus, corr = correction, cov = coverage, D = distance, def = default, diff = difference, dist = distribution, dyn = dynamic, EM = expectation maximization, emp = empirical, exp = expected, FM = Ferragina Manzini, freq = frequency, |G| = genome size, HMM = hidden Markov model, hopo = homopolymer, I = indel = insertion/deletion, k = k-mer length, L = read length, LR = long read, loc = local, man = manual, max = maxmimum, min = minimum, MSA = multiple sequence alignment, N = #reads, na = not available, obs = observed, oc = overcall, occ = occurrence, pos = position, pot = potential, prob = probabilty, Q = quality, rel = relative, repl = replicated, RLE = run-length encoding, spec = spectrum, SR = short read, SVM = support vector machine, thr = threshold, uc = undercall, uniq = uniqueness, X = experience.
Citations and software URLs of error correction tools
| Tool | Author, year | Citation | Software URL |
|---|---|---|---|
| Acacia | (Bragg | [ | |
| AHA | (Bashir | [ | |
| ALLPATHS | (Butler | [ | |
| ALLPATHS-LG | (Gnerre | [ | |
| AmpliconNoise | (Quince | [ | |
| ARACHNE | (Batzoglou, 2002) | [ | |
| AutoEdit | (Gajer, 2004) | [ | not available (any more) |
| BayesHammer | (Nikolenko | [ | |
| BFC | (Li, 2015) | [ | |
| BLESS | (Heo | [ | |
| Bloocoo | (Drezen | [ | |
| Blue | (Greenfield | [ | |
| Coral | (Salmela and Schroder, 2011) | [ | |
| CUDA-EC | (Shi | [ | |
| DecGPU | (Liu | [ | |
| DeNoiser | (Reeder and Knight, 2010) | [ | |
| ECHO | (Kao | [ | |
| ECTools | (Lee | [ | |
| EDAR | (Zhao | [ | not available |
| EULER | (Pevzner | [ | |
| EULER | (Chaisson | [ | |
| EULER-SR | (Chaisson and Pevzner, 2008) | [ | |
| EULER-USR | (Chaisson | [ | |
| fermi | (Li, 2012) | [ | |
| Fiona | (Schulz | [ | |
| FreClu | (Qu | [ | |
| Hammer | (Medvedev | [ | |
| HECTOR | (Wirawan | [ | |
| HiTEC | (Ilie | [ | |
| Hybrid-SHREC | (Salmela, 2010) | [ | |
| KEC | (Skums | [ | |
| Lighter | (Song | [ | |
| LoRDEC | (Salmela and Rivals, 2014) | [ | |
| LSC | (Au | [ | |
| MisEd | (Tammi, 2003) | [ | not available |
| Musket | (Liu | [ | |
| MyHybrid | (Zhao | [ | not available |
| N-corr | (Shin and Park, 2014) | [ | |
| Nanocorr | (Goodwin | [ | |
| pacbio_qc | (Jiao, 2013) | [ | |
| PBcR | (Koren | [ | |
| Potts model | (Aita | [ | not available |
| PREMIER | (Yin | [ | not available |
| proovread | (Hackl | [ | |
| PSAEC | (Zhao | [ | not available |
| PyroNoise | (Quince | [ | |
| Quake | (Kelley | [ | |
| QuorUM | (Marçais | [ | |
| RACER | (Ilie and Molnar, 2013) | [ | |
| RECOUNT | (Wijaya | [ | not available (any more) |
| REDEEM | (Yang | [ | |
| Reptile | (Yang | [ | |
| SEECER | (Le | [ | |
| SGA | (Simpson and Durbin, 2012) | [ | |
| ShoRAH | (Zagordi | [ | |
| SHREC | (Schroder | [ | |
| SleepEC | (Sleep | [ | |
| SOAPdenovo | (Li | [ | |
| SOAPdenovo2 | (Luo | [ | |
| SysCall | (Meacham | [ | |
| Trowel | (Lim | [ |
Figure 3.Overview how to generate a pileup from a read set depending on the error correction strategy. (A) If a good and close reference is known, reads can be mapped to it. Otherwise, one of the other approaches is necessary: (B) A MSA of reads can be formed from pairwise alignments of all read pairs, of all reads with an overlap in an initial mapping to an available reference (dashed grey arrow from A to B), of all reads sharing part of a suffix (dashed grey arrow from F to B) or of all read pairs sharing a k-mer seed, identified by a table recording all reads that each k-mer occurs in (grey table and respective dashed grey arrows). Also, a simple recording of the count of all k-mers can be used to derive (C) a k-mer Spectrum or (D) a Hamming graph (Figure 4), and read suffixes of reads augmented with unique symbol ($x) can be used to construct (E) a suffix trie (Figure 5) or (F) a suffix array (Figure 6).
Figure 7.k-mer coverage histogram with a model fit. The histogram in this plot from the Quake paper [70] gives a nice example of an empirical k-mer coverage distribution. The density tells us which proportion of all existing k-mers in the data set has a particular coverage. The solid line gives the Quake model fit. The first peak of the distribution is formed by very low coverage error k-mers and is usually modelled by a Poisson or a Gamma distribution. The second peak results from the majority of correct k-mers and is usually modelled by a Poisson or a Gaussian distribution. Between these two peaks, a clear local minimum can provide a k-mer trust coverage cut-off. The heavy tail of higher multiplicity k-mers is the result of k-mers from sequence repeats. Quake draws the k-mers' sequence copy numbers from a Zeta distribution and then projects these k-mers into the coverage range of the correct k-mers. Adapted by font change and label addition from [70], according to the Creative Commons Attribution license CC-BY 2.0 (http://creativecommons.org/licenses/by/2.0/).
Figure 8.Example of a weighted de Bruijn graph from the example read set. The read set is augmented to include reads that show variation compared to the example read set in the earlier figures: while the Ts (red and orange) could be substitution errors and the G (purple) could be an insertion error, all three could also be reads covering alternative alleles of the same sequence locus or slightly different repeats at other sequence loci. The new reads create graph structures that are commonly removed in graph pruning (and thus correction) steps: (i) the orange T creates a bulge, a cycle in the graph that can not be traversed by a single path, as edges of the cycle go in opposing directions (e.g. edges ACC and TCC both ending at node CC); (ii) the red T creates a tip, a short dead end of the graph; (iii) the purple G creates a whirl, a cycle in the graph with a possible path going around the whole cycle (i.e. all edges go in the same direction). A repeat graph would have small (and assumingly erroneous) whirls, bulges and tips removed and repeats collapsed, but care must be taken to not remove genuine variation. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 9.Example of a MSA of a read set, demonstrating consistent mismatches (green nucleotides) in comparison to isolated or low-frequency mismatches and indels (red nucleotides and dashes). The consistency of mismatches can be tested through their linkage within reads (four in this example) and are called DNPs in the first tool that used such information, MisEd [86, 87]. However, genuine polymorphisms can only be told from sequencing errors if multiple of the linked variant sites are within the range covered by the average read (pair). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Recommendations which tools to consider for benchmarking of which data and analysis types
| Criterion | Property | Data sets | Tools to consider |
|---|---|---|---|
| Platform | 454 | Any (homopolymer errors!) | HECTOR, KEC, Acacia, AmpliconNoise, DeNoiser, PyroNoise |
| Platform | Oxford Nanopore | Any (very high error rate!) | Nanocorr |
| Platform | PacBio | Any (high error rate!, little bias) | proovread, LoRDEC, ECTools, pacbio_qc, PBcR, LSC, AHA |
| Data property | Non-uniform coverage | Metagenomics, transcriptomics, whole genome amplified | Trowel, Blue, BayesHammer, QuorUM, fermi, Hammer, ALLPATHS-LG, Reptile |
| Data property | mostly substitution errors | Complete Genomics, Illumina | BFC, Lighter, Trowel, BayesHammer, QuorUM, Musket, RACER, SGA, SOAPdenovo2, fermi, REDEEM, Hammer, SysCall, DecGPU, ECHO, HiTEC, ALLPATHS-LG, Reptile, CUDA-EC, SOAPdenovo, Quake, SHREC, FreClu, EULER-USR |
| Data property | indel errors prevalent | Any (esp. 454, Ion Torrent, PacBio) | Fiona, Blue, SEECER, Coral, ShoRAH, Hybrid-SHREC |
| Data property | many repeats or haplotypes | Metagenomics, complex genomes (e.g. eukaryotes) | SEECER, Acacia, SGA, SHoRAH, EULER-USR |
| Data property | two haplotypes | Diploid genome | ECHO |
| Analysis type | sensitive to single nucleotide errors | e.g. for SNV analysis | Fiona, REDEEM, ECHO, SysCall, Quake, FreClu |
Tools are given in chronological order of publication, newest tools first. Recommendations are based on this literature review of the different approaches to error correction.
Tools from 2008 or earlier were excluded. Platform-specific tools for 454, Oxford Nanopore and PacBio are only mentioned in the first, second and third row, respectively. The tools listed in the other rows should all be applicable to various data types from different platforms.