| Literature DB >> 31952552 |
Anqi Wang1,2, Kin Fai Au3,4,5.
Abstract
The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.Entities:
Mesh:
Year: 2020 PMID: 31952552 PMCID: PMC6966875 DOI: 10.1186/s13059-019-1885-y
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Illustration of alignment-based and graph-based method; results for model fitness and accuracy gain on simulated data. a Schematic of alignment-based method. b is a certain base on the long read, and b′ is the corresponding base on the reference sequence. The C real short reads are aligned to the long read (with N of them being successfully aligned), and then the consensus is inferred at each base. b Relationship of the successful alignment probability for short reads τ with the mismatch rate p, lower threshold on perfect match k-mer size k and the upper threshold of mismatches m. In spite of the changes of k or/and m, τ is near to one when p < 5%, and is near to zero when p > 30%. This indicates that mismatch rate is the most dominant factor on τ. As m increases from 10 to 20, the curves move upper (from blue to red and green), implying that τ increases with m. Moreover, the divergence between the dashed and solid blue, red, and green lines also shows an increasing tendency, which means the effect of k on τ also increases with m. c Schematic of graph-based error correction method. DBG is built based on short reads. Solid k-mers are detected on the long reads. The fragment between two adjacent solid k-mers is then aligned with the correlated path on the DBG. The path is used to correct the fragment when certain criteria are satisfied. d Accuracy gain at each error rate for simulated long reads corrected by alignment-based method. The boxplots represent the accuracy gain distribution for long reads. The solid lines represent the theoretical values. The dashed gray lines (diagonal lines) correspond to perfect correction. e Proportion of simulated long reads with solid k-mer detected at each error rate level. The solid lines represent the theoretical values. The dashed lines represent the results on simulated long reads. f Accuracy gain at each error rate for simulated long reads corrected by graph-based method. L: long read length; k: size of perfectly matched seed or solid k-mer
Fig. 2Explanation of bimodal accuracy gain for graph-based method; model fitness and accuracy gain on real dataset. a Proportion of long reads with different solid k-mer number. Without loss of generosity, the simulated long reads with length of 1 kb and error rate of 25% are taken as example. A long read is labeled as “high-gain long read” of the accuracy gain is larger than 12.5% (half of the value of error rate), and “low-gain long read” otherwise. b Distribution of the single solid k-mer locations on the high-gain and low-gain long reads. Only the long reads with one solid k-mer are considered. c Distribution of solid k-mer number on the long reads with different lengths. d Accuracy gain distribution at each error rate level for alignment-based method. e Proportion of long reads with solid k-mer detected. Due to the mixture of different long read lengths, an upper boundary and lower boundary is provided. f Accuracy gain distribution at each error rate level for graph-based method. g Length distribution of long reads on which graph-based method (labeled as DBG) has better, equal, or worse performance than the alignment-based method (labeled as ALN). The p value is calculated by Wilcoxon rank sum test