| Literature DB >> 31695060 |
Mustafa Abdallah1, Ashraf Mahgoub1, Hany Ahmed2, Somali Chaterji3.
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the "perplexity" metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions-for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena's selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of Lighter, Blue, and RACER using 7 datasets.
| Dataset | Exhaustive Search | With Athena (RNN) | With Athena (N-gram) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Lighter | |||||||||
| Selected | Alignment Rate (%) | EC Gain (%) | Selected | Alignment Rate (%) | EC Gain (%) | Selected | Alignment Rate(%) | EC Gain (%) | |
| 98.95% | 96.30% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 61.42% | 73.80% | 61.15% | 80.10% | Same as RNN | |||||
| 80.44% | 86.78% | 80.39% | 95.34% | Same as Exhaustive Search | |||||
| 93.95% | 89.87% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 92.15% | 81.70% | 92.09% | 83.80% | Same as Exhaustive Search | |||||
| 86.16% | 85.63% | Same as RNN | |||||||
| 40.53% | 37.58% | Same as Exhaustive Search | 40.24% | 7.70% | |||||
| 99.53% | 99% | 99.29% | 98.60% | Same as Exhaustive Search | |||||
| 57.44% | 4.61% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 84.17% | 99.20% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 95.31% | 98.50% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 92.33% | 88.90% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 86.18% | Same as Exhaustive Search | 86.07% | |||||||
| 17.19% | 3.57% | 16.96% | 1.47% | Same as Exhaustive Search | |||||
| 99.26% | 84.80% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 81.15% | 92.90% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 84.11% | 88.27% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 95.33% | 97% | Same as Exhaustive Search | Same as Exhaustive Search | ||||||
| 92.29% | 81.63% | 92.28% | 80.50% | Same as Exhaustive Search | |||||
| 86.36% | Same as Exhaustive Search | 86.12% | |||||||
| 17.55% | 21.10% | 17.40% | 26.50% | Same as Exhaustive Search | |||||
This is for finding the best k-value (GL for RACER) using Athena variants vs. exhaustive search. We find either the optimal value or within 0.53% (over Alignment Rate) and within 8.5% (EC Gain) of the theoretical best (in the worst case), consistent with the reported results by Lighter (Figure 5 in[6]). These slightly sub-optimal configurations is due to the impact of sub-sampling. However, with appropriate sampling rate selection, A achieves configurations that is 0.53% of the oracle best configuration (found with exhaustive searching). We also notice that for RACER, GL found by Athena is within 3% of the reference GL (except for the RNN model with D5, which still achieves very close performance for both Alignment Rate and EC Gain).
The Gain metric is not shown for D6 as the tool used to compute it was not able to handle reads of length 250 bp. We notice that the best genome length found by Athena for D7 (human genome) is 20Mbp, which is very low compared to the actual human genome length (≈3 k Mbp). This shows that using heuristics to estimate the optimal value of K based on only one parameter (genome length) can produce significantly suboptimal performance, even if the actual value of the genome length is provided. Moreover, GenomeLength parameter in Racer represents the approximate length of the DNA molecule that originated the reads. If only parts of a genome were sequenced, then only the total length of those parts should be used, instead of the length of the total genome. Dataset D7 is just for a part of the genome (24 Mbp), and Athena ‘s genome length selection of 20Mbp shows the efficacy of Athena for this usecase.
Figure 1Overview of Athena’s workflow. First, we train the language model using the entire set of uncorrected reads for the specific dataset. Second, we perform error correction on a subsample from the uncorrected reads using an EC tool (e.g., Lighter or Blue) and a range of k-values. Third, we compute perplexity of each corrected sample, corrected with a specific EC tool, and decide on the best k-value for the next iteration, i.e., the one corresponding to the lowest perplexity metric because EC quality is negatively correlated with the perplexity metric. This process continues until the termination criteria are met. Finally, the complete set of reads is corrected with the best k-value found and then used for evaluation.
Comparison of Overall Alignment Rate of Fiona versus RACER (with and without A’s tuning).
| — | Correlation to Alignment | Comparison with FIONA | Assembly quality | Runtime Improvement | |||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | Correlation (N-Gram) | Correlation (RNN) | Fiona + Bowtie2 (Alignment Rate) | RACER w/o Athena + Bowtie2 (Alignment Rate) | RACER w/Athena + Bowtie2 (Alignment Rate) | NG50 of Velvet w/o EC | NG50 of Velvet w/(Racer + Athena) | Athena | Bowtie2 |
| D1 | −0.977 | −0.938 | 99.25% | 85.01% | 3019 | 6827 (2.26X) | 1 m 38 s | 10 m 5 s | |
| D2 | −0.981 | −0.969 | 73.75% | 58.66% | 47 | 2164 (46X) | 49 s | 3 m 53 s | |
| D3 | −0.982 | −0.968 | 83.12% | 80.79% | 1042 | 4164 (4X) | 1 m 39 s | 7 m 50 s | |
| D4 | −0.946 | −0.930 | 93.86% | 118 | 858 (7.27X) | 52 s | 3 m 8 s | ||
| D5 | −0.970 | −0.962 | 90.91% | 92.29% | 186 | 2799 (15X) | 1 m 40 s | 9 m 42 s | |
| D6 | −0.944 | −0.979 | 85.76% | 86.84% | 1098 | 1237 (1.12X) | 6 m 40 s | 1 h 42 m | |
| D7 | −0.723 | −0.862 | NA | 17.17% | 17.55% | 723 | 754 (1.04X) | 16 m | 71 m |
RACER requires the user to enter a value for the “Genome Length”, which has no default value. Therefore, “RACER w/o Athena” is RACER operating with a fixed Genome Length of 1M. Columns 5 & 6 demonstrate the strong anti-correlation values between Perplexity and Alignment Rate. The last two columns show the assembly quality (in terms of NG50) before and after correction by RACER, tuned with Athena. Improvements in NG50 are shown between parentheses, while NGA50 and the amount of assembly errors metrics showed similar improvements and hence omitted. We also show the search time comparison for estimating the perplexity metric with Athena (N-gram) for a point in search space vs. estimating overall alignment rate with Bowtie2.
Figure 2An example showing how the perplexity metric encodes errors in genomic reads. The read on the left is an erroneous read selected from dataset D3, while the read on the right is the same read, after correction with Lighter. When using language modeling to compute the perplexity for both reads, we notice that the read on the right has a lower perplexity value (15.2), relative to the erroneous read (77.72), as the sequence of k-mers after correction has a higher probability of occurrence. Also notice that the probability of a sequence of k-mers depends on both their frequencies and their relative order in the read, which allows the perplexity metric to capture how likely it is to observe this k-mer given the neighboring k-mers in a given read.
Datasets’ description with coverage, number of reads, read lengths, genome type, and the Accesson number.
| Dataset | Coverage | #Reads | Read Length | Genome Type | Accession Number |
|---|---|---|---|---|---|
| D1 | 80X | 20.8M | 136 bp | SRR001665 | |
| D2 | 71X | 7.1M | 47 bp | SRR022918 | |
| D3 | 173X | 18.1M | 36 bp | SRR006332 | |
| D4 | 62X | 3.5M | 75 bp | DRR000852 | |
| D5 | 166X | 7.1M | 100 bp | SRR397962 | |
| D6 | 70X | 33.6M | 250 bp | ERR2173372 | |
| D7 | 67X | 202M | 101 bp | SRR1658570 |
Coverage is estimated according to Illumina’s documentation[49].
Figure 3Impact of sub-sampling on perplexity and gain. We compare the perplexity and gain with samples of sizes 35% and 70% of the D2 dataset. We observe the negative correlation between both metrics and also the positive correlation between the values of each metric on the two samples.
Figure 4N-Gram (Figure A,B) and RNN (Figure C,D) Perplexity metric for different types of synthetic errors: Indels and Substitution errors, and a mixture of the three for E. coli str reference genome (Figure A,C) and Acinetobacter sp. reference genome (Figure B,D). We compare two versions of such errors: high and low error rates.