| Literature DB >> 34991450 |
Atul Sharma1, Pranjal Jain2, Ashraf Mahgoub1, Zihan Zhou1, Kanak Mahadik3, Somali Chaterji4.
Abstract
BACKGROUND: Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome.Entities:
Keywords: Automated configuration tuning; Error correction; Nanopore reads; Natural language processing (NLP); PacBio reads; Parameter search space; Perplexity metric; Transformer networks
Mesh:
Year: 2022 PMID: 34991450 PMCID: PMC8734100 DOI: 10.1186/s12859-021-04547-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Lerna evaluated on 7 Illumina (Table 2) short read datasets
| Dataset | Read length | Exhaustive search |
|
| Athena (RNN) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Base pairs | Selected | Alignment | Selected | Alignment | Selected | Alignment | Selected | Alignment | |
| D1 | 36 | 17 | 98.95 | Same as exhaustive search | Same as exhaustive search | Same as exhaustive search | |||
| D2 | 47 | 15 | 61.42 | Same as exhaustive search | |||||
| D3 | 36 | 15 | 80.44 | Same as exhaustive search | |||||
| D4 | 75 | 17 | 93.95 | Same as exhaustive search | Same as exhaustive search | ||||
| D5 | 100 | 17 | 92.15 | Same as Exhaustive Search | Same as exhaustive search | ||||
| D6 | 250 | 25 | 86.16 | Same as Exhaustive Search | Same as exhaustive search | ||||
| D7 | 101 | 17 | 40.53 | Same as exhaustive search | |||||
We test both the Transformer character- and word-level LMs. We observed that the best performance was attained by using the Transformer word-level LM with word length 4 (|w|=4). For the Transformer word-level LM (|w| = 4), Lerna finds either the optimal value or a value with an alignment rate within 0.31% of the theoretical best, consistent with the reported results by Lighter (Figure 5 in [18]). These slightly sub-optimal configurations are an artefact of sub-sampling
A comparison between Athena and Lerna on short reads
| Dataset | Read length (bp) | Coverage | Correlation Athena | Correlation | NG50 without correction | NG50 with | NG50 with Athena |
|---|---|---|---|---|---|---|---|
| D1 | 36 | 80 | − 0.93 | − 0.94 | 3019 | 6827 | 6827 |
| D2 | 47 | 71 | − 0.97 | − 0.96 | 47 | 2164 | |
| D3 | 36 | 173 | − 0.92 | − 0.93 | 1042 | 4164 | |
| D4 | 75 | 62 | − 0.86 | − 0.97 | 118 | 858 | 858 |
| D5 | 100 | 166 | − 0.96 | − 0.98 | 186 | 2799 | |
| 70 | − | 1098 | 1237 | ||||
| D7 | 101 | 67 | − 0.72 | − 0.82 | 723 | 739 |
The dataset is described in Table 2. We show the correlation between the perplexity metric and the alignment rate of the data after correction for Lerna vis-à-vis its closest competitor, Athena. On dataset D6 (greatest read length of 250 bp), Athena fails and has a positive correlation of + 0.95 (instead of having a negative correlation, as desired), highlighted in the Table. This is in line with the fact that RNNs are unable to model longer sequences. We also show the improvement of the assembly quality (in NG50) after tuning the EC tool with Lerna versus using the uncorrected reads. The NG50 with Lerna is always higher than, or equal to, that with Athena, other than a small drop for Dataset D7. All the superior values are bolded
Lerna and Athena run on 7 Illumina short read datasets
| Dataset | Coverage | Genome | Read length (bp) | #Reads | Athena | Speedup | |
|---|---|---|---|---|---|---|---|
| D1 | 80 | 136 | 20.8M | 98s | 5.5s | 17.8 | |
| D2 | 71 | 47 | 7.1M | 49s | 3s | 16.3 | |
| D3 | 173 | 36 | 18.1M | 69s | 4s | 17.25 | |
| D4 | 62 | 75 | 3.5M | 52s | 3s | 17.3 | |
| D5 | 166 | 100 | 7.1M | 100s | 5.5s | 18.2 | |
| D6 | 70 | 250 | 33.6M | 400s | 23s | 17.4 | |
| D7 | 67 | 101 | 202M | 960s | 63s | 15.2 |
The time for calculating perplexity has been reported, along with the read lengths and number of reads. We observe that on average our pipeline is 18 faster than Athena. This translates to 80 to 275 faster than estimating the alignment rate with Bowtie2
Fig. 1Workflow of Lerna. Lerna’s high level intuition is that by reducing the perplexity metric for the generated reads, we increase the alignment rate and the assembly quality (NG50). Our evaluation shows that a word-level Transformer LM, with a word length of 4 (|w| = 4) works best across all datasets and tools. Our algorithmic suite works on both NGS short reads and PacBio and Nanopore long reads
NGS short-read datasets’ description with coverage (estimated per Illumina’s documentation), number of reads, read lengths, genome type, and the Accession #
| Dataset | Coverage | #Reads | Read length (bp) | Genome type | Accession number |
|---|---|---|---|---|---|
| D1 | 80 | 20.8M | 36 | SRR001665 | |
| D2 | 71 | 7.1M | 47 | SRR022918 | |
| D3 | 173 | 18.1M | 36 | SRR006332 | |
| D4 | 62 | 3.5M | 75 | DRR000852 | |
| D5 | 166 | 7.1M | 100 | SRR397962 | |
| D6 | 70 | 33.6M | 250 | ERR2173372 | |
| D7 | 67 | 202M | 101 | SRR1658570 |
Effect of erroneous sequences on perplexity for multiple k-values on reads: these reads are generated by NanoSim using the E.coli reference genome
| 15 | 1073.6 | 943.5 | 952.9 |
| 37 | 1072.8 | 944.5 | 946.8 |
| 81 | 1072.8 | 973.4 | 956.3 |
, , and denote the perplexity scores on erroneous and error-free reads, and the entire dataset (i.e., erroneous and error-free sequences)
Effect of word length on correlation between perplexity and the percentage of aligned reads for E. coli PacBio simulated reads
| | | | | Mean (PPL) | SD (PPL) | Correlation | Test time (s) |
|---|---|---|---|---|---|
| 1 | 4 | 3.9997 | 0.00004 | − 0.714 | 367.5 |
| 2 | 16 | 15.824 | 0.02298 | 990.1 | |
| 3 | 64 | 62.176 | 0.06331 | 1439 | |
| 4 | 256 | 243.45 | 0.17869 | − 0.965 | 2026 |
| 5 | 1024 | 951.70 | 2.86750 | − 0.904 | 1862 |
| 6 | 4096 | 3724.7 | 21.5649 | − 0.889 | 2472 |
| 7 | 16,384 | 14675 | 133.207 | − 0.882 | 2923 |
| 8 | 65,536 | 59199 | 811.820 | − 0.877 | 7150 |
Strong correlation is observed for higher |w| values, with producing the strongest correlation. The vocabulary size is represented by |V|
Fig. 2Variation of perplexity and % of aligned reads with k on simulated E. coli PacBio reads, corrected by LoRDEC, mapped by Minimap2. The perplexity is computed by a Transformer word-level LM with . A strong negative correlation between the two metrics is clearly observed
Results of Lerna evaluated on PacBio reads: |w| denotes the word length used for training, k is the selected k-value, is the total number of reads out of the 10,000 reads that aligned with the reference genome after correction done by LoRDEC using the selected k-value
| | | k | Mapped reads |
|---|---|---|
| 1 | 67 | 9432 |
| 2 | 67 | 9432 |
| 3 | 67 | 9432 |
| 5 | 17 | 9982 |
| 6 | 17 | 9982 |
| 7 | 17 | 9982 |
| 8 | 23 | 9656 |
Results of Lerna evaluated on Nanopore simulated reads corrected using Canu
| k | Mapped reads |
|---|---|
| 11 | 9531 |
| 13 | 9549 |
| 15 | 9631 |
| 17 | 9723 |
| 21 | 9724 |
| 23 | 9696 |
| 25 | 9657 |
| 27 | 9553 |
| 29 | 9545 |
We use a word length of 4 () and find that the best value of the MhapMerSize comes out to be 19
Lerna results on real PacBio E.Coli K-12 reads
| Test PPL | NG50 | |
|---|---|---|
| 15 | 240.57 | 101,440 |
| 133,690 | ||
| 240.59 | ||
| 21 | 240.62 | 166,229 |
| 23 | 240.65 | 92,537 |
| 25 | 240.67 | 92,859 |
| 27 | 240.69 | 74,776 |
| 31 | 240.70 | 58,332 |
| 37 | 240.64 | 32,160 |
Simulated annealing finds as the best k value, which is also evident from the fact that it generated the minimum test perplexity. This value is quite close to that generates the highest NG50 on assembly. Both of these k-values are highlighted in the Table.
Lerna results on real Nanopore Acinetobacter baumannii reads
| Test PPL | NG50 | |
|---|---|---|
| 11 | 221.14 | 39,916 |
| 13 | ||
| 15 | 221.00 | 40,271 |
| 17 | 222.93 | 40,452 |
| 19 | 222.90 | 40,478 |
| 21 | 222.76 | 40,491 |
| 23 | 223.10 | 40,484 |
| 25 | 222.96 | 40,477 |
| 27 | 222.75 | 40,489 |
| 31 | 222.69 | 40,541 |
| 37 | 222.38 | 40,541 |
| 45 | 223.01 | 40,516 |
Simulated annealing here finds as the best k value that also generates the highest NG50 on assembly, both of which are highlighted.
Fig. 3A JIT compiler, which leverages the qualities of both static compilers and interpreters, has information about the system during execution, enabling better optimizations
Fig. 4The selected value of k-mer length for Illumina dataset D2 (Table 2). The value of the parameter selected is relatively unaffected by the initial temperature (T). In the worst case, the alignment rate 61.15% compared to 61.42% at the optimum value. In this experiment, we set = 0.7
Fig. 5Variation of selected k with parameter keeping the initial temperature (T) = 2.8. In this case we observe that any greater than 0.6 and less than 1 gives the optimum value for dataset D2