| Literature DB >> 27499762 |
Kasra Zandi1, Gregory Butler2, Nawwaf Kharma3.
Abstract
Computational design of RNA sequences that fold into targeted secondary structures has many applications in biomedicine, nanotechnology and synthetic biology. An RNA molecule is made of different types of secondary structure elements and an important RNA element named pseudoknot plays a key role in stabilizing the functional form of the molecule. However, due to the computational complexities associated with characterizing pseudoknotted RNA structures, most of the existing RNA sequence designer algorithms generally ignore this important structural element and therefore limit their applications. In this paper we present a new algorithm to design RNA sequences for pseudoknotted secondary structures. We use NUPACK as the folding algorithm to compute the equilibrium characteristics of the pseudoknotted RNAs, and describe a new adaptive defect weighted sampling algorithm named Enzymer to design low ensemble defect RNA sequences for targeted secondary structures including pseudoknots. We used a biological data set of 201 pseudoknotted structures from the Pseudobase library to benchmark the performance of our algorithm. We compared the quality characteristics of the RNA sequences we designed by Enzymer with the results obtained from the state of the art MODENA and antaRNA. Our results show our method succeeds more frequently than MODENA and antaRNA do, and generates sequences that have lower ensemble defect, lower probability defect and higher thermostability. Finally by using Enzymer and by constraining the design to a naturally occurring and highly conserved Hammerhead motif, we designed 8 sequences for a pseudoknotted cis-acting Hammerhead ribozyme. Enzymer is available for download at https://bitbucket.org/casraz/enzymer.Entities:
Keywords: Pseudobase; RNA secondary structure; hammerhead ribozyme; pseudoknot; sequence design algorithm
Year: 2016 PMID: 27499762 PMCID: PMC4956659 DOI: 10.3389/fgene.2016.00129
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The design pipeline of . Step 1: we generate a random seed sequence, which is compatible with the target. Step 2: we evaluate the quality of the sequence. If the of the stop condition is met, we return the sequence. Step 3: the adaptive defected weighted sampling process starts here. In 3.1 the mutation operator is uniformly randomly selected. If the m-mutation schema is chosen. In step 3.2 we compute the value of m. In 3.3 we sample from low ensemble defect mutational landscape of the current sequence by applying the mutation operator. Step 4: when the stop condition is reached, we return the designed sequence.
Enzymer(τ, f, max_it, t, 3)
| 1: // input: target structure, target normalized ensemble defect, maximum iterations and the design template |
| 2: ϕ ← |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: // adaptive defect weighted sampling process starts here |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: ϕ ← |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: |
| 28: |
| 29: |
| 30: Return ϕ, |
mutate_single_nucleotide(ϕ, τ, P, t)
| 1: // input: sequence, target structure, matrix of pair probabilities and the design template |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: continue |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: ϕ′ ← |
| 12: |
| 13: ϕ ← ϕ′ |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: Return ϕ |
| 1: // This function mutates exactly |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: ϕ ← |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: |
| 14: Return ϕ |
| 1: //function inputs: sequences, target, nested pair probability, non-nested pair probability, template |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: continue // The entire pair is locked as specified by the design template |
| 7: |
| 8: |
| 9: ϕ′ ← |
| 10: |
| 11: ϕ ← ϕ′ |
| 12: |
| 13: |
| 14: Return ϕ |
| 15: |
| 16: |
| 17: ϕ′ ← |
| 18: |
| 19: ϕ ← ϕ′ |
| 20: |
| 21: |
| 22: Return ϕ |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: |
| 28: ϕ′ ← |
| 29: |
| 30: ϕ ← ϕ′ |
| 31: |
| 32: |
| 33: |
| 34: |
| 35: |
| 36: |
| 37: |
| 38: |
| 39: ϕ′ ← |
| 40: |
| 41: ϕ ← ϕ′ |
| 42: |
| 43: |
| 44: |
| 45: |
| 46: |
| 47: Return ϕ |
Figure 2Frequency of the solutions per structure where . For each target τ ∈ Pseudo for k = 1…201, the corresponding vertical bar represents the frequency (out of 30 trials) of the generated sequences ϕ for l = 1…30, where N(ϕ, τ) ≤ 0.01. (A) Enzymer generated at least one sequence ϕ such that N(ϕ, τ) ≤ 0.01 for 188 of the structures. (B) MODENA generated at least one sequence ϕ such that N(ϕ, τ) ≤ 0.01 for 144 of the structures. (C) antaRNA generated at least one sequence ϕ such that N(ϕ, τ) ≤ 0.01 for 24 of the structures. Binomial statistic test with 99% confidence, suggests Enzymer significantly outperforms both MODENA and antaRNA in generating sequences such that N(ϕ, τ) ≤ 0.01. Notably, the binomial test also suggests superior performance of MODENA compared with antaRNA. Structure IDs on the x-axis are sorted based on increasing size of the corresponding targets.
Figure 3MFE defect. For each target τ ∈ Pseudo for k = 1…201, the corresponding vertical bar represents the frequency (out of 30 trials) where MFE(ϕ, τ) = τ was reached. Comparison of performance of Enzymer (A) with the performance of MODENA (B) and antaRNA (C) shows Enzymer outperformed the other two methods in 191 and 194 cases respectively. A binomial test statistic with 99% confidence suggests Enzymer outperforms both methods in generating sequences with lower MFE defect. Furthermore, MODENA outperforms antaRNA in 127 cases and the binomial test statistic suggests superior performance of MODENA compared with antaRNA. Structure IDs on the x-axis are sorted based on increasing size of the corresponding targets.
Figure 4Comparing normalized ensemble defect. In each figure, each vertical bar represents the median N(ϕ, τ) obtained for each corresponding target. The results show Enzymer (A) outperformed both MODENA (B) and antaRNA (C) in 200 and 201 cases respectively. A binomial test statistic with 99% confidence suggests Enzymer delivers significantly better results compared to the other two methods. Furthermore, MODENA outperformed antaRNA in 155 cases and a binomial test static suggests that MODENA delivers significantly superior performance compared to antaRNA.
Figure 5Comparing probability defect values. In each figure, each vertical bar represents the median π(ϕ, τ) obtained for each corresponding target. The results show Enzymer (A) outperformed both MODENA (B) and antaRNA (C) in 196 and 201 cases respectively. A binomial test statistic with 99% confidence suggests Enzymer delivers significantly better results compared to the other two methods. Furthermore, MODENA outperformed antaRNA in 153 cases and a binomial test static suggests that MODENA delivers significantly superior performance compared to antaRNA.
Figure 6Comparing normalized median free energy. In each figure, each vertical bar represents the median ΔG(ϕ, τ) obtained for each corresponding target. The results show Enzymer (A) outperformed both MODENA (B) and antaRNA (C) in generating sequences with lower free energy in 102 and 198 cases respectively. A binomial test statistic with 99% confidence suggests Enzymer delivers significantly better results to antaRNA, however similar performance to MODENA. Furthermore, MODENA outperformed antaRNA in 195 cases and a binomial test static suggests that MODENA delivers significantly superior performance compared to antaRNA.
Figure 7Comparing median Boltzmann frequency. In each figure, each vertical bar represents the median Boltzmann frequency obtained for each corresponding target. The results show Enzymer (A) outperformed both MODENA (B) and antaRNA (C) in 197 and 201 cases respectively. A binomial test statistic with 99% confidence suggests Enzymer delivers significantly better results compared to the other two methods. Furthermore, MODENA outperformed antaRNA in 153 cases and a binomial test static suggests that MODENA delivers significantly superior performance compared to antaRNA.
Figure 8Comparing sequence identity. In each figure, each vertical bar represents the median sequence identity obtained for each corresponding target. For all 197 out of 201 cases where antaRNA (C) returned solutions, the median sequence identity was lower than Enzymer (A) as well as MODENA (B). On the other hand in 193 cases Enzymer generated sequences with lower sequence identity when compared with MODENA. Binomial test statistic with 99% confidence suggests antaRNA outperforms the other methods in generating sequence populations that are more diverse while MODENA generates sequences with the lowest sequence diversity.
Figure 9Run-time performance of . (A) Comparing run-time performance of Enzymer and MODENA. (B) Enzymer reached the stop condition in less than 200 iterations for 179 out of 201 cases.
Figure 10Effect of adaptive sampling on defect. The adaptive sampling strategy lowered the median normalized ensemble defect in 199 cases (A) and also lowered the median probability defect of the sequences in 181 cases (B). Binomial test statistic with 99% confidence interval suggests for improving impact of the adaptive sampling strategy on both normalized ensemble defect and probability defect of the sequences we generated by Enzymer. For both figures the data was generated by setting max_it = 400.
Figure 11Secondary structure of hammerhead ribozyme from mouse gut metagenome. The stems are in blue and free nucleotides are in red. The 5 nucleotide long pseudoknot, starts at position 3 on stem 1. The shown sequence represents the HHB1 sequence designed by Enzymer. The secondary structure in standard dot bracket notation is presented by “..[[[[[…..(((((……(((..]]]]]…….)))..(((((((……))))))).)))))……….” and is extracted from Perreault et al. (2011). We generated this figure using PseudoViewer3 (Byun and Han, 2009).
The data generated for the hammerhead ribozyme.
| 4.01 | 5.41 | −3.21 | 400 | |
| 4.97 | 6.33 | −2.13 | 400 | |
| 5.02 | 6.66 | −2.47 | 400 | |
| 4.34 | 5.85 | −2.66 | 400 | |
| 4.43 | 5.76 | −2.33 | 400 | |
| 4.99 | 6.44 | −2.49 | 400 | |
| 4.29 | 5.73 | −2.19 | 400 | |
| 5.38 | 7.05 | −2.65 | 400 | |
| Mean | 4.68 | 6.16 | −2.52 | 400 |
| Median | 4.70 | 6.09 | −2.48 | 400 |