Literature DB >> 31572840

Helix-Based RNA Landscape Partition and Alternative Secondary Structure Determination.

Fengfei Wang¹, Li-Zhen Sun², Tingting Sun³, Shan Chang¹, Xiaojun Xu¹.

Abstract

RNA is a versatile macromolecule with the ability to fold into and interconvert between multiple functional conformations. The elucidation of the RNA folding landscape, especially the knowledge of alternative structures, is critical to uncover the physical mechanism of RNA functions. Here, we introduce a helix-based strategy for RNA folding landscape partition and alternative secondary structure determination. The benchmark test of 27 RNAs involving alternative stable structures shows that the model has the ability to divide the whole landscape into distinct partitions at the secondary structure level and predict the representative structures for each partition. Furthermore, the predicted structures and equilibrium populations of metastable conformations for the 2'dG-sensing riboswitch reveal the allosteric conformational switch on transcript length, which is consistent with the experimental study, indicating the importance of metastable states for RNA-based gene regulation. The model delivers a starting point for the landscape-based strategy toward the RNA folding mechanism and functions.

Entities: Chemical Species

Year: 2019 PMID： 31572840 PMCID： PMC6761681 DOI： 10.1021/acsomega.9b01430

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

RNAs perform critical cellular functions at the level of gene expression and regulation. RNA functions are determined not only by specific structures or structural motifs but also by dynamics and kinetics of conformational changes. For example, riboswitches adopt different conformations in response to specific cellular conditions to regulate gene expression.[1,2] The elucidation of RNA folding landscapes, especially the knowledge of alternative structures, is critical to uncover physical mechanisms of the RNA function. Experimental techniques, such as single molecule methods, chemical mapping, and nuclear magnetic resonance, have been widely used to probe folding landscapes and detect alternative structures of RNAs.[3−14] However, most techniques require significant infrastructure investment and expert intuition. RNA folding landscapes at the secondary structural level, on the other hand, have been computationally predicted for decades.[15−35] For example, by using a dynamic programming approach to track calculation of the partition function, Vfold2D[25] recursively calculates base pairing probabilities and probable structures. RintW[27] applies discrete Fourier transformation to detect essential alternative secondary structures by decomposing base pairing probabilities over the Hamming distance. SwiSpot[28] models alternative structures of riboswitches by spotting out switching sequences with the McCaskill algorithm[29] for base pairing probabilities. Since base pairing probabilities are sensitive to base stacking and loop energy parameters, any small change of parameters may produce dramatic changes in predicted base pairing probability distributions, which lead to limited tolerance of uncertainty for energy parameters. On the other hand, evolution-based RNA sequence analysis provides promising insights but is limited by the divergency of homologous sequences across species.[30] Beyond base pair resolution, RNAshapes formalizes the concept of abstract shapes and predicts optimal and suboptimal structures in the shape space.[31,32] Furthermore, RNAHeliCes introduces a position-specific abstraction based on helices, providing better resolution with the cost of a slightly increased search space.[33] RNA profiling denoises the set of observed base pairs in the Boltzmann sampled structures (sampling according to the Boltzmann distribution) to identify significant combinations of base pairs, which dominate low-energy RNA secondary structures.[34] The model highlights critical relations at the substructure level, yielding crucial information for molecular biologists. An RNA secondary structure can be divided into helices and single-strand loops. Helices are canonical double-stranded regions involving favorable base paring and stacking interactions. From a kinetics point of view, folding of an RNA molecule is usually initiated by the formation of helical regions at the secondary structure level followed by global folding associated with tertiary interactions. The stability of RNA helices determines RNA secondary structural ensemble thermodynamics and the kinetics of the RNA folding energy landscape. Here, we introduce a helix-based strategy for partitioning RNA folding landscapes and determining alternative secondary structures. The model uses the stable helical regions to define the main features of distinct structural ensembles in RNA folding landscapes; constraints from the selected stable helices are used to predict the minimal free energy structures to better address the conformational heterogeneity in RNAs. The benchmark test of 27 RNAs, as well as the prediction of transcriptional intermediates, indicates that the model has the ability to partition RNA folding landscapes and predict alternative stable secondary structures, which may deliver a starting point for a landscape-based strategy toward the RNA folding mechanism and functions.

Results and Discussion

Base Pairing Probability Distribution

To validate the helix-based strategy for RNA folding landscape partition, we randomly generate RNA sequences (i.e., samples A, C, G, and U with equiprobability at each position) of different sizes and predict the base pairing probability distributions with Vfold2D.[25] As shown in Figure A, the triangular relationship between the averaged probabilities (over the individual base pairing probabilities in each helix) and the helix stabilities (free energies from the Turner parameters[36]) for the RNA of 150-nt length with 754 total helices indicates that stable helices may have large probabilities of existing in stable structures with high base pairing probabilities. Other RNAs with similar distributions are shown in Figure S1. Therefore, we may use stable helices to divide the whole RNA landscape into distinct structural ensembles for landscape partitioning.

Figure 1

(A) Dependence of the averaged base pairing probabilities on the helix stabilities (in kcal/mol) of a 150-nt RNA sequence with 754 total helices. (B) Base pairing probability distributions of the top 10 most stable helices from panel (A). On the other hand, the calculated individual base pairing probabilities for each helix are usually not identical, as shown by the top 10 stable helices in Figure B and Figure S1. Helices in different structures may be involved in different folding/unfolding processes, leading to different patterns of base pairing probabilities. For example, the kinetics of a hairpin loop is usually rate-limited by the folding of the loop-closing base pairs followed by a fast zipping process for the rest of the base pairs. The process may become more complicated when the hairpin loop is base-paired with its single-strand tail to fold into a pseudoknotted structure. As shown in Figure B, the base pairing probabilities for the base pairs in both helix terminal ends are usually lower than those in the middle. Due to the dynamic nature of RNA molecules in solution, it is reasonable to assume that base pairs in the middle of stable helices define the main feature of a local minimum in the folding landscape and the base pairs in helix terminal ends and other less stable helices shape the local energy landscape around each minimum basin.

Partitioning the Landscape

To benchmark test the predictive ability of the helix-based strategy used to partition the RNA folding landscape and determine alternative secondary structures, we collect 27 RNAs with multiple reported alternative stable structures, as listed in Table S1. Among them, 21 are bistable systems and 6 are multistable RNAs. The RNA size ranges from 17 to 121 nucleotides. The reported alternative stable structures are extracted from the corresponding published papers. The prediction accuracy is evaluated by four parameters: sensitivity (SE), positive predictive value (PPV), Matthews correlation coefficient (MCC), and ranking. Here, SE and PPV are defined as the ratios between the correctly predicted canonical base pairs and the total number of canonical base pairs in the reported and in the predicted structures, respectively. MCC is approximated by the arithmetic mean of the SE and PPV.[14] Ranking is calculated as the ratio between the stability-based rank of best prediction and the total number of partitions. Figure gives the total number of folding partitions and the prediction accuracy of sensitivity and ranking as a function of the size of the selective helix pool for the two cases of RNA-15 (17-nt in size and 10 saturated helices) and RNA-17 (121-nt in size and 596 saturated helices) (see Figure S2 for other cases). As expected, the number of folding partitions increases exponentially with the size of the selective helix pool. However, the sensitivity for the prediction of alternative structures approaches 100% for small pool sizes and remains fairly constant during the pool enlargement. Meanwhile, the ranking of the prediction decreases dramatically with the pool size and reaches its limit, that is, top ranks, especially for the large-sized RNAs. Since we use the MFE structure as the representative for each folding partition, the prediction accuracy increases if the subalternative structures (not only the MFE structure) are considered for each partition. More discrete partitions could provide more detailed insights on the RNA folding mechanism and functions. However, there is a balance between the prediction accuracy, the ranking for best predictions, and the computational efficiency.

Figure 2

Total number of partitions (in red), sensitivity (in black), and ranking (in blue) of prediction accuracy for each alternative structure as a function of the size of the selective helix pool for (A) RNA-15 of 17-nt size with 10 total helices and (B) RNA-17 of 121-nt size with 596 total helices. Table S2 lists the detailed benchmark test for the prediction of alternative secondary structures. Here, we select helices from most to least stable in the saturated helix pool until the total number of assembled partitions reaches 100. For example, for the case of RNA-15 with 17 nucleotides in length, we include all saturated helices for landscape partitioning. For the case of RNA-17 with a length of 121 nucleotides, we only select the top 11 out of 596 total saturated helices to assemble 196 folding partitions (<100 partitions, if only top 10 helices are selected). As we see in Table S2, most of the alternative structures can be correctly predicted with all of the canonical base pairs in the structures of best predictions. The average SE is 0.95. Furthermore, for the test cases in Table S2 with 61 alternative structures in total, 20, 15, and 8 cases of the best predictions are ranked as the first, second, and third prediction by stabilities, respectively. Only three (out of 61) cases are not ranked in the top 10 predictions. All the predicted and reported structures, as well as the number of helices, are listed in Table S2 and Figure S3. We also use the online servers of Sfold (Srna module),[23] Mfold,[18] RNAstructure,[19,20] RNAshapes,[32] and RNA profiling[34] with the default parameters to predict the alternative secondary structures for the above 61 tested cases. Figure (Figures S4–S7) gives the direct comparison for the prediction accuracy of SE, PPV, MCC, and ranks. Sfold, Mfold, and RNAstructure predict alternative structures in the structural ensemble with single base pair resolution and achieve similar performances (in the range of (0.83, 0.94) for the averaged SE, PPV, and MCC). RNAshapes predicts alternative structures at the abstract shape resolution. Although its prediction accuracy (for the test cases) is slightly lower than others, the model is especially useful if a known topology (shape) is suspected. Like our helix-based RNA folding model, RNA profiling also constructs clusters of structures from the combinations of helices. However, profiling generates helix classes from Boltzmann sampling. When the free energy difference between alternative structures is large, as in the example of RNA-10 shown in Figure S3 (one of the reported structures is a hairpin structure with 7 base pairs, while the other one is a 4-bp hairpin), the Boltzmann sampling with a limited number of structures (1000 for the profiling server by default) may not include structures with the 4-bp helix, leading to poor predictions for the less stable helix.

Figure 3

Comparison of the Sfold, Mfold, RNAstructure, RNAshapes, RNA profiling, and our model predictions for (A) ranks, (B) SE, (C) PPV, and (D) MCC. All cases are listed in Table S1.

Comparison of the Sfold, Mfold, RNAstructure, RNAshapes, RNA profiling, and our model predictions for (A) ranks, (B) SE, (C) PPV, and (D) MCC. All cases are listed in Table S1. In contrast, our model selects helices based on the stabilities, partitions the landscape, and predicts alternative structures from the complete secondary structural ensemble at the (one-third) helix level. The prediction results listed in Figure , Figure S3, and Table S2 indicate that only a small number (<20) of stable saturated helices are needed to partition the landscape and place most of the best predictions of alternative structures in the top 10 predictions.

Determination of Transcriptional Intermediates

To further test the helix-based RNA folding model, we predict alternative structures for the I-A 2′-deoxyguanosine-sensing (2′dG-sensing) riboswitch from Mesoplasma florum at different sequence lengths, which shows a strict fine-tuning of binding and sequence-dependent alterations of the conformational space by structural analysis of all relevant transcription intermediates at single-nucleotide resolution from NMR studies.[37] Since the energy terms for ligand-binding interactions at the secondary structural level are currently unavailable, we only predict the transcriptional intermediates of the riboswitch in the absence of ligands and compare the predicted results with the corresponding experimental data. The 2′dG-sensing riboswitch has two dominant functional states: ON and OFF. The NMR studies revealed that, depending on the sequence length during the transcriptional process as shown in Figure A, the OFF state adopts two distinct secondary structural elements. OFF (short) is a three-way junction (aptamer domain) attached by a single-strand 3′-tail. However, the three-way junction of OFF (long) is attached by a stem-loop motif (terminator). The ON state also has two forms of structures. ON(M) has a multiloop junction containing at least three stems. ON(I) is a stem-loop structure with an internal loop. Both ON(M) and ON(I) share the same antiterminator stem and equivalent function. We use Vfold2D to calculate the equilibrium populations of the three structural ensembles, namely, OFF (short or long), ON(M), and ON(I), with the base pair constraints from the corresponding functional structures shown in Figure A for different sequence lengths to mimic the transcriptional process.

Figure 4

(A) Conformational rearrangement during the sequence elongation reveals the alternative structure-guided regulation mechanism of 2′dG-sensing riboswitch. (B) Predicted population changes of the three transcriptional intermediates of 2′dG-sensing riboswitch, namely, OFF (short or long), ON(M), and ON(I), along with the sequence length. As shown in Figure B for the predicted population changes of the three functional states, namely, OFF, ON(M), and ON(I), along with the sequence length, there are two conformational rearrangements that happen during sequence elongation: OFF → ON at (110, 114) and ON → OFF at (122, 139) of sequence length. OFF (short) dominates the early time windows of transcription for the nascent RNAs with 3′ ends at the location of (a, d). OFF (long) dominates the late time windows for the 3′ ends at (h, i). The dominant ON state during the middle time windows adopts two bistable antiterminator conformations in slow exchange: ON(M) and ON(I). ON(M) is slightly more stable than ON(I) with larger population in equilibrium. The predicted population changes agree well with the NMR studies, revealing the importance of metastable states for RNA-based gene regulation. It should be noted that the population of OFF (short) for the 3′ ends at the location of a is very low (≈0.03, shown in Figure B), which is inconsistent with the experimental results. From the free energy calculations, we find that there is an energetically more favorable structure dominating the populations in equilibrium. As shown by the cyan part of the aptamer domain in Figure , the P1 stem of the three-way junction is disrupted to fold into a stable four-way junction where the single-strand 5′-tail attached to the P1 stem is base-paired with the 3′-tail, leading to the low population of the OFF state shown in Figure B. With the increase of sequence length, the formation of the aptamer domain dominates over the other structures since the P1 stem becomes longer and stabilizes the three-way junction of the aptamer domain. The truncation of the first five 5′-nucleotides, as simulated by the Barmap approach, eliminates the possibility of folding into alternative structures, such as the four-way junction structure, and gives results consistent with the NMR studies even for the case with the 3′ ends at position a.

Figure 5

Alternative structures of the aptamer domain of 2′dG-sensing riboswitch: (A) a four-way junction, (B) a three-way junction attached by a 19-nt-long 5′-tail. The P1 stem of the three-way junction (cyan) in (B) is disrupted to fold into a stable four-way junction where the single-strand 5′-tail attached to the P1 stem is base-paired with the 3′-tail. The P2 stem of the three-way junction (magenta) in panel (B) is partially disrupted in order to fold a stem-loop structure in panel (A). On the other hand, our model also reveals an alternative structure for the P2 stem of the aptamer domain, which is consistent with simulation results obtained by truncating the first five 5′-nucleotides with Barmap. As highlighted by the magenta part of the aptamer domain in Figure , stem P2 of the three-way junction is partially disrupted in order to form a stem-loop structure. The alternative structure, which shares the same hairpin loop as closed by the P2 stem, contains a bulge loop with an additional coaxial stacking interaction between the two helices and decreases the size of the (three- or four-way) junction. Energetically, the two alternative structures of the P2 stem shown in Figure have similar stabilities, resulting in ≈0.35 for the population of OFF shown in Figure B.

Conclusions

Using stable helices to effectively divide the RNA folding landscape into discrete structural ensembles, the helix-based RNA folding model partitions the folding landscape and determines alternative secondary structures with high accuracy. The benchmark test of 27 RNAs involving multiple alternative structures shows that our model accurately predicts alternative structures with top ranking structures corresponding to our best predictions. Furthermore, the test of the metastable transcriptional intermediates of the 2′dG-sensing riboswitch indicates that the predicted population changes associated with sequence elongation agree well with the experimental results. The model correctly predicts the conformational rearrangement between ON and OFF states and the coexistence of two ON states in slow exchange, which is not predictable by simulations from the Barmap approach. Furthermore, reducing the large conformational space associated with the whole folding landscape at single base pair resolution to a limited number of selective helix pool size-dependent landscape partitions provides us with a feasible way to decipher the dynamics/kinetics properties and structure–function relationship of RNA molecules from folding landscape studies. For the free energy-based RNA folding models, the simulation results are usually sensitive to the choice of energy parameters, especially for the prediction of MFE structures. Partitioning the landscape to determine alternative structures with the helix-based RNA folding model may effectively increase the tolerance of uncertainty for energy parameters and better reveal the structure–function relationship of RNA molecules. Furthermore, the model may also provide a way to facilitate the experimental determination of functional structures through a step-by-step screening of stable helices. However, the current model is limited to predicting folding at the secondary structure level. Further development of the model should include more complicated structures, such as pseudoknotted and loop–loop kissing motifs.

Algorithm and Methods

Different RNA sequences may have different sets of helices with different folding stabilities. Helices with lower free energies (higher stabilities) may have higher probabilities of initiating the folding process and surviving in stable structures. Therefore, we can use stable helices to effectively divide the whole RNA folding landscape into discrete partitions and determine alternative structures to better decipher the structure–function relationship of RNAs. As shown in Figure , the helix-based strategy for partitioning the RNA folding landscape works with the following steps.

Figure 6

Schematic of the helix-based RNA folding landscape partition: (1) build a saturated helix pool from a given RNA sequence; (2) select top helices according to the helix stabilities; (3) partition the folding landscape from the selective helix pool. Each helix is denoted by (i, j, k). The folding partition in the dashed rectangle represents a structural ensemble of all structures with the helix H1 and without the helices H2 and H3.

A saturated helix pool is built from a given RNA sequence. We only consider the canonical base pairs, such as A-U, G-C, and G-U, in this step. Helices with only one base pair are removed. All sub-helices are not included in the saturated helix pool. As shown in Figure , each saturated helix is denoted by (i, j, k), with (i, j) a starting base pair at nucleotide positions i and j, and k is the number of base pairs in the helix. To enumerate all possible saturated helices, we first identify all starting pairs of (i, j), such that (i, j) is a canonical base pair, while (i – 1, j + 1) is noncanonical (except for i = 1 or j = N, size of a given sequence). For each starting pair of (i, j), we then scan for k consecutive canonical base pairs, such that (i + k, j – k) is noncanonical or the size of the fragment within (i + k, j – k) is less than 3 (i.e., minimum hairpin loop). Top helices are selected according to the helix stabilities. The stability of each helix is calculated by the base stacking energies from the Turner parameters.[36] The number of selected helices is arbitrary and is related to the total number of folding landscape partitions, as well as the computational time. As shown in the example in Figure , the selective helix pool contains the three most stable helices (H1, H2, and H3). The folding landscape is partitioned from the selective helix pool. By enumerating all the possible helix combinations and deleting those with helix overlaps (two nucleotides are base-paired with the same one) or crossing (two base pairs of (i, j) and (m, n) in the conditions of m > i, j > m, and n > j, as in the case of a pseudoknotted structure), we obtain a list of folding partitions featured by the selected helices. Each partition defines a structural ensemble in the overall folding landscape containing all the secondary structures with base pairs from the included helices but without base pairs from the excluded helices in the selective helix pool. Schematic of the helix-based RNA folding landscape partition: (1) build a saturated helix pool from a given RNA sequence; (2) select top helices according to the helix stabilities; (3) partition the folding landscape from the selective helix pool. Each helix is denoted by (i, j, k). The folding partition in the dashed rectangle represents a structural ensemble of all structures with the helix H1 and without the helices H2 and H3. Since folding partitions are defined by the inclusion and exclusion of selected helices, there is at least one helix difference between any two of them. The size of the selective helix pool, that is, the threshold of helix stability, determines the discriminative ability of the partitioned landscape. Larger selective helix pool size leads to smaller differences between different partitions.

Alternative Structure Determination

We use the Vfold2D[25] model with the corresponding base pair constraints from the selective helix pool to calculate the total free energy of each folding partition and treat the minimal free energy (MFE) structure as the representative of each partition. Other RNA secondary structure prediction models, such as Mfold,[18] RNAstructure,[19,20] and RNAfold,[21] may also be applied to calculate the free energy and determine the MFE structure with similar performances. As shown by the example of the dashed rectangle in Figure , the assembled partition represents a structural ensemble of all secondary structures with the helix H1 (included) and without the helices H2 and H3 (excluded) in the three-helix selective helix pool. Since base pairs in both helix terminal ends are usually unstable and involve spontaneous dynamics of opening and closing in solution, we only consider base pairs in the middle of helices (one-third in length) for the included helices, as shown by the four red ones (h1) of a 10-bp helix H1 in Figure . The removal of terminal base pairs would allow two partially overlapping saturated helices to coexist, which often happens in junction motifs, where nucleotides may be base-paired in different helices. For the excluded helices, however, we use all the base pairs as the base pair constraints in Vfold2D for structure enumeration.

37 in total

1. Sfold web server for statistical folding and rational design of nucleic acids.

Authors: Ye Ding; Chi Yu Chan; Charles E Lawrence
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

2. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features.

Authors: Yiliang Ding; Yin Tang; Chun Kit Kwok; Yu Zhang; Philip C Bevilacqua; Sarah M Assmann
Journal: Nature Date: 2013-11-24 Impact factor: 49.962

Review 3. Insights into RNA structure and function from genome-wide studies.

Authors: Stefanie A Mortimer; Mary Anne Kidwell; Jennifer A Doudna
Journal: Nat Rev Genet Date: 2014-05-13 Impact factor: 53.242

4. Modeling RNA secondary structure folding ensembles using SHAPE mapping data.

Authors: Aleksandar Spasic; Sarah M Assmann; Philip C Bevilacqua; David H Mathews
Journal: Nucleic Acids Res Date: 2018-01-09 Impact factor: 16.971

5. RNAstructure: software for RNA secondary structure prediction and analysis.

Authors: Jessica S Reuter; David H Mathews
Journal: BMC Bioinformatics Date: 2010-03-15 Impact factor: 3.169

Review 6. New insights from cluster analysis methods for RNA secondary structure prediction.

Authors: Emily Rogers; Christine Heitsch
Journal: Wiley Interdiscip Rev RNA Date: 2016-03-11 Impact factor: 9.957

7. Statistical modeling of RNA structure profiling experiments enables parsimonious reconstruction of structure landscapes.

Authors: Hua Li; Sharon Aviran
Journal: Nat Commun Date: 2018-02-09 Impact factor: 14.919