Fengfei Wang1, Li-Zhen Sun2, Tingting Sun3, Shan Chang1, Xiaojun Xu1. 1. Institute of Bioinformatics and Medical Engineering, School of Mathematics and Physics, Jiangsu University of Technology, Changzhou, Jiangsu 213001, China. 2. Department of Applied Physics, Zhejiang University of Technology, Hangzhou, Zhejiang 310023, China. 3. Department of Physics, Zhejiang University of Science and Technology, Hangzhou, Zhejiang 310023, China.
Abstract
RNA is a versatile macromolecule with the ability to fold into and interconvert between multiple functional conformations. The elucidation of the RNA folding landscape, especially the knowledge of alternative structures, is critical to uncover the physical mechanism of RNA functions. Here, we introduce a helix-based strategy for RNA folding landscape partition and alternative secondary structure determination. The benchmark test of 27 RNAs involving alternative stable structures shows that the model has the ability to divide the whole landscape into distinct partitions at the secondary structure level and predict the representative structures for each partition. Furthermore, the predicted structures and equilibrium populations of metastable conformations for the 2'dG-sensing riboswitch reveal the allosteric conformational switch on transcript length, which is consistent with the experimental study, indicating the importance of metastable states for RNA-based gene regulation. The model delivers a starting point for the landscape-based strategy toward the RNA folding mechanism and functions.
RNA is a versatile macromolecule with the ability to fold into and interconvert between multiple functional conformations. The elucidation of the RNA folding landscape, especially the knowledge of alternative structures, is critical to uncover the physical mechanism of RNA functions. Here, we introduce a helix-based strategy for RNA folding landscape partition and alternative secondary structure determination. The benchmark test of 27 RNAs involving alternative stable structures shows that the model has the ability to divide the whole landscape into distinct partitions at the secondary structure level and predict the representative structures for each partition. Furthermore, the predicted structures and equilibrium populations of metastable conformations for the 2'dG-sensing riboswitch reveal the allosteric conformational switch on transcript length, which is consistent with the experimental study, indicating the importance of metastable states for RNA-based gene regulation. The model delivers a starting point for the landscape-based strategy toward the RNA folding mechanism and functions.
RNAs perform critical
cellular functions at the level of gene expression
and regulation. RNA functions are determined not only by specific
structures or structural motifs but also by dynamics and kinetics
of conformational changes. For example, riboswitches adopt different
conformations in response to specific cellular conditions to regulate
gene expression.[1,2] The elucidation of RNA folding
landscapes, especially the knowledge of alternative structures, is
critical to uncover physical mechanisms of the RNA function. Experimental
techniques, such as single molecule methods, chemical mapping, and
nuclear magnetic resonance, have been widely used to probe folding
landscapes and detect alternative structures of RNAs.[3−14] However, most techniques require significant infrastructure investment
and expert intuition.RNA folding landscapes at the secondary
structural level, on the
other hand, have been computationally predicted for decades.[15−35] For example, by using a dynamic programming approach to track calculation
of the partition function, Vfold2D[25] recursively
calculates base pairing probabilities and probable structures. RintW[27] applies discrete Fourier transformation to detect
essential alternative secondary structures by decomposing base pairing
probabilities over the Hamming distance. SwiSpot[28] models alternative structures of riboswitches by spotting
out switching sequences with the McCaskill algorithm[29] for base pairing probabilities. Since base pairing probabilities
are sensitive to base stacking and loop energy parameters, any small
change of parameters may produce dramatic changes in predicted base
pairing probability distributions, which lead to limited tolerance
of uncertainty for energy parameters. On the other hand, evolution-based
RNA sequence analysis provides promising insights but is limited by
the divergency of homologous sequences across species.[30]Beyond base pair resolution, RNAshapes
formalizes the concept of
abstract shapes and predicts optimal and suboptimal structures in
the shape space.[31,32] Furthermore, RNAHeliCes introduces
a position-specific abstraction based on helices, providing better
resolution with the cost of a slightly increased search space.[33] RNA profiling denoises the set of observed base
pairs in the Boltzmann sampled structures (sampling according to the
Boltzmann distribution) to identify significant combinations of base
pairs, which dominate low-energy RNA secondary structures.[34] The model highlights critical relations at the
substructure level, yielding crucial information for molecular biologists.An RNA secondary structure can be divided into helices and single-strand
loops. Helices are canonical double-stranded regions involving favorable
base paring and stacking interactions. From a kinetics point of view,
folding of an RNA molecule is usually initiated by the formation of
helical regions at the secondary structure level followed by global
folding associated with tertiary interactions. The stability of RNA
helices determines RNA secondary structural ensemble thermodynamics
and the kinetics of the RNA folding energy landscape. Here, we introduce
a helix-based strategy for partitioning RNA folding landscapes and
determining alternative secondary structures. The model uses the stable
helical regions to define the main features of distinct structural
ensembles in RNA folding landscapes; constraints from the selected
stable helices are used to predict the minimal free energy structures
to better address the conformational heterogeneity in RNAs. The benchmark
test of 27 RNAs, as well as the prediction of transcriptional intermediates,
indicates that the model has the ability to partition RNA folding
landscapes and predict alternative stable secondary structures, which
may deliver a starting point for a landscape-based strategy toward
the RNA folding mechanism and functions.
Results and Discussion
Base Pairing
Probability Distribution
To validate the
helix-based strategy for RNA folding landscape partition, we randomly
generate RNA sequences (i.e., samples A, C, G, and U with equiprobability
at each position) of different sizes and predict the base pairing
probability distributions with Vfold2D.[25] As shown in Figure A, the triangular relationship between the averaged probabilities
(over the individual base pairing probabilities in each helix) and
the helix stabilities (free energies from the Turner parameters[36]) for the RNA of 150-nt length with 754 total
helices indicates that stable helices may have large probabilities
of existing in stable structures with high base pairing probabilities.
Other RNAs with similar distributions are shown in Figure S1. Therefore, we may use stable helices to divide
the whole RNA landscape into distinct structural ensembles for landscape
partitioning.
Figure 1
(A) Dependence of the averaged base pairing probabilities
on the
helix stabilities (in kcal/mol) of a 150-nt RNA sequence with 754
total helices. (B) Base pairing probability distributions of the top
10 most stable helices from panel (A).
(A) Dependence of the averaged base pairing probabilities
on the
helix stabilities (in kcal/mol) of a 150-nt RNA sequence with 754
total helices. (B) Base pairing probability distributions of the top
10 most stable helices from panel (A).On the other hand, the calculated individual base pairing probabilities
for each helix are usually not identical, as shown by the top 10 stable
helices in Figure B and Figure S1. Helices in different
structures may be involved in different folding/unfolding processes,
leading to different patterns of base pairing probabilities. For example,
the kinetics of a hairpin loop is usually rate-limited by the folding
of the loop-closing base pairs followed by a fast zipping process
for the rest of the base pairs. The process may become more complicated
when the hairpin loop is base-paired with its single-strand tail to
fold into a pseudoknotted structure. As shown in Figure B, the base pairing probabilities
for the base pairs in both helix terminal ends are usually lower than
those in the middle. Due to the dynamic nature of RNA molecules in
solution, it is reasonable to assume that base pairs in the middle
of stable helices define the main feature of a local minimum in the
folding landscape and the base pairs in helix terminal ends and other
less stable helices shape the local energy landscape around each minimum
basin.
Partitioning the Landscape
To benchmark test the predictive
ability of the helix-based strategy used to partition the RNA folding
landscape and determine alternative secondary structures, we collect
27 RNAs with multiple reported alternative stable structures, as listed
in Table S1. Among them, 21 are bistable
systems and 6 are multistable RNAs. The RNA size ranges from 17 to
121 nucleotides. The reported alternative stable structures are extracted
from the corresponding published papers. The prediction accuracy is
evaluated by four parameters: sensitivity (SE), positive predictive
value (PPV), Matthews correlation coefficient (MCC), and ranking.
Here, SE and PPV are defined as the ratios between the correctly predicted
canonical base pairs and the total number of canonical base pairs
in the reported and in the predicted structures, respectively. MCC
is approximated by the arithmetic mean of the SE and PPV.[14] Ranking is calculated as the ratio between the
stability-based rank of best prediction and the total number of partitions.Figure gives the
total number of folding partitions and the prediction accuracy of
sensitivity and ranking as a function of the size of the selective
helix pool for the two cases of RNA-15 (17-nt in size and 10 saturated
helices) and RNA-17 (121-nt in size and 596 saturated helices) (see Figure S2 for other cases). As expected, the
number of folding partitions increases exponentially with the size
of the selective helix pool. However, the sensitivity for the prediction
of alternative structures approaches 100% for small pool sizes and
remains fairly constant during the pool enlargement. Meanwhile, the
ranking of the prediction decreases dramatically with the pool size
and reaches its limit, that is, top ranks, especially for the large-sized
RNAs. Since we use the MFE structure as the representative for each
folding partition, the prediction accuracy increases if the subalternative
structures (not only the MFE structure) are considered for each partition.
More discrete partitions could provide more detailed insights on the
RNA folding mechanism and functions. However, there is a balance between
the prediction accuracy, the ranking for best predictions, and the
computational efficiency.
Figure 2
Total number of partitions (in red), sensitivity
(in black), and
ranking (in blue) of prediction accuracy for each alternative structure
as a function of the size of the selective helix pool for (A) RNA-15
of 17-nt size with 10 total helices and (B) RNA-17 of 121-nt size
with 596 total helices.
Total number of partitions (in red), sensitivity
(in black), and
ranking (in blue) of prediction accuracy for each alternative structure
as a function of the size of the selective helix pool for (A) RNA-15
of 17-nt size with 10 total helices and (B) RNA-17 of 121-nt size
with 596 total helices.Table S2 lists the detailed
benchmark test for the prediction of
alternative secondary structures. Here, we select helices from most
to least stable in the saturated helix pool until the total number
of assembled partitions reaches 100. For example, for the case of
RNA-15 with 17 nucleotides in length, we include all saturated helices
for landscape partitioning. For the case of RNA-17 with a length of
121 nucleotides, we only select the top 11 out of 596 total saturated
helices to assemble 196 folding partitions (<100 partitions, if
only top 10 helices are selected). As we see in Table S2, most of the alternative structures can be correctly
predicted with all of the canonical base pairs in the structures of
best predictions. The average SE is 0.95. Furthermore, for the test
cases in Table S2 with 61 alternative structures
in total, 20, 15, and 8 cases of the best predictions are ranked as
the first, second, and third prediction by stabilities, respectively.
Only three (out of 61) cases are not ranked in the top 10 predictions.
All the predicted and reported structures, as well as the number of
helices, are listed in Table S2 and Figure S3.We also use the online servers of Sfold (Srna module),[23] Mfold,[18] RNAstructure,[19,20] RNAshapes,[32] and RNA profiling[34] with the default parameters to predict the alternative
secondary structures for the above 61 tested cases. Figure (Figures S4–S7) gives the direct comparison for the prediction
accuracy of SE, PPV, MCC, and ranks. Sfold, Mfold, and RNAstructure
predict alternative structures in the structural ensemble with single
base pair resolution and achieve similar performances (in the range
of (0.83, 0.94) for the averaged SE, PPV, and MCC). RNAshapes predicts
alternative structures at the abstract shape resolution. Although
its prediction accuracy (for the test cases) is slightly lower than
others, the model is especially useful if a known topology (shape)
is suspected. Like our helix-based RNA folding model, RNA profiling
also constructs clusters of structures from the combinations of helices.
However, profiling generates helix classes from Boltzmann sampling.
When the free energy difference between alternative structures is
large, as in the example of RNA-10 shown in Figure S3 (one of the reported structures is a hairpin structure with
7 base pairs, while the other one is a 4-bp hairpin), the Boltzmann
sampling with a limited number of structures (1000 for the profiling
server by default) may not include structures with the 4-bp helix,
leading to poor predictions for the less stable helix.
Figure 3
Comparison of the Sfold,
Mfold, RNAstructure, RNAshapes, RNA profiling,
and our model predictions for (A) ranks, (B) SE, (C) PPV, and (D)
MCC. All cases are listed in Table S1.
Comparison of the Sfold,
Mfold, RNAstructure, RNAshapes, RNA profiling,
and our model predictions for (A) ranks, (B) SE, (C) PPV, and (D)
MCC. All cases are listed in Table S1.In contrast, our model selects helices based on
the stabilities,
partitions the landscape, and predicts alternative structures from
the complete secondary structural ensemble at the (one-third) helix
level. The prediction results listed in Figure , Figure S3, and Table S2 indicate that only a small number (<20)
of stable saturated helices are needed to partition the landscape
and place most of the best predictions of alternative structures in
the top 10 predictions.
Determination of Transcriptional Intermediates
To further
test the helix-based RNA folding model, we predict alternative structures
for the I-A 2′-deoxyguanosine-sensing (2′dG-sensing)
riboswitch from Mesoplasma florum at
different sequence lengths, which shows a strict fine-tuning of binding
and sequence-dependent alterations of the conformational space by
structural analysis of all relevant transcription intermediates at
single-nucleotide resolution from NMR studies.[37] Since the energy terms for ligand-binding interactions
at the secondary structural level are currently unavailable, we only
predict the transcriptional intermediates of the riboswitch in the
absence of ligands and compare the predicted results with the corresponding
experimental data.The 2′dG-sensing riboswitch has two
dominant functional states: ON and OFF. The NMR studies revealed that,
depending on the sequence length during the transcriptional process
as shown in Figure A, the OFF state adopts two distinct secondary structural elements.
OFF (short) is a three-way junction (aptamer domain) attached by a
single-strand 3′-tail. However, the three-way junction of OFF
(long) is attached by a stem-loop motif (terminator). The ON state
also has two forms of structures. ON(M) has a multiloop junction containing
at least three stems. ON(I) is a stem-loop structure with an internal
loop. Both ON(M) and ON(I) share the same antiterminator stem and
equivalent function. We use Vfold2D to calculate the equilibrium populations
of the three structural ensembles, namely, OFF (short or long), ON(M),
and ON(I), with the base pair constraints from the corresponding functional
structures shown in Figure A for different sequence lengths to mimic the transcriptional
process.
Figure 4
(A) Conformational rearrangement during the sequence elongation
reveals the alternative structure-guided regulation mechanism of 2′dG-sensing
riboswitch. (B) Predicted population changes of the three transcriptional
intermediates of 2′dG-sensing riboswitch, namely, OFF (short
or long), ON(M), and ON(I), along with the sequence length.
(A) Conformational rearrangement during the sequence elongation
reveals the alternative structure-guided regulation mechanism of 2′dG-sensing
riboswitch. (B) Predicted population changes of the three transcriptional
intermediates of 2′dG-sensing riboswitch, namely, OFF (short
or long), ON(M), and ON(I), along with the sequence length.As shown in Figure B for the predicted population changes of the three
functional states,
namely, OFF, ON(M), and ON(I), along with the sequence length, there
are two conformational rearrangements that happen during sequence
elongation: OFF → ON at (110, 114) and ON → OFF at (122,
139) of sequence length. OFF (short) dominates the early time windows
of transcription for the nascent RNAs with 3′ ends at the location
of (a, d). OFF (long) dominates
the late time windows for the 3′ ends at (h, i). The dominant ON state during the middle time
windows adopts two bistable antiterminator conformations in slow exchange:
ON(M) and ON(I). ON(M) is slightly more stable than ON(I) with larger
population in equilibrium. The predicted population changes agree
well with the NMR studies, revealing the importance of metastable
states for RNA-based gene regulation.It should be noted that
the population of OFF (short) for the 3′
ends at the location of a is very low (≈0.03,
shown in Figure B),
which is inconsistent with the experimental results. From the free
energy calculations, we find that there is an energetically more favorable
structure dominating the populations in equilibrium. As shown by the
cyan part of the aptamer domain in Figure , the P1 stem of the three-way junction is
disrupted to fold into a stable four-way junction where the single-strand
5′-tail attached to the P1 stem is base-paired with the 3′-tail,
leading to the low population of the OFF state shown in Figure B. With the increase of sequence
length, the formation of the aptamer domain dominates over the other
structures since the P1 stem becomes longer and stabilizes the three-way
junction of the aptamer domain. The truncation of the first five 5′-nucleotides,
as simulated by the Barmap approach, eliminates the possibility of
folding into alternative structures, such as the four-way junction
structure, and gives results consistent with the NMR studies even
for the case with the 3′ ends at position a.
Figure 5
Alternative structures of the aptamer domain of 2′dG-sensing
riboswitch: (A) a four-way junction, (B) a three-way junction attached
by a 19-nt-long 5′-tail. The P1 stem of the three-way junction
(cyan) in (B) is disrupted to fold into a stable four-way junction
where the single-strand 5′-tail attached to the P1 stem is
base-paired with the 3′-tail. The P2 stem of the three-way
junction (magenta) in panel (B) is partially disrupted in order to
fold a stem-loop structure in panel (A).
Alternative structures of the aptamer domain of 2′dG-sensing
riboswitch: (A) a four-way junction, (B) a three-way junction attached
by a 19-nt-long 5′-tail. The P1 stem of the three-way junction
(cyan) in (B) is disrupted to fold into a stable four-way junction
where the single-strand 5′-tail attached to the P1 stem is
base-paired with the 3′-tail. The P2 stem of the three-way
junction (magenta) in panel (B) is partially disrupted in order to
fold a stem-loop structure in panel (A).On the other hand, our model also reveals an alternative structure
for the P2 stem of the aptamer domain, which is consistent with simulation
results obtained by truncating the first five 5′-nucleotides
with Barmap. As highlighted by the magenta part of the aptamer domain
in Figure , stem P2
of the three-way junction is partially disrupted in order to form
a stem-loop structure. The alternative structure, which shares the
same hairpin loop as closed by the P2 stem, contains a bulge loop
with an additional coaxial stacking interaction between the two helices
and decreases the size of the (three- or four-way) junction. Energetically,
the two alternative structures of the P2 stem shown in Figure have similar stabilities,
resulting in ≈0.35 for the population of OFF shown in Figure B.
Conclusions
Using stable helices to effectively divide the RNA folding landscape
into discrete structural ensembles, the helix-based RNA folding model
partitions the folding landscape and determines alternative secondary
structures with high accuracy. The benchmark test of 27 RNAs involving
multiple alternative structures shows that our model accurately predicts
alternative structures with top ranking structures corresponding to
our best predictions. Furthermore, the test of the metastable transcriptional
intermediates of the 2′dG-sensing riboswitch indicates that
the predicted population changes associated with sequence elongation
agree well with the experimental results. The model correctly predicts
the conformational rearrangement between ON and OFF states and the
coexistence of two ON states in slow exchange, which is not predictable
by simulations from the Barmap approach.Furthermore, reducing
the large conformational space associated
with the whole folding landscape at single base pair resolution to
a limited number of selective helix pool size-dependent landscape
partitions provides us with a feasible way to decipher the dynamics/kinetics
properties and structure–function relationship of RNA molecules
from folding landscape studies.For the free energy-based RNA
folding models, the simulation results
are usually sensitive to the choice of energy parameters, especially
for the prediction of MFE structures. Partitioning the landscape to
determine alternative structures with the helix-based RNA folding
model may effectively increase the tolerance of uncertainty for energy
parameters and better reveal the structure–function relationship
of RNA molecules. Furthermore, the model may also provide a way to
facilitate the experimental determination of functional structures
through a step-by-step screening of stable helices. However, the current
model is limited to predicting folding at the secondary structure
level. Further development of the model should include more complicated
structures, such as pseudoknotted and loop–loop kissing motifs.
Algorithm
and Methods
Different RNA sequences
may have different sets of helices with different folding stabilities.
Helices with lower free energies (higher stabilities) may have higher
probabilities of initiating the folding process and surviving in stable
structures. Therefore, we can use stable helices to effectively divide
the whole RNA folding landscape into discrete partitions and determine
alternative structures to better decipher the structure–function
relationship of RNAs. As shown in Figure , the helix-based strategy for partitioning
the RNA folding landscape works with the following steps.
Figure 6
Schematic of
the helix-based RNA folding landscape partition: (1)
build a saturated helix pool from a given RNA sequence; (2) select
top helices according to the helix stabilities; (3) partition the
folding landscape from the selective helix pool. Each helix is denoted
by (i, j, k). The
folding partition in the dashed rectangle represents a structural
ensemble of all structures with the helix H1 and without the helices
H2 and H3.
A saturated helix
pool is built from
a given RNA sequence. We only consider the canonical base pairs, such
as A-U, G-C, and G-U, in this step. Helices with only one base pair
are removed. All sub-helices are not included in the saturated helix
pool.As shown in Figure , each saturated helix is denoted by (i, j, k), with (i, j) a starting base pair at nucleotide positions i and j, and k is the
number of base pairs in the helix. To enumerate all possible saturated
helices, we first identify all starting pairs of (i, j), such that (i, j) is a canonical base pair, while (i – 1, j + 1) is noncanonical (except for i =
1 or j = N, size of a given sequence).
For each starting pair of (i, j),
we then scan for k consecutive canonical base pairs,
such that (i + k, j – k) is noncanonical or the size of the
fragment within (i + k, j – k) is less than 3 (i.e., minimum
hairpin loop).Top
helices are selected according
to the helix stabilities. The stability of each helix is calculated
by the base stacking energies from the Turner parameters.[36] The number of selected helices is arbitrary
and is related to the total number of folding landscape partitions,
as well as the computational time. As shown in the example in Figure , the selective helix
pool contains the three most stable helices (H1, H2, and H3).The folding landscape
is partitioned
from the selective helix pool. By enumerating all the possible helix
combinations and deleting those with helix overlaps (two nucleotides
are base-paired with the same one) or crossing (two base pairs of
(i, j) and (m, n) in the conditions of m > i, j > m, and n > j, as in the case of a pseudoknotted
structure),
we obtain a list of folding partitions featured by the selected helices.
Each partition defines a structural ensemble in the overall folding
landscape containing all the secondary structures with base pairs
from the included helices but without base pairs from the excluded
helices in the selective helix pool.Schematic of
the helix-based RNA folding landscape partition: (1)
build a saturated helix pool from a given RNA sequence; (2) select
top helices according to the helix stabilities; (3) partition the
folding landscape from the selective helix pool. Each helix is denoted
by (i, j, k). The
folding partition in the dashed rectangle represents a structural
ensemble of all structures with the helix H1 and without the helices
H2 and H3.Since folding partitions are defined
by the inclusion and exclusion
of selected helices, there is at least one helix difference between
any two of them. The size of the selective helix pool, that is, the
threshold of helix stability, determines the discriminative ability
of the partitioned landscape. Larger selective helix pool size leads
to smaller differences between different partitions.
Alternative
Structure Determination
We use the Vfold2D[25] model with the corresponding base pair constraints
from the selective helix pool to calculate the total free energy of
each folding partition and treat the minimal free energy (MFE) structure
as the representative of each partition. Other RNA secondary structure
prediction models, such as Mfold,[18] RNAstructure,[19,20] and RNAfold,[21] may also be applied to
calculate the free energy and determine the MFE structure with similar
performances.As shown by the example of the dashed rectangle
in Figure , the assembled
partition represents a structural ensemble of all secondary structures
with the helix H1 (included) and without the helices H2 and H3 (excluded)
in the three-helix selective helix pool. Since base pairs in both
helix terminal ends are usually unstable and involve spontaneous dynamics
of opening and closing in solution, we only consider base pairs in
the middle of helices (one-third in length) for the included helices,
as shown by the four red ones (h1) of a 10-bp helix H1 in Figure . The removal of
terminal base pairs would allow two partially overlapping saturated
helices to coexist, which often happens in junction motifs, where
nucleotides may be base-paired in different helices. For the excluded
helices, however, we use all the base pairs as the base pair constraints
in Vfold2D for structure enumeration.