Literature DB >> 21685058

A conditional random fields method for RNA sequence-structure relationship modeling and conformation sampling.

Abstract

UNLABELLED: Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging, because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation sampling method (e.g. FARNA) bears shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. A recent dynamic Bayesian network method, BARNACLE, overcomes the issue of fragment assembly. In addition, neither of these methods makes use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, conditional random fields (CRFs), to model RNA sequence-structure relationship, which enables us to accurately estimate the probability of an RNA conformation from sequence. Coupled with a novel tree-guided sampling scheme, our CRF model is then applied to RNA conformation sampling. Experimental results show that our CRF method can model RNA sequence-structure relationship well and sequence information is important for conformation sampling. Our method, named as TreeFolder, generates a much higher percentage of native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE. CONTACT: zywang@ttic.edu; j3xu@ttic.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA

Year: 2011 PMID： 21685058 PMCID： PMC3117333 DOI： 10.1093/bioinformatics/btr232

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

RNA has become an important research subject in recent years, and there is an increasing study of non-coding RNA in biology and health. Its growing important role appears in various life domains and processes, including regulating gene expression (Backofen ; Hiller ; Ray ; Solnick, 1985), interaction with other ligands (Badorrek ; Buck ) and stabilizing itself (Reymond ). To elucidate the function of an RNA molecule, it is essential to determine its 3D structure. However, there are a great number of RNA sequences without solved structures. Experimental methods for RNA 3D structure determination are time-consuming, expensive and sometimes technically challenging. By far, there are ~29 million RNA molecules with (predicted) secondary structure in the Rfam database (Gardner ), but only 4816 of them have tertiary structures in the nucleotide database (Berman ). Therefore, we have to fill this large gap by predicting the 3D structure of an RNA using computational methods. RNA tertiary structure prediction does not gain as much attention as secondary structure prediction (Akutsu, 2000; Alkan ; Backofen ; Bindewald and Shapiro, 2006; Eddy and Durbin, 1994; Ferretti and Sankoff, 1989; Gardner and Giegerich, 2004; Hamada ; Havgaard ; Hofacker, 2003; Knudsen and Hein, 2003; Mathews, 2006; Mathews and Turner, 2002; Mathews and Turner, 2006; Poolsap ; Will ; Zhang ; Zuker, 2003; Zuker and Sankoff, 1984). Both molecular dynamic methods (Bindewald and Shapiro, 2006; Hajdin ; Sharma ) and knowledge-based statistical methods (Das and Baker, 2007; Das ; Frellsen ) have been proposed to fold RNA molecules. The knowledge-based statistical methods for RNA tertiary structure prediction consist of two major components: an algorithm for conformation sampling and an energy function for differentiating the native structure from decoys. Fragment assembly, a knowledge-based method widely used for protein structure prediction (Haspel ; Lee ; Simons ), has been implemented in FARNA (Das and Baker, 2007) for RNA 3D structure prediction. However, this method has a couple of limitations: (i) there is no guarantee that any region of an RNA structure can be accurately covered by structure fragments in the RNA solved structure database, which currently contains only a limited number of non-redundant solved RNA structures; and (ii) sequence information is not employed in FARNA for conformation sampling. MC-Sym (Parisien and Major, 2008) is a motif assembly method for RNA 3D structure prediction, which uses a library of nucleotides cyclic motifs (NCM) to construct an RNA structure. MC-Sym has a time complexity exponential with respect to RNA length (i.e. the number of nucleotides), so MC-Sym may not be used to predict the tertiary structure for a very large RNA. As reported in Laing and Schlick (2010), MC-Sym also fails in the case when the secondary structure of RNA lacks cyclic motifs. Recently, Frellsen )have proposed a probabilistic model (BARNACLE) of RNA conformation space. BARNACLE uses a dynamic Bayesian network (DBN) to model RNA structures, but this DBN method does not take into consideration any sequence information. In addition, BARNACLE models the interdependency between the local conformations of only two adjacent nucleotides, but not of more nucleotides. Other RNA three dimensional structure prediction methods can be found in Abraham ); Das and Baker (2007); Das ); Ding ); Flores ); Frellsen ); Gillespie ); Hajdin ); Jonikas ); Laing and Schlick (2010); Parisien and Major (2008); Sharma ); Tang ); Wexler ). This article presents a novel probabilistic method conditional random fields (CRFs) (Lafferty ) to model RNA sequence–structure relationship. Different from BARNACLE modeling only RNA structures, our CRF method models the sophisticated relationship among primary sequence, secondary structure and 3D structure, which enables us to more accurately estimate the probability of RNA conformations from its primary sequence and thus sample RNA conformations more efficiently. We have already successfully applied CRF to model protein sequence–structure relationship and conformation sampling (Zhao , 2009, 2010). However, our CRF method for proteins cannot be directly applied to RNA. In order to apply CRF to RNA modeling, we have to employ a different method to represent an RNA 3D structure and model RNA bond torsion angles. We also have to face the challenge that there are a lot fewer solved RNA structures than the solved protein structures for CRF model training. By exploiting the secondary structure information of an RNA molecule, we have also developed a novel tree-based sampling scheme that can simultaneously sample conformations for two segments far away from each other along the RNA sequence. In contrast, our protein conformation sampling method can sample conformations for only one short segment at a given time. Finally, we also have to employ a totally different energy function for RNA folding. To the best of our knowledge, CRF has also been applied to RNA secondary structure prediction (Do ) and alignment (Sato and Sakakibara, 2005), but not modeling the relationship between RNA sequence and 3D structure. Our method TreeFolder is more effective in sampling native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE, which contains only base-pairing information. Tested on 11 RNA molecules, TreeFolder obtains much better decoys for most of them. Our results imply that TreeFolder models RNA sequence–structure relationship well, which it is feasible to sample RNA conformations without using fragments and that sequence information is important for RNA conformation sampling. Experiments also show that TreeFolder works well with predicted secondary structures generated by tools such as CONTRAfold (Do ).

2 METHODS

2.1 Representation of an RNA structure and conformation state

We can represent an RNA 3D structure using a sequence of torsion angles, as shown in Figure 1. Every nucleotide has in total seven bonds that rotate freely. Six of them lie on the backbone: P–O5′, O5′−C5′, C5′−C4′, C4′−C3′, C3′−O3′ and O3′−P. The seventh bond connects a base to atom C1′. As shown in Figure 2 torsion χ around the seventh bond has a small variance, so we assume that it is independent of the other angles and has a normal distribution. The planar angles between two adjacent bonds on the backbone are almost constants, so are the lengths of the bonds.

Fig. 1.

Conformation of a nucleotide is represented by angles.

Fig. 2.

Empirical distribution of the torsion angle χ collected from the all representative RNA structures (see Section 2.4).

Conformation of a nucleotide is represented by angles. Empirical distribution of the torsion angle χ collected from the all representative RNA structures (see Section 2.4). We use a simplified representation so that we can reduce the number of torsion angles needed for the local conformation of a nucleotide (Cao and Chen, 2005; Duarte and Pyle, 1998; Hershkovitz ; Zhang ). In particular, we use the torsions τ1 and τ2 on pseudo-bonds P−C4′ and C4′ –P (see pink lines in Figure 1). However, to determine coordinates of the six backbone atoms of a nucleotide, we also need two planar angles θ, ψ and another torsion α on bond P−O5′. Overall, we use a five tuple (τ1, τ2, θ, ψ, α) to represent the local conformation of a nucleotide. The torsion angles are separated in several groups in the whole angle space, as shown in Figure 3. Although there are many different methods to represent an RNA conformation, this simplified representation enables us to rapidly rebuild backbone atoms from angles. Similar representations have also been extensively adopted by previous works (Cao and Chen, 2005; Duarte and Pyle, 1998; Hershkovitz ; Zhang ).

Fig. 3.

(A) Empirical distribution density of the torsion (τ1) on the pseudo−bond C4′-−P and α. (B) Distribution density of the torsion (τ2) on the pseudo−bond P-C4′ and α. The empirical distributions are built from all representative RNA structures (see Section 2.4). Our simplified representation does not lose much accuracy: given the torsion angles, we can rebuild the atom coordinates of an RNA molecule with very good accuracy. As shown in Table 1, the structures rebuilt from the native angle values (assuming the bond lengths are constants) have RMSD <1 Å from; their natives.

Table 1.

Accuracy of the structures rebuilt from the native torsion angles, assuming the bond lengths are constants

PDB ID	RMSD	PDB ID	RMSD
1esy	0.77	1xjr	0.55
1kka	0.35	1zih	0.22
1l2x	0.41	28sp	1.00
1q9a	0.28	2a43	0.34

Accuracy of the structures rebuilt from the native torsion angles, assuming the bond lengths are constants Conformation state: we use a Gaussian distribution to describe the local conformation preference of one nucleotide. First, we cluster all the angles collected from the experimental structures into dozens of groups (20~100). Then, we calculate the mean and variance in each group and model the angle distribution, using Gaussian distribution. Each group (or cluster) and its Gaussian distribution are identified by an index, which is also denoted as a conformation state. Given the conformation state of a nucleotide, we can sample its real-valued angles from the corresponding distribution. Note that to make angle sampling easy and fast, we assume the torsion angles are independent of one another in Gaussian distribution. Later we will show how to empirically determine the best number of conformation states to achieve the best sampling performance.

2.2 CRF model for RNA sequence–structure relationship

Our CRF method can estimate the probability of an RNA conformation from the primary sequence and secondary structure. A CRF model consists of two major components: input features and output labels. The input features at each nucleotide include its nucleotide types, base pairing states and its neighbor nucleotide types. The input features are encoded as a vector of binary variables. The base pairing states can be predicted using some secondary structure prediction programs (Akutsu, 2000; Do ; Eddy and Durbin, 1994; Gardner and Giegerich, 2004; Knudsen and Hein, 2003; Mathews and Turner, 2006; Poolsap ; Zuker, 2003) with reasonable accuracy. The base pairing information can also be obtained using some experimental methods (Gewirth ; Wohnert ; Zwahlen ), which are much less expensive than those methods determining RNA tertiary structures. The output label at each nucleotide is a conformation state (also called label in CRF). It is also the index of the cluster which the angles at this nucleotide belongs to. In contrast to BARNACLE (Frellsen ) estimating the generative probability of an RNA structure, our CRF model estimates the conditional probability of an RNA structure, represented as a conformation state vector y, from the input feature vector x as follows. Meanwhile, Z(x) is the partition function; x is the feature vector at position i; y is the label at position i; W is the weight for transition from state i−j; V is the weight factor for predicting state i from an input feature x; L is the length of RNA, i.e. the number of nucleotides. The function ψ describes dependency between a conformation state and the input features and thus, called a label feature function. The function Φ describes dependency between two adjacent states and thus called an edge feature function. Figure 4 shows a linear-chain CRF model for the sequence–structure relationship of an artificial RNA with five nucleotides. We also extend ψ to a linear combination of features of the adjacent nucleotides in a sliding window. That is, ψ is a linear function of , WL is the window size to be determined later.

Fig. 4.

A linear-chain CRF model describes the RNA sequence–structure relationship. The input feature vector X contains sequence information and the label (state) vector Y contains local conformation states. Once the CRF model is trained, we can calculate the (marginal) probability of a conformation state at a given position, using the forward–backward algorithm as follows. We train our CRF model by maximizing the occurring probability of a set of training RNAs with solved structures. In order to avoid overfitting, we also enforce regularization on the model parameters. As such, we train the model parameters by maximizing the following regularized log-likelihood. Meanwhile, y and x are the conformation state vector and input feature vector of the k-th training RNA, W and V are model parameters defined in Equation (1) and λ and μ are the regularization factors. This maximization problem can be solved to optimal using the L-BFGS algorithm (Liu and Nocedal, 1989). We also extend the first-order CRF model to the second-order model so that we can capture dependency among three adjacent nucleotides. As in Figure 5, two adjacent positions are combined to a single super node. All the algorithms for the first-order CRF model can be easily extended to the second-order model.

Fig. 5.

The second-order CRF model describes RNA sequence–structure relationship. A super node in this model contains the conformation states in two adjacent positions.

2.3 A tree-guided conformation sampling algorithm

Once the CRF model is trained, we can use it to sample conformations for a segment in an RNA molecule. By combining this segment conformation sampling algorithm with a tree representation of the RNA base pairing information, we can have a tree-guided conformation sampling scheme, which enables us to sample conformations for two segments far away from each other along the sequence. Building a guide tree for conformation sampling: the guide tree represents the base pairing information in an RNA, which can be predicted using a secondary structure prediction method or determined by experimental methods. In the case of pseudo-knots, we remove the minimal number of base pairings so that a tree can be built. Since the pseudo-knots do not occur frequently, removal of a small number of base pairings does not impact our method. Note that all the base pairings are taken into consideration in calculating the energy of a sampled conformation. Therefore, removal of some base pairs in tree construction will not impact the formation of pseudo-knots, since we also use energy function to guide the folding simulation. Given the base pairing information, we build a guide tree as follows. The root node in the tree corresponds to the whole RNA. Given a base pair (i, j), we have one node in the tree corresponding to the segment between i and j. One node A is the child of the other node B if and only if the segment corresponding to B is the minimal segment containing the segment corresponding to A. In case that one node has more than two child nodes, we can always add some intermediate nodes so that any node has at most two child nodes. For example, supposing node B, corresponding to segment (i, j), has three child nodes A1, A2 and A3, where A corresponds to segment (i, j) and iA1 and A2 and B has only two child nodes A3 and C. Segment conformation sampling algorithm: This sampling algorithm consists of two steps: sampling a label for each nucleotide, in the segment, by the probability calculated from the CRF model and sampling real-valued angles from Gaussian distribution corresponding to a label. We use a forward–backward algorithm to sample the label sequence of a segment from position i to j. The algorithm iteratively draws a conformation label of the last position from the conditional probability as follows. Meanwhile, Z(x) is the partition function and can be calculated using the forward–backward algorithm. After the conformation state at position j is sampled, the algorithm replaces j by j−1 and repeats the sampling process until position i is sampled. Once the labels of the segment are sampled, we can sample the real-valued angles from the Gaussian distribution associated with a label. Folding simulation: the folding simulation begins with a heating up process, in which we repeatedly sample conformations for the whole RNA using the above-mentioned segment conformation sampling algorithm. This heating up procedure terminates if one conformation without steric clashes is generated. In our experiments, we usually can obtain a conformation without clashes very quickly, which is used as the initial conformation of the simulated annealing optimization (Andrieu ; Zhao ). To resample conformations of an RNA, we build a conformation sampling guide tree based upon the base pairing information in the RNA and all the nodes in the tree are marked as ‘undone’. The torsion angles of the RNA are resampled using a bottom-up method along the tree as follows. We randomly pick up an ‘undone’ node A in the tree, which is either a leaf node or a node with all the child nodes being marked as ‘done’. If A is a leaf node, we resample the angles for the segment corresponding to A using the segment conformation sampling algorithm. If A has one or two child nodes, by cutting out the segments corresponding to the child nodes, we have at most three separate segments left in A, for which we use the segment conformation sampling algorithm to generate angles separately. The new conformation is accepted if its energy is lower. Otherwise it is accepted by a probability exp (ΔE/T), where ΔE is the energy difference between current and the new conformations and T is the annealing temperature. This sampling procedure is repeated 3000 times and then node A is marked as ‘done’. The folding simulation process ends when the root node is marked as ‘done’. Energy function: different from the complex energy function in FARNA, we adopt a simple energy function used by BARNACLE (Frellsen ) as follows. where H is the number of hydrogen bonds formed in the secondary structure (every A–U and G–U pair contributes two distances, and every C–G pair contributes three distances), is the distance between the donor and the acceptor of the k-th hydrogen bond and d is the average length of hydrogen bonds of the same type. The smaller this value is, the more the decoy is consistent with its secondary structure. The energy is measured in Å, and the ideal base pair energy of 0 Å is only obtained for conformations with perfect base paring. We employ such a simple energy function (without any tuned parameters) so that we can carefully examine the performance of our sampling algorithm and perform a well-controlled comparison with other sampling methods such as BARNACLE. More sophisticated energy items, such as Mg2+ ion interaction and stacking effect of base pairs, can be taken into account in future study.

2.4 CRF model training

Training data: we build our training dataset from the RNA structure classification database DARTS (Abraham ), which collects 244 structures representing 1333 solved RNA structures and groups them into 94 clusters. Our training set comes from the 94 cluster representative structures, which have ~6000 nucleotides in total. We use all 94 cluster representative structures to build empirical distributions of bond torsion angles. To make sure our training dataset does not overlap with the 11 benchmark RNA molecules, we exclude the representative structures in the same cluster as the 11 benchmark RNA. With the remaining 83 structures, we use 3-fold cross-validation to determine the CRF model regularization factors λ and μ and the proper window size. In each fold validation, two thirds of structures are used for training and the remaining for test. Model selection: the (training/test) accuracy of a second-order CRF model is defined as the number of correctly predicted states divided by the total number of positions. Fixing the number of conformation states in a CRF model, we search for the appropriate regularization factors and window size using a grid search strategy. As shown in the Supplement Figure 1, the CRF model with 50 conformation states has the best performance when λ=5, μ=10 and window size=5. We choose these parameters to maximize accuracy and avoid overfitting. Supplement Figure 1 shows that a larger window size does not improve the test accuracy significantly, but increase the accuracy gap between the training and test data, which might indicate overfitting. We also investigate the sampling performance of the CRF model with respect to the number of conformation states. We tested our CRF models with 20, 30, 50, 80 and 100 conformation states. For each CRF model, we generate 3000 decoys for each of the five RNAs: 2a43, 28sp, 2f88, 1zih and 1xjr. Figure 6 shows the 5% quantiles of the RMSD distributions for decoys generated by four different CRF models. As shown in Figure 6, the model with 50 states generates better decoys than others.

Fig. 6.

The 5% quantiles of the RMSD distributions for decoys sampled from the CRF models with different number of conformation states. Y-axis is the RMSD value.

The 5% quantiles of the RMSD distributions for decoys sampled from the CRF models with different number of conformation states. Y-axis is the RMSD value. Using different methods to model the distribution of torsion, χ , makes a slight difference on the quality of sampled decoys. Figure 7 shows the 5% quantiles of RMSD values for 300 decoys sampled using four different χ distributions with a well-trained CRF model. In Model 1, we fix χ as the mean of the training data. Model 2 samples χ from a log normal distribution. Model 3 samples χ from a normal distribution. Model 4 uses sample χ directly from the training data without using any mathematical modeling. Finally, we decide to use the normal distribution for χ, to yield a bit of variance.

Fig. 7.

The 5% quantiles of the RMSD distributions for decoys sampled from models with different distributions of torsion χ. Model 1 uses a fixed value of χ. Model 2 uses log normal distribution. Model 3 is the normal distribution. Model 4 is the empirical distribution of all values in training data. Models 2 and 3 are fit from all training data.

3 RESULTS

We use 11 RNAs tested by both BARNACLE and FARNA to benchmark our method TreeFolder. These RNAs contain 12~46 nucleotides and are not homologous to any structures in our training dataset. In case an RNA has multiple NMR structures, we use the first structure in the PDB file as its native structure. It is not very reliable to compare two methods simply using the decoys with the lowest RMSD, since they may be generated by chance and also depend on the number of decoys to be generated. The more decoys are generated, the more likely the lowest-RMSD decoy has lower RMSD from the native. Therefore, a better strategy is to compare the RMSD distributions of decoys. Our TreeFolder generates better decoys than FARNA: we compare FARNA and TreeFolder in terms of the quality of the decoy clustering centroids. Similar to FARNA clustering only on the top 1% decoys with the lowest energy, we run MaxCluster to cluster the top 1% of our decoys with the lowest energy into five clusters. As shown in Table 2, TreeFolder can generate decoys with better cluster centroids for nine RNAs: 1a4d, 1esy, 1kka, 1q9a, 1xjr, 1zih, 28sp, 2a43 and 2f88. By the way, even if a significantly smaller number of decoys is generated by us, the lowest RMSD decoys by our TreeFolder for 1a4d, 1zih and 28sp still have smaller RMSD than those by FARNA.

Table 2.

Comparison between FARNA and our method TreeFolder

			FARNA			TreeFolder
PDB ID	Method	Len	Best cluster centroid	Lowest RMSD decoy	No. of decoys	Best cluster centroid	Lowest RMSD decoy	#Decoys
1a4d	NMR	41	6.48	3.43	28 949	3.65	2.69	7168
1esy	NMR	19	3.98	1.44	69 103	2.00	1.52	22 529
1kka	NMR	17	4.14	2.08	81 492	3.71	2.4	24 934
1l2x	X-ray	27	3.88	3.11	47 958	8.07	3.97	15 360
1q9a	X-ray	27	6.11	2.65	48 817	4.76	3.5	15 415
1qwa	NMR	21	3.71	2.01	65 977	3.77	2.49	18 838
1xjr	X-ray	46	9.82	6.25	24 646	9.26	7.05	7168
1zih	NMR	12	1.71	1.03	117 104	1.19	0.73	40 960
28sp	NMR	28	3.2	2.31	46 034	2.96	1.91	17 117
2a43	X-ray	26	4.93	2.79	49 972	4.52	3.47	18 432
2f88	NMR	34	3.63	2.41	36 664	3.33	2.7	12 230

The results of FARNA are taken from Table 1 in Das and Baker (2007). Column ‘Best cluster centroid’ lists the RMSD of the best cluster centroid of the top 1% decoys with the lowest energy. Column ‘No. of decoys’ is the number of decoys generated by the methods. Bold fonts indicate better results.

Comparison between FARNA and our method TreeFolder The results of FARNA are taken from Table 1 in Das and Baker (2007). Column ‘Best cluster centroid’ lists the RMSD of the best cluster centroid of the top 1% decoys with the lowest energy. Column ‘No. of decoys’ is the number of decoys generated by the methods. Bold fonts indicate better results. Our TreeFolder generates better decoys than BARNACLE: Table 3 displays the 5% and 25% quantiles of the RMSD distributions for decoys generated by BARNACLE and TreeFolder. The quantiles by BARNACLE are taken from Supplementary Table S4 in Frellsen ). BARNACLE considers only decoys with energy <1, since this kind of decoys are likely to have more correct base pairings. We use exactly the same energy function as BARNACLE, so we also consider only decoys with energy <1 to ensure a fair comparison. We did not generate as many decoys as BARNACLE and thus for some test RNAs we do not have many decoys with energy <1. In this case, we use decoys with energy <2. On the 10 RNAs shown in Table 3, TreeFolder yields better RMSD distributions for eight of them: 1esy, 1kka, 1q9a, 1qwa, 1xjr, 1zih, 28sp, 2a43 and 2f88.

Table 3.

The 5 and 25% quantiles of the RMSD distributions for decoys generated by our method TreeFolder and BARNACLE

			BARNACLE		TreeFolder
PDB ID	Len	Bps	5%	25%	5%	25%	# Energy <1	5%	25%	# Energy <2
1esy	19	6	2.99	3.28	2.19	2.60	577	2.25	2.78	1102
1kka	17	6	4.40	5.02	3.75	4.30	349	3.8	4.39	776
1l2x	27	8	5.43	6.88	–	–	0	5.44	8.08	5
1q9a	27	6	4.80	5.42	4.55	5.05	486	4.61	5.07	1025
1qwa	21	8	4.06	4.64	3.65	4.26	407	3.9	4.51	884
1xjr	46	15	10.41	11.01	8.50	9.43	22	8.84	9.79	540
1zih	12	4	1.72	2.16	1.32	1.84	1721	1.36	1.88	1931
28sp	28	8	3.23	3.76	2.88	3.43	152	2.93	3.58	563
2a43	26	7	4.72	6.08	–	–	0	4.64	5.48	26
2f88	34	13	3.82	4.41	3.73	3.73	1	3.85	4.57	130

Bold numbers indicate better distributions. Columns ‘#energy < 1’ and ‘#energy < 2’ list the number of decoys with energy <1 and <2, respectively. ‘Bps’ is the number of base pairings.

The 5 and 25% quantiles of the RMSD distributions for decoys generated by our method TreeFolder and BARNACLE Bold numbers indicate better distributions. Columns ‘#energy < 1’ and ‘#energy < 2’ list the number of decoys with energy <1 and <2, respectively. ‘Bps’ is the number of base pairings. Sequence information is important for RNA conformation sampling: different from other two state-of-art methods, FARNA and BARNACLE, our TreeFolder makes use of sequence information to significantly improve conformation sampling, as measured by the median RMSD values of decoys. The result is shown in Table 4, in which we compare two CRF models: one using sequence to sample conformations and the other not. Without using sequence information, our CRF method is similar to BARNACLE. That is, it models only angle state transitions in a RNA structure. Both CRF models use 50 conformation states. For the CRF model without sequence features, the regularization factor is set to 5 (i.e. λ=5). While for the CRF model utilizing sequence information, the regularization factor are set to 5 and 10 (i.e. λ=5, μ=10). To calculate the median RMSD, for each RNA we generate 300 decoys using the two CRF models.

Table 4.

Comparison between the CRF models using or without using sequence information

	Median RMSD value			Median RMSD value
PDB ID	With seq. feature	Without seq. feature	PDB ID	With seq. feature	Without seq. feature
1zih	2.68	4.56	28sp	6.02	10.27
1esy	3.73	6.17	1a4d	7.79	11.60
1kka	5.49	6.67	2a43	10.62	12.25
1qwa	5.58	5.99	1l2x	11.01	10.74
1q9a	5.91	6.84	1xjr	10.92	12.70
2f88	6.36	9.55

For 10 of the 11 tested RNAs, the model using sequence information yields decoys with much smaller median RMSD. Bold numbers indicate smaller RMSD values.

Comparison between the CRF models using or without using sequence information For 10 of the 11 tested RNAs, the model using sequence information yields decoys with much smaller median RMSD. Bold numbers indicate smaller RMSD values. Sampling real-valued angles generates better decoys: in order to show the detailed difference between our TreeFolder and FARNA, we look into the decoys of 1esy. We choose it because that FARNA and TreeFolder yield the largest difference on this RNA among all the 11 tested RNA molecules. As shown in Figure 8. TreeFolder can generate a much larger percentage of decoys with RMSD <4 Å than FARNA. We also compute local RMSD of each position in the decoys, which is defined as the RMSD of the segment of four consecutive nucleotides starting with this position, as compared to the native structure. We calculate the correlation between the local RMSD of each position with the global RMSD, as shown in Figure 9. Among the decoys generated by both FARNA and TreeFolder, the local RMSD at position 13 has the highest correlation with the global RMSD. We also calculate the angle error at each position by Error=‖v−v0‖2 , where v is the angle vector of a decoy at one position and v0 is the native angle vector at the same position.

Fig. 8.

The RMSD histograms of the 3000 decoys generated by our method TreeFolder (A) and FARNA (B) for 1esy.

Fig. 9.

Correlation between the local RMSD at each position and the global RMSD. The X-axis is the start position of a segment.

The RMSD histograms of the 3000 decoys generated by our method TreeFolder (A) and FARNA (B) for 1esy. Correlation between the local RMSD at each position and the global RMSD. The X-axis is the start position of a segment. Figure 10 shows the angle error histograms in three positions 13, 14 and 15. The angles at these three positions determine the conformation of the segment starting at position 13. At positions 13 and 15, the angle errors by our method TreeFolder are significantly smaller than those by FARNA. As Figure 10 shows, the angle errors by FARNA are distributed around several separated peaks, which may be caused by the limited number of fragments used in FARNA. In contrast, the angle errors by TreeFolder are distributed more smoothly, possibly because we can sample real-valued angles.

Fig. 10.

The angle error histograms at positions 13, 14 and 15. At positions 13 and 15, the decoys by our TreeFolder have much smaller angle errors than those by FARNA.

The angle error histograms at positions 13, 14 and 15. At positions 13 and 15, the decoys by our TreeFolder have much smaller angle errors than those by FARNA. Folding RNA using predicted secondary structures: we use the secondary structures predicted by CONTRAfold (Do ) and sample 1000 decoys for each RNA. The quantiles of their RMSD values are shown in Table 5. On 6 of the 10 tested RNA, decoys generated from native secondary structures are better than those from predicted secondary structures. On the other four RNAs, the difference between the two types of decoys is small, because of accurate secondary structure prediction. The results for 1l2x and 2a43 from predicted secondary structures are quite bad, since all of their base pairs are contained in a H-type pseudoknot and only half of their base pairs are recovered by CONTRAfold. However, our TreeFolder generates decent conformations for half of the pseudoknot with predicted base pairs, as shown in brackets. In particular, TreeFolder generates decent structures for 2a43 from nucleotides 1 to 14 and for 1l2x from nucleotides 1 to 18, respectively. In order to improve sampling performance on the whole structures of 1l2x and 2a43, we need an energy function like what is used in FARNA to guide the folding simulation.

Table 5.

Comparison between folding with native and predicted secondary structure

PDB ID	Distribution of RMSD values
	Native SS		Predicted SS
	5%	25%	5%	25%
1esy	2.25	2.78	3.90	4.35
1kka	3.80	4.39	4.57	5.46
1l2x	5.44	8.08	15.23 (3.53)	17.32 (3.88)
1q9a	4.61	5.07	4.65	5.01
1qwa	3.90	4.51	3.45	4.31
1xjr	8.84	9.79	9.17	9.79
1zih	1.36	1.88	3.56	4.02
28sp	2.93	3.58	2.71	3.63
2a43	4.64	5.48	21.22 (3.89)	21.99 (4.35)
2f88	3.85	4.57	3.58	4.21

The four numerical columns list the RMSD values of the 5 and 25% quantiles of the decoys with energy values <2. Bold numbers indicate better results.

Comparison between folding with native and predicted secondary structure The four numerical columns list the RMSD values of the 5 and 25% quantiles of the decoys with energy values <2. Bold numbers indicate better results. Comparison with MC-Sym on the large RNA molecules: our TreeFolder is much faster than the MC-Fold and MC-Sym pipeline (Parisien and Major, 2008) for folding large RNA molecules, as shown in Table 6. The running times in this table were obtained on a workstation with 96 GB RAM and 24 computing cores [2.67 GHz Intel(R) Xeon(R)].

Table 6.

Running time comparison between MC-Sym and our TreeFolder on large RNA molecules

PDB ID	Length	MC-Sym (h)	TreeFolder (s)
1l8v	152	48	1919
2gis	94	32	564
1vc7	74	46	400

Running time comparison between MC-Sym and our TreeFolder on large RNA molecules Overlay examples: Figure 11 shows three overlay examples of 1q9a, 2a43 and 1xjr with length of 27 nt, 26 nt and 49 nt, respectively. Pictures in blue display native, while in red the best centroids produced by our algorithm. As shown in this figure, our algorithm recovered a pseudoknot for 2a43.

Fig. 11.

Overlay representation of the best centroids (red) of 1q9a, 2a43 and 1xjr (from left to right) with their native structures (blue). These three RNA molecules have lengths of 27 nt, 26 nt and 49 nt.

4 CONCLUSIONS

We have presented a new method TreeFolder for modeling RNA sequence–structure relationship and conformation sampling using CRFs and a tree-guided sampling scheme. Our CRF method not only captures the relationship between sequence and angles, but also models the interdependency among the angles of three adjacent nucleotides. Our conformation sampling method distinguishes from FARNA in that we do not use fragments to build RNA conformations, so that we do not need to worry about if there are a sufficient number of structure fragments to cover all the possible local conformations. Our TreeFolder also differs from both FARNA and BARNACLE, in that we use primary sequence to estimate the probability of backbone angles, while the latter two do not. In addition, we also use a tree, built from (predicted) secondary structure, to guide conformation sampling so that at one moment we can simultaneously sample conformations for two segments far away from each other along the RNA sequence. In contrast, both FARNA and BARNACLE can only sample conformations for a single short segment at any time. The results indicate that our TreeFolder indeed models sequence–structure relationship well and compares favorably to both FARNA and BARNACLE, even if we use only the same simple energy function as BARNACLE. We will extend our TreeFolder further. For example, we can incorporate information in sequence homologs into our CRF model so that we can estimate the conformation probability more accurately and thus improve the sampling accuracy. Information in homologs has been successfully used in RNA secondary structure and should be useful for 3D structure prediction. Information in homologs has also been used for protein conformation sampling (Zhao ). Currently TreeFolder works well when the native base pairing information is used to calculate the energy function (same as BARNACLE) and to build the sampling guide tree. Not all the RNAs without 3D structures have the native base pairing information. Our next step is to further improve TreeFolder with the predicted base pairings. In particular, we need to design an energy function similar to what is used in FARNA to guide the folding simulation so that TreeFolder works well even if the predicted secondary structure is not very accurate. To tolerate errors in the predicted base pairing information, we will use the predicted confidence as the weight of each item in the energy function and only use those base pairings with high confidence to build the conformation sampling guide tree. We can also take another strategy to circumvent possible impact of errors in the predicted base pairings. In particular, we will extend our CRF method so that we can simultaneously sample base pairings and 3D conformations so that errors in the predicted base pairings will be corrected in the folding simulation process. Currently, we use a very simple energy function to guide the folding simulation. We will develop a more sophisticated energy function to guide the formation of hydrogen bonds in a better way, just like what FARNA does. Thus, we can not only generate decoys with better RMSD, but also with better hydrogen bonds.

51 in total

1. CONTRAfold: RNA secondary structure prediction without physics-based models.

Authors: Chuong B Do; Daniel A Woods; Serafim Batzoglou
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

2. Statistical analysis of RNA backbone.

Authors: Eli Hershkovitz; Guillermo Sapiro; Allen Tannenbaum; Loren Dean Williams
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2006 Jan-Mar Impact factor: 3.710

3. Ab initio RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms.

Authors: Feng Ding; Shantanu Sharma; Poornima Chalasani; Vadim V Demidov; Natalia E Broude; Nikolay V Dokholyan
Journal: RNA Date: 2008-05-02 Impact factor: 4.942

4. Stepping through an RNA structure: A novel approach to conformational analysis.

Authors: C M Duarte; A M Pyle
Journal: J Mol Biol Date: 1998-12-18 Impact factor: 5.469

5. A continuous analog for RNA folding.

Authors: V Ferretti; D Sankoff
Journal: Bull Math Biol Date: 1989 Impact factor: 1.758

6. Secondary structure of 5S RNA: NMR experiments on RNA molecules partially labeled with nitrogen-15.

Authors: D T Gewirth; S R Abo; N B Leontis; P B Moore
Journal: Biochemistry Date: 1987-08-11 Impact factor: 3.162

7. Discriminative learning for protein conformation sampling.

Authors: Feng Zhao; Shuaicheng Li; Beckett W Sterner; Jinbo Xu
Journal: Proteins Date: 2008-10

8. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Authors: Sebastian Will; Kristin Reiche; Ivo L Hofacker; Peter F Stadler; Rolf Backofen
Journal: PLoS Comput Biol Date: 2007-02-22 Impact factor: 4.475

9. RNA folding on the 3D triangular lattice.

Authors: Joel Gillespie; Martin Mayne; Minghui Jiang
Journal: BMC Bioinformatics Date: 2009-11-05 Impact factor: 3.169

10. A comprehensive comparison of comparative RNA structure prediction approaches.

Authors: Paul P Gardner; Robert Giegerich
Journal: BMC Bioinformatics Date: 2004-09-30 Impact factor: 3.169

5 in total

1. Modeling and Predicting RNA Three-Dimensional Structures.

Authors: Vladimir Reinharz; Roman Sarrazin-Gendron; Jérôme Waldispühl
Journal: Methods Mol Biol Date: 2021

2. RNA-MoIP: prediction of RNA secondary structure and local 3D motifs from sequence data.

Authors: Jason Yao; Vladimir Reinharz; François Major; Jérôme Waldispühl
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

3. Drug-target interaction prediction by integrating chemical, genomic, functional and pharmacological data.

Authors: Fan Yang; Jinbo Xu; Jianyang Zeng
Journal: Pac Symp Biocomput Date: 2014

4. Towards 3D structure prediction of large RNA molecules: an integer programming framework to insert local 3D motifs in RNA secondary structure.

Authors: Vladimir Reinharz; François Major; Jérôme Waldispühl
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937

Review 5. Systems biology in the context of big data and networks.

Authors: Md Altaf-Ul-Amin; Farit Mochamad Afendi; Samuel Kuria Kiboi; Shigehiko Kanaya
Journal: Biomed Res Int Date: 2014-05-27 Impact factor: 3.411

5 in total