Literature DB >> 30996105

What is the best reference state for building statistical potentials in RNA 3D structure evaluation?

Ya-Lan Tan1, Chen-Jie Feng1, Lei Jin1, Ya-Zhou Shi2, Wenbing Zhang1, Zhi-Jie Tan1.   

Abstract

Knowledge-based statistical potentials have been shown to be efficient in protein structure evaluation/prediction, and the core difference between various statistical potentials is attributed to the choice of reference states. However, for RNA 3D structure evaluation, a comprehensive examination on reference states is still lacking. In this work, we built six statistical potentials based on six reference states widely used in protein structure evaluation, including averaging, quasi-chemical approximation, atom-shuffled, finite-ideal-gas, spherical-noninteracting, and random-walk-chain reference states, and we examined the six reference states against three RNA test sets including six subsets. Our extensive examinations show that, overall, for identifying native structures and ranking decoy structures, the finite-ideal-gas and random-walk-chain reference states are slightly superior to others, while for identifying near-native structures, there is only a slight difference between these reference states. Our further analyses show that the performance of a statistical potential is apparently dependent on the quality of the training set. Furthermore, we found that the performance of a statistical potential is closely related to the origin of test sets, and for the three realistic test subsets, the six statistical potentials have overall unsatisfactory performance. This work presents a comprehensive examination on the existing reference states and statistical potentials for RNA 3D structure evaluation.
© 2019 Tan et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

Keywords:  RNA 3D structure; knowledge-based potential; reference states

Mesh:

Substances:

Year:  2019        PMID: 30996105      PMCID: PMC6573789          DOI: 10.1261/rna.069872.118

Source DB:  PubMed          Journal:  RNA        ISSN: 1355-8382            Impact factor:   4.942


INTRODUCTION

RNA molecules play vital roles in cell life activities such as gene regulations and catalysis (Dethoff et al. 2012; Guttman and Rinn 2012), and their functions are generally relevant to their structures (Watson et al. 2003; Montange and Batey 2008). Therefore, understanding RNA structures, especially RNA three-dimensional (3D) structures, would help to understand their biological functions. RNA 3D structures can be derived through several experimental techniques such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy (Aviv et al. 2006; Baird et al. 2010). However, it is still very expensive and time consuming to experimentally obtain high-resolution RNA 3D structures and consequently, RNA structures deposited in the PDB database (Rose et al. 2017) are still limited up to now. To complement experimental methods, various computational models have been proposed (Shi et al. 2014a; Miao and Westhof 2017; Schlick and Pyle 2017; Sun et al. 2017), aiming to predict RNA 3D structures in silico, including knowledge-based and physics-based models (Major et al. 1991; Das and Baker 2007; Ding et al. 2008; Parisien and Major 2008; Das et al. 2010; Jossinet et al. 2010; Cao and Chen 2011; Rother et al. 2011; Popenda et al. 2012; Zhang et al. 2012; Zhao et al. 2012; Cragnolini et al. 2013; Xia et al. 2013; Kim et al. 2014; Shi et al. 2014b, 2015, 2018; Xu et al. 2014; Bian et al. 2015; Boniecki et al. 2016; Li et al. 2016; Magnus et al. 2016; Bell et al. 2017; Jain and Schlick 2017; Wang et al. 2017; Jin et al. 2018). Generally, a predictive model can produce an ensemble of folded candidate structures, and consequently, a reliable knowledge-based statistical potential is required to evaluate predicted candidate structures. Furthermore, a reliable statistical potential can be used to guide RNA 3D structure folding (Jonikas et al. 2009; Zhang and Chen 2018). Knowledge-based statistical potential has been proved to be efficient and effective for evaluating protein tertiary structures (Sippl 1990; DeBolt and Skolnick 1996; Thomas and Dill 1996; Samudrala and Moult 1998; Lu and Skolnick 2001; Zhou and Zhou 2002; Shen and Sali 2006; Rykunov and Fiser 2007; Zhang and Zhang 2010; Huang and Zou 2011), protein–protein (Huang and Zou 2008), and protein–ligand docking (Huang and Zou 2006a, 2006b), and there have been six representative statistical potentials developed for protein tertiary structure evaluation, i.e., RAPDF (Samudrala and Moult 1998), KBP (Lu and Skolnick 2001), HA_SRS (Rykunov and Fiser 2007), Dfire (Zhou and Zhou 2002), Dope (Shen and Sali 2006), and RW (Zhang and Zhang 2010). These six statistical potentials for proteins are built based on six different reference states, i.e., averaging (Samudrala and Moult 1998), quasi-chemical approximation (Lu and Skolnick 2001), atom-shuffled (Rykunov and Fiser 2007), finite-ideal-gas (Zhou and Zhou 2002), spherical-noninteracting (Shen and Sali 2006), and random-walk-chain (Zhang and Zhang 2010) reference states, respectively, and the core difference between these statistical potentials mainly originates from the choice of different reference states. For RNA 3D structure evaluation, several statistical potentials have been built based on different reference states. Bernauer et al. (2011) have derived fully differentiable statistical potentials of KB at both all-atom and coarse-grained levels, based on the quasi-chemical approximation reference state. Capriotti et al. (2011) also have built all-atom and coarse-grained statistical potentials of RASP based on the averaging reference state. Recently, for RNA 3D structure evaluation, Wang et al. (2015) obtained a combined statistical potential of 3dRNAscore based on averaging reference state, which is composed of distance-dependent and torsion angle-dependent potentials. Simultaneously, knowledge-based statistical potentials have also been built to simulate RNA structural folding (Jonikas et al. 2009; Zhang and Chen 2018). Jonikas et al. (2009) have proposed a nucleotide-level coarse-grained potential of bond, angle, dihedral, and non-bond term based on statistical analyses on RNA structure information, and used the potential to model 3D structures of large RNAs based on secondary structure and tertiary contact predictions. Very recently, Zhang and Chen proposed a set of correlated energy functions through an iterative method, and such energy functions can produce the RNA structural parameters very close to those from experimental structures in the PDB database (Zhang and Chen 2018). However, compared to proteins, only the averaging and quasi-chemical approximation reference states were successfully used to construct statistical potentials for RNA 3D structure evaluation, and for RNAs, there is still lacking a comprehensive understanding of the performances of those reference states widely used for proteins. Therefore, we would perform a comprehensive examination on the extensive reference states and try to figure out which reference state is the best one for RNA 3D structure evaluation. In this work, based on six representative reference states—averaging (Samudrala and Moult 1998), quasi-chemical approximation (Lu and Skolnick 2001), atom-shuffled (Rykunov and Fiser 2007), finite-ideal-gas (Zhou and Zhou 2002), spherical-noninteracting (Shen and Sali 2006), and random-walk-chain (Zhang and Zhang 2010) reference states—we have built six statistical potentials for RNA 3D structure evaluation. Furthermore, we conducted an extensive examination of the six statistical potentials against three RNA test sets, including six subsets, through comparing their ability to identify native structures, identify near-native structures, and rank whole decoy sets. Additionally, we made extensive comparisons with the existing statistical potentials for RNAs and further examined the effect of training sets on the performance of statistical potentials. In order to get a reliable understanding of the reference states, the six statistical potentials were trained by a uniform nonredundant RNA training set and with the same parameters, such as bin width and distance cutoff.

RESULTS

Evaluation metrics

In general, there are two aspects for assessing the performance of a statistical potential: the ability of correctly identifying the native structure from a pool of decoys and ranking the near-native structures reasonably. Thus, in this work, we use the number of native structures with the minimum energy obtained by a statistical potential in the test sets, and we also calculate the ES (enrichment score) (Tsai et al. 2003; Bernauer et al. 2011; Wang et al. 2015) and PCC (Pearson correlation coefficient) (Capriotti et al. 2011) as metrics for near-native structures. ES is defined as (Bernauer et al. 2011; Wang et al. 2015) where is the number of structures with energy in the lowest 10% energy range whose rmsd is also in the lowest 10% rmsd range. Ndecoys is the total number of decoy structures for one native RNA. If the energy is extremely correlated to the rmsd, ES is equal to 10, and if it is completely unrelated to the rmsd, ES is equal to 1. PCC is given as (DeBolt and Skolnick 1996) where En and Rn are the energy and rmsd of the nth structure, respectively. and are the average energy and average rmsd of decoys, respectively. N is the total number of decoy structures for one native RNA. Equation 2 indicates that the closer the value of PCC to 1, the more linear the relationship between the rmsds and energies. If PCC is equal to 1, the relationship between the energies and rmsds is completely linear and the performance of the statistical potential is perfect. From the above, the number of identified native structures, ES value and PCC value can describe the abilities of a statistical potential in identifying native structures, in identifying near-native structures and in ranking whole decoy structures, respectively. In the following, we evaluate the six statistical potentials using the above-mentioned three evaluation metrics against three different RNA test sets including six test subsets.

Performance on test set I

Test set I, called randstr decoy set (Capriotti et al. 2011), consists of 85 RNAs with decoy structures generated by MODELER (Šali and Blundell 1993) with a set of Gaussian restraints for atom distances and dihedral angles from 85 native structures, which can be downloaded at http://melolab.org/supmat/RNApot/Sup._Data.html. In test set I, there are 500 decoy structures for each RNA native structure, and the rmsds of decoy structures for test set I are mainly distributed in the range of 0–6 Å; see Figure 1A. Firstly, we examined the six statistical potentials through identifying native structures from decoy structures for test set I, and the numbers of native structures identified by them are summarized in Table 1. As shown in Table 1, all the statistical potentials identify 83 native structures out of the decoys of 85 RNAs. Afterward, we calculated the ES and PCC values for test set I by the six statistical potentials. As shown in Supplemental Table S2 in the Supplemental Material, the average ES values of 85 decoys obtained by Avg-REF, QChA-REF, ASh-REF, FIG-REF, SNI-REF, and RWC-REF are 9.0, 8.9, 9.0, 8.9, 9.0, and 8.9, respectively, and the average PCC values are 0.87, 0.86, 0.87, 0.85, 0.87, and 0.87, respectively. This indicates that for test set I, the correlations between rmsds and energies from the six statistical potentials are all very strong and all reach high performance. Thus, overall, the six statistical potentials all exhibit high performance and are not significantly different in structure evaluation for the test set I. The rmsd-energy scatterplots for all the 85 RNAs in test set I by the six statistical potentials can be found in the Supplemental Material.
FIGURE 1.

(A) The rmsd probability distribution of decoys in test set I (Bernauer et al. 2011). (B) The rmsd probability distributions of decoys in test set II, which is composed of MD (Bernauer et al. 2011), NM (Bernauer et al. 2011), and FARNA (Das and Baker 2007) subsets within 16 Å. (C) The rmsd probability distributions of test set III, which is composed of FARFAR (Das et al. 2010) and RNA-Puzzles (Miao et al. 2017) subsets within 34 Å.

TABLE 1.

The numbers of native structures identified by the different statistical potentials

(A) The rmsd probability distribution of decoys in test set I (Bernauer et al. 2011). (B) The rmsd probability distributions of decoys in test set II, which is composed of MD (Bernauer et al. 2011), NM (Bernauer et al. 2011), and FARNA (Das and Baker 2007) subsets within 16 Å. (C) The rmsd probability distributions of test set III, which is composed of FARFAR (Das et al. 2010) and RNA-Puzzles (Miao et al. 2017) subsets within 34 Å. The numbers of native structures identified by the different statistical potentials

Performance on test set II

Test set II is comprised of the decoy structures built by Bernauer et al. (2011) and Das and Baker (2007). The former includes two subsets: decoys for five RNAs generated by replica-exchange molecular dynamics simulations with atom position restrained, called MD subset (Bernauer et al. 2011); and decoys for 15 RNAs generated by normal mode perturbation method, called NM subset (Bernauer et al. 2011). These two subsets can be downloaded from http://csb.stanford.edu/rna/. There are 3500 decoy structures for four RNAs and 2600 decoy structures for one RNA (PDB ID: 1msy) in the MD subset, and the rmsds of the decoy structures are mainly in the range of 0–3 Å; see Figure 1B. Furthermore, there are 500 decoy structures for each RNA in the NM subset, and the majority rmsds of decoy structures are in the range of 1–5 Å; see Figure 1B. The latter is called the FARNA subset (Das and Baker 2007), which includes decoys for 20 RNAs and was generated by the FARNA method (Das and Baker 2007). For each RNA in the FARNA subset, there are about 500 decoy structures, and the rmsds of decoy structures for the FARNA subset are quite dispersed in the range of 4–15 Å. The FARNA subset can be downloaded from https://daslab.stanford.edu/site_data/pub_data/. For the MD subset in test set II, as shown in Table 1, ASh-REF, FIG-REF, and RWC-REF can identify five native structures out of the decoys of five RNAs, while Avg-REF, QChA-REF, and SNI-REF identified four native structures. As shown in Table 2, the ES values from the statistical potentials derived by six reference states are all around 8.0 and do not differ much: ASh-REF and SNI-REF are very slightly higher than others and FIG-REF is slightly lower than the other five. However, for the PCC value, FIG-REF has the best performance with 0.85 and RWC-REF is very slightly lower than FIG-REF. The performances of Avg-REF, QChA-REF, ASh-REF, and SNI-REF are similar, and are slightly lower than FIG-REF and RWC-REF. Therefore, overall, FIG-REF and RWC-REF slightly outperform others for the MD subset, and the higher performance on PCC values of FIG-REF as well as RWC-REF may mainly come from their high performance on the decoy structures with relatively large rmsds (see Figure 2 and Supplemental Figure S2 in the Supplemental Material for the rmsd-energy scatter-plots from the six potentials for RNAs of PDB IDs 1f27 and 434d).
TABLE 2.

The ES and PCC values calculated by the different statistical potentials for test set IIa

FIGURE 2.

The rmsd-energy scatter-plot of 1f27 in RNA test set II_MD. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels.

The rmsd-energy scatter-plot of 1f27 in RNA test set II_MD. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels. The ES and PCC values calculated by the different statistical potentials for test set IIa For the NM subset, as shown in Table 1, all the statistical potentials can identify 13 native structures out of the decoys for 15 RNAs, and the two unidentified native structures are the RNAs of PDB IDs of 1esy and 1kka, whose native structures were solved by NMR spectroscopy at low salt (Amarasinghe et al. 2000; Cabello-Villegas et al. 2002). The experiment condition may be the main reason that these two native structures cannot be identified, since the deformation of RNA structure can be strongly influenced by cation concentration of solution and its structure will become less compact at lower salt due to the polyanionic nature (Jin et al. 2018; Shi et al. 2018). As shown in Table 2, the average ES value of RWC-REF is slightly higher than those of other potentials, and the average ES of FIG-REF is slightly lower than those of others. However, for average PCC values shown in Table 2, FIG-REF and RWC-REF have the best performances, and the performances of Avg-REF, QChA-REF, ASh-REF, and SNI-REF are similar and very slightly lower than those of FIG-REF and RWC-REF. Therefore, based on the overall assessment of both ES and PCC values, RWC-REF has the very slightly better performance for the NM subset. For the two largest RNAs (PDB IDs of 1x9k and 1i9v), on which RWC-REF has the best performance of the PCC value, the rmsd-energy scatter-plots by these six potentials are shown in Figure 3, and in Supplemental Figure S3 in the Supplemental Material, respectively.
FIGURE 3.

The rmsd-energy scatter-plot of 1x9k in RNA test set II_NM. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels.

The rmsd-energy scatter-plot of 1x9k in RNA test set II_NM. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels. For the FARNA subset, as shown in Table 1, the numbers of identified native structures of Avg-REF, QChA-REF, ASh-REF, FIG-REF, SNI-REF, and RWC-REF are 15, 15, 15, 16, 15, and 17 out of the decoys for 20 RNAs, respectively, and RWC-REF can identify the most native structures for the FARNA subset. As shown in Table 2 for ES and PCC values, the six statistical potentials all have unsatisfactory performances with mean ES values <3 and PCC values <0.4, respectively. QChA-REF has the slightly best performance with mean ES value of 2.56 and PCC value of 0.38, and FIG-REF has the worst performance with mean ES value of 1.83 and PCC value of 0.20. Overall, QChA-REF slightly outperforms other potentials on evaluation metrics for near-native structures in the FARNA subset. Besides, it is shown that the rmsds of decoys are generally large (e.g., between ∼8 Å and ∼15 Å for 1j6s) in this subset, and this may be the major reason that the statistical potentials all have unsatisfactory performance for the FARNA subset (see Fig. 4 and Supplemental Fig. S4 in the Supplemental Material for the rmsd-energy scatter-plots for RNAs of PDB IDs 1j6s and 1a4d). The rmsd-energy scatterplots for all 40 RNAs in test set II by the six statistical potentials can be found in the Supplemental Material.
FIGURE 4.

The rmsd-energy scatterplot of 1j6s in RNA test set II_FARNA. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels.

The rmsd-energy scatterplot of 1j6s in RNA test set II_FARNA. Here, the energy was calculated by the statistical potentials based on six reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The native structure is highlighted by an empty circle, and ES and PCC values are shown in the respective panels.

Performance on test set III

Test set III is composed of the FARFAR subset (Das et al. 2010) and RNA-Puzzles subset (Cruz et al. 2012; Miao et al. 2017). The former subset was obtained by RNA modeling with the FARFAR method (Das et al. 2010), and it contains the five lowest energy clusters of structure models for each of the 32 RNA motifs containing noncanonical base pairs (Das et al. 2010). Thus, FARFAR decoys can be used to assess the ability of statistical potentials to evaluate RNAs with noncanonical base pairs. As shown in Figure 1C, the rmsds for decoy structures in the FARFAR subset are mainly in the range of 1–11 Å. The latter subset was obtained from RNA-Puzzles (Miao et al. 2017), which is a CASP-like evaluation of blind 3D RNA structure predictions (Miao et al. 2017). Thus, the RNA-Puzzles subset contains various predicted decoy structures from the different RNA prediction models and can be a realistic test set for demonstrating the performance of a statistical potential in evaluating RNA 3D structures. There are dozens of predicted decoy structures for 18 different RNAs in the RNA-Puzzles subset, and the rmsds are distributed in the wide range of ∼2–34 Å (see Fig. 1C). FARFAR and RNA-Puzzles subsets can be downloaded from https://daslab.stanford.edu/site_data/pub_data/ and https://github.com/RNA-Puzzles/RNA-Puzzles-Normalized-submissions, respectively. Since there are only a dozen or several dozens of predicted structures for each RNA in FARFAR and RNA-Puzzles subsets, we only calculated the rmsd of the predicted structure that had the lowest energy for each RNA instead of ES value, as well as PCC values. For the FARFAR subset, as shown in Table 1, the numbers of identified native structures by Avg-REF, QChA-REF, ASh-REF, FIG-REF, SNI-REF, and RWC-REF are 19, 19, 19, 22, 21, and 22 for 32 RNAs, respectively. FIG-REF and RWC-REF slightly outperform SNI-REF, while Avg-REF, QChA-REF, and ASh-REF have a slightly worse performances than the others. As shown in Table 3, for the average rmsd of predicted structures with the lowest energy, the six different statistical potentials are similar. Moreover, different statistical potentials also have very similar PCC values, while ASh-REF, SNI-REF, and RWC-REF are very slightly better than the others. Overall, for the FARNA subset, the six statistical potentials have similar and unsatisfactory performances.
TABLE 3.

The rmsds of predicted structures with minimum energy and PCCs between energy and rmsd of decoy structures calculated by the different statistical potentials for test set III_FARFARa

The rmsds of predicted structures with minimum energy and PCCs between energy and rmsd of decoy structures calculated by the different statistical potentials for test set III_FARFARa For the RNA-Puzzles subset, as shown in Table 1, FIG-REF still performs the best in identifying native structures, and it can identify six native structures out of the decoys of 18 RNAs. Next, QChA-REF, ASh-REF, and SNI-REF can identify five native structures for 18 RNAs, and Avg-REF and RWC-REF can only identify four native structures. Furthermore, as shown in Table 4, FIG-REF has the best performance with the lowest average rmsd of the predicted structures with the lowest energy and the maximum average PCC value of 0.47, and RWC-REF has a slightly lower performance with an average PCC value of 0.43. However, for the RNA-Puzzles subset, all six statistical potentials do not have a satisfactory performance in identifying and ranking near-native structures. As described above, 10 RNA structures in the RNA-Puzzles subset are also in the training set; thus we benchmarked the effect of the 10 RNA structures on the performance of six reference states on the decoys of these 10 RNA structures by using the leave-one-out or jackknife method (Capriotti et al. 2011). In other words, for each one of the 10 RNAs, we rebuilt a statistical potential based on the training set with the remaining 107 native RNA structures by removing the specific RNA structure to assess decoy structures of this RNA. The results obtained by the above-described leave-one-out method are shown in Supplemental Tables S3 and S4. Compared with those from the training set of 108 RNA native structures, the rmsds of predicted structures with minimum energy and PCC values are almost exactly the same except for the subtle change in the average PCC value of ASh-REF. Such subtle effect due to the leave-one-out method is not surprising since the percentage of each one of these 10 RNAs in our training set is <2.6% in nucleotide number.
TABLE 4.

The rmsds of predicted structures with lowest energy and PCCs between energy and rmsd of decoy structures calculated by the different statistical potentials for test set III_RNA-Puzzlesa

The rmsds of predicted structures with lowest energy and PCCs between energy and rmsd of decoy structures calculated by the different statistical potentials for test set III_RNA-Puzzlesa

Overall performance on test sets

Identifying native structures

As shown in Table 1, FIG-REF and RWC-REF can identify the most native structures for five subsets. Next, ASh-REF identifies the most native structures for three subsets. After that, Avg-REF, QChA-REF, and SNI-REF can identify the most native structures for two subsets. On the overall level, Avg-REF, QChA-REF, ASh-REF, FIG-REF, SNI-REF, and RWC-REF can recognize 138, 139, 140, 145, 141, 144 native structures for 175 RNAs. Therefore, the performances of the different statistical potentials based on six reference states in identifying the native structures follow the order: FIG-REF ≳ RWC-REF > SNI-REF ≳ ASh-REF ≳ QChA-REF ≳ Avg-REF. It should be noted that the ability of the six statistical potentials in identifying RNA native structures is still weak, e.g., for the RNA-Puzzles subset, even FIG-REF with the best performance can only identify six native structures out of the decoys of 18 RNAs. Therefore, a good statistical potential is still highly desired for identifying native structures out of the predicted candidates from computational models for RNA 3D structures.

Identifying near-native structures

Equation 1 indicates that ES value reflects the ability of a statistical potential in identifying 10% of near-native structures from whole decoy structures. For only a dozen or several dozens of decoys corresponding to each native RNA in FARFAR and RNA-Puzzles subsets, we cannot calculate ES value for these two subsets, and we use the rmsd of predicted structure with the lowest energy instead of ES value. As shown in Table 2, QChA-REF, ASh-REF, SNI-REF, and RWC-REF have very slightly highest mean ES values for one subset (FARNA subset for QChA-REF, MD subset for ASh-REF and SNI-REF, and NM subset for RWC-REF) in test set II, which contains three subsets in total. Nevertheless, it is noted that the overall difference between various potentials in ES value is rather slight. For example, the maximum mean difference in ES value between different statistical potentials is 0.11 (between 7.92 for FIG-REF and 8.03 for ASh-REF and SNI-REF), 0.41 (between 5.69 for FIG-REF and 6.10 for RWC-REF), and 0.73 (between 1.83 for FIG-REF and 2.56 for QChA-REF) for MD, NM, and FARNA subsets, respectively. In addition, except for the MD subset, the statistical potentials all have relatively low ES values for NM and FARNA subsets, especially for the FARNA subset, i.e., mean ES value is as low as <2.6 for the FARNA subset. For the FARFAR subset, as shown in Table 3, there is still no distinctive difference between different statistical potentials for the average rmsd of predicted structure with the lowest energy. Besides, for identifying the nearest-native structure, which has minimum rmsd and energy simultaneously excluding its native structure, except that RWC-REF can identify eight nearest-native structures from the decoys of 32 RNAs, other statistical potentials can identify nine nearest-native structures. For the RNA-Puzzles subset, as shown in Table 4, all six statistical potentials have very unsatisfactory performance, and FIG-REF has the relatively better performance compared to the others. FIG-REF can identify three nearest-native structures from the decoys of 18 RNAs. This also suggests that a statistical potential of high performance is still highly desired for identifying near-native structures for predicted RNA 3D structures.

Ranking decoy structures in test sets

An important aim of a statistical potential is to be used to guide RNA folding or structure prediction, and thus there should be a positive and strong correlation between rmsds and the corresponding energies evaluated by a statistical potential of high performance. As shown in Tables 2–4, for PCC values, FIG-REF performs the best for three subsets (MD, NM, and RNA puzzles subsets). After that, RWC-REF has the best performance for two subsets (NM and FARFAR subsets) and QChA-REF, ASh-REF and SNI-REF have the best performance for one subset (FARNA subset for QChA-REF, FARFAR subset for ASh-REF and SNI-REF). Furthermore, for MD and RNA-Puzzles subsets, PCC values from RWC-REF are only slightly smaller than those from FIG-REF, and the other four statistical potentials have visibly lower performance than FIG-REF and RWC-REF. It is noted that FIG-REF has the best performance for test set III_RNA-Puzzles, which is from blind RNA 3D structure predictions, using extensive computational models (Miao et al. 2017). Therefore, the performances of the statistical potentials in ranking near-native structures follow the order: FIG-REF ≳ RWC-REF > QChA-REF ∼ ASh-REF ∼ SNI-REF ∼ Avg-REF. It is also noted that the six statistical potentials globally have unsatisfactory performances for FARNA, FARFAR, and RNA-Puzzles subsets, with PCC values <0.5. From the above shown performances on identifying native structures and ranking structure decoys, we can roughly rank that FIG-REF and RWC-REF are slightly superior to other statistical potentials, although for three subsets (FARNA, FARFAR, and RNA-Puzzles), the performances of FIG-REF and RWC-REF do not reach a satisfactory level.

Ability of capturing base-base interactions

Base-pairing and base-stacking interactions are critical to the stability of RNA 3D structure (Wang et al. 2016, 2018). The ability of capturing the base-pairing and base-stacking interactions is also an important criterion for assessing the quality of a statistical potential. Figure 5 shows the potentials between the N2 atom of guanine and the O2 atom of cytosine derived based on six reference states. There are two apparent wells for all of the six potentials: The first well at the distance of ∼3.0 Å is corresponding to the base-pairing interaction, and the second one at ∼8.0 Å is corresponding to the indirect base-stacking interaction between next-nearest residues. However, only FIG-REF, SNI-REF, and RWC-REF capture the significant base-stacking interaction between nearest bases at ∼3.6 Å, and the wells of FIG-REF and SNI-REF at 3.6 Å are slightly more distinctive. Similar phenomena were found in the potentials between the N1 atom of adenine and the N3 atom of uracil derived based on six reference states, which were also shown in Supplemental Figure S5. Therefore, FIG-REF, RWC-REF, and SNI-REF can capture all of the base-pairing, the nearest-neighbor and the next nearest-neighbor base stacking interactions rather than Avg-REF, QChA-REF, and ASh-REF, especially FIG-REF and RWC-REF. This may be the reason why FIG-REF, RWC-REF, and SNI-REF are the top three statistical potentials in identifying native structures.
FIGURE 5.

(A) The statistical potentials between N2 atom of guanine and O2 atom of cytosine based on different reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). For the situation in which atom pairs were not observed within a certain bin width, the statistical potentials in these distance bins were set as the most unfavorable value over the whole range of the corresponding statistical potential. (B) a, b, and c illustrate the three representative distances for the base-pairing, the nearest base-stacking, and the next-nearest base-stacking interactions between N2 atom of guanine and O2 atom of cytosine, respectively.

(A) The statistical potentials between N2 atom of guanine and O2 atom of cytosine based on different reference states: Avg-REF, the averaging reference state (Samudrala and Moult 1998); QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); ASh-REF, the atom-shuffled reference state (Rykunov and Fiser 2007); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). For the situation in which atom pairs were not observed within a certain bin width, the statistical potentials in these distance bins were set as the most unfavorable value over the whole range of the corresponding statistical potential. (B) a, b, and c illustrate the three representative distances for the base-pairing, the nearest base-stacking, and the next-nearest base-stacking interactions between N2 atom of guanine and O2 atom of cytosine, respectively.

Comparison with existing statistical potentials

Here, we make the comparison with three existing well-known statistical potentials which have relatively good performances for RNA 3D structure evaluation: KB potential of all-atom version based on quasi-chemical approximation reference state (Bernauer et al. 2011), RASP of all-atom version based on averaging reference state (Capriotti et al. 2011), and 3dRNAscore with involving torsion angle potential and based on averaging reference state (Wang et al. 2015). The data of 3dRNAscore, RASP, and KB potentials for the comparisons on test set I and test set II are directly taken from Bernauer et al. (2011) and Wang et al. (2015). Since the existing statistical potentials have not been examined against test set III completely and there is no complete data available for comparisons on test set III, we only compare their performances with QChA-REF, FIG-REF, SNI-REF, and RWC-REF for test set I and test set II including four subsets. As shown in Figure 6A, the numbers of identified native structures for test set I by various statistical potentials follow the order: 3dRNAscore ≳ QChA-REF = FIG-REF = SNI-REF = RWC-REF > KB > RASP. Overall, these statistical potentials all have excellent performance except for KB and RASP, which can identify 80 and 79 native structures for 85 RNAs in test set I. For test set II, the numbers of identified native structures by the statistical potentials follow the order: RWC-REF3 ≳ dRNAscore = KB = RASP = FIG- REF > QChA-REF = SNI-REF. Furthermore, we examined the statistical potentials through calculating ES values. As shown in Figure 6B, the mean ES value of 3dRNAscore is very similar to that of QChA-REF, FIG-REF, SNI-REF, and RWC-REF for the MD subset, and RASP and KB potential have apparently lower ES values. For the NM subset, 3dRNAscore has a very slightly higher ES value than QChA-REF, SNI-REF, and RWC-REF, while KB and RASP both have visibly lower ES values. For the FARNA subset, QChA-REF and SNI-REF have very slightly higher ES values than 3dRNAscore, and the ES values of KB, RASP, and FIG-REF are slightly lower than others. The detailed data in Figure 6 are also shown in Supplemental Tables S5 and S6.
FIGURE 6.

(A) The numbers of identified native structures for test set I and test set II and (B) the average ES values for test set I and test set II calculated by three existing statistical potentials: 3dRNAscore (Wang et al. 2015), KB potential (Bernauer et al. 2011), and RASP (Capriotti et al. 2011), and four current statistical potentials: QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The data of 3dRNAscore, KB potential, and RASP are taken from Bernauer et al. (2011) and Wang et al. (2015). Test set I consists of decoy structures of 85 RNAs (Capriotti et al. 2011) and test set II is composed of decoy structures of 40 RNAs (Das et al. 2010; Bernauer et al. 2011).

(A) The numbers of identified native structures for test set I and test set II and (B) the average ES values for test set I and test set II calculated by three existing statistical potentials: 3dRNAscore (Wang et al. 2015), KB potential (Bernauer et al. 2011), and RASP (Capriotti et al. 2011), and four current statistical potentials: QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). The data of 3dRNAscore, KB potential, and RASP are taken from Bernauer et al. (2011) and Wang et al. (2015). Test set I consists of decoy structures of 85 RNAs (Capriotti et al. 2011) and test set II is composed of decoy structures of 40 RNAs (Das et al. 2010; Bernauer et al. 2011). From the above, the 3dRNAscore is similar to the present statistical potentials of FIG-REF and RWC-REF in identifying native structures for test sets I and II, and the other existing statistical potentials of KB and RASP outperform slightly worse than others. For identifying near-native structures, the 3dRNAscore is similar to QChA-REF and SNI-REF, which have a very slightly better performance than other potentials. It is noted that RASP and KB potential have visibly lower ES values. Therefore, the 3dRNAscore has a similar performance to the top statistical potentials derived from the above described six reference states, while RASP and KB have visibly lower performance than others. In the following, we try to understand the difference in performance between the existing statistical potentials (KB, RASP, and 3dRNAscore) and those built in the current work. First, it was mentioned above that RASP, 3dRNAscore, and Avg-REF are all based on the averaging reference state. However, RASP of the all-atom version is only for 23 clustered atom types, while 3dRNAscore and Avg-REF both are for 85 heavy atom types. The significantly lower resolution of atom types might be the main reason that the performance of RASP is visibly lower than others. In contrast, the (relatively good) performance of 3dRNAscore (Wang et al. 2015) might be attributed to the explicit emphasis on the local torsional structure feature of RNA backbone, which is important for RNA 3D structures. Second, KB and QChA-REF are both based on the quasi-chemical approximation reference state and for 85 heavy atom types, while their training sets consist of 77 and 108 RNA native structures, respectively. Consequently, the lower performance of KB than QChA-REF possibly arises from the different training sets, and the effect of the training set will be discussed in the following subsection.

Effect of training set

The training set involved in this work is a nonredundant set of 108 RNAs excluding RNA structures in RNA–protein and RNA–DNA complexes. We examine the effect of the training set on the performance of a statistical potential, by involving the 3dRNAscore training set, which is composed of 317 RNAs and can be downloaded from http://biophy.hust.edu.cn/3dRNAscore.html (Wang et al. 2015). As shown in Figure 7A, the statistical potentials trained by 3dRNAscore and the present training sets can identify similar numbers of native structures for test set I, while for test set II, the statistical potentials from the present training set can identify slightly more native structures. Furthermore, we examined the effect of training sets by calculating ES values for test set II. As shown in Figure 7B, the statistical potentials based on the current training set generally have slightly higher ES values than those based on the 3dRNAscore training set, except for SNI-REF for the FARNA subset. For the MD subset, the mean ES value can decrease by ∼0.45 if the current training set is replaced by the 3dRNAscore training set. Such a decrease in mean ES value becomes ∼0.3 for the NM subset and ∼0.08 for the FARNA subset, respectively. It is not strange that the statistical potentials from the current training set with 108 RNAs are slightly better than those from the 3dRNAscore training set of 317 RNAs. This may be because our training set excludes those RNAs complexed with protein or DNA, although the number of RNAs in our training set is much smaller than that in the 3dRNAscore training set. Note that after removing those RNA structures complexed with DNA or protein in the 3dRNAscore training set, 42 RNA structures remained. However, it needs to be noted that the current nonredundant training set of 108 RNAs may still be inadequate for training a satisfactory statistical potential. Additionally, with the increasing number of RNA structures in the PDB database (Rose et al. 2017), the statistical potentials can be further improved in the future. Finally, a high-quality training set is required for generating a high-performance statistical potential, and it is also necessary to examine the minimal number of RNA structures for a satisfactory training set when there are plenty of available RNA 3D structures in the PDB database. The detailed data in Figure 7 were also presented in Supplemental Tables S7 and S8.
FIGURE 7.

(A) The numbers of identified native structures for test set I and test set II and (B) the average ES values for test set I and test set II calculated by the statistical potentials from the 3dRNAscore training set with 317 RNAs (Wang et al. 2015) and the present training set with 108 RNAs, respectively: QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). Here, for simplicity, we use 317 and 108 to denote 3dRNAscore and the current training sets, respectively. In addition, for FIG_317, the parameter α was taken as 1.0 to build a relatively uniform atom distribution for 317 RNA spheres (Zhou and Zhou 2002), and for RWC_317, the Kuhn length l was also set to 310. Test set I consists of decoy structures of 85 RNAs (Capriotti et al. 2011), and test set II is composed of decoy structures of 40 RNAs (Das et al. 2010; Bernauer et al. 2011).

(A) The numbers of identified native structures for test set I and test set II and (B) the average ES values for test set I and test set II calculated by the statistical potentials from the 3dRNAscore training set with 317 RNAs (Wang et al. 2015) and the present training set with 108 RNAs, respectively: QChA-REF, the quasi-chemical approximation reference state (Lu and Skolnick 2001); FIG-REF, the finite-ideal-gas reference state (Zhou and Zhou 2002); SNI-REF, the spherical-noninteracting reference state (Shen and Sali 2006); RWC-REF, the random-walk-chain reference state (Zhang and Zhang 2010). Here, for simplicity, we use 317 and 108 to denote 3dRNAscore and the current training sets, respectively. In addition, for FIG_317, the parameter α was taken as 1.0 to build a relatively uniform atom distribution for 317 RNA spheres (Zhou and Zhou 2002), and for RWC_317, the Kuhn length l was also set to 310. Test set I consists of decoy structures of 85 RNAs (Capriotti et al. 2011), and test set II is composed of decoy structures of 40 RNAs (Das et al. 2010; Bernauer et al. 2011).

DISCUSSION

Significance of reference states

The ideal reference state for statistical potentials refers to the state in which there are no interactions between atoms, and such a state should contain all possible RNA chain conformations in phase space, including extended and compact ones (Rykunov and Fiser 2007). The purpose of a statistical potential involving a reference state is for extracting the structural features to distinguish the native structure from a set of nonnative conformations (Samudrala and Moult 1998), or for guiding RNA folding/structure predictions. The six reference states examined in this work can be classified into two types: based on measured RNA 3D structures in the PDB database and based on physical modeling. The former includes Avg-REF (averaging reference state) (Samudrala and Moult 1998), QChA-REF (quasi-chemical approximation reference state) (Lu and Skolnick 2001), and ASh-REF (atom-shuffled reference state) (Rykunov and Fiser 2007), which use the structures in the PDB database to simulate reference states through summation on atom types, involving molar fraction of atom types and shuffling atom/residue types, respectively. In contrast, the latter includes FIG-REF (finite-ideal-gas reference state) (Zhou and Zhou 2002), SNI-REF (spherical-noninteracting reference state) (Shen and Sali 2006), and RWC-REF (random-walk-chain reference state) (Zhang and Zhang 2010), which discard the 3D structures in the PDB database and use an ideal gas model, spherical cluster model, and random-walk-chain to simulate reference states. The performances of the six statistical potentials should be tightly related to the corresponding reference states. The principles of the 3D-structure-based reference states are similar to each other by using various methods of mixing atom types for 3D structures deposited in the PDB database, and consequently the corresponding performances are also similar. In the physical-model-based reference states, various physical models such as random-walk-chain and finite-ideal-gas are approximated as reference states, which can better capture the conformation space than the 3D-structure-based reference states despite the simplification of local structure details. Furthermore, the physical-model-based reference states involve undetermined parameters in simulating reference states, e.g., dimension parameter α in FIG-REF and Kuhn length l in RWC-REF, which may also contribute to the higher performance of FIG-REF and RWC-REF than other statistical potentials. However, the above described 3D-structure-based reference states are based only on the ensemble of native structures, which naturally excludes a huge number of nonnative conformations in phase space. Thus, the 3D-structure-based reference states would differ significantly from the ideal reference states in conformation space, and consequently the corresponding statistical potentials do not work very well. It is also noted that the performance of FIG-REF and RWC-REF does not reach a satisfactory level (e.g., for the latter three test subsets), which may be attributed to the oversimplified structure models, e.g., ideal gas or random-walk-chain.

Effect of the origin of test sets

As shown in the Results section, a statistical potential can have different performances for different test sets, and as shown in Supplemental Table S9, even for the decoys from different test sets with the same native structures, the same statistical potential can have distinctly different performances. For example, for some RNAs (PDB IDs: 1kka, 1qwa, 1xjr, 28sp, and 2f88), the statistical potentials have overall much better performance for the NM subset than for the FARNA subset. This may come from the different origins for generating test sets (e.g., perturbation method for the NM subset [Bernauer et al. 2011] and fragment assembly for the FARNA subset [Das and Baker 2007]). For test set I, MD and NM subsets in test set II, the six statistical potentials overall have relatively good performance in identifying native structures and ranking near-native structures. These three subsets were produced by MODELLER with Gaussian constraints (Capriotti et al. 2011), replica-exchange MD (Bernauer et al. 2011), and the normal mode perturbation method (Bernauer et al. 2011), respectively, and thus the produced decoy structures are generally very close to the native structures. For example, as shown in Figure 1A,B, the rmsds of decoy structures in test set I, test set II_MD, and test set II_NM are mainly in the range of 0–6, 0–3, and 1–5 Å, respectively. In contrast, the rmsds of decoy structures in test set II_FARNA produced by the fragment assembly with FARNA (Das and Baker 2007) are quite dispersed in the range of 4–15 Å. Thus, all six statistical potentials overall have unsatisfactory performance for test set II_FARNA. Test set III_FARFAR from the fragment assembly with FARFAR (Das et al. 2010) is composed of small RNA segments of 6–23 nt, and the rmsds of the decoy structures are rather dispersed relative to their length (Fig. 1C). Test set III_RNA-Puzzles is a special test set with a small number of decoys of large RNAs that were from the blind CASP-like RNA 3D structure predictions of various computational models (Miao et al. 2017), and the rmsds of the decoy structures are very dispersed, e.g., the rmsd distribution of this subset extends to ∼34 Å. Thus, test set III is very challenging and the six statistical potentials overall have unsatisfactory performances on test set III_FARFAR and test set III_RNA-Puzzles. Since the purpose of a statistical potential is for realistic application, the test set from realistic predictions such as the RNA-Puzzles subset can be a more realistic examination for a statistical potential, rather than other test sets by near-native perturbation methods.

Limitation of current statistical potentials and perspective

First, the number of RNA molecules in a nonredundant RNA training set is still limited, which leads to insufficient data information for training a statistical potential. This is an inevitable problem for building all kinds of statistical potentials, and the problem can be gradually overcome in the future with the increase of the number of RNA structures deposited in the PDB database. Second, as mentioned above, the six reference states are either based on RNA native structures deposited in PDB or based on ideal physical models. Such oversimplified approximations may cause the produced reference states to significantly deviate from the ideal reference state, which may be the main reason for the unsatisfactory performance on the decoys with dispersed rmsd distributions. A more advanced approximation for a reference state is still highly required. For example, for circumventing the reference states, Huang and Zou developed an iterative method to build a scoring function for protein–ligand docking, and such an iterative method may be useful for building a statistical potential for RNA structure evaluation (Huang and Zou 2006a,b, 2008, 2011). Third, knowledge-based statistical potentials have been widely used and proven to be effective in protein structure evaluation. However, RNA structure is distinctively different from protein, and the involvement of RNA structure characteristics in a statistical potential may improve its performance. For example, RNAs are strongly charged polyanionic chains, whose structure can be affected by intrachain Columbic repulsion. Thus, RNA 3D structures can be highly sensitive to ion conditions (Tan and Chen 2010, 2011; Wu et al. 2015; Xi et al. 2018), and the involvement of ion-electrostatic interaction might be helpful for the performance of a statistical potential. Finally, the statistical potentials are generally pairwise for different atom pair types, while in principle the effect of other atoms is already involved. Consequently, the summation on such pairwise potentials for calculating total energy would bring double-counting. The development of many-body statistical potentials (e.g., through developing many-body contact potentials to supplement the pairwise statistical potentials; Li and Liang 2010) or removal of the effect of other atoms in building a pairwise statistical potential might bring the improvement of the statistical potential on RNA 3D structure evaluation.

Conclusions

In this work, we used six representative reference states widely used for proteins to construct statistical potentials for RNA 3D structure evaluation, and we have made extensive comparisons between them against three test sets including six subsets. We found that, overall, on identifying native structures and ranking decoy structures, the performances of FIG-REF and RWC-REF are slightly better than other ones and on identifying near-native structures, very slight difference exists between the six reference states. In addition, compared with three existing RNA statistical potentials, the top statistical potentials derived from six reference states have a similar performance to 3dRNAscore (Wang et al. 2015), while RASP (Capriotti et al. 2011) and KB (Bernauer et al. 2011) have a visibly lower performance than others. Furthermore, the performance of a statistical potential could be apparently dependent on the training set. Finally, we found that the performance of a statistical potential is closely related to the origin of the test sets. However, the overall performance of the six statistical potentials is still not at a high level for realistic test sets from structure prediction models, and thus an applicable statistical potential with high performance still remains to be improved, through proposing more realistic reference states, circumventing the problem of reference states, or combining a physical potential. Besides, previous studies for proteins show that the combination of structural clustering may improve the performance of statistical potentials (Zhang and Skolnick 2004; Zhang 2009; Xu et al. 2011; Deng et al. 2012). Furthermore, involving the unique characteristics of RNA, such as local structure feature (Wang et al. 2015) or ion electrostatic interactions in a statistical potential, can possibly improve its performance. Moreover, a multibody statistical potential (Singh et al. 1996; Feng et al. 2007; Masso 2018) can possibly capture more structural features than conventional pairwise ones, while generally involving a higher computational cost. Finally, machine-learning methods can be applied in building the statistical potential to dig critical information not easily detected for RNA structures (Li et al. 2018). Nevertheless, this work presents a comprehensive and critical survey on the performances of the existing reference states and statistical potentials for RNA 3D structure evaluation. Therefore, the present study can be considered as a benchmark work and can serve as a basis for further development on advanced knowledge-based statistical potentials of high performance for RNA 3D structure evaluation.

MATERIALS AND METHODS

Knowledge-based statistical potential

A knowledge-based statistical potential is generally derived based on Boltzmann or Bayesian formulations, and any kind of structure features that are able to distinguish the native conformation from a set of structural decoys can be used to derive a statistical potential (Samudrala and Moult 1998). Here, we still focus on a conventional all-heavy atom distance-dependent statistical potential, which can be given by (Deng et al. 2012) where kB is the Boltzmann constant, T is the Kelvin temperature, is the observed probability for the pair of atom types i and j residing within the distance bin of [r,r + dr]: is the probability for the pair of atom types i and j within the distance bin of [r,r + dr] from a conformation ensemble of the so-called reference state (Deng et al. 2012), and in principle, an ideal reference state can be obtained from a nonredundant and complete reference decoy conformation ensemble where interactions between atoms are assumed to vanish: Unfortunately, an ideal reference decoy database might not exist (Deng et al. 2012). Hence, people generally use various approximations based on experimental structure database or statistical physics models to build the reference states (Samudrala and Moult 1998; Lu and Skolnick 2001; Zhou and Zhou 2002; Shen and Sali 2006; Rykunov and Fiser 2007; Zhang and Zhang 2010; Deng et al. 2012). For building statistical potentials for proteins, there have been six reference states, which are introduced briefly as follows: averaging (Samudrala and Moult 1998), quasi-chemical approximation (Lu and Skolnick 2001), atom-shuffled (Rykunov and Fiser 2007), finite-ideal-gas (Zhou and Zhou 2002), spherical-noninteracting (Shen and Sali 2006), and random-walk-chain (Zhang and Zhang 2010) reference states.

Reference states

Averaging reference state

The averaging reference state was developed by Samudrala and Moult (1998). They used the average distribution of all kinds of atom pair types in experiment structures to approximately simulate the distribution of different atom pair types in the reference state. Thus, the probability can be approximated as (Samudrala and Moult 1998) where Nobs(r) is the observed number of atom pairs within the distance bin of [r,r + dr] regardless of atom types. Nobs is the total number of atom pairs over all distance bins. Given an RNA structure database, the averaging reference state is easy to use to derive a statistical potential.

Quasi-chemical approximation reference state

Considering that the counts of a certain atom pair type of the ideal reference state should be proportional to the mole fraction of the corresponding one in the experiment structures, Lu and Skolnick (2001) proposed the quasi-chemical approximation reference state, and can be calculated by (Lu and Skolnick 2001) Here x is the mole fraction of atom type i and can be obtained from a whole experimental structure database. is the number of the pair of atom types i and j over all the distance bins, and Nobs is the total number of atom pairs over all distance bins.

Atom-shuffled reference state

Shuffled reference states were proposed by Rykunov and Fiser (2007) to simulate the reference decoys. There are three shuffling modes, including residue-shuffled, sequence-shuffled, and atom-shuffled (Rykunov and Fiser 2007). Here, we used the atom-shuffled reference state in which all the atoms are fixed in coordinates while randomly exchanged in the identity of them, and we shuffled every experimental structure more than 3 million times, randomly. Then, can be given by Rykunov and Fiser (2007): where and are acquired from all the structures after being shuffled. This reference state provides an extremely stochastic reference conformation space and eliminates the effect of the connection of chemical bond.

Finite-ideal-gas reference state

Zhou and Zhou (2002) proposed the finite-ideal-gas reference state by applying the pair distribution function to the protein macromolecule system. The pair distribution function is written as (Friedman 1985) where is the observed number of pairs of atoms types i and j within the spherical shell of the radius bin of [r,r + dr]. N and N are the total number of atom types i and j over all the distance bins, respectively. V is the volume of a protein system. The atomic pairwise potential u(r) is equal to−kBTlng(r) (Friedman 1985), and u(r) can be expressed as follows: In general, for the distance between two considered atoms longer than the cutoff distance r, the interaction between them would decrease to zero. That is, u(r) ≈ 0 for r ≥ r. Thus, can be given by (Zhou and Zhou 2002) Here, α is a dimension parameter since the systems of macromolecules are not ideal gases even at high temperature. In our RNA system, α was taken as 1.39 to build a relatively uniform atom distribution for all RNA spheres (Zhou and Zhou 2002), and the relative fluctuation of the atom distribution function has been shown in Supplemental Figure S1.

Spherical-noninteracting reference state

Shen and Sali (2006) developed the spherical-noninteracting reference state for proteins, in which a native structure is represented by a sample sphere with the same radius of gyration as the native structure. Thus, this reference state takes the native structures of different sizes into account. can be expressed by (Shen and Sali 2006) where a is defined as the radius of an effective sphere, which has the same radius of gyration (Rg) as the experimental protein structure. Here, r also means the cutoff distance. Based on this reference state, one needs to make the statistics from the experimental structures one by one, and p represents a sampled protein structure. Thus, the final statistical potential can be calculated by (Shen and Sali 2006) where the weight w of the sampled experiment structure is given by the ratio between the number of all atom pairs in this sampled structure and the number of atom pairs in all samples, regardless of the pair types.

Random-walk-chain reference state

The random-walk-chain reference state was proposed by Zhang and Zhang (2010) to simulate the inherent connectivity of protein chains. According to the polymer theory of freely joined chain models, can be written as (Zhang and Zhang 2010) where N is the number of residues in an experimental structure, l is Kuhn length, and l is considered as an adjustable parameter to match the scale of free-joint chain to that of a realistic protein chain. In addition, p also represents a sampled protein structure, and is equal to the sum of over all the experimental protein structures. Similarly, u(r) ≈ 0 at r = r, and can be given by (Zhang and Zhang 2010) In this work for RNAs, the value of l2 is set to 310, in which case the potential based on random-walk-chain has the best performance for RNAs.

Training set and parameters

In this work, we established our nonredundant training set based on the RNA 3D Hub nonredundant set (Release 2.121, 2017-03-31), which can ensure that the sequence identity between any two chains in the set is <95% (Leontis and Zirbel 2012). First, we collected 1245 representative RNA chains of all the different clusters with X-ray resolution <3.5 Å from RNA 3D Hub list (Release 2.121, 2017-03-31), which can be downloaded from http://rna.bgsu.edu/rna3dhub/nrlist. This list shows that all of the RNA redundant chains have been divided into many clusters, and each cluster has a representative RNA chain. Next, what we needed to do was discard the representative chain whose structure complexes with protein or DNA and replace it with another member in this cluster whose structure is without protein and DNA, to avoid the possibly significant influence of complexed proteins or DNAs on RNA structures. Afterward, we removed the RNA structures with sequence identity >80% and coverage >80% using the BLASTN program (Altschul et al. 1990). However, since sequence identity is not equal to structure identity, and at sequence matching regions, we still kept those RNAs that have different 2D structures at sequence matching regions, even though the value of their sequence identity reaches the criterion. The 2D structure of different RNAs can be viewed from http://rnafrabase.cs.put.poznan.pl/ (Popenda et al. 2010). Finally, through the prior operation steps, our training set contained 108 RNA structures and we downloaded them from the Protein Data Bank (PDB) (Rose et al. 2017) in the form of biological assembly, which is believed to be the functional form of the macromolecule (Leontis and Zirbel 2012). It should be pointed out that our training set does not contain RNAs in test set I, test set II and the FARFAR subset in test set III, while there are 10 complicated native structures for the RNA-Puzzles subset in test set III. These 10 RNAs are riboswitches (3OX0, 4GXY, 4L81, 4QLM, 4XWF), ribozymes (4R4V, 5EAQ), exonuclease resistant RNA (5TPY), RNA Nanosquare (3P59), and regulatory motifs from mRNA (3MEI). The PDB IDs of these 108 RNAs are presented in Supplemental Table S1, and the PDB IDs of the 10 RNAs in the RNA-Puzzles subset are bolded. In building the six statistical potentials, 85 heavy atom types were considered. The distance cutoff was set to 20 Å and the distance bin width was taken as 0.3 Å, according to a previous study (Wang et al. 2015). For the situation that some atom pairs were not observed within a certain bin width, the corresponding potentials were set to the value of highest potential in the whole potential, and kBT was taken as the unit of potential energy in this work. Also, for convenience, we used the following abbreviations to represent six different reference states: Avg-REF (averaging reference state) (Samudrala and Moult 1998), QChA-REF (quasi-chemical approximation reference state) (Lu and Skolnick 2001), ASh-REF (atom-shuffled reference state) (Rykunov and Fiser 2007), FIG-REF (finite-ideal-gas reference state) (Zhou and Zhou 2002), SNI-REF (spherical-noninteracting reference state) (Shen and Sali 2006), and RWC-REF (random-walk-chain reference state) (Zhang and Zhang 2010).

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.
  76 in total

1.  The NMR and X-ray structures of the Saccharomyces cerevisiae Vts1 SAM domain define a surface for the recognition of RNA hairpins.

Authors:  Tzvi Aviv; Andrew N Amborski; X Sharon Zhao; Jamie J Kwan; Philip E Johnson; Frank Sicheri; Logan W Donaldson
Journal:  J Mol Biol       Date:  2005-12-07       Impact factor: 5.469

2.  An iterative knowledge-based scoring function to predict protein-ligand interactions: II. Validation of the scoring function.

Authors:  Sheng-You Huang; Xiaoqin Zou
Journal:  J Comput Chem       Date:  2006-11-30       Impact factor: 3.376

3.  Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures.

Authors:  Sheng-You Huang; Xiaoqin Zou
Journal:  Proteins       Date:  2011-07-05

4.  Competitive Binding of Mg2+ and Na+ Ions to Nucleic Acids: From Helices to Tertiary Structures.

Authors:  Kun Xi; Feng-Hua Wang; Gui Xiong; Zhong-Liang Zhang; Zhi-Jie Tan
Journal:  Biophys J       Date:  2018-04-24       Impact factor: 4.033

5.  Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide-nucleotide interactions from direct coupling analysis.

Authors:  Jian Wang; Kangkun Mao; Yunjie Zhao; Chen Zeng; Jianjin Xiang; Yi Zhang; Yi Xiao
Journal:  Nucleic Acids Res       Date:  2017-06-20       Impact factor: 16.971

6.  SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction.

Authors:  Michal J Boniecki; Grzegorz Lach; Wayne K Dawson; Konrad Tomala; Pawel Lukasz; Tomasz Soltysinski; Kristian M Rother; Janusz M Bujnicki
Journal:  Nucleic Acids Res       Date:  2015-12-19       Impact factor: 16.971

7.  Assemble: an interactive graphical tool to analyze and build RNA architectures at the 2D and 3D levels.

Authors:  Fabrice Jossinet; Thomas E Ludwig; Eric Westhof
Journal:  Bioinformatics       Date:  2010-06-18       Impact factor: 6.937

8.  Vfold: a web server for RNA structure and folding thermodynamics prediction.

Authors:  Xiaojun Xu; Peinan Zhao; Shi-Jie Chen
Journal:  PLoS One       Date:  2014-09-12       Impact factor: 3.240

9.  Structure Prediction of RNA Loops with a Probabilistic Approach.

Authors:  Jun Li; Jian Zhang; Jun Wang; Wenfei Li; Wei Wang
Journal:  PLoS Comput Biol       Date:  2016-08-05       Impact factor: 4.475

10.  Capturing RNA Folding Free Energy with Coarse-Grained Molecular Dynamics Simulations.

Authors:  David R Bell; Sara Y Cheng; Heber Salazar; Pengyu Ren
Journal:  Sci Rep       Date:  2017-04-10       Impact factor: 4.379

View more
  5 in total

1.  rsRNASP: A residue-separation-based statistical potential for RNA 3D structure evaluation.

Authors:  Ya-Lan Tan; Xunxun Wang; Ya-Zhou Shi; Wenbing Zhang; Zhi-Jie Tan
Journal:  Biophys J       Date:  2021-11-17       Impact factor: 4.033

2.  FebRNA: An automated fragment-ensemble-based model for building RNA 3D structures.

Authors:  Li Zhou; Xunxun Wang; Shixiong Yu; Ya-Lan Tan; Zhi-Jie Tan
Journal:  Biophys J       Date:  2022-08-17       Impact factor: 3.699

3.  Pairing a high-resolution statistical potential with a nucleobase-centric sampling algorithm for improving RNA model refinement.

Authors:  Peng Xiong; Ruibo Wu; Jian Zhan; Yaoqi Zhou
Journal:  Nat Commun       Date:  2021-05-13       Impact factor: 14.919

4.  Structure folding of RNA kissing complexes in salt solutions: predicting 3D structure, stability, and folding pathway.

Authors:  Lei Jin; Ya-Lan Tan; Yao Wu; Xunxun Wang; Ya-Zhou Shi; Zhi-Jie Tan
Journal:  RNA       Date:  2019-08-07       Impact factor: 4.942

Review 5.  RNA 3D Structure Prediction Using Coarse-Grained Models.

Authors:  Jun Li; Shi-Jie Chen
Journal:  Front Mol Biosci       Date:  2021-07-02
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.