Literature DB >> 31395671

A structure-based model for the prediction of protein-RNA binding affinity.

Chandran Nithin¹, Sunandan Mukherjee¹, Ranjit Prasad Bahadur¹.

Abstract

Protein-RNA recognition is highly affinity-driven and regulates a wide array of cellular functions. In this study, we have curated a binding affinity data set of 40 protein-RNA complexes, for which at least one unbound partner is available in the docking benchmark. The data set covers a wide affinity range of eight orders of magnitude as well as four different structural classes. On average, we find the complexes with single-stranded RNA have the highest affinity, whereas the complexes with the duplex RNA have the lowest. Nevertheless, free energy gain upon binding is the highest for the complexes with ribosomal proteins and the lowest for the complexes with tRNA with an average of -5.7 cal/mol/Å2 in the entire data set. We train regression models to predict the binding affinity from the structural and physicochemical parameters of protein-RNA interfaces. The best fit model with the lowest maximum error is provided with three interface parameters: relative hydrophobicity, conformational change upon binding and relative hydration pattern. This model has been used for predicting the binding affinity on a test data set, generated using mutated structures of yeast aspartyl-tRNA synthetase, for which experimentally determined ΔG values of 40 mutations are available. The predicted ΔG empirical values highly correlate with the experimental observations. The data set provided in this study should be useful for further development of the binding affinity prediction methods. Moreover, the model developed in this study enhances our understanding on the structural basis of protein-RNA binding affinity and provides a platform to engineer protein-RNA interfaces with desired affinity.

Entities: Species

Keywords: binding affinity; conformation change; dissociation constant; protein–RNA interaction; regression model

Mesh：

Substances：

Year: 2019 PMID： 31395671 PMCID： PMC6859855 DOI： 10.1261/rna.071779.119

Source DB: PubMed Journal: RNA ISSN： 1355-8382 Impact factor: 4.942

INTRODUCTION

Biomolecular assemblies involving proteins and RNAs are essential for many cellular functions, and the stability of these assemblies is mediated by the noncovalent interactions (Pauling and Pressman 1945; Kauzmann 1959; Chothia and Janin 1975; Janin 1995; Draper 1999; Nadassy et al. 1999; Jones 2001; Treger and Westhof 2001; Bahadur et al. 2008). These noncovalent forces are responsible for the binding processes as well as for the folding of the biomolecules; however, specificity plays an important role in the recognition. Although the conformation of the chemical groups, constituting the biomolecules in three dimensions (3D), determine the specificity of binding, the energy required for the biomolecules to adopt this conformation is the determinant factor for the affinity of complex formation and is termed as the free energy of binding. At equilibrium, the change in Gibbs free energy upon binding (ΔG) can be determined from the reaction kinetics in terms of the dissociation constant Kd. Although numerous biophysical and biochemical methods are available to determine the Kd (and consequently the ΔG) values, the gap between the experimentally determined atomic structures of the complexes available in the Protein Data Bank (PDB) (Berman et al. 2002) and their corresponding free energy of binding is still enormous. To bridge this gap, many physics-based methods have been developed to determine the ΔG values theoretically (Kollman 1993), which are often correlated with their corresponding experimental values when available in the literature (Horton and Lewis 1992; Murphy et al. 1993; Vajda et al. 1994; Janin 1995; Ma et al. 2002; Audie and Scarlata 2007; Su et al. 2009; Kastritis and Bonvin 2010). Although these methods are extensively used on the protein–ligand and on the protein–protein complexes (Ballester and Mitchell 2010; Moal et al. 2011), they are yet to be tested on the protein–RNA complexes. One of the major difficulties in testing them on the protein–RNA complexes is the scarcity of the available experimental data of protein–RNA binding affinity. However, the growing interest in the field of studying protein–RNA interactions made available a handful of experimentally determined Kd values for a diverse set of protein–RNA complexes. Recently, we have assembled a protein–RNA docking benchmark v2.0, which contains 126 complexes (Nithin et al. 2017) as an update to the v1.0 of protein–RNA docking benchmarks (Barik et al. 2012b; Pérez-Cano et al. 2012; Huang and Zou 2013). Of these complexes, the binding affinity values are available for 40 cases in the literature. The curated data set represents a wide array of biological functions with affinities covering eight orders of magnitude (range between 10−4 and 10−11 M), as well as four different structural classes. We train a mathematical model to predict the protein–RNA binding affinity using structural and physicochemical parameters of the protein–RNA interfaces derived from this data set. In addition, we have successfully tested this model on a data set of mutated structures of yeast Asp-tRNA synthetase complexed with its cognate tRNA, for which experimental ΔG values of 40 mutations are available in the literature (Eriani and Gangloff 1999). Our findings will provide a basis for the quantitative description of protein–RNA binding affinity and may further be extended to engineer protein–RNA interfaces with desired affinity. Additionally, the curated data set can be used for further development and testing of algorithms to predict the protein–RNA binding affinity.

RESULTS

The data set of protein–RNA binding affinity

The starting point of the present study is the protein–RNA docking benchmark v2.0 compiled by Nithin et al. (2017). We performed an exhaustive literature survey to curate the binding affinity data of the protein–RNA complexes, for which at least one unbound structure is available. The binding affinity data set contains 40 complexes, for which the Kd values were determined by any of the six methods described in the Materials and Methods section. For each complex in this data set, we report the following parameters in Table 1: PDB entry code and the chain identifiers for the complex and its constituents, the length of the RNA molecule in the crystal structure and in the solution, the Kd value and the derived ΔG, the temperature and the pH at which the measurements were done, the experimental methodologies used, and the references reporting the publications. In addition, we also report the values of interface area (B), free energy gain upon binding of the partner molecules, c-rmsd, p-rmsd, i-rmsd, and the flexibility of the interface.

TABLE 1.

The data set for the protein–RNA binding affinity

The data set for the protein–RNA binding affinity There are different data sets available for protein–RNA binding affinity values curated by multiple groups (Nithin et al. 2018). With 73 cases, the first data set of affinity values was developed in 2013 by the Liu group (Yang et al. 2013). The PDBBind data set developed by the Wang group reports affinity values for six protein–RNA complexes (Liu et al. 2015). Similarly, the Bahadur group had curated the data set of affinity values for alanine substitutions in protein components of protein–RNA complexes, which reports 94 experimental affinity values for 14 native structures (Barik et al. 2016). This data set was later expanded by the Deng group to include 49 RNP complexes (Pan et al. 2018), which reports 334 experimental affinity values. More recently, the Mitchell and Zhu groups developed the dbAMEPNI data set, which reports experimentally determined affinity values for 51 protein–RNA complexes (Liu et al. 2018) and for 193 alanine substitutions in these RNA-binding proteins. In the current study, we have used the data set of affinity values curated for complexes available in the protein–RNA docking benchmark, which is highly nonredundant, making the choice justified.

Experimental techniques used to measure the affinity values

An important aspect to be considered about the binding affinity data set is the methodological difference among different experimental techniques used for affinity measurements (Kastritis and Bonvin 2013). Protein–RNA binding affinity is highly dependent on temperature, pH, and ionic strength, as well as the presence of a high concentration of other macromolecules. This has been carefully checked for all the cases present in this data set and is reported in Table 1. The affinity values reported in Table 1 were measured by the following experimental methods: EMSA (Ryder et al. 2008), filter binding assay (Rio 2012), fluorescence spectroscopy (Vivian and Callis 2001), kinetic study (Goodrich and Kugel 2015), ITC (Feig 2009), and SPR (Katsamba et al. 2002). All the experiments that measured the reported dissociation constants in the present data set have been performed in the temperature range between 273 K and 338 K. In 13 cases, the Kd values were determined by EMSA, which is routinely used to visualize the protein–RNA interactions (Ryder et al. 2008). This method relies on the electronic property of the RNA surface, which helps them to migrate toward an anode upon application of an electric field in an agarose or polyacrylamide gel matrix. The average Kd value determined by EMSA is in the micromolar range, although it reports both high affinity (Kd = 5.4 × 10−11 M) in the helicase dbpA-23S rRNA fragment complex (Hardin et al. 2010) and low affinity (Kd = 1.0 × 10−6 M) in elongation factor Sel B and the RNA complex (Yoshizawa et al. 2005). The dissociation constants measured by EMSA were at different temperatures, and the average ΔG is −10.55 kcal/mol (Table 1). EMSA captures more stable interactions, which can tolerate electrophoresis conditions and heat generated during the process (Hellman and Fried 2007). Fluorescent spectroscopy (Vivian and Callis 2001) makes use of either the intrinsic tryptophan fluorescence (Ghisaidoobe and Chung 2014) or fluorophore-labeled oligonucleotide probes (Pagano et al. 2011) to quantitatively monitor the protein–nucleic acid interactions. This method is the second major method that reports nine Kd values with an average of 1.25 × 10−5 M, and the corresponding average for ΔG is −8.81 kcal/mol. The value ranges from 1.1 × 10−4 M in the complex between Nab3-RRM and the UCUU recognition sequence (Lunde et al. 2011) to 4.0 × 10−8 M in the complex between the ZnF domain and pre-mRNA (Teplova and Patel 2008). The affinity measurements by the fluorescence titration method could be affected by the use of a labeling that may alter the binding behavior of the complex (Klotz 1985). Filter binding assay (Rio 2012) is the third major method that reports Kd values for eight complexes. This methodology is based on the assumption that proteins bind to nucleic acids, and if a protein is associated with a nucleic acid, then the complex can also be retained on a nitrocellulose filter (Hall and Kranz 1999). This method is fully informative when there is a single binding site, but it fails at the presence of multiple binding sites. The average Kd value reported by this method is 6.44 × 10−7 M (corresponding average ΔG value is −11.30 kcal/mol) with a range from 5.0 × 10−6 M in the complex between the Rho transcription termination factor and cytosine-rich RNA (Martinez et al. 1996) to 3.33 × 10−11 M in the complex between SRP and 4.5S RNA (Batey et al. 2001). These values were measured either at 298 K or at 310 K. The association between the protein and the nucleic acid must be tight enough to survive the filtration, and the protein must be able to retain the bound nucleic acid when it is in turn bound to the filter (Oehler et al. 1999). ITC measures the heat taken up or given off during interaction between protein and RNA (Feig 2009). The enthalpy changes associated with the binding are measured by calorimetry, which makes it the only method available for directly determining the thermodynamic parameters associated with an interaction (Ladbury and Chowdhry 1996). ITC was used in the determination of Kd values for seven complexes with an average Kd of 4.67 × 10−7 M, and the values range from 1.2 × 10−6 M in the complex between NS1 and dsRNA (Cheng et al. 2009) to 1.26 × 10−8 M in the complex between GTPase ERA and 16S rRNA (Tu et al. 2011). Even though ITC has high sensitivity, the extraction of heat effects of complex formation can be challenging under certain conditions. For instance, when the association of protein and RNA exhibits small binding enthalpy, the signal-to-noise ratio is relatively low. Similarly, when the complex formation is a rather slow process, the kinetically low process can get neglected (Du et al. 2016). The other two methods, SPR and kinetic study, report Kd values for two (9.8 × 10−10 M and 5.3 × 10−9 M) and one (3.6 × 10−7 M) complex, respectively. SPR is an optical technique that measures the refractive index near the surface of the sensor (Yang et al. 2008). SPR (Biacore) is able to detect more transient and less stable interactions in the solution phase when the ligand molecule is provided at a very low concentration ranging from picomolar to nanomolar (Katsamba et al. 2002). However, the measurements performed by SPR could be affected by mass transfer limitations and the heterogeneity of the binding surface (Schuck and Zhao 2010).

Affinity in different structural classes

In the present data set, nine out of 40 complexes have both the protein and the RNA components in their free forms. Table 2 represents the data set divided into four structural classes as provided in the protein–RNA docking benchmark: (A) complexes with tRNA, (B) complexes with ribosomal protein, (C) complexes with duplex RNA, and (D) complexes with single-stranded RNA. They represent 12.5%, 5%, 37.5%, and 45% of the cases, respectively. In addition, Table 2 provides the mean value and the standard deviation for ΔG, B, and i-rmsd in each class. All the structures reported in this data set are of medium affinity class as described in the affinity benchmark of protein–protein complexes by Kastritis et al. (2011).

TABLE 2.

Affinity and structural classes of the complexes

Affinity and structural classes of the complexes The structural class A contains five complexes with aminoacyl tRNA synthetases and their corresponding tRNAs. The Kd is in the range from 3 × 10−8 M to 3.6 × 10−7 M. The average ΔG in this class is −9.97 kcal/mol, and the corresponding average interface area (B) is 4022 Å2. The data set contains only two Kd values for the class B complexes involving ribosomal protein and rRNA. For both of them, the dissociation constant is ∼7 × 10−8 M. The average ΔG in this class is −9.4 kcal/mol, and the average B is 1404 Å2. Class C, complexes with duplex RNA, exhibits a wide Kd range from 1.2 × 10−6 M to 3.33 × 10−11 M with an average of 1.04 × 10−7 M (corresponding average ΔG value is −11.47 kcal/mol). In this class, the highest affinity is found in the complex between the SRP and 4.5S RNA (Kd is 3.33 ×10−11 M). Here, the minor groove of 4.5S RNA recognizes the M domain of SRP and the recognition is highly sequence-specific (Batey et al. 2001). The buried surface area between the protein and the RNA is 1364 Å2 (Fig. 1A) with a fully flexible interface (i-rmsdRNA = 30.4 Å). The average ΔG in class C complexes is −11.5 kcal/mol, and the average B is 2313 Å2. Class D involves complexes with single-stranded RNA with a wide Kd ranging from 9.80 × 10−10 M to 1.10 × 10−4 M. The average ΔG is −8.98 kcal/mol, and the average B is 1950 Å2. This structural class contains the lowest affinity complex involving Nab3-RRM and the UCUU recognition sequence (Kd is 1.1 × 10−4 M) in which the protein–RNA recognition process is sequence-specific (Lunde et al. 2011). The buried surface area between the protein and the RNA is very small (926 Å2) (Fig. 1B) with a rigid interface (i-rmsd = 1.2 Å). Table 2 shows that average ΔG is the highest for complexes involving single-stranded RNA and the lowest for complexes involving duplex RNA with significant standard deviation in each class.

FIGURE 1.

Test cases in the protein–RNA affinity data set. (A) Structure of an SRP in complex with 4.5 S RNA (PDB ID: 1HQ1; structural class C), and (B) structure of the Rho transcription termination factor in complex with cytosine-rich RNA (PDB ID: 2A8V; structural class D). In each case, the bound form of the protein is in green, and the unbound form is in cyan. The bound form of RNA is in gray, and the unbound is in olive.

Conformational changes and the binding affinity

Protein–RNA interactions are often associated with conformational changes in which both the protein and the RNA molecules undergo significant rearrangements in their 3D structures upon binding (Ellis et al. 2007). At the protein side, the frequent changes include the side-chain rotation, small adjustments of the main chain, or large movements of the domains. Nevertheless, large domain movement exhibited in the ribosomal protein L1-mRNA complex (Tishchenko et al. 2006) or the transition of the polypeptide chain from disordered to ordered state exhibited in the P22 N-peptide bound to boxB RNA (Bahadur et al. 2009) and in the pseudouridine synthase TruB-tRNA complex (Pan et al. 2003) are also associated with RNA binding. Being more flexible than protein, RNA undergoes larger conformational changes than its partner protein at the protein–RNA binding site (Barik et al. 2012b). The data set displays examples with varying flexibility at the protein–RNA binding sites. While at one end it has a full flexible protein exemplified in the O-phosphoseryl-tRNA kinase-tRNA-Sec complex (i-rmsd: 27.3 Å; PDB ID: 3ADB), at the other end it has a rigid complex formed between exosome and substrate RNA (i-rmsd: 0.3 Å; PDB ID: 3IEV). The corresponding ΔG values are −11.31 kcal/mol and −7.98 kcal/mol, respectively, with a significant difference in their interface area, which are 3014 Å2 and 1533 Å2, respectively (Table 1). Average i-rmsd is the lowest in the complexes with single-stranded RNA and the highest in the complexes with double-stranded RNA (Table 2); however, the overall correlation with the ΔG is poor (Table 3).

TABLE 3.

Pearson correlation coefficient between binding affinity and structural parameters

Variable length of the RNA and affinity measurements

For each complex, Table 1 shows the length of the RNA in the crystal structure, as well as in the solution prepared for the Kd measurement. In most cases, the length of the RNA in the solution used for determining the Kd values are different from those available in the crystal structures. In the structural class A, in all but one case, the length of the RNAs from crystal and from solution is identical. This exception is found in the complex CCA-adding enzyme bound to 35-mer tRNA. Here, the partial tRNA used in the crystal is a synthetic construct (Tomita et al. 2006). In the structural class B, the RNA molecules in the crystal are shorter than those in the solution for Kd measurements. In the structural class C, significant RNA length variability is observed in three out of 15 cases. In the structural class D, significant RNA length variability is observed in nine out of 18 cases. In all of these cases, where the length variation exists, most of the crystal structures contain a part of the full-length RNA or synthetic RNA constructs, which are generally biologically irrelevant. In such cases, to predict the binding affinity theoretically, predictors should start with the longer RNA and model its conformation (Sim et al. 2012).

Mathematical model for the prediction of binding affinity

The data set of affinity values curated in this study has been used to generate a regression model to predict the binding affinity from various structural and physicochemical parameters (Supplemental Table S1). We have developed mathematical models as described in the Materials and Methods section using Eureqa (Schmidt and Lipson 2009). Models with R2 (the coefficient of determination) below 0.8 were discarded. The models fitted with the lowest error are selected and listed in Supplemental Table S2. The best fit model with minimum values for maximum error reported in the regression fitting is where is the fraction of nonpolar interface atoms on the protein side. The parameter dr quantifies the relative hydration of interfaces. It is defined as the ratio of the average distance of interface waters to the average distance of interface atoms (Barik and Bahadur 2014). All the distances were measured from the center of mass of the interface. The above model has a goodness of fit (R2) value of 0.92 and a correlation coefficient of 0.96, with a mean absolute error of 0.27 and mean squared error of 0.30. The goodness of fit is represented graphically as a plot of experimental and predicted values of ΔG (Fig. 2).

FIGURE 2.

Observed ΔG (in kcal/mol) versus predicted ΔGempirical (in kcal/mol) for the training data set.

Observed ΔG (in kcal/mol) versus predicted ΔGempirical (in kcal/mol) for the training data set. The intrinsic variability embedded in different experimental techniques for measuring the binding affinity may account for the difference between the predicted and the experimental values. To test the validity of the model building process, different models were trained on data derived from a single experimental method. The experimental methods A, B, C, and E were used to train the models. The number of data points available for experimental methods D and F was not sufficient to train mathematical models. Equations 2 to 5 represent the best models from training using data from A, B, C, and E, respectively: The models trained were tested on the data set and the results are shown in Figure 3. The variable sensitivity report for all the models is available in Supplemental Table S3.

FIGURE 3.

A comparison of predictions for models trained on four data sets segregated based on the experimental method as well as the overall training data set. The four segregated data sets include Kd values determined by filtration assay (A), fluorescence titration (B), EMSA (C), and ITC (E). The number of data points available for Kd determined by two experimental methods, binding kinetics (D) and SPR (F), was not sufficient to train independent mathematical models.

Validation of the mathematical model on the test data set

The model was tested on a data set of ΔG values curated from literature for yeast aspartyl-tRNA synthetase (AspRS) (Eriani and Gangloff 1999). Experimentally determined ΔG values for 40 mutations are available for this complex. Yeast AspRS (PDB ID: 1ASY) representing the various mutated residues at interface and noninterface regions is shown in Figure 4. We modeled each of these mutated structures using the native structure as the template. The various structural and physicochemical parameters were calculated for the mutated structures and the ΔGempirical values were calculated using the developed model (Supplemental Table S4). The predicted ΔGempirical and experimental ΔG are presented as a scatter plot in Figure 5. The standard error, the mean absolute error, and the percentage relative error observed in this prediction are 1.69%, 1.42%, and 16.95% (Supplemental Table S4), respectively.

FIGURE 4.

FIGURE 5.

Observed ΔG (in kcal/mol) versus predicted ΔGempirical (in kcal/mol) for the substitutions in yeast Asp-tRNA synthetase.

Mutated residues in yeast Asp-tRNA synthetase (PDB ID: 1ASY). The interface and the noninterface residues are shown in blue and cyan, respectively. The residues mutated at the interface are shown in red, whereas those at the noninterface region are shown in olive. Observed ΔG (in kcal/mol) versus predicted ΔGempirical (in kcal/mol) for the substitutions in yeast Asp-tRNA synthetase.

DISCUSSION

The availability of experimental binding affinity values facilitates correlating them with the structural data available from the 3D structure determination methods. The binding affinity measures the strength of the association between the biomolecules, and because of the difficulty to measure them directly from the experiments in some specific systems, efforts have been made to estimate them from the correlation derived from the structural data. This effort is very successful in the case of protein–protein complexes correlating the binding free energy with the buried surface area (Chothia and Janin 1975; Guharoy and Chakrabarti 2005). The affinity benchmark for protein–protein complexes (Kastritis et al. 2011; Vreven et al. 2015) facilitates the progress of the theoretical models that uses 3D atomic structures to predict the binding free energy. This was started almost 29 years ago by Horton and Lewis (1992), with a handful of experimental dissociation constants curated from the literature. Since then, many computational methods have been developed based on empirical, semi-empirical, or knowledge-based approaches, which predict the binding affinity of protein–protein (Baker and Murphy 1998; Ma et al. 2002; Audie and Scarlata 2007; Kastritis et al. 2011; Tian et al. 2012; Janin 2014; Xue et al. 2016), protein–ligand (Böhm 1994; Eldridge et al. 1997; Mitchell et al. 1999; Gilson and Zhou 2007; Kim and Skolnick 2008; Ballester and Mitchell 2010), and protein–DNA complexes (Selvaraj et al. 2002; Zhang et al. 2005) with varied success. However, because of lack of availability of data, implementation of these methods on protein–RNA complexes is still in its infancy (Yang et al. 2014). The success of all these binding affinity prediction methods partially depends on the quality of the data, which can often be contaminated by incorrect Kd values or by the association of Kd with incorrect PDB entries due to human error during the literature search. However, even for a high-quality data set, these methods fail to predict the binding affinity correctly because of the lack of accountability for the conformational changes during the binding process. In this current study, we have developed a structure-based mathematical model for the prediction of protein–RNA binding affinity values. The model was trained on a curated data set of protein–RNA binding affinity available in the literature. In addition, the model was further validated on a protein–RNA complex, for which the binding affinity values of 40 mutations are available in the literature. The affinity benchmark for the protein–protein complexes includes 179 entries (Vreven et al. 2015); however, because of the scarcity of data on protein–RNA complexes, we are restricted here to only 40 entries. Our data set is not only diverse in terms of the RNA structural classes with complexes involving tRNA and ribosomal RNA to duplex and single-stranded RNA, but also diverse in terms of the partners’ affinity with Kd ranging between 10−4 M and 10−11 M. This data set also represents a wide variety of complexes, resembling rigid body, semiflexible, and full flexible based on the conformational changes upon complex formation (Bahadur et al. 2008). In the present data set of protein–RNA binding affinity, the linear correlations (Pearson correlation coefficient r) between ΔG and i-rmsd and between ΔG and p-rmsd are mediocre, −0.41 and −0.51, respectively. However, there is no observable correlation between ΔG and interface area (B) and between ΔG and c-rmsd for the entire data set (Table 3). In the unbound–unbound class, there exists a good correlation between ΔG and i-rmsd (r = −0.91). Except in class A, a mixed correlation is observed between the affinity and the structural parameters among different structural classes. We have calculated the free energy gain upon complex formation (the free energy per unit buried surface area of the protein–RNA complex), which is −5.7 cal/mol/Å2 for the entire data set (Table 2). This gain is maximum for the complexes with ribosomal proteins (−7.0 cal/mol/Å2) and minimum for the complexes with tRNA (−2.7 cal/mol/Å2). For the other two structural classes, this value is −5.6 cal/mol/Å2 (complexes with duplex RNA) and −5.9 cal/mol/Å2 (complexes with single-stranded RNA). We did not obtain a correlation coefficient of >0.73 in any class or in any temperature category. This may be attributed to the effect of temperature, pH, and ionic strength on the measurement of Kd. This is evident in our data set as the temperature, pH, and ionic strength are different for different experimental measurements. The curated data set was trained to develop a mathematical model to predict the values of binding free energy of protein–RNA complexes from the structural parameters. The mathematical model developed in this study depends on structural and physicochemical parameters of both the bound and the unbound structures, namely, the fraction of nonpolar protein atoms at the interface, the interface rmsd, and the distribution of water molecules at the interface. Previous reports show that these three parameters play crucial roles in determining protein–RNA recognition (Bahadur et al. 2008; Barik and Bahadur 2014; Barik et al. 2015, 2016). Furthermore, these parameters were also used quite efficiently by many research groups to address various issues in macromolecular recognition (Jayaram and Jain 2004; Rodier et al. 2005; Li and Lazaridis 2007; Teyra and Pisabarro 2007; Janin and Bahadur 2008; Reichmann et al. 2008; Hou et al. 2011; Barik and Bahadur 2014; Barik et al. 2015, 2016). For example, they are quite efficient in discriminating the biological interfaces from the crystal contacts in protein–protein complexes (Bahadur et al. 2004; Rodier et al. 2005; Janin et al. 2008; Terribilini 2008; Iwakiri et al. 2012) and are useful in efficient scoring function for macromolecular docking (Chen et al. 2004; Zheng et al. 2007; Setny and Zacharias 2011; Tuszynska and Bujnicki 2011; Zhao et al. 2011; Li et al. 2012; Huang et al. 2013, 2016; Huang and Zou 2014) and the prediction of binding affinity and binding hotspots (Yang et al. 2014; Barik et al. 2016; Krüger et al. 2018; Pan et al. 2018). The model depends on the fraction of nonpolar interface area, , which measures the hydrophobicity at the binding sites. It is evident that the protein–RNA binding is often driven by the conformational changes, and the mathematical model includes the effect of conformational flexibility in the empirical estimation of binding free energy. The model has two terms with this parameter. In the first term, the relationship is inverse, and this shows that a highly polar interface is preferred for RNA binding. This is typical of protein–RNA interfaces, as the polar residues contribute significantly to the binding to the negatively charged phosphate of RNA (Bahadur et al. 2008; Barik et al. 2015). Moreover, the nonpolar region of the interface undergoes significant conformational changes and higher changes in accessible surface area (Mukherjee and Bahadur 2018). The dependence of binding affinity on the parameter dr indicates the important role played by the water molecules in protein–RNA recognition (Barik and Bahadur 2014; Barik et al. 2016). The dr values indicate the relative hydration of protein–RNA recognition sites. Previous studies have demonstrated that the water-binding sites at the protein–RNA interfaces vary significantly from the protein–protein interfaces (Rodier et al. 2005; Barik and Bahadur 2014; Mukherjee et al. 2019). The contribution of water-mediated interactions at the protein–RNA interfaces is significantly higher than at the protein–protein interfaces, and a substantial amount of these interactions are mediated by the 2′ OH group of the ribose sugar. The mathematical model trained in this study shows that the “dr” values contribute significantly to the ΔGempirical values. This can be attributed to the higher number of water-mediated interactions at the protein–RNA interface. The model was tested on AspRS (PDB ID: 1ASY), for which experimentally determined affinity values of 40 mutations are available (Eriani and Gangloff 1999), which were not used in the training data set. Of these 40 mutations, 29 are single, 10 are double, and one is a triple mutant (Supplemental Table S4). All the affinity values for the native and mutant structures were measured using nitrocellulose binding assay (Eriani and Gangloff 1999). The majority (35) of these mutations are at the interface, whereas only five are at the noninterface region. Our model predicted the affinity values of 29 (75%) mutants with an absolute error of < 2 kcal/mol. In the case of eight mutants, the absolute error is within a limit of 3.5 kcal/mol. The remaining two cases, in which the absolute error is >3.5 kcal/mol, are double mutants at the interface. In these two cases, the ΔΔG values are significantly higher, 2.68 and 2.97 kcal/mol. The different amino acid substitutions in AspRS were at four different binding regions: the terminal A binding region, the acceptor arm binding region, the anticodon loop binding region and the central core binding region (Eriani and Gangloff 1999). The terminal A binding region does not influence the binding of tRNA to AspRS in the ground state and is involved in acylation. The prediction model shows the lowest root mean squared error (RMSE)—0.83 kcal/mol—for these cases. The predicted values in substitutions on the anticodon binding region and central core binding region show comparable RMSEs with 1.99 kcal/mol and 1.91 kcal/mol, respectively, whereas the predictions for substitutions at the acceptor arm binding region show a slightly lower RMSE of 1.66 kcal/mol. The predicted values for double substitutions involving both anticodon loop binding and acceptor arm binding regions show the highest RMSE—2.64 kcal/mol (Supplemental Table S4). While testing the mathematical model using the mutant structures of yeast AspRS, the mean standard error, the mean absolute error, and the mean percentage relative error observed are 1.69%, 1.42%, and 16.95%, respectively. However, the correlation coefficient between the experimental ΔG and predicted ΔGempirical values is poor (R2 = 0.12). This poor correlation coefficient may be attributed to two reasons: the errors introduced by modeling and the poor estimate of dr values from the modeled complexes. The template-based modeling technique is unable to recapitulate the structural changes introduced by the alanine substitutions (Fiser 2010). The higher RMSE values observed for the double mutants may be attributed to the modeling error. However, for the mutants with higher ΔΔG values, we observe higher relative error in the predicted ΔGempirical. The changes in binding affinity can be quantified in terms of ΔΔG and are indicative of the structural changes in the protein that affects the RNA binding. The correlation coefficient between the experimental and predicted ΔG values is 0.29 for mutants with ΔΔG ≤ 1.5 kcal/mol. To estimate the dr value with better accuracy, we chose a subset of eight mutants with alanine substitutions at the protein–RNA interface with ΔΔG ≤ 2.0 kcal/mol and solvated them. The solvated structures were further minimized and equilibrated and used to estimate the dr values (Supplemental Table S5). The current force fields available for simulations of protein–RNA complexes have their own limitations (Šponer et al. 2018), which might add additional errors to the modeled structures. Despite this, the correlation between experimental ΔG and predicted ΔGempirical values improves from 0.17 to 0.75 for this subset of mutants. Moreover, for this subset of eight mutants, the RMSE values improve from 1.41 kcal/mol to 0.51 kcal/mol. The prediction accuracy is very much dependent on the quality and accuracy of the experimental data. The temperature, pH, ionic condition, and intrinsic variability embedded in different techniques may also account for the difference between predicted ΔGempirical and experimental ΔG values.

Conclusion

The protein–RNA binding affinity data set curated in this study contains a wide variety of complexes in terms of their structural classes as well as their cellular functions. Moreover, the complexes have a wide range of affinities in which the dissociation constant spans eight orders of magnitude. This data set should be a valuable resource for the computational structural biologists attempting to predict the binding affinity from atomic structures and will also stimulate the development of novel methods accounting for the flexibility in the assembly formation. The major challenges in making this data set are the paucity of binding affinity data and their reliable curation from the literature. The curated data set of affinity values was used to develop a structure-based mathematical model for predicting the binding affinity of protein–RNA complexes. The model developed in this study is highly accurate and can be deployed for finding the affinity of all known protein–RNA complexes. This model will enhance our understanding of the structural basis of protein–RNA binding affinity and may be valuable to the experimentalists aiming to engineer protein–RNA interfaces with desired affinity.

MATERIALS AND METHODS

The data set of binding affinity

The protein–RNA docking benchmark was used to get the data set of protein–RNA complexes for which bound and unbound structures are available (Nithin et al. 2017). The binding affinity values were manually curated from the available literature. The publications available as references for the atomic structures submitted in the PDB were checked for the corresponding Kd values. When the values were not found in the primary citation, they were found by conducting an exhaustive search by clicking the button “Search Related Articles in PubMed.” The experimental Kd values were taken only if the protein from the publication was the same as that in the published crystal structure and if there was a match in the organism name. In most of the cases, when the reported Kd values were taken from sources other than the published X-ray structure in the PDB, the affinity measurements were done on protein samples or genetic constructs that were different from the X-ray studies and under different experimental conditions. The various experimental methods used to calculate the dissociation constants are the filter binding assay (Rio 2012), EMSA (Ryder et al. 2008), fluorescence spectroscopy (Vivian and Callis 2001), kinetic study (Goodrich and Kugel 2015), ITC (Feig 2009), and SPR (Katsamba et al. 2002). The temperature, pH, and the experimental conditions were recorded for all 40 cases from the same publication if available or from the references or protocols followed by the authors. The Gibbs free energy of dissociation was calculated by taking the temperature stated and using the following equation (c0 = 1M standard state):

Structural and physicochemical parameters used in the prediction of binding affinity

The following parameters of protein–RNA interfaces were considered in the prediction model: number of amino acids, number of nucleotides, number of interface water molecules, number of preserved interface waters, number of water bridges, dr (parameter to quantify the dry and wet protein–RNA interfaces), number of hydrogen bonds, fraction of nonpolar area on the protein and the RNA sides, fraction of buried atoms on the protein and the RNA sides, local density (LD) indices on the protein and the RNA sides, number of salt bridges and stacking interactions, shape complementarity index, and gap volume (GV) index. All these parameters are calculated following Barik et al. (2015) and Barik and Bahadur (2014). The size of the protein–RNA interfaces was estimated by measuring the solvent accessible surface area (SASA) buried in the contact. The interface area (B) is the sum of the SASA of the two components less that of the complex and was calculated using the PRince web server (Barik et al. 2012a). The SASA values were measured with the program Naccess (Hubbard and Thornton 1993), which implements the Lee and Richards (1971) algorithm with a probe radius of 1.4 Å and default group radii: During complex formation, an atom may lose its SASA completely and thus become fully buried in the interface. Those atoms that lose SASA upon complexation were identiﬁed as the interface atoms. The fraction of such buried atoms (fbu) was calculated at the binding regions on both the protein and the RNA side of the interface using the following equation: Although fbu measures the compactness of the atomic packing at the interface, the LD index is used to measure the atomic density at each point of the interface (Bahadur et al. 2004). In brief, LD is deﬁned as the mean number of interface atoms that are within 12 Å of another interface atom. If an interface has N atoms, and if ni atoms are within a distance of 12 Å from a given interface atom i, the LD for that subunit is calculated as below: Hydrogen bonds (H-bonds) at the protein–RNA interfaces were calculated using the software HBPLUS (McDonald and Thornton 1994) with default parameters. The salt bridges at protein–RNA interfaces were calculated when the distance between the side-chain nitrogen atoms of positively charged residues and the negatively charged phosphate group of the nucleotides is within 4 Å (Barlow and Thornton 1983; Xu et al. 1997). Stacking interactions at protein–RNA interfaces are usually deﬁned as the π–π interactions that can occur between the side chains of Tyr, Trp, Phe, His, and the RNA bases. Moreover, the π–π and π-cation stacking of Arg through its guanidinium moiety with nucleosides were included in the calculation of stacking interactions (Allers and Shamoo 2001). The planes were defined at both sides by considering the atoms constituting the aromatic rings, and the center of the plane was calculated as the midpoint of all these atoms. The cutoff distance between the centers of both the planes was kept at ≤5 Å, and the dihedral angle between the two planes was constrained to ≤30° (Allers and Shamoo 2001). The stagger angle is deﬁned as the angle between the normal to the ﬁrst plane and the vector joining the centers of the two planes. The shape correlation index (Sc) (Lawrence and Colman 1993; Allers and Shamoo 2001) was used to quantify the shape complementarity at protein–RNA interfaces. Atomic packing at protein–RNA interfaces was also evaluated using the gap volume index (GV) (Jones and Thornton 1996) given by the following equation: The GV for each complex was calculated using the SURFNET program (Laskowski 1995). The structural parameter dr was used to quantitatively define the distribution of the water molecules at the protein–RNA interface. Briefly, dr is a ratio of the average distance of interface water molecules to the average distance of interface atoms contributed by the protein and the RNA chains. In both cases, distances were measured from the center of mass of the interface: The superposition of the structures was performed using the distance matrix alignment method implemented in the Dali server (Holm and Laakso 2016). The root-mean-squared displacement (rmsd) values were calculated from the interface amino acids and nucleotides after superposing the respective bound and unbound structures of the protein and the RNA components when available in their free form. For each polypeptide chain, c-rmsd was calculated, which is the displacement of all the equivalent Cα atoms between the superposed bound and the unbound proteins. For the RNA chains, the p-rmsd values were calculated in a similar way over the superposed equivalent backbone phosphorus (P) atoms. Interface rmsd (i-rmsd) values were calculated for each complex considering only the equivalent Cα atoms of the interface residues and/or the equivalent P atoms of the interface nucleotides. The different structural and physicochemical parameters used in this study are available in Supplemental Table S1.

Development of the mathematical model for prediction of binding affinity

The mathematical models were generated using regression analysis following Schmidt and Lipson (2009) implemented in Eureqa software. The different physicochemical and structural parameters for the protein–RNA complexes were provided as inputs to the software. The data set was split randomly into a training data set and a validation data set by the Eureqa software used in this study. The training set is used to generate and optimize solutions, and the validation set is used to test how well those models generalize to new data. Eureqa also uses the validation data to filter out the best models. The following mathematical operations were applied to the input variables while searching for equations that fit the input data: addition, subtraction, multiplication, division, sine, cosine, and constants. For each pair of variables, the numerical partial derivatives were calculated, and these partial derivatives were used to evaluate the symbolic functions generated in the subsequent steps. Candidate symbolic functions were generated to represent the relationship between different parameters provided as input. Initially, these functions generated were random; however, the subsequent filtering of the generated equations allows them to converge to the mathematical representation of input data. In symbolic regression, many initial random symbolic equations compete to model input data in the most parsimonious way. New equations are formed by recombining previous equations and probabilistically varying their subexpressions. The equations that model the input data well were retained, whereas unpromising solutions were removed. To evaluate the predictive ability, the partial derivatives for each pair of variables were computed for the candidate functions, and cross-validation analysis was performed with the partial derivatives derived from the input data. The best matching equations were selected, and this process was repeated iteratively until the predictive ability of these equations reached sufficient accuracy. The most parsimonious equation generated from this process was returned as the best mathematical model (Fig. 6). For each equation, the predictive ability is measured in terms of goodness of fit (R2), correlation coefficient, maximum error, mean squared error, and mean absolute error. The equations with R2 ≥ 0.80 were selected as plausible models (Supplemental Table S2). The best fit model was selected from these plausible models, and used for predicting the ΔGempirical.

FIGURE 6.

Schematic representation of the workflow followed in mathematical modeling for prediction of binding affinity in protein–RNA complexes.

Validation of the mathematical model on a test data set

The model was tested on a data set generated using mutated structures of yeast AspRS, which reports experimentally determined ΔG values of 40 mutations (Eriani and Gangloff 1999). The experimental Kd values were used to calculate both ΔG (using Equation 6) and ΔΔG values of the mutants (using Equation 12): The wild-type PDB structures (PDB ID: 1ASY, IEOV) were used as the templates for modeling all the bound and unbound forms of the proteins with alanine substitutions. The modeling and energy minimization were performed using Modeler (Webb and Sali 2016). The energy-minimized modeled structures are used to calculate structural and physicochemical parameters described above for evaluating the mathematical model for predicting ΔG. For each of the 40 mutant structures, the ΔGempirical was calculated using the equation modeled in this study. To evaluate the accuracy of predictions, the RMSE was calculated using the following equation: A subset of eight mutants with ΔΔG ≤ 2.0 kcal/mol was further optimized using AMBER. The complexes and their unbound structures were solvated using the TIP3P (Sun and Kollman 1995) water model with a truncated octahedral box of 10 Å. The solvated structures were energy-minimized for 10,000 cycles with restraint, followed by 10,000 cycles without restraint. The minimized structures were subjected to heating, density equilibration, and short runs of equilibration. The heating was done from 100 K to 300 K for 500 psec with restraints on the entire structure. The density equilibration was performed for 500 psec with restraints on the entire structure. The equilibration of the structures was run for four short rounds. The first three rounds of equilibration were run for 200 psec each with main chain atoms constrained. The final round of equilibration was performed for 2 nsec. Amber ff14sb force field (Maier et al. 2015) and χOL3 (Banáš et al. 2010; Zgarbová et al. 2011) were used for this study. The minimization was performed using Sander, and the subsequent steps were performed using the CUDA version of PMEMD available in the AMBER package (Götz et al. 2012; Le Grand et al. 2013; Salomon-Ferrer et al. 2013).

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

145 in total

1. Structural insights into RNA quality control: the Ro autoantigen binds misfolded RNAs via its central cavity.

Authors: Adam J Stein; Gabriele Fuchs; Chunmei Fu; Sandra L Wolin; Karin M Reinisch
Journal: Cell Date: 2005-05-20 Impact factor: 41.582

2. A knowledge-based potential function predicts the specificity and relative binding energy of RNA-binding proteins.

Authors: Suxin Zheng; Timothy A Robertson; Gabriele Varani
Journal: FEBS J Date: 2007-11-12 Impact factor: 5.542

3. Quantitative prediction of protein-protein binding affinity with a potential of mean force considering volume correction.

Authors: Yu Su; Ao Zhou; Xuefeng Xia; Wen Li; Zhirong Sun
Journal: Protein Sci Date: 2009-12 Impact factor: 6.725

4. The Era GTPase recognizes the GAUCACCUCC sequence and binds helix 45 near the 3' end of 16S rRNA.

Authors: Chao Tu; Xiaomei Zhou; Sergey G Tarasov; Joseph E Tropea; Brian P Austin; David S Waugh; Donald L Court; Xinhua Ji
Journal: Proc Natl Acad Sci U S A Date: 2011-06-06 Impact factor: 11.205

Review 5. Protein-protein recognition.

Authors: J Janin
Journal: Prog Biophys Mol Biol Date: 1995 Impact factor: 3.667

Review 6. Sensing the heat: the application of isothermal titration calorimetry to thermodynamic studies of biomolecular interactions.

Authors: J E Ladbury; B Z Chowdhry
Journal: Chem Biol Date: 1996-10

7. Molecular architecture of protein-RNA recognition sites.

Authors: Amita Barik; Nithin C; Smita P Pilla; Ranjit Prasad Bahadur
Journal: J Biomol Struct Dyn Date: 2015-02-11

8. The interpretation of protein structures: estimation of static accessibility.

Authors: B Lee; F M Richards
Journal: J Mol Biol Date: 1971-02-14 Impact factor: 5.469

9. Divergent evolutions of trinucleotide polymerization revealed by an archaeal CCA-adding enzyme structure.

Authors: Mayuko Okabe; Kozo Tomita; Ryuichiro Ishitani; Ryohei Ishii; Nono Takeuchi; Fumio Arisaka; Osamu Nureki; Shigeyuki Yokoyama
Journal: EMBO J Date: 2003-11-03 Impact factor: 11.598

10. Characterization of interfacial solvent in protein complexes and contribution of wet spots to the interface description.

Authors: Joan Teyra; M T Pisabarro
Journal: Proteins Date: 2007-06-01

2 in total

1. A Novel Family of RNA-Binding Proteins Regulate Polysaccharide Metabolism in Bacteroides thetaiotaomicron.

Authors: Amanda N D Adams; Muhammad S Azam; Zachary A Costliow; Xiangqian Ma; Patrick H Degnan; Carin K Vanderpool
Journal: J Bacteriol Date: 2021-07-12 Impact factor: 3.490

2. ProNAB: database for binding affinities of protein-nucleic acid complexes and their mutants.

Authors: Kannan Harini; Ambuj Srivastava; Arulsamy Kulandaisamy; M Michael Gromiha
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2 in total