Literature DB >> 35969673

3D-RISM-AI: A Machine Learning Approach to Predict Protein-Ligand Binding Affinity Using 3D-RISM.

Kazu Osaki¹, Toru Ekimoto¹, Tsutomu Yamane², Mitsunori Ikeguchi^1,2.

Abstract

Hydration free energy (HFE) is a key factor in improving protein-ligand binding free energy (BFE) prediction accuracy. The HFE itself can be calculated using the three-dimensional reference interaction model (3D-RISM); however, the BFE predictions solely evaluated using 3D-RISM are not correlated to the experimental BFE for abundant protein-ligand pairs. In this study, to predict the BFE for multiple sets of protein-ligand pairs, we propose a machine learning approach incorporating the HFEs obtained using 3D-RISM, termed 3D-RISM-AI. In the learning process, structural metrics, intra-/intermolecular energies, and HFEs obtained via 3D-RISM of ∼4000 complexes in the PDBbind database (ver. 2018) were used. The BFEs predicted using 3D-RISM-AI were well correlated to the experimental data (Pearson's correlation coefficient of 0.80 and root-mean-square error of 1.91 kcal/mol). As important factors for the prediction, the difference in the solvent accessible surface area between the bound and unbound structures and the hydration properties of the ligands were detected during the learning process.

Entities: Chemical

Mesh：

Substances：
Ligands
Proteins

Year: 2022 PMID： 35969673 PMCID： PMC9421647 DOI： 10.1021/acs.jpcb.2c03384

Source DB: PubMed Journal: J Phys Chem B ISSN： 1520-5207 Impact factor: 3.466

Introduction

Developing accurate and efficient methods for predicting the binding affinity of ligands to target proteins is required in computer-aided drug discovery.[1] The binding affinity experimentally evaluated using the half-maximal inhibitory concentration (IC50) or the dissociation constant (Kd) can be converted to binding free energy (BFE).[2,3] Currently, BFE calculations based on atomic structures are widely performed in pharmaceutical processes daily. For example, in in silico screening processes, to rank many ligands in terms of affinity, BFEs of the ligands are calculated quickly. Thus, empirical scores of docking simulations (e.g., Glide[4]) are frequently used. After the screening process, ligand optimization processes are conducted to increase the ligand activity. Therefore, the BFEs of a few tens of ligands must be calculated accurately to examine the structure–activity relationships. Thus, free energy calculations based on molecular dynamics (MD) simulations are useful.[1] To improve the accuracy of BFE calculations, the accurate treatment of the hydration effects is a key factor because many water molecules are involved in the ligand-binding process. Upon ligand binding, water molecules in the binding site are replaced by the ligand and are forced to rearrange in the bound state.[5] The hydration water molecules around the ligand in the isolated state are dispelled upon ligand binding. In MD simulations using an explicit solvent, the hydration effects can be considered explicitly. However, MD simulations have high computational demands, leading to a loss in computational efficiency. Therefore, precisely handling the hydration effects is a trade-off between accuracy and efficiency. Various computational methods for BFE calculation based on physicochemical approaches[1,4,6−22] using the thermodynamic cycle (Figure ) or data-driven approaches using machine learning[23−31] have been proposed to date. The required computational burdens of physicochemical approaches differ depending on the incorporation methods of the hydration effects. In most physicochemical approaches, the interaction energy between a protein and a ligand can be calculated using force fields, whereas the hydration effects are handled differently. For example, in docking simulations with abundant protein–ligand pairs, the receptor structure is often fixed, and the hydration effect is implicitly and approximately evaluated using empirical score functions (e.g., Glide score[4]) to quickly evaluate BFEs. In molecular mechanics (MM) and Poisson–Boltzmann surface area (PB/SA) or MM and generalized Born surface area (GB/SA) methods, the interaction energy between a protein and a ligand and the configurational entropies are calculated using MM, where the hydration free energies (HFEs) of the protein, ligand, and complex are approximately calculated using a continuous dielectric model (i.e., implicit solvent model).[9,10] In continuous dielectric models, the detailed molecular features of water molecules such as hydrogen bonding, the hydrophobic effect, and the rearrangement of the water molecules upon solute insertion are missing. Recently, instead of continuum models, a solution statistical-mechanics theory-based method, called the three-dimensional reference interaction model (3D-RISM), was combined with MM.[11−16] The 3D-RISM method can evaluate HFEs, preserving the molecular features.[34−37] The 3D-RISM method provides a thermally averaged distribution of solvent molecules around solute molecules, and HFE can be calculated through an integral equation using the distribution function. The MM/3D-RISM method can successfully predict the BFE of a group of ligands with different activities for the same protein. The 3D-RISM method has also been applied to the statistical analysis of hydration water based on a large number of 3D-RISM calculations,[38] development of an efficient method to calculate HFEs,[39] or binding-site searches by extending the theory to the distribution of atoms constituting the ligand.[40] The 3D-RISM method, combined with MM or extended protocols described above, improves the prediction of HFEs.[11−16,38−40]

Figure 1

Thermodynamic cycle of the binding free energy. The protein (gray) and ligand (green) are shown under the upper and lower conditions representing vacuum and solution, respectively. The binding free energy (ΔGbind) is obtained from the binding free energy in vacuum (ΔGbindvacuum) and the hydration free energies for the protein (Δμprotein), ligand (Δμligand), and complex (Δμcomplex). As a more rigorous approach, free energy calculations using all-atom MD simulations have been used to obtain the BFE between the bound and unbound states, such as the alchemical approach,[17] the potential of mean force (PMF)-based approach,[18] free energy perturbation (FEP)+,[19] generalized replica exchange with solute tempering (gREST)+FEP,[20] pmx,[21] and massively parallel computation of absolute BFE with well-equilibrated states (MP-CAFEE).[22] The free energy calculations have been applied to the relative or absolute BFE calculations, and they exhibit a good correlation with experimental BFEs. However, because of the high computational demands of MD simulations, free energy calculations for abundant protein–ligand pairs are practically difficult. In contrast to the physicochemical approach described above, machine learning approaches predict BFEs by learning the correlation between experimental BFE data and input features, such as structural metrics and scoring functions.[23−31] Recently, a large amount of experimental BFE data and crystal structures of the protein–ligand complex have been stored in databases (e.g., the PDBbind database[32,33]), and machine learning models using the data for BFE prediction, for example, KDEEP,[24] XGB-Score,[25] and SFCscoreRF[26], have been proposed. As the input features, the descriptors given by structural metrics, such as atomic coordinates, distances between atoms, and amino acid sequences, or energetic metrics, such as scoring functions, were employed,[23] in which the relationships between the descriptors and the experimental BFE data were learned and a regression model was built. The selection of the appropriate features describing the experimental BFE data is one of the key points for accuracy.[23] Therefore, as descriptors, incorporating well-defined physicochemical quantities related to BFE, as well as conventional structural metrics, has the potential for accurate predictions of abundant protein–ligand pairs. Thus, incorporating hydration effects as an input feature may improve the accuracy of BFE predictions. Herein, we propose a machine learning approach combined with the 3D-RISM method for BFE predictions. First, we calculated the BFEs for 3993 protein–ligand pairs in the PDBbind database using MM/3D-RISM (Figure ). However, the BFEs evaluated using the 3D-RISM method exhibited a poor correlation with the experimental BFEs. Although the MM/3D-RISM strategy predicted the BFEs of a few ligands with a similar scaffold for the same protein, the strategy failed with multiple types of protein–ligand pairs. We also attempted to apply the improved version of 3D-RISM to BFE calculations. According to Palmer et al., the HFEs for 185 neutral small molecules calculated using the 3D-RISM method deviated from the experimental HFEs, and the difference was proportional to the partial molar volume.[41] Therefore, they proposed a universal correction, in which the contribution of the partial molar volume was simply subtracted from the HFE obtained using the 3D-RISM method. However, in our calculation of the BFEs, their correction did not improve the correlation between the calculated and experimental BFEs. Therefore, we developed a machine learning approach using thermodynamic quantities obtained from the 3D-RISM method as principal descriptors, termed 3D-RISM-AI. By introducing a machine learning algorithm for regression, we aimed to predict BFEs for abundant protein–ligand pairs, which cannot be expressed simply by energy addition and subtraction operations in the thermodynamic cycle. Regression models were constructed from the structural features and thermodynamic quantities calculated using 3D-RISM for the 3993 protein–ligand pairs in the PDBbind database. Four machine learning algorithms were used for the regression, and their performance was verified. The best-performing learning model exhibited a high correlation between the predicted and experimental BFEs: Pearson’s correlation coefficient (R) of 0.80, Spearman’s rank correlation coefficient (ρ) of 0.77, and root-mean- square error (RMSE) of 1.91 kcal/mol. Although the performance of 3D-RISM-AI is comparable to that of other machine learning models (R = 0.753–0.806, ρ = 0.647–0.796, and RMSE = 1.80–2.22 kcal/mol), the advantage of 3D-RISM-AI is that the feature importance analysis allows us to determine thermodynamic quantities that are effective in the BFE predictions.

Theoretical Background

The BFE (ΔGbind) between a protein and a ligand is defined aswhere Gcomplex, Gprotein, and Gligand are the free energies of the complex, protein, and ligand, respectively. The free energy is decomposed into the following three terms: the internal energy of the solute molecule (E), configurational entropy of the solute molecule (S), and HFE (Δμ), as follows:where T is the temperature and X is one of the complexes, proteins, or ligands. According to the thermodynamic cycle of BFE (Figure ), ΔGbind can be obtained by the sum of the BFE in vacuum (ΔGbindvacuum) and the difference in the HFE (ΔΔμ) as follows:Here, we focus on one conformation of the solute complex (e.g., the crystal structure) and suppose that the conformations of both protein and ligand do not change from those in the complex, and the contributions of the entropy term can be neglected. Under this assumption, the terms on the right-hand side of eq areThe internal energies in eq can be calculated using a force field. The HFE (eq ) is obtained using the 3D-RISM method.[34−37] On the basis of the statistical solution theory, thermodynamic quantities are obtained through the pair distribution function, g(r), which can be obtained by solving the Ornstein–Zernike (OZ) integral equation as a function of the total correlation function, h(r) = g(r) – 1, together with incorporating closure approximations. In the case of molecular liquids, the OZ equation includes degrees of freedom for both configuration and rotation, and it is difficult to solve such high-dimensional equations for complicated molecules. In contrast, using the RISM approximation, the molecules are described as a set of atom sites corresponding to atom types, and the molecular OZ equation can be approximately rewritten as the equation of site–site distance, which is called 1D-RISM.[42,43] Because the 1D-RISM approach uses the spherically symmetric site–site correlation function, the three-dimensional distribution of the solvent molecules around the solute molecule cannot be described. Therefore, an extension of RISM to three dimensions, called 3D-RISM, was introduced. In 3D-RISM, the total correlation functions of the solute and solvent sites are obtained using the 3D-RISM equation:where hα(r) is the total correlation function of the solute site and the solvent site of the atom type for α, cξ(r′) is the direct correlation function of the solvent atom type for ξ, and χ (|-’ |) is the susceptibility function of the solvent sites in the bulk solvent given by eq . In eq , ωξα(r) is the intramolecular correlation function of the solvent molecule, ρ is the site-number density for atom type α in the bulk solvent, and hξα(r) is the total correlation function of the intramolecular sites calculated from the 1D-RISM. To solve eq , the Kovalenko–Hirata (KH) closure, which is an approximate relationship between the total and direct correlation functions, is introduced.[37] The KH closure is given bywhere β is 1/kBT, kB is the Boltzmann constant, and uα(r) is the solute–solvent site potential calculated from the force field. Using eqs –9, HFE is calculated as follows:where Θ is the Heaviside step function. In the supervised machine learning, the relationship between objective and explanatory variables is expressed as a regression model.[23] Using the training data set comprising the experimental BFE values for the n protein–ligand pairs as the objective variable and the thermodynamic quantities and the structural indices calculated from each pair as the explanatory valuables, a regression model is given bywhere ΔGbind_exp( is the experimental BFE for the nth protein–ligand pair, {x(, y(, ..., z(} represents the input feature vector comprising multiple descriptors calculated from the nth pair, θ is the hyperparameter, and the function f is determined by optimizing the objective function L asUsing the training regression model f, the BFE for the new (n + 1)th pair, which is not included in the training data set, can be predicted asThe settings of function f, the hyperparameters, and the objective function to be optimized depend on the regression algorithm. In this study, four types of regression algorithms were employed: ridge regression (RR),[44] support vector regression (SVR),[45] random forest regression (RFR),[46] and extreme gradient boosting regression (XGBR).[47] RR and SVR use a linear function with regularization term or the support vector and kernel trick, and RFR and XGBR use the decision tree for the regression process. The machine learning approach is expected to provide functions of BFEs that cannot be expressed by a simple linear operation of energies (eqs –5 and 10), as defined by the thermodynamic cycle, because the function given by the regression algorithm is represented by a nonlinear optimized function depending on the data set and explanatory variables used.

Methods

Data Sets

The experimental BFE values and the crystal structures for the protein–ligand complex in the PDBbind database (ver. 2018) were used.[32,33] In the PDBbind database, from 16151, the refined set (ver. 2018, the number of complexes: n = 4463) was defined according to the quality of the crystal structures.[48] In addition, the core set (ver. 2016, n = 285) was selected from the refined set, 57 clusters were determined using amino acid sequence similarities, and five representative proteins were selected from each cluster. Crystal structures in the refined set containing unnatural amino acids or atomic collisions that could not be treated in the computational preprocess were excluded from the calculations. Finally, we used 3933 complexes in the refined set, including 217 complexes as the core set. In the machine learning process, 217 complexes for the core set were used as test data and 3716 complexes (the refined set except for the core set) were used as training data. The complexes used as training data did not contain complexes in the test data.

3D-RISM Calculations

Before the 3D-RISM calculations, structural modeling was performed on 3993 crystal structures. All crystal water molecules were removed. With multiple protein complexes, a protein located within 10 Å of the ligand was used. For missing residues, the acetyl cap and N-methyl cap were added to the missing N- and C-termini, respectively. The protonation states of histidine were determined using the ProtAssign module implemented in Maestro.[49] Energy minimizations for the modeling structures were performed using the GB solvent model implemented in Amber18 and AmberTools19.[50] The force fields for proteins and ligands were ff14SB and the generalized amber force field (GAFF), respectively.[51,52] The partial charge of the ligand was determined using the AM1-BCC method.[53,54] Two-step minimizations were performed: a 250-step steepest descent method and a 250-step conjugate gradient method. In the energy minimizations, position constraints with a force constant of 5 kcal/mol were added to the heavy atoms. The cutoff distance of the long-range interactions was set to 12 Å. 3D-RISM calculations were performed using the rism3d.snglpnt command in AmberTools19.[50] The complex structure after the energy minimizations was used as the input structure. For closure, the KH closure was used.[37] For 1D-RISM, the dielectrically consistent reference interaction site model (DRISM) theory was used.[36] The water model was the SPC/E model,[55] the water density was set to 0.999 g/cm3, and the temperature was 298 K. The buffer distance between the solute molecule and the boundary in the calculation box was set to 20 Å, and the grid spacing on the three-dimensional grid was set to 0.5 × 0.5 × 0.5 Å3. The other parameters were set to default values.

Descriptors Used in 3D-RISM-AI

Thirteen thermodynamic quantities described below were calculated from complex structures. Using the Amber force field and 3D-RISM method, the internal energy (E, eq ) and HFE (Δμ, eqs and 10) were calculated for the complex, protein, and ligand, as well as BFE (eq ). The protein and ligand structures were extracted from the complex structures. From the 3D-RISM calculations, thermodynamic quantities related to the HFE, such as the partial molar volume (V), enthalpy term (ε), and entropy term (−TS), were calculated. In addition, the Δμ, V, ε, and TS terms were decomposed into polar and apolar contributions. Consequently, 13 thermodynamic quantities (E, Δμ, Δμapolar, Δμpolar, V, Vapolar, Vpolar, −TS, −TSapolar, −TSpolar, ε, εapolar, and εpolar) were obtained. The structural indices were also calculated. Because the BFEs are correlated to the solvent accessible surface area (SASA),[1] the SASA values for the complex, protein, and ligand were calculated using CppTraj in AmberTools19.[50] To incorporate the conformational entropy of the ligand, the number of rotatable bonds was calculated using Rdkit,[56] because the entropy is correlated to the rotatable bonds.[57] In addition, the difference in each quantity between the complex and the isolated protein or ligand was calculated as XBind = Xcomplex – Xprotein – Xligand. Hereinafter, the difference was denoted as “Bind” and the Bind descriptors were calculated for the 13 thermodynamic quantities and SASA. In summary, 58 descriptors were used as input features in the 3D-RISM-AI. The descriptors comprised four types (i.e., complex, protein, ligand, and Bind) of the 13 thermodynamic quantities and SASA, the number of rotatable bonds of the ligand, and BFE. The overall procedure and summary of the descriptors are shown in Figure . Each descriptor value was standardized to a mean of zero and a variance of one. All training data are available on GitHub (https://github.com/IkeguchiLab/3D-RISM-AI).

Figure 2

Overall procedure of 3D-RISM-AI. Quantities of the complex, protein, ligand, and binding are denoted as Com_A, Prt_A, Lig_A, and Bind_A, respectively. The descriptors denoted as A are the internal energy (E), HFE (Δμ), partial molar volume (V), enthalpy term of the HFE (ε), entropy term of the HFE (−TS), and solvent accessible surface area (SASA). The index i represents the total value and the polar and apolar decomposed values. The number of rotatable bonds of the ligand is Lig_Nrotatable. The BFE is Bind_Gtheory. The regression algorithms used are the extreme gradient boosting regression (XGBR), random forest regression (RFR), support vector regression (SVR), and ridge regression (RR).

Machine Learning

Four regression algorithms were used in supervised machine learning: RR, SVR, RFR, and XGBR. The models for RR, SVR, and RFR were built using scikit-learn,[58] and the model for XGBR was built using xgboost.[47] A Gaussian kernel is used in SVR. The hyperparameters and feature selection were optimized by 5-fold cross-validation using the training data set. A grid search is used in the hyperparameter search. The objective variables—the experimental BFE data—were calculated from Kd as ΔG = kBT ln Kd. The optimization was evaluated using the averaged root-mean-squared error (RMSE) from the 5-fold cross-validation. After the optimization, a regression model with all training data (n = 3716) was built using a combination of the hyperparameters and the set of descriptors that exhibited the lowest mean RMSE. Using the training regression model, BFEs were predicted for the test data set (n = 217). Compared to the experimental and predicted BFEs, the accuracy of the learning model was evaluated using Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and the RMSE. In the evaluation using RMSE, data with an absolute value of the difference between the experimental and predicted values within 2 kcal/mol were classified as a low absolute error (LAE), and the data with an error larger than 2 kcal/mol were classified as a high absolute error (HAE). In the decision tree-based algorithms (RFR and XGBR), the feature importance was evaluated using the information gain, a measure of improvement in the objective function when creating a branch in the decision tree, and the default measure was used in scikit-lean (RFR) and xgboost (XGBR).[47,58] All test data together with learning and prediction codes are available on GitHub (https://github.com/IkeguchiLab/3D-RISM-AI).

Results and Discussion

Binding Free Energies Calculated Using 3D-RISM and the Thermodynamic Cycle

First, we calculated the BFEs for 3993 protein–ligand pairs using the 3D-RISM method and the thermodynamic cycle (Figure ). The calculated BFEs exhibited a poor correlation with the experimental BFEs (Figure ), with a Pearson’s correlation coefficient (R) of 0.047, Spearman’s rank correlation coefficient (ρ) of 0.069, and RMSE of 49.25 kcal/mol. The calculated BFEs were one digit larger than the experimental BFEs. We also attempted an improved version of 3D-RISM in which the contribution of the partial molar volume was subtracted from the HFE obtained using the original 3D-RISM method.[41] However, the correlation between the calculated and experimental BFEs did not improve (Figure S1), with R = 0.049, ρ = 0.124, and RMSE = 187.68 kcal/mol.

Figure 3

Correlation between the experimental and calculated BFEs using the 3D-RISM method and thermodynamic cycle for 3993 protein–ligand pairs. The Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and root-mean-squared error (RMSE) are 0.047, 0.069, and 49.25 kcal/mol, respectively. Although the MM/3D-RISM strategy predicted the BFEs of a few ligands with a similar scaffold for the same protein, the strategy failed with multiple types of protein–ligand pairs. Therefore, we developed a machine learning method, termed 3D-RISM-AI, using thermodynamic quantities obtained by the 3D-RISM method as principal descriptors. Because the machine learning method can handle nonlinear relationships between an objective variable (BFEs) and explanatory variables (thermodynamic and structural features), it is possible to improve the BFE prediction using 3D-RISM.

Training Process

To build an optimal regression model, feature selection was performed using the training data set (n = 3716). The best combination of descriptor types was searched for in each regression algorithm (Figure ). In this feature selection, eight combinations of the descriptor types involved in Bind and the three remaining types (i.e., complex, protein, and ligand) were examined, and the mean RMSE was evaluated with 5-fold cross-validation (Table S1). For all algorithms, combining the descriptor types of the ligand, protein, and complex, besides the Bind, resulted in a lower mean RMSE, suggesting that the Bind-type descriptors alone were not effective for the BFE predictions.

Figure 4

Performance depends on features included in the regression algorithms. The feature combination patterns denoted as A to H are represented in the table.

Performance depends on features included in the regression algorithms. The feature combination patterns denoted as A to H are represented in the table. Consequently, we selected the best combination exhibiting the lowest mean RMSE and employed all descriptor types (denoted as H in Figure ) for XGBR, RFR, and SVR and the Bind, ligand, and complex descriptors (denoted as F in Figure ) for the RR. Before examining the performance of the test data set, as shown in the next section, using all training data sets, a learning model was constructed for each algorithm with the optimized hyperparameters and the best combination of the feature types.

Performance of the 3D-RISM-AI

Using the test data set (n = 217), BFEs were predicted using each learning model (Figure ). In contrast to the BFE prediction based solely on the 3D-RISM method (Figure ), the BFEs predicted from all regression models in 3D-RISM-AI were well correlated to the experimental BFEs. In a comparison of the four algorithms, XGBR exhibited the best performance for all indicators: R = 0.80, ρ = 0.77, RMSE = 1.91 kcal/mol, and 154 data in LAE (Figure a). Here, the data in LAE represents the number of predicted BFEs within a 2 kcal/mol deviation from the experimental BFE. RFR, which is a decision tree-based method similar to XGBR, also showed a better performance (R = 0.78, ρ = 0.75, RMSE = 2.02 kcal/mol, and 141 data in LAE) (Figure b). In contrast, the performance of RR (R = 0.65, ρ = 0.66, RMSE = 2.36 kcal/mol, and 132 data points in LAE) and SVR (R = 0.68, ρ = 0.64, RMSE = 2.21 kcal/mol, and 150 data points in LAE) were relatively poor in this examination (Figure c,d).

Figure 5

Comparison of the predicted BFEs using four learning models. Blue and orange dots represent the low absolute error (LAE) data and the high absolute error (HAE) data, respectively. LAE means that the difference between the predicted and experimental BFE is smaller than 2 kcal/mol. The difference for HAE is larger than 2 kcal/mol. Decision tree-type regressions, such as XGBR and RFR, worked well for predicting BFE. Because RR uses a linear function for the regression model, its capability to express the relationship between the experimental BFE and the input features may be poor. As for SVR, the mean RMSE in the training process for only Bind descriptors (denoted as A in Figure ) showed similar values of XGBR or RFR; however, the mean RMSE of SVR did not decrease when other descriptors were included, which was a different behavior from the decision tree-type algorithms. The performance of XGBR and RFR in 3D-RISM-AI was compared with that of other machine learning approaches. For example, the performance of the KDEEP based on the three-dimensional convolutional neural network was R = 0.82, ρ = 0.82, and RMSE = 1.73 kcal/mol.[24] As for the decision tree type learning model with structure and interaction-energy features, such as RF-Score, ID-Score, SFCscoreRF, XGB-Score, and ΔvinaXGB, their performances were R = 0.753–0.806, ρ = 0.647–0.796, and RMSE = 1.80–2.22 kcal/mol for the best model for each approach.[25−29] The performance of the 3D-RISM-AI was comparable to that of previous machine learning models using different descriptors and algorithms. However, because 3D-RISM-AI was based on well-defined thermodynamic quantities, the factor analysis of the decision tree method allowed us to analyze descriptors that were important for predicting the BFE, as shown in the next section.

Feature Importance in the BFE Prediction

To understand descriptors that were effective for BFE prediction, the feature importance was analyzed using information gain in the XGBR learning model. The contribution to reducing the loss function was evaluated by determining the information gain of each descriptor. According to the ratio of the information gain of each descriptor to the total gain, the difference in SASA upon ligand binding (Bind_SASA) and the descriptors related to the HFE components of ligands (Lig_–TStotal, Lig_Vtotal, Lig_μtotal, Lig_Vpol, and Lig_εapol) were important in the learning process (Figure ). Among the top seven descriptors with ratios above 0.02, besides Bind_SASA, only Bind_εapol was the Bind-type descriptor. In RFR, the most important descriptors were almost the same as the top important descriptors in XGBR. In particular, Bind_SASA and the HFE components of the ligands were assigned as important descriptors (Figure S2). These results suggest that, except for Bind_SASA, the descriptors related to the HFE components of the ligands calculated from 3D-RISM notably contributed to the BFE prediction.

Figure 6

Rate of information gain in the training process of XGBR. The top descriptor is the difference in SASA upon ligand binding (Bind_SASA). The second to fifth descriptors are the HFE components of ligands: the entropy term (Lig_–TStotal), partial molar volume (Lig_Vtotal), HFE (Lig_μtotal), and polar component of the partial molar volume (Lig_Vpol). The sixth and seventh descriptors are the apolar component of the enthalpy (Lig_εapol) and the apolar component of the enthalpy difference in HFE upon ligand binding (Bind_εapol), respectively. To further understand the important physicochemical descriptors for prediction, the correlation among the descriptors was analyzed. The top seven descriptors of XGBR described above were classified into two groups exhibiting high correlations among the descriptors, which were hydrophobic and hydrophilic features (Figure ). Although the descriptors related to hydrophobicity (Bind_SASA, Lig_–TStotal, and Lig_Vtotal) were the top three (Figure ), the remaining important descriptors were hydrophilic features. Because both hydrophobic and hydrophilic features are important for BFE prediction, the physicochemical features of hydrophobicity and hydrophilicity should be learned in a well-balanced manner.

Figure 7

Correlation matrix of the top seven descriptors. The polar and apolar components, that is, hydrophilic and hydrophobic features, are denoted by Pol and Apol, respectively.

Correlation matrix of the top seven descriptors. The polar and apolar components, that is, hydrophilic and hydrophobic features, are denoted by Pol and Apol, respectively. The correlations between the descriptors and experimental BFEs were compared with the machine learning results (Figure ). The largest correlation coefficient of the individual descriptors with the experimental BFEs was less than 0.5. Considering that the correlation coefficients of the machine learning results were 0.6–0.8 (Figure ), the machine learning algorithms could develop regression models showing higher correlations with the experimental BFEs. The top descriptors correlated to the experimental BFEs were similar to those that were important in the learning process of the decision tree (Figure ). The top descriptors were Bind_SASA, Bind_εapol, and the ligand HFE-related descriptors. The top seven descriptors were hydrophobic features that correlated well with each other in the correlation matrix (Figure S3). Interestingly, the correlations between the hydrophilic features and the experimental BFEs were relatively small (no hydrophilic descriptor in the top seven), although their feature importance was not small (three in the top seven). These results indicate that in the machine learning algorithms the weights of the hydrophobic and hydrophilic descriptors were learned in a well-balanced manner.

Figure 8

Correlations between the descriptors and experimental BFEs. The correlation is evaluated using the absolute value of Pearson’s correlation coefficient (R).

Trends in Predicted BFEs by XGBR and Data Bias

To investigate data bias, the distributions of the predicted BFEs using XGBR and the experimental BFEs were compared (Figure ). In a comparison of the distributions across all test data (Figure a), the predicted BFEs were biased toward the mean of BFEs about −9 kcal/mol. In the LAE data, the distributions were similar to each other (Figure b). By contrast, in the HAE data, the experimental BFEs were broadly distributed and the predicted BFEs were biased toward the mean (Figure c). This inaccuracy could be attributed to the experimental BFE data used in the training process were not uniformly distributed and biased toward the mean (Figure S4). Therefore, the amount of data largely deviating from the mean was small, and the construction of the regression model failed to predict the data deviating from the mean.

Figure 9

Histograms of BFEs. The distributions are (a) all test data sets, (b) data for the low absolute error (LAE) with |ΔGXGBR – ΔGexp)| ≤ 2 kcal/mol, and (c) data for the high absolute error (HAE) with |ΔGXGBR – ΔGexp)| > 2 kcal/mol. The solid and dashed lines represent the predicted BFEs by XGBR and the experimental BFEs, respectively. Finally, we should discuss future approaches to address the limitations in the above analyses and to improve accuracy by incorporating new features not considered in the 3D-RISM-AI. First, because the training data set in the PDBbind database was biased around the mean, more training data, including uniformly distributed BFEs over a wide energy range, are necessary. The BFE data for multiple types of ligands frequently used in the structure–activity relationship analyses would be effective. Second, the performance of 3D-RISM-AI depended on the performance of the 3D-RISM method and the force field used. Because the important features were the ligand HFEs and their components are given by the 3D-RISM method, solving the challenges inherent to the theoretical framework and numerical calculations, such as the closure and force field, would directly improve 3D-RISM-AI. In addition, the interactions of ligand molecules include weak interactions, such as cation−π, CH−π, and interactions related to halogen atoms, such as halogen bonding and sigma holes, which cannot be represented by conventional force fields but captured by quantum chemical (QM) calculations. Therefore, it would be a good strategy to improve the accuracy of the ligand force field or to add the interactions given by QM calculations as new descriptors. Third, the 3D-RISM-AI does not incorporate dynamic features. Both structural fluctuations in the solution and the structural diversity of the unbound states leading to configurational entropy changes could be considered combined with MD simulations. Fourth, 3D-RISM-AI is complementary to other machine learning approaches, such as KDEEP, which employs structural features as descriptors. The 3D-RISM-AI uses thermodynamic quantities based on HFE as descriptors and can incorporate structural information with different properties in a complementary manner. Here, we discuss several issues to be resolved in the future before applying 3D-RISM-AI to in silico pharmaceutical processes, such as high-throughput virtual screening (HTVS) and structure–activity relationship analyses for abundant protein–ligand pairs. First, in these processes, complex models are obtained using docking simulations; however, the models usually deviate from experimental ones. As only experimental structures were used in the current 3D-RISM-AI, it is not clear how sensitive the machine learning model is to the quality of the structures of the protein–ligand complexes. Next, the computational time required to calculate the descriptors is an issue in HTVS. In this study, it required approximately 6 h using one central processing unit (CPU) core to conduct the 3D-RISM calculation for a protein or a complex, depending on the size of the protein: approximately 2 and 42 h were the minimum and maximum, respectively, and 6 h was the median. To perform the 3D-RISM calculations for abundant pairs in HTVS, which typically involves at least 100000s of molecules and often millions, a large number of CPU cores such as supercomputers are required. Alternatively, a graphics processing unit (GPU) version of the 3D-RISM method could be used for speeding up the 3D-RISM calculation.[59] Furthermore, the recently developed deep learning model for the 3D-RISM results,[60] which is capable of quickly predicting water distributions around proteins, could potentially work with 3D-RISM-AI. Finally, it is required to examine whether 3D-RISM-AI is a better approach to find hit compounds because it is not clear whether the machine learning model is significantly more accurate than the docking scores. Solving these issues and applying 3D-RISM-AI to pharmaceutical processes are future challenges.

Conclusions

Using the HFE based on the 3D-RISM method as the principal input feature, we proposed a machine learning approach to predict the BFE for abundant protein–ligand pairs, termed 3D-RISM-AI. Whereas the BFEs solely evaluated using the 3D-RISM method through the thermodynamic cycle were not correlated to the experimental data, the BFEs predicted using 3D-RISM-AI showed a good correlation with the experimental BFEs: R = 0.80, ρ = 0.77, and RMSE = 1.91 kcal/mol. Although the performance was comparable to that of other machine learning approaches using other input features, the important factor analysis allowed us to understand the important features for predicting BFEs, such as the difference in the SASA between the unbound and bound structures and the ligand HFE-related descriptors. The physicochemical features described by the most important descriptors were hydrophobic, but the rest were hydrophilic, indicating that the balance between them is important for the predictions. Although the importance of both hydrophobic and hydrophilic interactions is well-known, the fine balance between them could be automatically detected using the machine learning approach of the physicochemical-based 3D-RISM-AI. In addition, the machine learning framework of 3D-RISM-AI can incorporate both structural diversity sampled from MD simulations and other structural or energetic input features.

42 in total

1. The PDBbind database: methodologies and updates.

Authors: Renxiao Wang; Xueliang Fang; Yipin Lu; Chao-Yie Yang; Shaomeng Wang
Journal: J Med Chem Date: 2005-06-16 Impact factor: 7.446

2. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes.

Authors: Richard A Friesner; Robert B Murphy; Matthew P Repasky; Leah L Frye; Jeremy R Greenwood; Thomas A Halgren; Paul C Sanschagrin; Daniel T Mainz
Journal: J Med Chem Date: 2006-10-19 Impact factor: 7.446

3. Massively parallel computation of absolute binding free energy with well-equilibrated states.

Authors: Hideaki Fujitani; Yoshiaki Tanida; Azuma Matsuura
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2009-02-26

4. Fast and accurate predictions of binding free energies using MM-PBSA and MM-GBSA.

Authors: Giulio Rastelli; Alberto Del Rio; Gianluca Degliesposti; Miriam Sgobba
Journal: J Comput Chem Date: 2010-03 Impact factor: 3.376

5. Computationally predicting binding affinity in protein-ligand complexes: free energy-based simulations and machine learning-based scoring functions.

Authors: Debby D Wang; Mengxu Zhu; Hong Yan
Journal: Brief Bioinform Date: 2021-05-20 Impact factor: 11.622