Sahar Cain1, Ali Risheh2, Negin Forouzesh1. 1. Department of Computer Science, California State University, Los Angeles, CA 90032, USA. 2. Department of Computer Engineering, Amirkabir University of Technology, Tehran 15914, Iran.
Abstract
Calculation of protein-ligand binding affinity is a cornerstone of drug discovery. Classic implicit solvent models, which have been widely used to accomplish this task, lack accuracy compared to experimental references. Emerging data-driven models, on the other hand, are often accurate yet not fully interpretable and also likely to be overfitted. In this research, we explore the application of Theory-Guided Data Science in studying protein-ligand binding. A hybrid model is introduced by integrating Graph Convolutional Network (data-driven model) with the GBNSR6 implicit solvent (physics-based model). The proposed physics-data model is tested on a dataset of 368 complexes from the PDBbind refined set and 72 host-guest systems. Results demonstrate that the proposed Physics-Guided Neural Network can successfully improve the "accuracy" of the pure data-driven model. In addition, the "interpretability" and "transferability" of our model have boosted compared to the purely data-driven model. Further analyses include evaluating model robustness and understanding relationships between the physical features.
Calculation of protein-ligand binding affinity is a cornerstone of drug discovery. Classic implicit solvent models, which have been widely used to accomplish this task, lack accuracy compared to experimental references. Emerging data-driven models, on the other hand, are often accurate yet not fully interpretable and also likely to be overfitted. In this research, we explore the application of Theory-Guided Data Science in studying protein-ligand binding. A hybrid model is introduced by integrating Graph Convolutional Network (data-driven model) with the GBNSR6 implicit solvent (physics-based model). The proposed physics-data model is tested on a dataset of 368 complexes from the PDBbind refined set and 72 host-guest systems. Results demonstrate that the proposed Physics-Guided Neural Network can successfully improve the "accuracy" of the pure data-driven model. In addition, the "interpretability" and "transferability" of our model have boosted compared to the purely data-driven model. Further analyses include evaluating model robustness and understanding relationships between the physical features.
Entities:
Keywords:
binding free energy; graph convolutional network; implicit solvent model
Proteins perform biological functions through interaction with other biomolecules, including ligands that are (small) molecules capable of binding to a target protein, often with high affinity and specificity [1]. Protein–ligand interaction is central to several biological processes, e.g., cellular signal transduction, viral invasion, DNA replication, and cellular energy production [2]. It also has vast applications in the early stages of drug discovery [3]. One key component of protein–ligand interactions is the binding free energy change, , occurring between the protein and the ligand upon the ligand’s attachment. This physiochemical feature heavily dictates how strongly a protein and ligand interact and is particularly useful to understand for drug design [4]. While wet-lab experiments accurately estimate , they are significantly slow, costly, and laborious. On the other hand, computational simulations enable significantly faster estimation of and shed light on the binding mechanism of various structures [5].A wide range of computational methods trades off between physical rigor and computational time for calculations. On one side of the spectrum, there is molecular docking [6,7] that provides on-the-fly determination of the best poses of a ligand based on different measurements, including . Such methods have vast usage in high-throughput virtual screening and lead optimization, where a quick ranking of candidate drugs is central. However, molecular docking fails to calculate accurately due to many rough approximations, e.g., receptor flexibility, strain energies, and various entropies. On the other side of the spectrum, there exist alchemical free energy methods which evaluate ratios of partition functions and, therefore, calculate entropic and enthalpic components of accurately [8,9,10]. The drawback, though, is the high computational cost needed to run such simulations. In the middle of the spectrum, there exist molecular mechanics Poisson–Boltzmann surface area (MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA) [11,12,13,14] methods that balance the aforementioned accuracy vs. computational cost. MM/PB(GB)SA have been employed in drug design challenges [15], studying large benchmarks of protein–ligand complexes [16], and examining virus–receptor interactions [17,18]. A key factor determining the accuracy of these methods is the underlying implicit solvent.Implicit solvent modeling is one of the most popular computational methods that consider solvent (usually water) as one continuum component [19]. Within this framework, the calculation of could be conducted more efficiently compared to explicit solvents [20,21]. Practical implicit solvent models have broad applications in computer-aided drug design, particularly in protein folding and molecular dynamics simulation [11]. Poisson–Boltzmann (PB) [22,23] and generalized Born (GB) [24,25,26] models are the two main classes of implicit solvent models that have been used widely in static and dynamic simulations of protein–ligand interactions [27,28]. Despite many years of research, there are serious concerns regarding the lack of accuracy in implicit solvent modeling [29] that is to some extent unavoidable due to the underlying physical approximations, such as the elimination of water response beyond the dipole and polar to non-polar coupling [30]. Recently, cutting-edge technology has provided opportunities to compensate for this issue.With the ever-increasing determination of protein–ligand structures and the emergence of powerful hardware, machine learning (ML) techniques have been extensively utilized to identify structural patterns and predict binding profiles [31]. While these methods demonstrate promising results, two major concerns question their credibility: first and foremost, the sole dependence of ML on data raises criticism of overfitting and the lack of transferability, knowing that the available (experimental) data do not represent the entire system. This concern becomes more critical when the training and test sets are small; which is the case in many real-world applications, including drug discovery, where generating clean and accurately labeled data is costly and time-consuming. Secondly, the “black-box” characteristic of ML leads to uninterpretable results that, in many cases, do not agree with well-known physical models. Apparently, such models cannot be accepted in the scientific community even if they work perfectly accurately on the given data. What can bridge the gap between uninterpretable and potentially over-fitted data-driven models and relatively inaccurate theoretical models is a combination of the two known as Theory-Guided Data Science [32,33]. This new paradigm has shown accurate, interpretable, and transferrable results in modeling diverse scientific realms, including quantum chemistry [34], genetics [35], fluid mechanics [36], and material sciences [37,38,39].The main objective of this work is to propose a transferable and interpretable model that can estimate more accurately compared to the reference physics-based model. Our approach is to incorporate experimental data into an implicit solvent so that, with adherence to the physical model, new features extracted from the input structures could improve the accuracy of calculations. The proposed hybrid model consists of two main components: a GB model called GBNSR6 [40] and a state-of-the-art neural network. Compared to other flavors of GB, GBNSR6 [40] calculates of protein–ligand complexes efficiently and accurately when the reference is explicit solvent models [41]. However, the ∼5 kcal/mol error compared to the standard TIP3P water model necessitates improved accuracy [42,43] (chemical accuracy: 1 kcal/mol). With that, the second component is designed based on Graph Convolutional Network, which has demonstrated remarkable performance in extracting spatial and structural features in different domains, including computational structural biology [44]. The proposed Physics-Guided Neural Network was first introduced in [45]. Here, we analyze and test it on a PDBbind (refined) set of a comprehensive collection of protein–ligand complexes and small host–guest systems. Results of the proposed hybrid model have been compared with reference experiments as well as pure data-driven and physics-based models.
2. Background
2.1. Physics-Based Model: GBNSR6
Accurate calculation of free energy is highly dependent on precise measurements of electrostatic interactions in an aqueous environment. To accomplish this, biomolecules are often placed in a solvent that is usually water accompanied by counter-ions to ensure a neutral environment. This model is called the explicit solvent model, where the water molecules are explicitly measured in calculating the electrostatic energy of the system. Although this method proved to be highly accurate, it suffers from substantial computational costs, especially for large structures. One alternative solution is implicit solvent modeling, where explicit water molecules are replaced with an infinite continuum medium that has equivalent dielectric properties as water [19].Binding free energy () of a molecular system is calculated as
where is the enthalpy change, is the entropy change of the system, and T is the temperature in Kelvin. Enthalpy is a property of a thermodynamic system and is defined as the sum of the system’s internal energy and the product of its pressure and volume. In the implicit solvent framework, is decomposed into the gas-phase molecular mechanics energy, , and solvation free energy, . consists of the changes in internal energy, van der Waals energy, and the change in electrostatic energy. consists of polar and nonpolar components. (the largest component of the total energy for biomolecules) is calculated with either a PB or GB model. GB is chosen in this research since it has shown to be an efficient approximation of PB with quite accurate results [40]. A GB model with ALPB correction [46] has the following formula:
where and are the dielectric constants of the solute and the solvent, respectively, , , and A is the electrostatic size of the molecule, which is essentially the overall size of the structure which can be computed analytically [46]. We employ the most widely used functional form [47] of : , where is the distance between atomic charges and , and , are the so-called effective Born radii of atoms i and j, which represent each atom’s degree of burial within the solute. In GBNSR6, effective born radii are calculated with “R6” equation [40]. The grid-based implementation of GBNSR6 is freely available in the AMBER suite of biomolecular simulation programs [48].Entropy is a measure of the molecular disorder or randomness of a system. It is associated with conformational energy loss when a free-state ligand binds to the corresponding unbound-state receptor. Standard methods for estimating entropic component are normal mode analysis (NMA) [49] and quasi-harmonic approximation [50]. Due to its computational complexity, though, the entropy calculation in many studies of free energy is ignored. In this study, entropy is estimated as a new feature by subtracting the enthalpy calculated by the physics-based component from the experimental reference values. See Materials and Methods for more information.
2.2. Data-Driven Model: GCN
Graph convolutional network (GCN) is a specialized convolutional neural network that accepts graphs as input and applies several layers of filters to learn patterns [51,52]. Particularly, GCN takes as input a feature description for every node i summarized in an feature matrix , where N is the number of nodes and D is the input features. In addition, a representative description of the graph structure in matrix form, which is typically in the form of an adjacency matrix . The output is a node-level matrix that is an feature matrix, where F is the number of output features per node. Graph-level output, z, can be modeled by introducing some form of pooling operation. With that, every neural network layer is written as a nonlinear propagation function with and (or z for graph-level outputs), L being the number of layers. The specific models then differ only in how is chosen and parameterized.In this work, two GCNs available in Deepchem [31,53] are employed: GraphConv model and AtomicConv model. GraphConv model implements the graph convolutional model presented in [54]. These graph convolutions start with a per-atom set of descriptors for each atom in a molecule, then combine and recombine the descriptors over convolutional layers. AtomicConv model functions as a variant of graph convolution [31]. The difference is that the “graph” in this model is the nearest neighbors graph in 3D space. The AtomicConv model leverages these connections in 3D space to train models that learn to predict energetic states starting from the spatial geometry of the model. These two models have been utilized as the reference data-driven models in comparison with our hybrid model. Due to its flexibility, the GraphConv model has been chosen as the GCN component in the proposed PGNN model.
3. Materials and Methods
3.1. Featurization and Parameterization
Model training is performed with a combination of structure-based and physics-based features. Structure-based features are directly extracted from PDB files. ConvMolFeaturizer [31] from Deepchem is employed to represent atom features in the form of a graph which is an implementation of Duvenaud graph convolutions [54]. Every protein–ligand complex is featurized based on each atom’s neighborhood and is transformed to a 2D matrix, , where N is the number of atoms in a complex and D is the number of features for each atom. ConvMolFeaturizer extracts 75 features for each atom that is a binary representation of atom type, atom hybridization type, implicit valence, aromaticity, atom degree, number of hydrogens, number of radical electrons, and formal charges of each atom.Physics-based features, P, of each molecule are calculated using GBNSR6 available in Ambertools 2020 [55]. In short, the GBNSR6 model is executed on the complex, protein, and ligand structures. Electrostatic energy (EELEC), electrostatic energy for 1–4 bonded atoms (1–4-eel), non-polar solvation energy (ESURF), polar solvation energy (EGB), and Van der Waals (VDWAALS) energy are extracted accordingly. (PBSA [56,57] model is employed to calculate Van der Waals energy since this calculation is not implemented in GBNSR6). Total enthalpy is subtracted from the experimental values to account for entropy estimation. This number is added to the model as the last physics-based feature. See Table 1 for more details.
Table 1
Physics-based features calculated for complex, protein and ligand structures using GBNSR6.
Parameter
Description
Count
1–4-eel
1–4 Electrostatic energy
3
VDWAALS
Van der Waals energy
3
EELEC
Electrostatic energy
3
ESURF
Non-polar solvation energy
3
EGB
Polar solvation energy
3
Entropy
Entropy
1
Total
16
3.2. Hybrid Model: PGNN
The proposed Physics-Guided Neural Networks (PGNN) is a GCN with integrated physics-based features. The architecture of this model is shown in Figure 1. The model employs a GCN [31] to capture spatial features of the structures in the 3D space. The PGNN model consists of a couple of GraphConv layers of fixed channel size (training epoch: 100, learning rate: 0.001). The activation function used for GraphConv layers is the function to provide output in the continuous range of . This layer combines the features of each atom with ten nearest neighbor atoms and creates a new feature vector for each atom. The output of GraphConv is a matrix of order . A batch normalization layer is applied to improve the learning process, followed by a single-dimensional max pooling to minimize the feature space. Finally, the GraphGather layer is used to combine the data from all different nodes. The output of this layer is the model variable M that later concatenates with vector P containing physics-based features. The new vector, , is fed to the final dense layer to estimate the . The activation function used in the last two dense layers is . The loss function is designed to minimize the empirical error or, in other words, to minimize the RMSE in calculating compared to the experiments.
Figure 1
Workflow of the proposed PGNN model. The sample input structure, the complex of the BIR domain of MLIAP and GDC0152, is selected from the PDBbind refined set [58]. After featurization, the structure-based features along with the adjacency matrix A enter the network. The results of the first network iteration are the model variable M. The output of the physics-based model is vector P, which is concatenated with M. The resulting vector goes through several iterations before entering the final dense layer.
The following thermodynamic equation is integrated into the learning process of the proposed PGNN model by initializing the weights of to 1, −1, −1 for the complex, protein and ligand, and −1 for entropy:
3.3. Datasets
PDBbind Refined Set v.2015. The primary dataset used in this work is acquired from PDBbind refined set v.2015 [58]. This dataset is a comprehensive collection of experimentally measured values for 3706 protein–ligand complexes. However, due to the limited scalability of the DeepChem featurizer, a subset of 368 complexes of various sizes were picked for training and testing the proposed model. As illustrated in Figure 2, a subset has been selected with reference to the values of the original PDBbind dataset to ensure a comprehensive sampling. The size of the protein–ligand complexes in the sample dataset varies between 1200 to 43,000 atoms, which replicates the wide range of structure sizes in the original dataset. Extensive structure optimization and force field assignments are carried out using the protocol explained in [59]. Water molecules are eliminated from the structural (PQR) files using the CPPTRAJ library in Ambertools 2020. ANTECHAMBER program is used to generate custom residue topology files (prep files) using the general Amber force field (GAFF) [60] for ligands and FF14SB [61,62] for protein structures. The TLEAP program is employed to create the coordinate and topology files, and these files are then used to run GBNSR6. The dataset was split into training and test sets with a ratio of 3:1. Model training and testing were performed on San Diego Supercomputing Center (SDSC) clusters with 20 CPU cores and 242 GB of memory which took approximately 40 min.
Figure 2
Distribution of values of PDBbind refined set vs. our sample dataset.
Host–Guest Systems. Another dataset used in this work is a collection of small structures acquired from the host–guest benchmark [4] and SAMPL challenges [63]. This dataset consist of seven distinct hosts named Octa Acid (OA), tetra-endomethyl octa acid (TEMOA), Alpha-Cyclodextrin (aCD) and Beta-Cyclodextin (bCD) which are from the host–guest benchmark system [4], OctaAcidH (OAH) [64] and OctaAcidMe (OAMe) [65] from SAMPL5 challenge and Cucurbit[8]uril (CB8) from the SAMPL6 challenge [66]. The hosts are small molecules containing less than 100 atoms. These hosts bind their guests the same way proteins bind their ligands, so they can be considered as simple test cases for computational models of noncovalent binding. In addition, their small size and, in many cases, their rigidity makes it feasible to efficiently estimate values. Several guests are provided for each host, comprising 72 rigid complexes in total. The raw PDB files and pre-processed topology and coordinate files of the host–guest benchmark are freely available at https://github.com/mobleylab/benchmarksets (accessed on 7 April 2017). SAMPL5 and SAMPL6 structure files are taken from SAMPL Challenges GitHub repository. The enthalpic and entropic components of for these molecules are experimentally measured. Complex topology and coordinate files were further processed to strip water molecules and counterions and then split into host and guest using CPPTRAJ library on Ambertools 2020. The topology and coordinate files of hosts, guests, and complexes were then used to run GBNSR6. The datasets were split into training and test sets with a ratio of 3:1.
4. Results and Discussion
4.1. Accuracy of the PGNN Model
This section examines the accuracy of the proposed PGNN model compared to the two data-driven models: GraphConv model and AtomicConv model. It should be noted that calculating total through the selected implicit solvent is not possible since GBSNR6 merely accounts for the enthalpic component. Since enthalpic and entropic values have not been measured separately in the PDBbind refined set, the accuracy of the PGNN model is just compared with the data-driven models. Both of these models have been trained for 100 epochs with 4-fold cross-validation, which leads to 276 data points for training and 92 data points for validation and testing. The mean squared error (MSE) is defined as the regression loss function for the learning process of the two models. RMSE is used to measure the accuracy of each model; see Table 2.
Table 2
Error in calculating values of the GraphConv model (data-driven), the AtomicConv model (data-driven), and the PGNN model (hybrid) on the PDBbind dataset. RMSE values are in kcal/mol.
GraphConv
AtomicConv
PGNN
Training set
2.93 ± 0.08
16.37 ± 8.3
3.88 ± 0.13
Test set
6.90 ± 0.86
5.23 ± 0.40
4.08 ± 0.46
The following conclusions can be drawn from Table 2: First, the GraphConv model outperforms the PGNN model on the training set but fails to do so on the test set. This observation implies overfitting of GraphConv on the training set. It also specifies that the physics-based features in PGNN played an essential role in guiding the neural network on unseen data. The AtomicConv model performs significantly better on the test set than on the training set. The inconsistency in estimating values on the two sets, though, raises concern about the transferability of this model. PGNN, on the other hand, shows more accurate results consistently on the training and test sets.The average loss per epoch of the 4-fold cross-validation is illustrated in Figure 3: Figure 3a shows that the GraphConv model converges more quickly than the PGNN model, with a significant gap between the training and validation results. This observation, again, implies the overfitting of this data-driven model during the training process. Despite applying the Early Stopping rule in a different scenario (not demonstrated here), this gap never closed. Figure 3b demonstrates the poor performance of the AtomicConv model on the training set and a large gap between the training loss and the validation loss, which does not close after 100 epochs. In Figure 3c, the PGNN model shows more fluctuations throughout the learning process, indicating that the model can explore the solution space more effectively and learn the relation between the features more accurately. The model is more successful than the data-driven models in closing the gap between training and validation loss and minimizing the error. This observation demonstrates that the PGNN model is less likely to be overfitted on the training set and can predict the more accurately on unseen datasets.
Figure 3
Validation loss (val loss) and training loss (train loss) per epoch for GraphConv, AtomicConv (data-driven) and PGNN (hybrid) models on the PDBbind dataset. (a) GraphConv model; (b) AtomicConv model; (c) PGNN model.
4.2. Transferability of the PGNN Model
To evaluate the transferability of the proposed model, in addition to the PDBbind dataset, PGNN was trained and tested on the host–guest systems. According to Table 3, compared to GraphConv, PGNN is slightly less accurate on the training set but more accurate on the test set. Aligned with results in Table 2, this observation implies overfitting of the GraphConv model. GBNSR6 was utilized for calculating enthalpic values of binding free energy. Entropic values were borrowed from the experiment to account for total values. Since host–guest systems are small and rigid, this strategy does not significantly affect the accuracy of calculations. It can be seen that the physics-based model has poor performance in comparison to the other two models. This inaccuracy is the main motivation for proposing the PGNN model, which estimates values of the complexes about 6 kcal/mol more accurately. It should be noted that the AtomicConv model was not tested on this dataset since the input PDBbind dataset is hard coded in this model and could not be replaced with the host–guest systems.
Table 3
Error in calculating values of the GraphConv model (data-driven), the PGNN model (hybrid), and the GBRNSR6 model (physics-based) on the host–guest dataset. RMSE values are in kcal/mol.
GraphConv
PGNN
GBNSR6
Training set
1.61 ± 0.10
1.71 ± 0.11
8.22 ± 0.11
Test set
2.43 ± 1.27
2.05 ± 0.27
8.35 ± 0.34
The average loss per epoch of the 4-fold cross-validation is illustrated in Figure 4. Similar to Figure 3, it is observed that GraphConv model (Figure 4a) converges more quickly than the PGNN model (Figure 4b) with a significantly larger gap between the training and validation results, which implies overfitting.
Figure 4
Validation loss (val loss) and training loss (train loss) per epoch for the GraphConv model (data-driven) and the PGNN model (hybrid) on the host–guest dataset. (a) GraphConv model; (b) PGNN model.
4.3. Interpretability of the PGNN Model
Calculating the binding free energy of biomolecules is a physics-based problem by nature. Therefore, it is crucial to interpret the relevant ML models by analyzing the model parameters compared to the physical laws. In the GraphConv model, the sum of the binary (atomic) features is directly related to values. However, the exact relation between these features and binding free energy is not straightforward due to the complicated architecture of the neural network. The proposed PGNN model utilizes physics-based features to learn the relationship between structural features. These features are used to analyze the interpretability of the proposed model.According to Equation (3), physics-based features of the complex have positive coefficients, physics-based features of the protein and the ligand have negative coefficients, and entropy has a negative coefficient. Accordingly, we initialized the weights of complex parameters to +1, protein, ligand, and entropy parameters to −1, and the model variable to 0.5. It is shown in Table 4 that the model has retained the thermodynamic equation (i.e., Equation (3)) since the coefficients of the physics-based features are almost the same as before and after training. In addition, the coefficient of the model variable, which has been derived from binary features, has decreased from 0.5 to 0.45. In other words, the presented PGNN model has decreased the weight of the model variable in order to converge to a smaller gap between the predicted and actual values of .
Table 4
Coefficients of the physics-based features and the model variable in the last dense layer before and after training.
Physics-Based Parameters
Before Training
After Training
VDWAALS
−1,−1,1
−1.01,−0.99,0.99
EELEC
−1,−1,1
−0.99,−1.00,1.00
ESURF
−1,−1,1
−0.99,−0.99,1.00
EGB
−1,−1,1
−0.99,−1.00,1.00
1–4-eel
−1,−1,1
−0.99,−1.00,1.00
Entropy
−1
−0.99
Model variable
0.5
0.45
4.4. Robustness Analysis
Noise injection [67] is a standard practice in machine learning to study the robustness of a model. This technique can also be used in the training phase of a neural network when adding noise prevents memorizing the training samples and leads to a more robust model with lower generalization error. In this study, we tested the robustness of the PGNN model when it was trained and tested on the original dataset and the one with noise. Entropy was selected for robustness analysis since GBNSR6 does not calculate it directly. Instead, entropy values are given to the model as a difference between the experimental values and the enthalpic component calculated by GBSNR6. Gaussian noise, , was added to the entropy feature such that represents the mean of entropic values over the entire dataset, and shows the standard deviation. According to Figure 5, it is observed that the accuracy of the model changes only 0.1 kcal/mol in the (noisy) training set and 0.5 kcal/mol in the (noisy) test set. These small changes in the RMSE of values confirm that the PGNN model is robust against small noises in entropy. Therefore, it is concluded that accurate entropy values, i.e., those calculated with NMA or Quasi-harmonic approximation, will not significantly affect the final results.
Figure 5
Error in calculating values on data with added noise compared to the original data.
4.5. Feature Analysis
The correlation heatmap between the physics-based features of protein–ligand complexes in the PDBbind dataset is shown in Figure 6. This figure provides insight into the relationship between the selected features, mainly whether they could be grouped or eliminated to avoid redundancy. It is demonstrated that van der Waals energy (vdwaals) has no linear relationship with other features. Therefore, keeping this feature as an independent indicator of short-range non-covalent energy is essential. The other features, though, show strong correlations: polar (egb) and non-polar (esurf) free energy components are negatively correlated. For future studies, it is worth merging the two and considering the total solvation free energy as a new feature. In addition, electrostatic energy for 1–4 bonded atoms (1–4-eel) strongly correlates with all the features, except for vdwaals. It is recommended to eliminate this feature and focus on other atomic interactions.
Figure 6
Correlation heatmap between the physics–based features of PDBbind structures.
5. Conclusions
In this study, we have proposed a hybrid data-physics model to estimate the binding free energy of protein–ligand complexes with a wide range of structure sizes. This novel model inherits “interpretability” and “transferability” from the underlying physics-based model and “accuracy” from the data-driven model. As the first step, we have trained and tested the model on 368 protein–ligand structures selected from the PDBbind refined set and 72 host–guest systems. Results show that the combination of structural features extracted by the GraphConv model and physics-based features calculated by GBSNR6 enables the model to predict the binding free energy more accurately than the two data-driven models, i.e., the GraphConv model and the AtomicConv model. Compared to these models, the PGNN model consistently performs better on both training and test sets, reflecting its transferability between different datasets. Furthermore, analyzing the coefficients of physics-based features before and after training makes it possible to compare the results with physical laws and interpret them accordingly.The robustness of the PGNN model has been evaluated by adding a noise function to a selected feature. Similar results before and after noise injection ascertain the robustness of the model. A closer look at the physical features and their relations demonstrates that van der Waals energy is the only one that does not correlate with other features. While it is recommended to keep this feature, eliminating or grouping other features should be tested in future works. The main objective of this paper was to introduce physics-data hybrid models and the corresponding neural network architecture. Extensive testing of this model on larger datasets, including new versions of PDBbind, is our immediate next step. Further extension of this work will be conducted by running short molecular dynamics (MD) simulations embedded in MM/GBSA to bring the dynamics of protein–ligand interactions into account.
Authors: James A Maier; Carmenza Martinez; Koushik Kasavajhala; Lauren Wickstrom; Kevin E Hauser; Carlos Simmerling Journal: J Chem Theory Comput Date: 2015-07-23 Impact factor: 6.006