Literature DB >> 24528282

Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity?

Pedro J Ballester¹, Adrian Schreyer, Tom L Blundell.

Abstract

Predicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for exploiting and analyzing the outputs of docking, which is in turn an important tool in problems such as structure-based drug design. Classical scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that describe an experimentally determined or modeled structure of a protein-ligand complex and its binding affinity. The inherent problem of this approach is in the difficulty of explicitly modeling the various contributions of intermolecular interactions to binding affinity. New scoring functions based on machine-learning regression models, which are able to exploit effectively much larger amounts of experimental data and circumvent the need for a predetermined functional form, have already been shown to outperform a broad range of state-of-the-art scoring functions in a widely used benchmark. Here, we investigate the impact of the chemical description of the complex on the predictive power of the resulting scoring function using a systematic battery of numerical experiments. The latter resulted in the most accurate scoring function to date on the benchmark. Strikingly, we also found that a more precise chemical description of the protein-ligand complex does not generally lead to a more accurate prediction of binding affinity. We discuss four factors that may contribute to this result: modeling assumptions, codependence of representation and regression, data restricted to the bound state, and conformational heterogeneity in data.

Entities: Chemical

Mesh：

Substances：
Ligands
Proteins

Year: 2014 PMID： 24528282 PMCID： PMC3966527 DOI： 10.1021/ci500091r

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Docking can play a key role in addressing a number of important problems such as protein-function prediction[1,2] or drug-lead identification and optimization.[3,4] This technique can be regarded as a two-stage process. The first is pose generation, which starts with the determination of the position, orientation, and conformation of a molecule as docked to the target’s binding site. The second stage is scoring, which predicts how strongly the docked pose of such a putative ligand binds to the target. While pose generation is relatively well handled by current algorithms, the inaccuracies of current scoring functions still constitute the main barrier to achieving reliability in docking.[5−7] Indeed, despite intensive research over more than two decades, accurate prediction of the binding affinities of large sets of diverse protein–ligand complexes remains one of the most important open problems in computational bioscience. Three classes of scoring functions have emerged over the years: force field,[8,9] knowledge-based,[10−14] and empirical.[15−19] For the sake of efficiency, scoring functions do not attempt to simulate certain physical processes that influence the process of binding. This has an impact on their ability to rank-order and selects small molecules by predicted binding affinity. Thus, two major sources of error in scoring functions arise from their limited description of protein flexibility and the implicit treatment of solvent. Instead of scoring functions, other computational methodologies based on molecular dynamics or Monte Carlo simulations can be used to model protein flexibility and desolvation upon binding. In principle, a more accurate prediction of binding affinity than that from scoring functions is obtained in those cases amenable to these techniques.[20,21] However, such expensive free energy calculations are not feasible for the evaluation of large numbers of protein–ligand complexes, and their application is generally limited to predicting binding affinity in series of congeneric molecules binding to a single target.[22] In addition to these two enabling simplifications, there is a third factor in scoring function development that, despite its importance, has received little attention until recently.[23] Each scoring function assumes a predetermined functional form relating the variables that describe the complex, which may also include a set of weights that are fitted to experimental or simulation data, and its predicted binding affinity. Such a relationship typically takes the form of a sum of weighted physicochemical contributions to binding in the case of empirical scoring functions or a reverse Boltzmann methodology in the case of knowledge-based scoring functions. As previously discussed,[23] the inherent drawback of this approach is that those complexes not conforming to this strong modeling assumption will be predicted poorly. As an alternative to these classical scoring functions, nonparametric machine learning can be used to capture implicitly the binding interactions that are challenging to model explicitly. By not imposing a particular functional form for the scoring function, the collective effect of intermolecular interactions in binding can be directly inferred from experimental data. The latter should lead to scoring functions with greater generality and prediction accuracy. Indeed, this unconstrained approach had to result in performance improvement given sufficient data, as it is well-known that the strong assumption of a predetermined functional form for a scoring function constitutes an additional source of error (e.g., imposing an additive form for the considered energetic contributions[24]). On the other hand, recent experimental results have resulted in novelties in the definition of molecular interactions such as the hydrogen bond[25] and the hydrophobic interaction,[26] implying that previously proposed expressions for these energetic contributions might need to be revised accordingly. While a few classifiers exploiting X-ray crystal structural data for discriminating between binders and nonbinders of a protein target have been presented,[27,28] it is only recently that machine learning for nonlinear regression has been shown[23,29] to be a particularly powerful approach to build generic scoring functions. This approach has been highlighted[30−33] as very promising for the improvement of scoring functions. Indeed, a growing number of studies showing the benefits of machine learning scoring functions have been presented.[23,29,34−37] However, these initial models are relatively coarse in the description of the complex, and thus the question remains as to whether the incorporation of additional chemical information relevant for binding would improve performance further. Here, we investigate the impact of a more precise chemical characterization of the protein–ligand complex on the predictive power of the resulting scoring function. This includes the use of structural interaction fingerprints,[38] using atom and interaction type definitions from the CREDO structural interactomics database.[39] We show that the new version of RF-Score performs much better than classical scoring functions on the same test set. The RF-Score performed best when describing a complex using a 12 Å distance cutoff between atom pairs, suggesting that there is a minor contribution from long-range atom pairs. In the light of the improved performance obtained and considering the uncertainty introduced by the static nature of crystal structures, we discuss the role of interatomic distance cutoffs and binning as well as protonation states in binding affinity prediction. As a byproduct of this systematic battery of numerical experiments, the most accurate scoring function to date on a widely used pre-existing benchmark is presented. An important conclusion of our study is that a more chemically precise description of the protein–ligand complex does not generally lead to more accurate prediction of binding affinity. We discuss four factors that may contribute to this result: modeling assumptions, codependence of representation and regression, data restricted to the bound state, and conformational heterogeneity in data.

Methods

Defining Descriptors

Each complex was described by a vector of integer-valued descriptors or features. Three description schemes were implemented: the Element scheme uses the combination of the element symbols of the interacting atoms to classify the interaction, e.g., C–C or N–O. The fingerprint of this scheme has a position for each pairwise combination of element symbols, and the directionality is preserved; i.e., N–O is distinct from O–N. Here, all of the heavy atoms commonly observed in PDB complexes (C, N, O, F, P, S, Cl, Br, I) are considered. The Sybyl scheme uses SYBYL atom types instead of the element symbols to define the range of considered protein–ligand atom pairs. These atom types permit deconvoluting the element into hybridization state and bonding environment. For instance, instead of having a single C element atom type, the Sybyl scheme considers the following subtypes: C+, C1, C2, C3, Cac, and Car (a description of SYBYL atom types can be found at http://www.tripos.com/mol2/atom_types.html). The latter leads to 36 distinct C–C descriptors in the Sybyl scheme in contrast to a single C–C descriptor in the Element scheme. The credo scheme uses Structural Interaction Fingerprints (SIFts)[38] to encode protein–ligand interactions. Here, interatomic pairs are categorized as interactions if particular geometrical and atom type constraints are satisfied. Atom types were defined through a set of SMARTS patterns that are completely customizable through a configuration file. The atom types of atoms belonging to standard amino acids as well as nonstandard binding site residues that occurred in the used test sets were precalculated because determining them “on the fly” was not feasible with the Open Babel toolkit. These atom types were stored in a separate configuration file and can therefore be easily changed by the user. The determination of standard and weak hydrogen bonds required the protonation state to be known, and there the complexes to rescore must be already protonated. Twelve different contact types are encoded in SIFt, of which four are solely distance-based. The latter are covalent bonds, van der Waals clashes and contacts, and finally proximal interactions. The other eight “feature” contact types are hydrogen bonds, weak hydrogen bonds, halogen bonds, ionic, metal complexes, aromatic, hydrophobic, and carbonyl. The definition of these including the source of the SMARTS definitions and other constraints are described in the original CREDO publication[39] with the following exception: the carbonyl interaction type has since then been implemented on the basis of the ab initio molecular-orbital calculations of Allen et al.[40] who have shown that carbonyl–carbonyl interactions can have similar strengths to those from hydrogen bonds. Appendix A3 in the Supporting Information summarizes the exact classification scheme for all interactions that was used for the SIFt descriptor. Once the scheme is selected, descriptors are generated by counting interatomic pairs between protein binding sites and their ligand molecules. These possibly interacting atoms were assigned a specific interaction type if their distance was within a distance threshold and if a combination of possible atom type and geometry constraints was satisfied. The software to calculate these three description schemes uses the open source chemistry toolkit Open Babel[41] in version 2.3.0 (through Python bindings) and the SciPy library.[42] The molecular structures of protein targets and their ligands comprising the used data sets from PDBbind are read-in separately. Interatomic contacts for a given distance cutoff are determined using the KDTree structure in the scipy.spatial module with the atomic coordinates as input (the KDTree is an efficient space-partitioning algorithm that limits the search space for interatomic contacts in order to prevent expensive all-by-all searches). The interacting atoms are then analyzed depending on the descriptor and the appropriate position on the fingerprint incremented. This descriptor-generating software is also capable of binning the identified interatomic pairs into arbitrary distance ranges. For each normal feature on a descriptor, a number of columns equal to the number of distance bins are created. Using a distance cutoff of 6 Å and a bin size of 1 Å for example would create six bins for each feature: from 0 to 1 Å, 1–2 Å, and so on. The correct bin for each interatomic pair that has to be incremented is determined using the numpy.digitize function in the NumPy package (http://numpy.scipy.org). The complete source code of the software that was used to generate the described results was released at https://bitbucket.org/aschreyer/rfscore under the MIT license.

Regression Model

RF-Score uses Random Forest[43] (RF) as the regression model. A RF for regression is an ensemble of P regression trees randomly generated from the same training data. In building these trees, RF determines the best split at each node of the tree from a subset of randomly selected features of size mtry (the recommended value of this control parameter is a third of the number of features). Note that although the selected features are generally different at each node, the same mtry value is applied to each node across the P trees of the forest. Here, we operate the RF with the default value P = 500, as its performance does not generally improve significantly beyond this threshold. The performance of each tree on predicting the Out-Of-Bag (OOB) set, here protein–ligand complexes that were not in the bootstrap sample used to train that tree, collectively provides an internal validation measure of RF. Further details about RF and its application to build RF-Score can be found in ref (23). The RF-Score software is available at http://pedroballester.com/software.

Training and Test Data

The PDBbind benchmark[44] has become a de facto standard in the validation of scoring functions. On the basis of the refined set from the 2007 release of the PDBbind database, it comprises 1300 diverse protein–ligand complexes with high quality structural and binding data (the protonation states of both proteins and ligands were already calculated by these authors[44]). From this refined set, Cheng et al.[44] constructed a test set, named the core set, with 195 diverse complexes spanning more than 12 orders of magnitude in measured binding affinities. The PDBbind benchmark consists of testing the predictions of scoring functions on the core set, whereas the remaining 1105 refined set complexes are used as the training set. A discussion on the composition and suitability of this benchmark can be found in refs (23) and (29).

Results

Preamble

We start this section with a concise description of the approach to predicting binding affinity using machine learning.[23] This process starts with the characterization of each protein–ligand complex as a set of intermolecular features or descriptors relevant to binding affinity prediction. The sketch in Figure 1 shows an example of how descriptors are generated from the X-ray crystal structure of a complex (PDB: 2p33). Each descriptor is given by the occurrence count of a particular protein–ligand atom pair within a predetermined distance range of each other. For example, a descriptor could be defined as the number of times that protein nitrogen and ligand oxygen are separated by less than a distance cutoff (dcutoff). As in previous studies,[23] nine atom types commonly observed in PDB complexes were selected by considering atomic number only (C, N, O, F, P, S, Cl, Br, I) to give rise to 81 protein–ligand atom pairs which are considered as descriptors (this descriptor scheme is named here “element”).

Figure 1

Sketch of the process of characterizing a protein–ligand complex (PDB: 2p33) as a set of structure-derived descriptors (C.C to I.I). The discontinuous green lines connect the ligand chlorine atom with all protein carbon atoms within the distance cutoff represented by the green sphere, with the number of these pairs giving value to the C.Cl descriptor. The rest of the descriptors are calculated in an analogous manner. Once the descriptor scheme (i.e., considered atom types, binning, and cutoff) is chosen, the next step is to select a source of curated structural and interaction data suitable for training and testing scoring functions. The PDBbind database[45] is an excellent choice for this purpose, with the additional advantage that a large number of scoring functions have already been benchmarked on a common PDBbind test set,[44] which permits comparing new developments against the state of the art. Moreover, some scoring functions[23,44] have not only been tested on this common data set but also calibrated on the same training set (further details can be found in the Methods section). This is important to avoid the often large bias introduced by using a different training set for each scoring function (such bias makes comparisons among scoring functions unreliable, even if compared on the same test set[29]). Therefore, we will be focusing here on these common training and test sets. Last, a regression model is needed to predict the binding affinity of test set complexes from the structural and interaction data in the training set. Here, we build upon RF-Score,[23] a machine learning scoring function using RF[43] for regression. RF is typically tuned using a single control parameter (mtry, which controls the number of features that are considered for the split at each tree node) and may be subjected to a feature selection strategy intended to remove descriptors with low information content as a way to improve performance (that is, in addition to the common practice of removing all those descriptors that have zero values across training complexes). As usual, predictive performance is measured as the difference between predicted and measured binding affinity across test set complexes. Figure 2 summarizes the process of training and testing machine learning scoring functions. Full details on descriptors, data, and regression protocols can be found in the Methods section.

Figure 2

Training and testing RF-Score workflow. Top: descriptors are generated from two nonoverlapping data sets with 1105 and 195 complexes for training and testing, respectively. Bottom: training Random Forest to learn the nonlinear relationship between this atomic-level description of the complex and its binding affinity (pKd or pKi; pKd/i denotes both without distinction). The resulting scoring function (RF-Score) is used to predict the binding affinity of the test set. We have seen that not only can a protein–ligand complex be described in various ways, but also the process of building a predictive model from these descriptors involves a number of choices. In a way, the overall process of training a scoring function can be regarded as a quest for finding an optimal combination of these design variables. An exhaustive evaluation of all possible combinations is impractical, as this would involve a prohibitively large number of RF training runs (2 runs, one for each possible subset of m features, even if we fix mtry to its recommended value). Thus, we were forced to assume the independence of these design variables and searched for the optimal value of each variable in a sequential manner as explained in the next subsections.

Optimal Interatomic Distance Cutoff

The first question we addressed was which distance cutoff leads to the best performance on the independent test set (henceforth referred to as simply the test set). Different distance cutoffs have been previously used in the literature, some as large as 12 Å (PMF[12]) and 15 Å (Fresno[46]). Here, we addressed this question empirically by generating element descriptors for four cutoffs (6 Å, 9 Å, 12 Å, and 15 Å), which gave rise to four different numerical characterizations of the training set with their corresponding test set counterparts. Thereafter, RF was calibrated on each of these training set versions and the resulting model used to predict the binding affinity of the corresponding test set complexes. As scoring function calibration is a stochastic process, a slightly different model is obtained with a different random seed. To assess the variability in RF prediction due to this factor, we repeated the training 10 times for each cutoff, each time with a different random seed. Such assessment is needed to establish whether the improvement in prediction is due to using another distance cutoff or just comes from variability in model calibration (this procedure is more accurate than basing the analysis on a single model calibration as has been the case so far). Lastly, we considered three commonly used metrics for quantifying the difference between predicted and measured binding affinity across the test set of protein–ligand complexes: Rp (Pearson’s correlation coefficient), Rs (Spearman’s correlation coefficient), and SD (Standard Deviation in log Kd/i units). Figure 3 illustrates the results of this numerical experiment.

Figure 3

Test set performance of RF-Score with element descriptors for each of the four distance cutoffs (6 Å, 9 Å, 12 Å, and 15 Å). Performance is measured as the difference between observed and predicted binding affinity in the test set using three metrics: Pearson’s correlation coefficient (Rp; left plot), Spearman rank-correlation coefficient (Rs; middle plot), and standard deviation (SD; right plot). Ten models are built from each of these four versions of the training data sets (6 Å, 9 Å, 12 Å and 15 Å), each time using a different random seed (the boxplot summarizes the performance on the test set achieved by each of the 10 models). Results showed that the best median performance, i.e., that with the highest correlations and lowest standard deviation, is obtained with the 12 Å cutoff in all three performance metrics. It is worth noting that optimizing the distance cutoff only led to a modest performance improvement (+0.017 in median Rp and −0.05 log K units in median SD).

Role of Protonation States and Bonding Neighborhood

The element descriptors used in the previous experiment constitute a coarse representation of the complex. Distinguishing between atoms of the same element in different local environments leads to a more chemically precise characterization (e.g., deconvoluting the occurrence counts of a carbon–carbon intermolecular pair into pairs that specify the hybridization state of both atoms), and thus the resulting model would in principle be expected to perform better. To test this hypothesis, we used Sybyl atom types with these characteristics. Further, one could also incorporate additional information into the descriptors such as the angle between hydrogen bond donors, acceptors, and hydrogen atoms as well as covalent and van der Waals radius. These are the Credo descriptors, which measure the abundance of a range of intermolecular interactions such as hydrogen bonds, hydrophobic interactions, or van der Waals clashes. Using the same training and test set, each description scheme gives rise to a different set of features that are used to characterize every complex. The performance of each of these three description schemes is presented in Figure 4.

Figure 4

Test set performance of RF-Score using the optimal 12 Å interatomic distance cutoff with element, Sybyl, and credo descriptors. Interestingly, the model based on Credo descriptors obtained much lower performance than that using Sybyl and element descriptors. Element descriptors led to a small improvement over Sybyl descriptors. These results hint at a trade-off between the predictability and interpretability of the model, which we will discuss later in this paper.

Incorporating Interatomic Distance

The strength of the interatomic interactions that collectively form the noncovalent intermolecular bond depends on the separation between the interacting atoms. Therefore, it is reasonable to think that partitioning the descriptors into a number of interatomic distance bins should lead to a model with more predictivity. Consequently, we generated element descriptors with a 12 Å cutoff, i.e. using all the optimal values, for six bin sizes (a 12 Å bin size with a 12 Å cutoff simply corresponds to the case without binning, which was previously shown in the first boxplot in Figure 4 and the third boxplot in Figure 3). Figure 5 shows the results for each bin size, where the best median performance is achieved by models with lower bin sizes (1 Å, 2 Å, and 3 Å), representing a moderate improvement with respect to the model based on single-binned descriptors (12 Å). The experiments were repeated using the same bin sizes but now with Sybyl and Credo descriptors instead of elements descriptors (the maximum cutoff for a Credo interaction type is 4.5 Å, all other atom pairs in this description scheme are labeled as “proximal”). It was observed that the performance was not as high as that with element descriptors (the best median performance for Sybyl was Rp = 0.779, Rs = 0.771, and SD = 1.59, whereas that for Credo was Rp = 0.739, Rs = 0.742, and SD = 1.68).

Figure 5

Test set performance of RF-Score with element descriptors and 12 Å interatomic distance cutoff using six different bin sizes (1 Å, 2 Å, 3 Å, 4 Å, 6 Å, and 12 Å). The best median performance is achieved by models with lower bin sizes (1 Å, 2 Å, and 3 Å), representing a modest improvement with respect to the model based on single-binned descriptors (12 Å).

Feature Selection

In addition to exploring the impact of different ways to describe the complexes, we also applied basic feature selection strategies intended to remove sparse features that increased the complexity of the model without improving performance. Here, the sparsity (spr) of a descriptor is defined as the average number of occurrence counts per training complex. In previous versions of RF-Score, only features with sparsity higher than the zero threshold (spr = 0), i.e., those that are nonzero for at least one training complex, were considered. Here, we also considered two additional sparsity thresholds (spr = 1 and spr = 2). We conducted this experiment for the three best bin sizes in Figure 5 (1 Å, 2 Å, and 3 Å), which had spr = 1 as the optimal value on all three bin sizes. Figure 6 presents the results for the best bin size across the three spr values (2 Å).

Figure 6

Test set performance of RF-Score with element descriptors, 12 Å cutoff, and 2 Å bin size using three values of the feature selection threshold (spr). Best median performance is obtained by spr = 1, which corresponds to only considering descriptors that have an average of at least one atom–atom pair in the considered distance range per training complex. The latter represents a moderate improvement with respect to the models using descriptors with spr = 0 and spr = 2.

Model Selection

Last, we have been using the recommended value for the RF mtry parameter so far. However, interval validation strategies can be used to select an optimal value for mtry. One of these strategies is called Out-Of-Bag (OOB) validation and essentially consists of training the model for each possible value of mtry and selecting the model that best performs on the internal validation set (a subset of the training set, as further explained in the Methods section). Figure 7 shows that this model selection strategy carries a small improvement in performance at the cost of much higher computational expense in model selection (one RF training run per considered mtry value).

Figure 7

Test set performance of RF-Score with element descriptors, 12 Å cutoff, 2 Å bin size, and feature selection threshold (spr=1) for the recommended mtry (model selection=0) and mtry selected by OOB validation (model selection=1), which requires 123 times more training than just using the recommended setting (as many RF training runs as features were selected to describe each complex).

Predictive Performance

This systematic battery of numerical experiments led to the new scoring function RF-Score::Elem(c12,b2)_spr1_oob (RF-Score::Elem-v2 for short). As we have seen, the descriptors come from partitioning occurrence counts of each element atom type pairs into six interatomic distance bins of 2 Å size, and the RF model is built with the 123 descriptors that are sufficiently dense (spr = 1) using the internally validated mtry value (mtry = 14 in this case). Figure 8 shows the predictive power of RF-Score::Elem-v2 on the test set.

Figure 8

RF-Score::Elem-v2 predicted versus measured binding affinity on the independent test set (195 complexes). Pearson’s correlation coefficient Rp = 0.803, Spearman’s correlation coefficient Rs = 0.797, standard deviation SD = 1.54 log Kd/i units, and Root Mean Square Error RMSE = 1.53 log Kd/i units. This plot can be visually compared to those for the best performing scoring functions in Cheng et al.’s[44] Figure 6. Performance comparisons on the same test set are presented here in Figure 9.

Figure 9

Performance of 18 scoring functions on the PDBbind benchmark as measured by Pearson’s correlation coefficient (Rp), Spearman’s correlation coefficient (Rs), and standard deviation of the difference between predicted and measured binding affinity (SD). The three plots on the right visually show the relative predictive power of RF-Score (“x” signs) against that of the other 17 scoring functions (“+” signs). NHA is the performance of a linear regression model with the number of heavy atoms of the ligand as only variable (this baseline is shown as a horizontal discontinuous line in the plots).

In terms of efficiency, RF-Score::Elem-v2 scored all 195 protein–ligand complexes in 0.01 s (all of the computation in this study was carried out with a single processing core Intel Core i7–2920XM at 2.50 GHz with 16 GB RAM). In addition, the time to generate these features for the 195 complexes was 8 s, and hence this is the most expensive part of the calculation. Therefore, the average time to score one protein–ligand complex is about 0.04 s if the features have not been calculated before, which makes RF-Score suitable to rescore a high number of docking poses in virtual screening applications. The predictive power of RF-Score::Elem-v2 was also compared against that of a wide selection of scoring functions on the PDBbind benchmark.[44] By using a pre-existing benchmark, the danger of constructing a benchmark complementary to the presented scoring function is avoided. It also has the advantage of ensuring that previously tested scoring functions were provided with optimal settings by their authors. Several of the scoring functions tested in the PDBbind benchmark have different versions or multiple options. However, for the sake of practicality, only the version/option of each scoring function that performs best on the PDBbind benchmark was reported by Cheng et al.[44] In addition to these 16 scoring functions, we also tested a more recent function called IMP::RankScore.[47] Figure 9 reports the performance of all scoring functions on the test set, with RF-Score::Elem-v2 obtaining the best performance with Rp = 0.803 (the performance of the original version of RF-Score[23] is also included). In contrast, classical scoring functions tested on the same test set obtained a lower Rp spanning from 0.216 to 0.644. This trend was also observed with the other two performance measures (Rs, SD). It is worth noting that the root-mean-square error of the free energy of binding on such a diverse test set is just 2.1 kcal/mol. Performance of 18 scoring functions on the PDBbind benchmark as measured by Pearson’s correlation coefficient (Rp), Spearman’s correlation coefficient (Rs), and standard deviation of the difference between predicted and measured binding affinity (SD). The three plots on the right visually show the relative predictive power of RF-Score (“x” signs) against that of the other 17 scoring functions (“+” signs). NHA is the performance of a linear regression model with the number of heavy atoms of the ligand as only variable (this baseline is shown as a horizontal discontinuous line in the plots). When introducing a scoring function, only the scoring function built with the random seed that provides the best performance is generally reported. We have followed here a more precise way to assess performance differences between scoring functions by comparing median performances from a set of independent trials. Moreover, in order to address the question of how significant is the reported improvement over the original version of RF-Score, we have trained and tested the original RF-Score using 10 different random seeds. Thereafter, we have repeated the process, using the same random seeds, with the new version of RF-Score. The resulting boxplots are compared in Figure 10. Lastly, we carried out a two-sample t test for each performance measure to find out that all differences are statistically significant (Rp p value= 6.0 × 10–12, Rs p value=1.5 × 10–11, and SD p value = 3.8 × 10–4).

Figure 10

Performance comparison between the original version and the new version of RF-Score.

Discussion

The new version of RF-Score performs much better than classical scoring functions on the same test set. In fact, this performance gain must be actually larger in most cases since only RF-Score and X-Score, among all scoring functions represented in Figure 9, use training sets that do not overlap with the test set. Having training complexes in the test set artificially enhances the performance of a scoring function, as it is not exclusively predicting unseen complexes but merely reporting the lower training error of those overlapping complexes. Indeed, adding a third of the test set to the training set makes RF-Score::Elem-v2’s Rp rise from 0.803 (no overlap between training and test sets) to 0.872 (65 overlapping complexes). Clearly, the same training and test set must be used when comparing scoring functions, but unfortunately this has not been required in recent community benchmarks.[48] Furthermore, it could be argued that there is something particular about the training/test partition selected by Cheng et al. in the PDBbind benchmark. This partition was chosen to compare RF-Score against the best scoring function in that study under exactly the same experimental conditions. An experiment to investigate this question was already carried out in the original RF-Score paper[23] (Appendix A4 in the supplementary data of that paper[23]) and further discussed in a subsequent commentary[29] (see Figure 1 therein). The performances of RF-Score for 25 randomly generated training/test partitions with the same sizes as the benchmark partition (1105/195) were calculated. The experiment demonstrated that there is a minor difference in RF-Score performance between the benchmark partition and the median of these 25 alternative partitions. While our study focuses on generic scoring functions, we would like to briefly comment on how RF-Score would perform on subgroups of the test set (e.g., complexes whose proteins belong to the same family). Clearly, the better the performance of RF-Score over another scoring function on the full test set, the more test subsets will be better predicted by RF-Score. To illustrate this, Appendix A1 presents the performance of the new version of RF-Score (RF-Score::Elem-v2; full test set RMSE = 1.54) and RF using Credo intermolecular interaction features (RF-Score::Credo; full test set RMSE = 1.72) on the four subsets resulting from partitioning the test set by binding affinity ranges. Appendix A2 presents another experiment where two small subsets of 23 complexes are generated, one containing those with the most similar ligands and the other subset with the most dissimilar ligands in terms of chemical structure. RF-Score::Elem-v2 outperforms RF-Score::Credo in all subsets but the one with the most dissimilar ligands, where RF-Score::Credo performs slightly better. These results illustrate the fact that, despite RF-Score::Elem-v2 generally performing better, there could be a few complexes (e.g., ligand-bound structures of a particular target) where other scoring functions perform better. We intend to study this issue further in the future. On the other hand, it is noteworthy that RF-Score performed best when describing a complex using a 12 Å distance cutoff between atom pairs, a distance well beyond direct interatomic contacts. The improvement over RF-Score with a more common 6 Å cutoff is, however, modest (+0.017 in median Rp, +0.003 in median Rs, and −0.05 log K units in median SD; see Figure 3). This result suggests that there is a minor favorable contribution from atom pairs separated by a distance between 6 and 12 Å over the 1300 considered complexes. Such long-range contributions to protein–ligand binding affinity have been attributed to the electronic properties of the binding site and ligand being affected by all protein atoms[49] and also to long-range electrostatics interactions.[50] Increasing the noncovalent cutoff to 12 Å has also been found beneficial in protein folding[51] and DNA molecular dynamics[52] simulations. It seems therefore that RF-Score is able to capture long-range effects implicitly to some extent. The main reason why RF-Score works much better than classical scoring functions at predicting binding affinity of most complexes is due to the circumvention of the strong assumption of a predetermined functional form. All classical scoring functions consist of a sum of typically nonlinear terms with respect to selected interatomic distances, such as van der Waals terms in empirical scoring functions or particular atom–atom potentials in knowledge-based scoring functions. For instance, the Scoring Function Consortium (SFC), in a concerted effort between 10 pharmaceutical companies and academic institutions, generated an empirical scoring function (SFCscore),[53] which by the time of its development was clearly superior to most of the then available scoring functions. Very recently, one of the leading SFC authors has demonstrated[54] that, by using Random Forest regression instead of SFCscore’s additive functional form and keeping all other modeling choices unaltered (training data, test data, and descriptors), performance rises from RMSE = 1.84 to RMSE = 1.56 (0.683 to 0.788 in the case of Rs). This is a very large improvement for a single modification in a generic model, especially taking into account that scoring functions are highly optimized due to intense work over the years in this area. Another study demonstrating that assuming an additive functional form is detrimental for the performance of empirical scoring functions is by Kinnings et al.[30] As force-field and knowledge-based scoring functions make the same assumption, these studies strongly suggest that a machine-learning version of other classical scoring functions will also result in significant improvement. Another important conclusion of our study is that a more precise chemical description of the protein–ligand complex does not generally lead to more accurate prediction of binding affinity (see Figure 4). In the first study, Li et al.[55] present a scoring function tested on exactly the same test set as us, with a much larger training set that includes ours and the use of a very precise description consisting of 50 calculated descriptors falling into nine interaction categories: van der Waals, hydrogen-bonding, electrostatic, pi-system, metal–ligand bonding, desolvation effect, entropic loss effect, shape matching, and surface property matching (Table 1 on page 593 of Li et al.’s paper). Li et al. obtained SD = 1.63 and Rs = 0.779 (Table 4 at page 597 of Li et al.’s paper), whereas RF-Score originally obtained SD = 1.58 and Rs = 0.762 on the same test set. Interestingly, these authors referenced RF-Score but did not include it in the comparison or comment on why its performance was better in some performance measures despite using much simpler descriptors and less data for training. The second independent study provides an even more direct comparison. Zilian and Sotriffer[54] used the same training set, test set, and regression model as RF-Score. The only difference between their scoring function and ours is in the 63 used descriptors, which was one of the outcomes of the industry–academia Scoring Function Consortium. These descriptors include the number of rotatable bonds in the ligand, hydrogen bonds, aromatic interactions, and polar and hydrophobic contact surfaces, among others (a full list can be found in Table 1 of page 398 of the original SFC paper[53]). Their best scoring function achieved RMSE = 1.56 and Rs = 0.788 (Table 1 of Zilian and Sotriffer’s paper), which is slightly better than the original version of RF-Score (RMSE = 1.58 and Rs = 0.762). If the modeling assumptions implied in the calculation of chemical properties were generally accurate, we should have seen many scoring functions performing much better than RF-Score thanks to using a more precise chemical description. But we have actually seen the opposite in these two independent studies, once we compare the performances achieved by Li et al. (SD = 1.63 and Rs = 0.779) and Zilian and Sotriffer (RMSE = 1.56 and Rs = 0.788) to that of the new version of RF-Score (SD = 1.54, RMSE = 1.54 and Rs = 0.797) on the same diverse test set. The new version differs from the original version of RF-Score in that features are distance-dependent but still do not explicitly incorporate calculated protonation states. We discuss next four convoluted factors that may contribute to this result: modeling assumptions, codependence of representation and regression, data restricted to the bound state, and conformational heterogeneity in data. The first factor is that more precise descriptors often mean making modeling assumptions that introduce additional error. For example, the protonation state of an atom needs to be estimated in order to assign its Sybyl type, but the local change in pH induced by hydrogen bond donors/acceptors in nearby residues and water molecules is usually not incorporated into scoring functions for binding affinity prediction. Similar arguments can be constructed regarding the calculation of donor-hydrogen-acceptor angles to perceive hydrogen bonds. The question remains as to how large the impact of this error is compared to that of not considering protonation states at all (the element descriptor scheme). The second factor, often neglected, is the optimality of problem representation (description scheme) for the applied solution construction method (regression technique). From a purely chemical perspective, deconvoluting elemental atom types into their various hybridization states constitutes a more precise description of the complex. However, this scheme also results in a higher number of features and thus more sparse features. The latter are detrimental for random forest regression because as many data as possible are needed to characterize the interaction between each pair of atom types, best achieved by minimizing the number of different types defined. In practice, the definition of atom types must reflect a compromise between these two conflicting objectives, so as to ensure that the features are backed up by sufficient data to be statistically as well as chemically meaningful. This situation gives rise to a trade-off between the predictability and interpretability of the model, which is not uncommon in regression problems[56] and has also been observed here (see Figure 4). For the sake of efficiency, scoring functions only exploit the information contained in the bound state of the complex, as represented by a crystal structure. However, binding affinity also depends on the energetic contributions from ligand and protein desolvation as well as induced fit upon binding. The third contributing factor is therefore the uncertainty about how well a particular description of the bound complex is also describing the complex just before desolvation and induced fit takes place. We speculate that descriptors whose values change less during the binding process might be more suitable for predicting binding affinity using only data about the bound state. For instance, element descriptors do not change much in general during the binding process, as a fixed cutoff will include roughly the same protein and ligand atoms just before and after binding. In contrast, protonation states will generally change significantly upon binding because of desolvation. The last contributing factor comes from the uncertainty arising from the fact that the crystallographers deposit a single structural model in the PDB while several different models may fit the electron density equally well.[57] The conformational heterogeneity of a complex (i.e., several bound states are possible for this complex, at least within experimental uncertainty) means that different sets of descriptors would be generated for exactly the same binding affinity. Access to the multiple structural models of a complex that are significantly different at the binding site level is likely to be helpful in deciding how to best address this issue. In particular, it would be interesting to investigate whether combining the predictions from each structure is a better strategy than simply predicting from the deposited structure. Our finding that binding affinity can be better predicted when calculated protonation states are not explicitly incorporated into the scoring function will be certainly seen as a controversial result by most molecular modellers. We are providing next an intuitive explanation for this result. In machine learning nomenclature, the chemical description of complexes constitutes a data representation. Representations present an opportunity to incorporate domain knowledge into the problem, which in principle can help to disentangle the different explanatory factors for variation of the predicted variable (binding affinity here) and thus lead to better performance by simplifying the regression problem. However, as domain knowledge is affected by confounding factors and implemented with various degrees of efficacy, it is entirely possible to obtain better performance by incorporating less domain knowledge (i.e., introducing less noise) and hence relying more on pure inference from the data. In our problem, binding affinity is experimentally determined in solution along a trajectory in the codependent conformational spaces of the interacting molecules, whereas the structure represents a possible final state of that process in a crystallized environment. Consequently, very precise descriptors calculated from the structure are not necessarily more representative of the dynamics of binding than less precise descriptors. This means that a more precise description will not necessarily lead to a better prediction of binding affinity, as it has been proven here using Random Forest. Because the information content of a set of variables is independent of the adopted regression model, the use of an alternative regression technique should lead to the same conclusion, although this point is still to be confirmed experimentally. We cannot stress enough that we are not making any claim about the importance of protonation for pose generation in docking. Pose generation and rescoring are different problems, and so are the objectives that the corresponding scoring functions must fulfill. In summary, we have seen that one can be easily fooled by uncertainty when investigating more accurate scoring functions. Given the unavoidable uncertainty, we believe that rigorous and systematic numerical studies are the most reliable way to make progress in predicting intermolecular binding affinity. We hope that the availability of the RF-Score software (links are provided in the Methods section) will encourage experts in the area to try to perform better on the PDBbind benchmark using alternative chemical descriptions as a way to investigate this issue further. The code permits reproducing the results obtained by RF-Score::Elem-v2 and can also be used as a template to test alternative regression techniques implemented in R. Without any modification, the RF-Score software can be employed to rescore ligands in crystal structures or docking poses. There is a range of applications in which more accurate prediction of binding affinity of a complex would be very useful, some of them new such as replacing force-fields in molecular dynamics simulations. Other applications include structure-based virtual screening and lead optimization. In fact, applying a simpler variant of RF-Score::Elem-v1[23] to prospective virtual screening has already been found[37] to excel at discovering innovative inhibitors of antibacterial targets. Very recently,[58] RF-Score::Elem-v1 has been incorporated into an easy-to-set large-scale docking Web server (http://istar.cse.cuhk.edu.hk/idock) to carry out virtual screening of up to 17 million purchasable molecules from the ZINC database,[59] which should be upgraded soon to RF-Score::Elem-v2.

52 in total

1. On the truncation of long-range electrostatic interactions in DNA.

Authors: J Norberg; L Nilsson
Journal: Biophys J Date: 2000-09 Impact factor: 4.033

Review 2. Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go.

Authors: N Moitessier; P Englebienne; D Lee; J Lawandi; C R Corbeil
Journal: Br J Pharmacol Date: 2007-11-26 Impact factor: 8.739

3. Mechanism of the hydrophobic effect in the biomolecular recognition of arylsulfonamides by carbonic anhydrase.

Authors: Phillip W Snyder; Jasmin Mecinovic; Demetri T Moustakas; Samuel W Thomas; Michael Harder; Eric T Mack; Matthew R Lockett; Annie Héroux; Woody Sherman; George M Whitesides
Journal: Proc Natl Acad Sci U S A Date: 2011-10-19 Impact factor: 11.205

4. Statistical potential for modeling and ranking of protein-ligand interactions.

Authors: Hao Fan; Dina Schneidman-Duhovny; John J Irwin; Guangqiang Dong; Brian K Shoichet; Andrej Sali
Journal: J Chem Inf Model Date: 2011-11-21 Impact factor: 4.956

Review 5. Current trends in virtual high throughput screening using ligand-based and structure-based methods.

Authors: Nagamani Sukumar; Sourav Das
Journal: Comb Chem High Throughput Screen Date: 2011-12 Impact factor: 1.339

6. A Mixed QM/MM Scoring Function to Predict Protein-Ligand Binding Affinity.

Authors: Seth A Hayik; Roland Dunbrack; Kenneth M Merz
Journal: J Chem Theory Comput Date: 2010-09-01 Impact factor: 6.006

Review 7. Efficient drug lead discovery and optimization.

Authors: William L Jorgensen
Journal: Acc Chem Res Date: 2009-06-16 Impact factor: 22.384

8. ZINC: a free tool to discover chemistry for biology.

Authors: John J Irwin; Teague Sterling; Michael M Mysinger; Erin S Bolstad; Ryan G Coleman
Journal: J Chem Inf Model Date: 2012-06-15 Impact factor: 4.956

9. Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit.

Authors: Noel M O'Boyle; Chris Morley; Geoffrey R Hutchison
Journal: Chem Cent J Date: 2008-03-09 Impact factor: 4.215

10. istar: a web platform for large-scale protein-ligand docking.

Authors: Hongjian Li; Kwong-Sak Leung; Pedro J Ballester; Man-Hon Wong
Journal: PLoS One Date: 2014-01-24 Impact factor: 3.240

38 in total

1. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach.

Authors: Yu Wang; Yanzhi Guo; Qifan Kuang; Xuemei Pu; Yue Ji; Zhihang Zhang; Menglong Li
Journal: J Comput Aided Mol Des Date: 2014-12-20 Impact factor: 3.686

2. Docking pose selection by interaction pattern graph similarity: application to the D3R grand challenge 2015.

Authors: Inna Slynko; Franck Da Silva; Guillaume Bret; Didier Rognan
Journal: J Comput Aided Mol Des Date: 2016-08-01 Impact factor: 3.686

3. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest.

Authors: Cheng Wang; Yingkai Zhang
Journal: J Comput Chem Date: 2016-11-17 Impact factor: 3.376

4. Incorporating Explicit Water Molecules and Ligand Conformation Stability in Machine-Learning Scoring Functions.

Authors: Jianing Lu; Xuben Hou; Cheng Wang; Yingkai Zhang
Journal: J Chem Inf Model Date: 2019-10-31 Impact factor: 4.956

5. Redesign of LAOBP to bind novel l-amino acid ligands.

Authors: Jesús Banda-Vázquez; Sooruban Shanmugaratnam; Rogelio Rodríguez-Sotres; Alfredo Torres-Larios; Birte Höcker; Alejandro Sosa-Peinado
Journal: Protein Sci Date: 2018-03-22 Impact factor: 6.725

6. CSM-lig: a web server for assessing and comparing protein-small molecule affinities.

Authors: Douglas E V Pires; David B Ascher
Journal: Nucleic Acids Res Date: 2016-05-05 Impact factor: 16.971

7. Protein-ligand docking using FFT based sampling: D3R case study.

Authors: Dzmitry Padhorny; David R Hall; Hanieh Mirzaei; Artem B Mamonov; Mohammad Moghadasi; Andrey Alekseenko; Dmitri Beglov; Dima Kozakov
Journal: J Comput Aided Mol Des Date: 2017-11-03 Impact factor: 3.686

Review 8. Application of Computational Biology and Artificial Intelligence Technologies in Cancer Precision Drug Discovery.

Authors: Nagasundaram Nagarajan; Edward K Y Yapp; Nguyen Quoc Khanh Le; Balu Kamaraj; Abeer Mohammed Al-Subaie; Hui-Yuan Yeh
Journal: Biomed Res Int Date: 2019-11-11 Impact factor: 3.411

9. Machine learning on ligand-residue interaction profiles to significantly improve binding affinity prediction.

Authors: Beihong Ji; Xibing He; Jingchen Zhai; Yuzhao Zhang; Viet Hoang Man; Junmei Wang
Journal: Brief Bioinform Date: 2021-09-02 Impact factor: 11.622

10. Deep Learning in Drug Design: Protein-Ligand Binding Affinity Prediction.

Authors: Mohammad A Rezaei; Yanjun Li; Dapeng Wu; Xiaolin Li; Chenglong Li
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2022-02-03 Impact factor: 3.710