Literature DB >> 33496578

General Purpose Structure-Based Drug Discovery Neural Network Score Functions with Human-Interpretable Pharmacophore Maps.

Benjamin P Brown¹, Jeffrey Mendenhall², Alexander R Geanes², Jens Meiler^2,3,4.

Abstract

The BioChemical Library (BCL) is an academic open-source cheminformatics toolkit comprising ligand-based virtual high-throughput screening (vHTS) tools such as quantitative structure-activity/property relationship (QSAR/QSPR) modeling, small molecule flexible alignment, small molecule conformer generation, and more. Here, we expand the capabilities of the BCL to include structure-based virtual screening. We introduce two new score functions, BCL-AffinityNet and BCL-DockANNScore, based on novel distance-dependent signed protein-ligand atomic property correlations. Both metrics are conventional feed-forward dropout neural networks trained on the new descriptors. We demonstrate that BCL-AffinityNet is one of the top performing score functions on the comparative assessment of score functions 2016 affinity prediction and affinity ranking tasks. We also demonstrate that BCL-AffinityNet performs well on the CSAR-NRC HiQ I and II test sets. Furthermore, we demonstrate that BCL-DockANNScore is competitive with multiple state-of-the-art methods on the docking power and screening power tasks. Finally, we show how our models can be decomposed into human-interpretable pharmacophore maps to aid in hit/lead optimization. Altogether, our results expand the utility of the BCL for structure-based scoring to aid small molecule discovery and design. BCL-AffinityNet, BCL-DockANNScore, and the pharmacophore mapping application, as well as the remainder of the BCL cheminformatics toolkit, are freely available with an academic license at the BCL Commons site hosted on http://meilerlab.org/.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33496578 PMCID： PMC7903419 DOI： 10.1021/acs.jcim.0c01001

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Computer-aided drug discovery (CADD) is a broad category of methods that can be employed to increase the efficiency of the drug discovery process. Broadly, CADD methods can be subdivided into two categories: ligand-based (LB) and structure-based (SB). LB methods predominantly employ similarity metrics to compare ligands with known biological activity or chemical attributes to a library of prospective small molecules. Among the most widely used LB methods are quantitative structure–activity relationship (QSAR) models, which relate quantitative chemical descriptors of molecules to known biological activities.[1,2] QSAR models lend themselves to supervised machine learning methods, such as artificial neural networks (ANNs) and random forest (RF).[3−8] Indeed, over the last two decades, we have demonstrated the efficacy of ANNs in LB classification tasks compared to other methods, such as support vector machines, and employed them to identify multiple G-protein-coupled receptor (GPCR) allosteric modulators.[3,9−12] At that time, we have contributed to multiple aspects of QSAR method development, including early efforts to expedite model training with graphics processing unit (GPU) programming,[13] chemical descriptor, and toolkit development,[14−16] improving QSAR ANN architectures with dropout,[5] and dataset assembly for community benchmarking.[17,18] We have accomplished this largely with the development of the BioChemical Library (BCL), a primarily ligand-based academic open-source cheminformatics toolkit. LB methods can often rank compounds many orders of magnitude faster than SB methods. Despite being very rapid and easily deployed on large databases for virtual high-throughput screening (vHTS), ligand-based methods have inherent limitations. Most notably, LB methods make predictions in the absence of binding pocket information. As a result, predictions made from LB methods must be target-specific, and generating LB models for a given target, especially QSAR models, may require a large amount of model training data.[1−8] Thus, there is considerable interest in developing target agnostic, rapid SB methods for vHTS. SB methods provide information about small molecule interactions with the binding pocket. Critically, this should allow SB methods to be target agnostic and provide chemically meaningful insight with which to guide hit optimization. Unfortunately, the most accurate SB methods come with a computational cost prohibitive for vHTS. Accurate prediction of small molecule binding affinities to target proteins is a key challenge in SB CADD. Structure-based alchemical free energy approaches, such as free energy perturbation (FEP) and thermodynamic integration (TI), are widely considered to be the most accurate.[19−21] Other approaches, such as molecular mechanics Poisson–Boltzmann or generalized-Born surface area (MM/PB(GB)SA), or protein–ligand docking semiempirical scoring functions, can also provide reliable relative binding free energies, but with overall performance seemingly being more system dependent.[22−25] Faster but less accurate docking score functions are being increasingly scaled to medium- and high-throughput virtual screening.[26,27] In the last decade, many machine learning approaches have been developed to increase the speed and accuracy of SB virtual screening approaches. As early as 2010, random forest (RF) rescoring of docked poses demonstrated that machine learning algorithms could provide rapid and competitive prediction of protein–ligand binding affinities (RF-Score).[28] A variation on RF as a modeling tool for protein–ligand binding affinity prediction is ΔVinaRF20, which uses random forest (RF) to predict an error correction term for the AutoDock Vina docking score function.[29] More recently, deep learning with convolutional neural networks (CNNs) has been widely investigated to predict binding affinities. For example, DeepVS is a CNN that attempts to generalize binding mode information by encoding local atomic neighborhoods around each selected ligand atom using simple descriptors (i.e., atom types, charges, distances, and interacting amino acid identity).[30] Multiple grid-based CNNs have also been developed, such as KDEEP and a CNN, by Ragoza et al., which treat protein–ligand complexes as three-dimensional (3D) images colored by specific atom type and pharmacophore properties.[31,32] AtomNet is another grid-based CNN that also includes features derived from protein–ligand interaction fingerprints.[33] It is well known that cheminformatics machine learning algorithms can be strongly limited in their domain of applicability by the chosen training set and descriptors.[34−40] There is concern that some newer CNN techniques demonstrating exceptional performance may suffer from the lack of generalizability owing to the dataset and training biases.[31,41] Even in cases where machine learning models make accurate predictions, the chemical basis of these predictions is not easily interpreted without substantial input sensitivity and feature analysis. This infamously gives rise to the “black box” problem of machine learning algorithms, especially deep neural networks (DNNs). Finally, a major motivation for the current project is to incorporate a modular and customizable SB score function into the BCL for use in the ongoing development of SB design algorithms. Currently, the BCL is only able to support LB design algorithms. Ultimately, we anticipate that increasing the capabilities of the BCL to perform both LB and SB design tasks will make it a valuable companion to other academic molecular modeling software projects, such as the Rosetta[42] macromolecular modeling and design software suite. To address these issues, we have designed a novel SB protein–ligand binding affinity and pose prediction model based on distance-dependent signed atom property protein–ligand correlations (PLCs). Instead of encoding specific protein and ligand properties, our method encodes the protein–ligand interaction feature space. This is analogous to the formation of statistical pair potentials, except that here we do not formally provide any constraints on the function to be approximated. We demonstrate that fully connected feed-forward neural networks trained with our new descriptors are competitive with the existing state-of-the-art machine learning methods and docking methods at protein–ligand binding affinity prediction, pose prediction, and virtual screening power. Moreover, we explicitly demonstrate that the performance of our models is not dependent on exploiting dataset bias. Finally, we show how our models can be rapidly decomposed into human-interpretable pharmacophore maps. These pharmacophore maps allow users to visualize the atoms/substructures of their molecules that drive the activity prediction, as well as map predicted or known relative binding free energy changes across molecule ensembles to specific substructures. This will be the first SB scoring tool available in the BCL, and the pharmacophore mapping tool is fully compatible with the LB QSAR methods currently implemented. Together, we believe that these tools improve the utility of the BCL for SB hit identification and lead to optimization in drug discovery. The new descriptors, models, and pharmacophore mapping application will be available in the upcoming BCL version 4.1 release, an academic open-source software package for cheminformatics written in the C++ programming language. It is our hope that our new method will be used in conjunction with other advancements in machine learning-based QSAR/QSPR to continue to improve the efficiency of drug discovery.

Results

Development of a Pose-Dependent Protein–Ligand Property Correlation Descriptor

Currently, the top performing deep learning scoring algorithms that predict binding affinities from protein–ligand complexes are CNNs that encode neighboring ligands and receptor atoms spatially and/or chemically (e.g., hydrogen bond donor/acceptor heuristics).[31,32] One critique of these CNNs is that test-set performance can be attributed to learning ligand-specific features and not the protein–ligand interface features.[41] In other words, the neural network can perform well on the tests simply by learning the biases in the ligand datasets. To avoid any such potential limitations here, we developed a pose-dependent protein–ligand interaction descriptor based on sign-aware 3D autocorrelations (3DAs). This descriptor can be likened to a potential of mean force profile in which the collective variables are the pairwise interatomic distances between the protein and ligand atoms for specific chemical properties/heuristics.

Small Molecule Chemical Property Autocorrelations

Consider a property-weighted 3D autocorrelation (3DA) function for a single small molecule. An atom-based property allows the 3DA to represent the spatial distribution of properties of interestwhere ra and rb are the boundaries of the current distance interval, N is the total number of atoms in the molecule, r is the distance between the two atoms being considered, δ is the Kronecker delta, β is a smoothing parameter referred to as “temperature”,[15,43] and P is the property computed for each atom. 3DAs computed for signed properties (e.g., partial charge) contain, for each distance interval, three values corresponding to product sums of each of the three possible sign pairings (−/–, +/+, −/+).[14]

Recasting Property Space into Protein–Ligand Interaction Distance Bins

Instead of corresponding to intramolecular atomic distances, the distance bins now correspond to intermolecular protein–ligand interatomic distances. The property correlation is between each atom in the ligand and all atoms in the receptor within a specified radius (Figure )where ra and rb are the boundaries of the current protein–ligand interatomic distance interval, Nlig and Nprot are the total number of atoms in the ligand and receptor, respectively, r is the distance between the current protein–ligand atom pair, δ is the Kronecker delta, β is the temperature, and P and P are the properties computed for ligand and receptor atoms l and p, respectively. As with 3DA in eq , protein–ligand correlation (PLC) descriptors distinguish signed pairs but can also optionally include an additional bin (−–/++/–+/+−) to account for opposite sign pairings between the protein and the ligand (Figure A). This can be useful if the properties between which the correlations are being taken are not identical or if the model being built is leveraging pre-existing knowledge about the chemical makeup of the system in the study.

Figure 1

Schematic of a pose-dependent protein–ligand descriptor. (A) Schematic representation of pose-dependent protein–ligand interaction feature space. (B) Surface representation of discoidin domain receptor 1 (DDR1) kinase binding pocket heavy atoms within 7.0 Å of select atoms within dasatinib. The surface representation is colored by the distance to the selected atom. Dasatinib shown in stick configuration colored by element type with the selected atom is indicated by a dot sphere. For example, consider the descriptor “HBondDonorTernary”. This descriptor returns a 1 if an atom is a hydrogen bond donor, −1 if it is a hydrogen bond acceptor, and 0 otherwise (Table S1). One could choose to differentiate hydrogen bond donor/acceptor pairs between the protein and the ligand (e.g., asymmetric: −+/+−) or to group all opposite sign pairs together (symmetric −/+). Sign pair discrimination is illustrated in Figure A for a property that tracks the protein–ligand directionality of opposite sign pairings. We empirically chose a total distance of 7.0 Å discretized at 0.50 Å intervals, resulting in either 42 (symmetric) or 56 (asymmetric) values per property (see the subsection on feature parameterization in Methods and the Supporting Information).

Representing Protein–Ligand Interactions with Property Correlation Descriptors

PLC descriptors (eq ) encode interactions between protein and ligand atomic atoms as represented by a variety of atomic properties: partial charge, electronegativity, polarizability, hydrophobicity, hydrogen bond donors and acceptors, aromatic and generic ring membership, heavy and light atoms (Table S1). These atomic features are a superset of those we used previously for QSAR,[5,14] and are identical to those we used previously for the superimposition of similar molecules.[44] To mitigate feature redundancy, we summed feature interactions that were nominally equivalent. For example, consider the PLC descriptor that represents the signed correlation between atomic partial charges in receptor and ligand atoms: 3DAPairRS050(Atom_SigmaCharge) (Tables S2 and S3). In this descriptor, we summed −/+ (ligand negative charge, protein positive charge) with +/– (ligand positive charge, protein negative charge) interactions under the notion that these are equivalently favorable pairings. We took a similar approach for hydrogen bond donation, hydrophobic interactions, and heavy atom/hydrogen atom discrimination. Some descriptors, such as polarizability and electronegativity, are strictly positive valued, and therefore do not require binning by sign pairs (Tables S2 and S3). While each of the previously mentioned descriptors can be considered symmetric in that we are correlating the same property for both the receptor and the ligand (e.g., partial charge), interactions can also be described by complementary interactions between dissimilar chemical properties. For example, interactions between aromatic ring systems and polar vs hydrophobic atoms. To create a property that can describe this interaction, we need to utilize Atom_HydrophobicTernary, which is an atom property that encodes hydrophobic atoms as +1, and polar atoms as −1. To better distinguish highly polar from less polar atoms, we multiply Atom_HydrophobicTernary by polarizability. We then encode aromatic–polar, aromatic–hydrophobic interactions with the PLC descriptor, “3DAPairRSAsym050(Multiply(Atom_HydrophobicTernary, Atom_Polarizability), Atom_IsInAromaticRingTernary)”. In this descriptor, each distance bin is further discretized into −/– (ligand polar atom polarizability with a nonaromatic receptor atom), +/+ (ligand hydrophobic atom polarizability with an aromatic receptor atom), −/+ (ligand polar atom polarizability with an aromatic receptor atom), and +/– (ligand hydrophobic atom polarizability with a nonaromatic receptor atom) (Tables S2 and S3). An inverted version of this descriptor, in which hydrophobicity is with respect to the receptor and aromaticity to the ligand, is also employed here. With these features, we trained two neural networks. BCL-AffinityNet is a “deep” single-task neural network (2 hidden layers, 512 neurons in the first hidden layer, and 32 neurons in the second layer) to directly predict log-scaled protein–ligand binding affinity values. BCL-DockANNScore is a multitasking shallow neural network (1 hidden layer with 32 neurons) that classifies binding poses as less-or-equal to 1.0, 2.0, 3.0, 5.0, or 8.0 Å from the native (cocrystallized) binding mode. Both of these models utilize only PLC descriptors (eq ), with BCL-DockANNScore, including an additional PLC descriptor that discretizes hydrogen bond donor/receiver pair angles (Table S3; see the Supporting Information for details). Finally, we note that we did not perform a deep exploration of possible base chemical descriptors and there are likely many additional features that could be effective (e.g., explicit consideration of π-interactions, σ-hole interactions, transition metal properties, solvation energies, etc.). Additionally, we did not perform feature selection to optimize the performance of our model on the benchmark training sets to avoid potentially over-optimizing the models for the training data. For a detailed evaluation of the importance of each feature in BCL-AffinityNet and BCL-DockANNScore, please see the top 20 features by model input sensitivity (Tables S4 and S5) and decomposition of each descriptor into the average input sensitivity per sign pair (Figures S1–S8) in the Supporting Information.

Scoring Power Evaluation of BCL-AffinityNet

We trained BCL-AffinityNet on protein–ligand complexes from the PDBbind v.2016 refined set and all general set protein–ligand (small molecule) complexes for which binding constants were available. Protein–ligand pairs comprising the coreset (285 unique test-set complexes) were entirely excluded from training. BCL-AffinityNet was trained with descriptors of the form eq . See the Supporting Information for a sample feature code object file and command lines to generate the model. We first tested the performance of BCL-AffinityNet on the scoring power task described in the comparative assessment of score functions 2016 update (CASF2016). This task evaluates affinity prediction across the PDBbind v.2016 coreset comprised of 285 protein–ligand pairs on 57 targets (5 small molecules per target) by measuring the Pearson correlation coefficient (R) between the predicted and experimental values. It has previously been noted that binding affinities in this task correlate strongly with both the fraction of buried solvent accessible surface area (ΔSAS, R = 0.63) (Figure A)[22] and several scalar ligand descriptors, including molecular weight (MW, R = 0.50), topological polar surface area (TPSA, R = 0.20), log P (R = 0.32), and polarizability (R = 0.52) (Table S1). An important measure of success is whether or not the affinity prediction method is capable of performing better than these simple metrics that are unaware of specific protein–ligand interactions.

Figure 2

Scoring power evaluation of BCL-AffinityNet. (A) Comparison of BCL-AffinityNet scoring power to other methods from the CASF2016 benchmark by Su et al.[22] Error bars indicate the 90% confidence interval. (B) Linear regression of experimental vs predicted pKd values in the CASF2016 coreset. BCL-AffinityNet is among the best algorithms on the scoring power task (R = 0.84) (Figure A,B). ΔVinaRF20, which is a protein–ligand interaction score function that uses a random forest (RF) algorithm to predict an error correction term on the AutoDock Vina score, performed similarly on the original CASF2016 report (Figure A).[22] However, as reported previously,[22] the training set for ΔVinaRF20 includes 140 of the coreset test complexes. Lu and colleagues re-evaluated the scoring power of ΔVinaRF20 after retraining it without any of the coreset complexes and found that it is still performed better than ΔSAS but with worse scoring power than originally reported (R = 0.73).[45] BCL-AffinityNet performs competitively with other machine learning models, such as the grid-based CNN KDEEP (R = 0.82) and RF-Score (R = 0.80) (Table S6). We note that KDEEP was evaluated on the 290 molecule version of the PDBbind coreset, not the canonical 285 molecule set. Moreover, in the absence of the underlying distributions, it is unclear if these results are statistically different; however, the effect sizes are similar.

Explicit Assessment of Dataset Bias on BCL-AffinityNet Scoring Power Performance

It is increasingly well documented that strong machine learning model performance on QSAR tasks can be the result of dataset bias.[41,46,47] Indeed, Yang et al. found that atomic CNNs (ACNNs) trained solely on ligand or receptor pocket features performed just as well as ACNNs trained on protein–ligand complexes,[46] suggesting that the model was unable to leverage features relating to the protein–ligand interactions in a meaningful way. Therefore, we sought to determine the extent to which dataset biases may be inflating BCL-AffinityNet performance. First, we trained a BCL-AffinityNet Y-scramble model, in which the result labels were shuffled between training examples. The Y-scramble model is a negative control, and as expected, we find virtually no correlation between predicted and experimental results on the coreset with this model (Figure S9). Next, we generated LB and pocket-based QSAR models with the same architecture as BCL-AffinityNet. These models were trained with the 3DA descriptor equivalent of the PLC features. In an ideal dataset, ligand and protein pocket controls would have near-zero correlation to experimental results; however, consistent with the findings of Yang et al.,[46] the LB and pocket-based QSAR models each had correlation coefficients greater than 0.50 at 0.72 and 0.61, respectively (Figure S9). To assess the impact of dataset bias on our PLC model performance for out-of-class predictions, we generated three new leave-class-out test-set splits based on the ligand, protein pocket, or combined ligand and protein pocket similarity to the PDBbind v.2016 coreset. Specifically, we generated a k-means (k = 75) applicability domain (AD) model from the 3DAs of the ligands, protein pockets, or combination of ligands and protein pockets of the PDBbind v.2016 coreset. Using each of these AD models, we removed training samples that were further from their nearest Kohonen map node than the furthest point of the PDBbind v.2016 coreset was from the AD model. Intuitively, the new test sets thus include only points that are outside the nominal descriptor space given by the PDBbind v.2016 coreset for ligands, protein pockets, or combination ligand–protein pockets. This has the effect of making the training set feature space more representative of the PDBbind v.2016 coreset feature space while simultaneously creating new test sets that are outside PDBbind v.2016 coreset feature space. This resulted in the creation of a LB AD test set (n = 995), pocket AD test set (n = 379), and combined AD test set (n = 1377) (see Methods for additional details). We hypothesized that the LB QSAR model would perform poorly on the LB AD test set, that the pocket-based QSAR model would perform poorly on the pocket AD test, and that both models would perform poorly on the combined AD test set. We further hypothesized that if models trained on PLC descriptors are truly generalizable SB score functions, then their performance on all three test splits ought not to be significantly worse than their training random-split cross-validation metrics. We found that the LB QSAR models performed worse on the LB AD test set (R = 0.28) than on the random-split training cross-validation sets (R = 0.67) (Figure S10). Similarly, the pocket-based QSAR model performed worse on the pocket AD test set (R = 0.33) than on the training splits (R = 0.63) (Figure S11). We also note a reduction in the performance of the pocket-based QSAR model on the LB AD test set relative to training (R = 0.51 vs 0.64, respectively), as well as a reduction in the performance of the LB QSAR model on the pocket AD test set relative to training (R = 0.54 vs 0.65, respectively) (Figures S10 and S11). On the combined AD test set, we observe the worst performance of the LB (R = 0.28) and pocket-based (R = 0.15) QSAR models (Figure ).

Figure 3

Performance evaluation on the combined AD test set. A total of 1377 training samples were excluded from the initial training set of 7568 samples (see Methods for details). The remaining 6191 training samples were used to train BCL-AffinityNet (i.e., a single-task regression DNN with PLC features), a signed 3DA LB QSAR model, or a signed 3DA pocket-based QSAR model. The training was completed with fivefold random-split cross-validation. Columns and error bars represent the mean and standard deviation of normalized mean absolute error (NMAE) (blue) or Pearson correlation coefficient (red) across either the fivefold random-split cross-validations (training) or fivefold random splits of the combined AD test set (testing). In contrast, we observed that BCL-AffinityNet, when retrained to exclude each AD test set, consistently performs well (R = 0.72, 0.75, and 0.72 for the LB, pocket-based, and combined AD test sets, respectively) despite the reduced training set size and coverage (Figures , S10, and S11). To evaluate whether PLC descriptors are effective with other machine learning model types, we have utilized WEKA[48] to train a random forest version of BCL-AffinityNet (termed AffinityRF for ease) for evaluation on the PDBbind v.2016 coreset and the combined AD test split. AffinityRF achieves a good correlation (R = 0.79 and 0.70, respectively) on both tasks, suggesting that PLC descriptors may be suitable in multiple machine learning paradigms (Figure S12). Altogether, these results suggest that the PLC descriptors encode generalized representations of protein–ligand interactions.

Performance Evaluation on Subsets of the CSAR-NRC HiQ Test Sets

As additional independent tests, we evaluated the performance of our models on the CSAR-NRC HiQ test sets. For the purposes of a head-to-head comparison with two of the leading machine learning methods in the field, KDEEP and RF-Score, we first compared our model to the 55 and 49 compounds of the CSAR-NRC HiQ test sets 1 and 2, respectively, which were previously evaluated for KDEEP and RF-Score in Jiménez et al.[32] For this evaluation, we retrained our models with the PDBbind set as described previously, but we also excluded any of the 55 or 49 compounds found in the CSAR test set from training. RF-Score performed the best on set 1 (R = 0.78, root mean square error (RMSE) = 1.99) with KDEEP (R = 0.72, RMSE = 2.09) and BCL-AffinityNet (R = 0.72, ρ = 0.77, RMSE = 2.02) performing similarly to one another (Table ). In contrast, BCL-AffinityNet is the top performing model (R = 0.85, ρ = 0.82, RMSE = 1.37) on set 2, followed by the RF-Score (R = 0.78, RMSE = 1.66) and KDEEP (R = 0.65, RMSE = 1.92) (Table ).

Table 1

Performance Evaluation of Models Trained on PDBbind Refined Version 2016 Dataset on Unique Complexes in the CSAR-NRC HiQ Test Sets. Gray shading indicates features/methods developed in this manuscript. Green shading indicates scalar properties employed a controls. Blue shading indicates comparisons to previously reported results with other software.c

As reported in Jiménez et al.[32]

Not reported.

As reported in Jiménez et al.[32] Not reported. Results reported as Pearson correlation coefficient (R), Spearman rank correlation coefficient (ρ), and root mean square error (RMSE). Note that the Spearman rank correlation here is across all targets in the coreset, while the “ranking power” metric is based on within-target ranking of molecule affinities. Next, in the interest of obtaining a more complete benchmark and facilitating future comparisons, we extended our evaluation of the CSAR-NRC HiQ test sets to the full molecule lists, which included 176 and 167 molecules in sets 1 and 2, respectively. Again, we retrained our models on the PDBbind set, excluding now either the 176 or 167 compounds in test set 1 or 2 in addition to the remaining molecules in the 285 compounds from the coreset. Performance of BCL-AffinityNet on set 1 (R = 0.75, ρ = 0.75, RMSE = 1.32) is very similar to performance on set 2 (R = 0.74, ρ = 0.73, RMSE = 1.36) (Table ).

Table 2

Performance Evaluation of Models Trained on PDBbind Refined Version 2016 Dataset Sans CSAR-NRC HiQ Complexes on All Complexes in the CSAR-NRC HiQ Test Sets. Gray shading indicates features/methods developed in this manuscript. Green shading indicates scalar properties employed a controls.a

Results reported as Pearson correlation coefficient (R), Spearman rank correlation coefficient (ρ), and root mean square error (RMSE). Note that the Spearman rank correlation here is across all targets in the coreset, while the “ranking power” metric is based on within-target ranking of molecule affinities. A summary of BCL-AffinityNet’s scoring power performance on the PDBbind v.2016 coreset, the CSAR-NRC HiQ I and II subsets from Jiménez et al.,[32] and the full CSAR-NRC HiQ I and II sets are shown in Figure S13.

Ranking Power Performance Evaluation

The CASF2016 ranking power evaluation analyzes the ability of score functions to rank ligands targeting the same receptor. Among the methods originally compared in Su et al.,[22] BCL-AffinityNet (ρ = 0.69) places just after ΔVinaRF20 (ρ = 0.75) (Figure ). Again taking into consideration Lu et al. retraining ΔVinaRF20 to exclude the 140 overlapped test-set compounds, ΔVinaRF20 achieves a ranking power ρ = 0.63 compared to ΔVinaXGB, which achieves a ranking power of ρ = 0.65.[45]

Figure 4

Ranking power evaluation of BCL-AffinityNet. Comparison of BCL-AffinityNet ranking power to other methods from the CASF2016 benchmark by Su et al.[22] with (A) Spearman rank correlation coefficient, (B) Kendall rank correlation coefficient, and (C) predictive index. Error bars indicate the 90% confidence interval. Green bars indicate BCL-AffinityNet. Altogether results on the scoring power and ranking power tests suggest that BCL-AffinityNet is competitive with state-of-the-art SB virtual screening methods for binding affinity prediction and affinity ranking.

Docking Power Performance Evaluation

Despite its success in the scoring and ranking power evaluations, BCL-AffinityNet is not ideally suited for decoy discrimination. This is because the training set for BCL-AffinityNet is composed entirely of native protein–ligand complexes. Thus, while BCL-AffinityNet could likely be used with an AD model generated in the same feature space to exclude clashed structures (by virtue of the lack of occupancy in the shortest distance bins, Figure A), it is unlikely to be able to discriminate plausible docking poses. To address this limitation, we built a shallow multitasking ANN trained with the same PLC descriptors (eq ) as BCL-AffinityNet, with the addition of the hydrogen bond angle descriptor described above (see the Supporting Information for a sample code object file). We reasoned that in differentiating properly docked poses, it would be insufficient to consider only hydrogen bond donor/acceptor distances. In our experience, we have also found that separating a categorical prediction task (i.e., is this pose most likely to be 1.0 Å from the native pose, 2.0, or 5.0 Å) into separate classification tasks for each category generally does not worsen model performance but may improve it. Thus, we nominally organized the output layer as five correlated classification tasks: determining whether a pose was less than 1.0, 2.0, 3.0, 5.0, and 8.0 Å from the native pose. We trained this ANN on the PDBbind v.2016 refined set, excluding all coreset protein–ligand complexes. For each complex in the training set, 250 additional decoys were generated with RosettaLigand (see Methods for details). The final model score, which we refer to as BCL-DockANNScore, is the product of the classification probability of a pose being less than 2.0 Å from the native pose (referred to elsewhere as a probability calibration curve[49,50]) and the BCL-AffinityNet affinity prediction score for that pose. BCL-DockANNScore performs reasonably well on the docking power benchmark with success rates of 0.81, 0.91, and 0.95 for native pose recovery at a 2.0 Å threshold for poses within the best scoring 1, 2, and 3 poses, respectively (Figure ). When native poses are excluded, BCL-DockANNScore success rates reduce by ∼5%, consistent with performance reductions in multiple other methods (Figure S15). Binding funnel analysis of BCL-DockANNScore demonstrates good Spearman rank correlation coefficients at wide RMSD ranges but performs less well in the 0–2.0 Å range (Figure S16). This suggests that one possible route to improve BCL-DockANNScore further is to provide additional training decoys within the 0–2.0 Å range or additional high-resolution descriptors.

Figure 5

Docking power evaluation of BCL-DockANNScore. Comparison of BCL-DockANNScore docking power to other methods from the CASF2016 benchmark by Su et al.[22] when recovering the native pose under 2.0 Å RMSD (A) within the top 3 poses, (B) within the top 2 poses, and (C) within the top 1 poses. Error bars indicate the 90% confidence interval. Green indicates BCL-DockANNScore. Overall, these results are especially encouraging because the decoy poses for the docking power benchmark are generated with a combination of GOLD[51] version 5.2, SYBYL’s Surflex,[52] and Chemical Computing Group’s Molecular Operating Environment (MOE) docking algorithm,[53] while our training decoys were generated with RosettaLigand. It suggests that the BCL-DockANNScore feature space (i.e., PLC descriptor space) is not overly dependent on RosettaLigand poses and can be used to build modular score functions.

Screening Power Performance Evaluation

We evaluated BCL-DockANNScore on the forward and reverse screening tests. The forward screening power task evaluates the ability of a score function to identify small molecule ligands that bind to a target protein. The reverse screening power task evaluates the ability of a score function to identify the protein that most effectively binds a small molecule ligand (i.e., cross-docking).[22] Similar to the docking power evaluation, we find that BCL-DockANNScore performs reasonably well, but not always among the very best docking scores. On the forward screening task, BCL-DockANNScore has a success rate of 0.18, 0.33, and 0.58 when identifying the ligand amongst the top 1, 5, and 10% of candidates, respectfully (Figure A). This is competitive with the best score functions at the 10% level; however, performance at the 1% level is more mid-tier (ranking alongside several of the MOE score functions, while the top performers are from GOLD, Glide, and the AutoDock Vina and derived methods). The overall enhancement factor at the 1% level is 8.5 (Figure C). In contrast, we find that the performance on the reverse screening task is competitive even with the top performers when identifying the top 1, 5, and 10% of candidates, with success rates of 0.15, 0.24, and 0.39, respectively (Figure B).

Figure 6

Screening power evaluation of BCL-DockANNScore. Comparison of BCL-DockANNScore screening power to other methods from the CASF2016 benchmark by Su et al.[22] (A) Forward screening power evaluation success rates, (B) reverse screening power evaluation success rates, (C) forward screening power evaluation enhancement factor (top 1%). Error bars indicate the 90% confidence interval. Green indicates BCL-DockANNScore.

Generating Absolute Pharmacophore Maps

Finally, one important consideration in the development of a SB score function for the BCL was model interpretability. One of the strengths of SB CADD is that the predicted changes in activity can be attributed to specific interactions with the target. Neural networks are, however, often negatively characterized as “black boxes” because usually the function learned in the model cannot be decomposed into human-interpretable parts. Traditional docking scoring functions, such as RosettaLigand, have the advantage that they can be decomposed into target per-residue contributions to the overall predicted affinity. This is important in drug discovery, where predictions need to be actionable. Here, we demonstrate that BCL-AffinityNet predictions can be decomposed into a map of atom contributions to the predicted bioactivity. We take two general approaches for constructing a pharmacophore map: (1) absolute feature contributions (Figure ) and (2) relative feature contributions (Figure ). The first case generates a map on any individual molecule by evaluating the contributions of specific atoms to the overall predicted activity. This can be likened to evaluating model input sensitivity, except in this case, the molecule of interest is being perturbed instead of the weights connecting individual neurons in the model.

Figure 7

Figure 8

Construction of relative pharmacophore maps. Relative pharmacophore maps are generated from a target molecule and a reference molecule. (A) Determine the MCS between the reference and target structure. (B) Identify the MCS atoms that connect to the corresponding non-MCS substructures in both the reference and target molecules. Non-MCS atoms are circled in gray, and the corresponding substructures between the reference and target share numerical labels (e.g., the reference molecule methyl circled in gray and the target molecule trifluoromethyl circled in gray correspond structurally and are labeled “1”). For both the reference and target molecules, non-MCS substructures are independently removed. The binding affinities of the reference, target, and perturbed molecules are estimated with BCL-AffinityNet. The ΔΔGbind between starting and perturbed molecules is determined for both the reference and target. (C) For each corresponding non-MCS substructure, compute ΔΔΔGbind as ΔΔΔGbind = ΔΔGbind – ΔΔGbind, where X indicates the perturbed target or the reference molecule. (D) Map the ΔΔΔGbind values back to the target molecule non-MCS substructures. The surface representation of atoms that contribute beneficially to BCL-AffinityNet’s binding affinity prediction is blue, while atoms that worsen the prediction are in red. Atoms that contribute neutrally/negligibly are white.

Construction of absolute pharmacophore maps. (A) The target molecule, in this case, compound 7c from Zhu et al.,[54] is first modeled in complex with its target receptor using PLC descriptors and scored with BCL-AffinityNet. (B) Then, we iterate over each atom in the target molecule and sequentially remove it from the molecule to create a perturbed molecule, X. (C) Perturbed molecules are saturated with hydrogen atoms to close any open valences resulting from the perturbation, and then they are scored with BCL-AffinityNet. (D) Differences in predicted binding affinity between the starting molecule and each perturbed molecule are mapped to the corresponding atoms of the starting structure. Here, predictions are in units of kcal/mol at 300 K. The surface representation of atoms that contribute beneficially to BCL-AffinityNet’s binding affinity prediction is blue, while atoms that worsen the prediction are in red. Atoms that contribute neutrally/negligibly are white. Construction of relative pharmacophore maps. Relative pharmacophore maps are generated from a target molecule and a reference molecule. (A) Determine the MCS between the reference and target structure. (B) Identify the MCS atoms that connect to the corresponding non-MCS substructures in both the reference and target molecules. Non-MCS atoms are circled in gray, and the corresponding substructures between the reference and target share numerical labels (e.g., the reference molecule methyl circled in gray and the target molecule trifluoromethyl circled in gray correspond structurally and are labeled “1”). For both the reference and target molecules, non-MCS substructures are independently removed. The binding affinities of the reference, target, and perturbed molecules are estimated with BCL-AffinityNet. The ΔΔGbind between starting and perturbed molecules is determined for both the reference and target. (C) For each corresponding non-MCS substructure, compute ΔΔΔGbind as ΔΔΔGbind = ΔΔGbind – ΔΔGbind, where X indicates the perturbed target or the reference molecule. (D) Map the ΔΔΔGbind values back to the target molecule non-MCS substructures. The surface representation of atoms that contribute beneficially to BCL-AffinityNet’s binding affinity prediction is blue, while atoms that worsen the prediction are in red. Atoms that contribute neutrally/negligibly are white. To generate an absolute pharmacophore map of a given molecule, we perturb the chemical structure by sequentially removing individual atoms and closing the newly opened valence(s) with hydrogen atoms. Afterward, we compute the predicted affinity for each perturbed molecule with BCL-AffinityNet. The predicted binding affinity of the perturbed molecules is compared to that of the original molecule. The differences in predicted activity between the perturbed and original molecules are assigned to the corresponding atoms (Figure ).

Generating Relative Pharmacophore Maps

Relative pharmacophore maps leverage structural similarity within a congeneric ligand series to attribute predicted affinity differences to specific substructures. It has been shown that highly accurate binding affinity estimates can be obtained with alchemical free energy methods when reference structures with experimentally determined binding affinities within a congeneric ligand series are available.[19−21] To generate a relative pharmacophore map between two molecules, we first identify a common substructure (MCS) via one of two methods: (1) identify the largest subgraph isomorphism between the two molecules or (2) assign spatially mutually matched atoms to be common to one another (the first approach is more accurate and is the default approach). Component substructures that graphically correspond to the same common atoms are then iteratively removed, newly opened valences are closed with hydrogen atoms, and the perturbed molecules are scored with BCL-AffinityNet (Figure ). Thus, for each non-MCS substructure in the reference and target molecules, there is a ΔΔGbind between the nonperturbed and perturbed molecules. A final ΔΔΔGbind is computed for each non-MCS substructure as the difference between the reference and target perturbation ΔΔGbind values (Figure ). The ΔΔΔGbind values are mapped to the target molecule for visualization. Consider a series of type II tyrosine kinase inhibitors (TKIs) of DDR1 kinase developed recently by Zhu et al.[54] We generated relative pharmacophore maps of compounds 7c, 7f, and 7j to compound 7i[54] (Figure A–D). We also modified the compound 7 scaffold to include N → C mutations in the hinge-binding region analogous to prior substitutions done by Wang et al.[55] in a previous DDR1 TKI series (Figure A,E–G).

Figure 9

Relative pharmacophore maps of a congeneric DDR1 inhibitor series. (A) Compound 7i is the reference molecule for the creation of the pharmacophore maps. Compounds (B) 7j, (C) 7f, and (D) 7c from Zhu et al.[54] Compounds with the N → C alteration at (F) the hinge-binding nitrogen atom, (E) the symmetrically placed hinge-binding nitrogen rotated away from the hydrogen bond donor partner, and (G) both nitrogen atom positions at the hinge-binding ring. Binding affinities in black text are predicted by BCL-AffinityNet, while green values are from Zhu et al.[54] The surface representation of atoms that contribute beneficially to BCL-AffinityNet’s binding affinity prediction is blue, while atoms that worsen the prediction are in red. Atoms that contribute neutrally/negligibly are white. From the pharmacophore maps, we also compute relative binding affinities of each molecule to compound 7i by summing the ΔΔΔGbind values for each non-MCS component in the target molecule: ΔΔGbind = ∑ΔΔΔGbind. In all comparisons, the trifluoromethyl group is preferable to the methyl. Relative binding affinity estimates of compounds 7c and 7f from 7i are within 0.50 kcal/mol of experimental values (−2.62 vs −2.82 kcal/mol and −2.32 vs −2.25 kcal/mol, respectively) (Figure A,C,D). The ethyl in 7j is also correctly estimated to improve binding affinity relative to methyl in 7i; however, BCL-AffinityNet underestimates the extent of the affinity improvement (−0.69 vs −2.25 kcal/mol) (Figure A,B). Conversion of both hinge-binding nitrogen atoms to carbon atoms is strongly unfavorable even in the presence of the trifluoromethyl group, consistent with prior SAR[55] (Figure A,G). Thus, the relative pharmacophore maps provide meaningful QSAR insights that can be readily visualized. Relative pharmacophore maps can be generated with respect to one or more reference input molecules (e.g., hit compounds or scaffolds) or in a pairwise manner across a series of input molecules. If more than one molecule is used as a reference, the final map for each target molecule indicates the favorability of each molecule’s substitutions in comparison to the whole ensemble. For example, command-line to generate a relative pharmacophore map, see the Supporting Information.

Case Study on Guiding Chemical Modifications with Pharmacophore Maps

To further illustrate this approach, consider three congeneric dysiherbaine analogues in complex with ionotropic glutamate receptor 5 (iGluR5). These molecules differ from one another by small substitutions at carbon atoms (1) and (2) (Figure A–C, first row). Each of the analogues was scored with BCL-AffinityNet and ranked correctly. For each of these three compounds, we generated absolute and relative pharmacophore maps (see Methods for command-line details).

Figure 10

Pharmacophore maps of dysiherbaine analogues in complex with iGluR5 generated from BCL-AffinityNet. Pharmacophore maps were generated for iGluR5 complexed with (A) 8,9-dideoxyneodysiherbaine (PDB ID 3GBB; pKd = 6.9, ΔG = −9.79 kcal/mol at 310 K), (B) neodysiherbaine (PDB ID 3FV2; pKd = 8.1, ΔG = −11.49 kcal/mol at 310 K), and (C) dysiherbaine (PDB ID 3FV1; pKd = 9.3, ΔG = −13.19 kcal/mol at 310 K) and mapped onto the native bound pose. Labeled yellow transparent circles in the top panel are used to reference the substituted carbon atoms of interest. Per atom pharmacophore map scores are output to a PyMol script for visualization as a molecular surface colored on a per atom basis by spectrum from blue (negative) to white (zero) to red (positive). In this example, negative values indicate atoms whose removal results in a loss in predicted binding affinity. The second row illustrates each ligand in complex with iGluR5. The third row illustrates the common substructure pharmacophore map (i.e., pairwise per-substructure relative binding free energy changes). The fourth row illustrates the raw pharmacophore map for each ligand upon sequentially removing individual atoms and saturating open valences. First, we generated relative pharmacophore maps of the dysiherbaine analogues in the pairwise manner described above (Figure ). The pharmacophore maps of dysiherbaine and neodysiherbaine suggest that the methylamine and hydroxyl substitutions, respectively, at position (2) provide a net increase in affinity relative to the proton in 8,9-dideoxyneodysiherbaine (Figure A–C, third row). Furthermore, the pharmacophore maps predict that the methylamine modification increases binding affinity more than the hydroxyl substitution, in agreement with experimental observation (Figure B,C, third row). Interestingly, the relative pharmacophore map of neodysiherbaine also predicts that the hydroxyl substitution at position (2) is more important for binding affinity than the hydroxyl substitution at position (1) (Figure B, third row). Similarly, the methylamine at position (2) of dysiherbaine is predicted to contribute more to the binding affinity than the hydroxyl at position (1) (Figure C, third row). Finally, we see from the absolute pharmacophore maps of all three analogues that the two carboxylic acid groups contribute favorably to the binding. Indeed, we see that their contributions are predicted to be more important than the substitutions at (1) and (2), supporting the notion that these substituents are an important component of the conserved scaffold (Figure A–C, fourth row). Together with the DDR1 TKI congeneric series, these comparisons illustrate how BCL-AffinityNet can yield structure–activity insight. To our knowledge, this is the first modern machine learning-based SB score function that is readily accompanied by an interpretable decomposition scheme. In principle, our pharmacophore mapping procedure is compatible with any LB or SB machine learning score function in the BCL. Thus, these results demonstrate a fast and simple approach to generate interpretable pharmacophore maps from BCL machine learning model predictions.

Discussion

Here, we develop a novel machine learning-based score function for vHTS SB scoring. Our approach centers around the development of novel protein–ligand signed property correlation descriptors. In addition to the new descriptors, our models avoid the use of ligand-specific features to reduce the risk of training dataset bias. The new models, BCL-AffinityNet and BCL-DockANNScore, have been evaluated on current best practice benchmarks and compared to other standard and leading methods. BCL-AffinityNet generally performs on par with or better than currently available SB virtual screening scores in affinity prediction and affinity ranking. BCL-DockANNScore, while generally not as good as GOLD, Glide, or the AutoDock Vina and derived methods at pose recovery or screening, performs competitively with respect to all of the evaluated methods. We therefore suggest that it may be a generally useful SB scoring algorithm with especially strong affinity prediction. Indeed, some of the best methods for docking and screening failed to provide estimates for power scoring (e.g., statistics for GlideScore-XP are based on 258/285 protein–ligand pairs, GlideScore-SP 252/285, GoldScore@GOLD 244/285).[22] Thus, when considering all of the tasks together (scoring power, ranking power, docking power, and screening power), the new SB scoring models in the BCL demonstrate the utility of our novel signed property protein–ligand correlation descriptors for SB CADD. Moreover, BCL-AffinityNet and BCL-DockANNScore represent the first instantiation of SB scoring in the BCL. While a number of algorithms consider multiple ligand-specific descriptors in their feature space alongside the protein–ligand interaction features (e.g., AutoDock Vina incorporates, e.g., the ligand length, number of hydrophobic atoms, etc.; both ΔVinaRF20 and ΔVinaXGB include ligand-specific pharmacophore features; ΔVinaXGB includes an estimate of ligand conformational stability; KDEEP contains ligand-specific voxels colored by pharmacophore features),[22,29,32,45] we made a conscious decision to avoid inclusion of such features in BCL-AffinityNet and BCL-DockANNScore. This was done to reduce the ligand bias of the models and hopefully yield a more generalizable score function. Nevertheless, efforts are underway to incorporate other aspects of protein–ligand binding affinity other than just interaction score terms into the BCL-AffinityNet and BCL-DockANNScore in an unbiased manner. These include improvements to both the neural network architectures employed here as well as the incorporation of efficient metrics for solvation energy, ligand conformational preference, and entropy changes. An important limitation of our work is that all models were trained in the absence of explicit water molecules, metal ions, and/or other cofactors. Others have recently demonstrated that the incorporation of explicit water molecules can improve model performance,[45] and future improvements to our model will incorporate these elements. As these updates are introduced, we will also continue to retrain the models leveraging the increasing availability of high-quality protein–ligand co-crystal structures with Ki/Kd data. Another limitation is the under-optimized protein–ligand interaction feature space of the current models. The generalizability of the PLC descriptors used to build BCL-AffinityNet and BCL-DockANNScore should not be conflated with completeness of the score function. By analogy, RosettaLigand with the Rosetta Talaris2014 score function[56] does not model halogen σ-hole interactions with aromatic ring systems and is thus unlikely to accurately determine the protein–ligand binding affinities of systems with these interactions. In the same way, BCL-AffinityNet and BCL-DockANNScore are incomplete representations of protein–ligand interactions. Further score function development will focus on expanding the availability of training data as well as describing additional salient chemical features. Ongoing work in the Meiler Lab is focused on the development of both LB and SB small molecule de novo design and focused library design algorithms. A critical motivator for the present work was the need for the BCL to have a rapid and flexible SB score function that can be deployed for design tasks where there is insufficient data to build a reliable LB QSAR model. BCL-AffinityNet and BCL-DockANNScore are fully integrated into the BCL descriptor framework, allowing them to be called and mathematically combined with a multitude of other features, including AD scores, ligand descriptors, and more. Another fundamental hurdle that we wanted to overcome was the so-called “black box” problem. This problem arises whenever the underlying feature space of the score function cannot be decomposed into human-interpretable parts, and it presents a major challenge when relying on complex score functions for rational drug design. In this manuscript, we have demonstrated a simple approach that can be employed with any score function in the BCL (machine learning or not) to convert predictions into all-atom pharmacophore maps. These pharmacophore maps can be generated with respect to underlying substructures or spatially matched atoms between different molecules, or they can be generated for individual molecules without a reference structure. We demonstrate how this can be accomplished with the BCL-AffinityNet score function for a series of congeneric DDR1 TKIs and dysiherbaine analogues. The relative pharmacophore maps provide an interpretable decomposition of affinity with respect to scaffold modifications that can be used to guide further molecule optimization. The absolute pharmacophore map procedure can tell the user which atoms are most salient to BCL-AffinityNet’s predictions. In addition to being a useful tool for interpreting machine learning score functions in the BCL, we anticipate that such pharmacophore maps will be valuable in automated drug design tasks. All of our models and applications for generating new models are freely available with an academic license for the BCL at http://meilerlab.org/. We hope that our descriptors and models may be integrated with future machine learning architecture development and descriptor optimization for the continued advancement of drug discovery.

Methods

Training Dataset Preparation

BCL-AffinityNet was trained using the refined set plus protein–ligand complexes from the general set of the PDBbind v.2016 dataset that satisfied the following criteria: (1) the ligand was a small molecule; (2) the binding affinity was measured as either Ki or Kd; and (3) all-atom types had defined Gasteiger atom types. The PDBbind v.2016 coreset was not included in the training set for any of the models for any of the performance evaluations. BCL-DockANNScore was trained using the refined set protein–ligand complexes from the PDBbind v.2016 dataset, excluding the 285 coreset compounds. For each protein–ligand complex in the PDBbind v.2016 refined set, 250 additional pose decoys were generated with RosettaLigand flexible docking with the Talaris2014 score function.[57−59]

Model Validation

Metrics for scoring power, ranking power, docking power, screening power, and confidence interval bootstrapping were performed with the scripts made available with download of PDBbind v.2016.[22] All models were trained with fivefold random-split cross-validation. The final model prediction value is the average prediction value obtained across all five splits (i.e., as opposed to selecting a single best model from the five splits). PDBbind v.2016 coreset complexes were always excluded from training. For other external test-set evaluations, the models were always retrained, excluding all test-set complexes explicitly. Thus, the final training set sizes for testing on the PDBbind 2016 coreset (n = 285), CSAR-NRC HiQ 1 Jiménez et al. subset (n = 55), CSAR-NRC HiQ 2 Jiménez et al. subset (n = 49), CSAR-NRC HiQ 1 full set (n = 176), and CSAR-NRC HiQ 2 full set (n = 167) were 7568, 7551, 7537, 7442, and 7440 (not every complex in the CSAR sets is in the PDBbind v.2016 set, hence the differences are not equivalent to 7568 – n). For comparisons to the CSAR-NRC HiQ benchmarks in Jiménez et al.,[32] complexes present in both the CSAR test sets and the PDBbind v.2016 refined subset were removed from the CSAR test sets. This resulted in two CSAR test sets of sizes 55 and 49, respectively, with the exact same PDB IDs as reported in the Supporting Information of Jiménez et al.[32] For our baseline assessment of ligand and receptor pocket bias on the PDBbind v.2016 coreset, we trained two DNNs identical in architecture to BCL-AffinityNet. For descriptors, we utilized the same chemical features, distance bins, and sign pairings as in the PLC descriptors, except we instead generated signed 3D autocorrelations of the ligand and/or receptor itself.[14] As inputs, we used the structures provided in the PDBbind v.2016 dataset such that the ligand-based DNNs were trained on the native poses of the ligands and the pocket-based DNNs were trained on the receptor binding pockets as extracted for inclusion in PDBbind v.2016.[22,60] For validation splits that explicitly address ligand and pocket bias of the training datasets, we generated k-means (k = 75) AD models of the PDBbind v.2016 coreset (n = 285) based on ligand 3DAs, pocket 3DAs, or column-combined ligand and pocket 3DAs (using the same descriptors that were used to create ligand- and pocket-based QSAR models; see the Supporting Information). We then scored all 7568 training set samples with each of these AD models. Previous studies on appropriate cutoffs for distance-based AD models have suggested that test-set samples further away from their closest node than 95–100% of the training samples can reliably be considered outside of the domain of applicability.[34,61] We therefore made three test-set splits (one for each AD model) containing all training samples that had AD scores greater than 1.0. The resulting test sets are those samples whose ligands, proteins, or ligands and proteins can be considered within the same AD as the PDBbind v.2016 coreset. Put another way, this creates larger PDBbind v.2016 coreset-like leave-class-out test-set splits based on the properties of the ligands, protein pockets, or combined ligands and protein pockets. We refer to these test sets, respectively, as LB AD test (n = 995), pocket AD test (n = 379), and combined AD test (n = 1377). For these evaluations, the total model training sample size is 7568 – n. For details on command-line syntax, see the Supporting Information.

Training Neural Networks for Affinity Prediction and Pose Discrimination

All neural networks were trained with the BCL. Our binding affinity prediction model, which we call BCL-AffinityNet, is a single-task, feed-forward regression neural network trained to predict pKi/d. While technically a “deep” neural network in that we utilize two hidden layers (512 and 32 neurons, respectively) instead of just one, BCL-AffinityNet is quite small compared to neural networks recently published for similar tasks.[31−33] Our pose prediction model, which we call BCL-DockANNScore, is a shallow (single hidden layer, 32 neurons) multitasking feed-forward classification neural network that predicts whether a protein–ligand pose is less than 1.0, 2.0, 3.0, 5.0, and 8.0 Å from the correct pose. Both networks can thus be formalized as follows. For a network with L hidden layers indexed l ∈ (1, ..., L), forward propagation for l ∈ (0, ..., L – 1) can be described aswhere is the output vector at layer l connected to the input vector z( at layer l + 1 by weights and biases and f is the transfer function applied to each set of inputs into the l + 1 layer. Correspondingly, the activation of a single neuron i in hidden layer l + 1 can be represented asto yield the output y( from layer l + 1. A mean-squared error (MSE) cost function was employed in all studies. Overtraining is prevented through the use of dropout in the input and hidden layers. During forward propagation, each output value y of each i neuron in the layer l of the ANN is randomly multiplied either by a value of 0 (corresponding to a “dropped” neuron) or 1.Here, is a vector with the same dimensions as whose values are either 0 (at fraction p) or 1 (at fraction 1 – p). At the end of every training batch, is shuffled. At test time, the corresponding weights are scaled down by the factor 1 – p. The BCL-AffinityNet DNN contains two hidden layers with 512 and 32 neurons, respectively. It was trained with a 5% dropout in the input layer, a 25% dropout in the first hidden layer, and a 5% dropout in the second hidden layer.[5] All neurons utilized a leaky rectifier transfer functionwhere x is the total input to a neuron. We utilized normalized mean absolute error (NMAE; defined as the quotient of mean absolute error and mean absolute deviation) as our objective function during training. The BCL-DockANNScore ANN contained a single hidden layer with 32 neurons. It was trained with 5% dropout in the input layer and 25% dropout in the hidden layer.[5] All neurons utilized a sigmoid transfer functionwhere x is the total input to a neuron. We utilized the area under the curve (AUC) as our objective function during training. The AffinityRF random forest model was trained with WEKA v.3.8.4 utilizing default settings.[48]

Feature Parameter and Neural Network Hyperparameter Tuning

Our adoption of 5% input layer dropout and 25% dropout in the first hidden layer (for both models) as well as the selection of a 32 neuron hidden layer prior to the output layer is based on extensive prior evaluation in Mendenhall et al.[5] For classification models, it has been shown that shallow networks often perform equivalently and sometimes better than deep networks at a substantially reduced training cost.[4] This, coupled with our own experience with QSAR classification tasks,[5,17,18] led us to use our previously utilized single hidden layer architecture for BCL-DockANNScore.[5] With respect to BCL-AffinityNet, we nominally selected the nearest power of 2 (29 = 512) to our input feature size as an upper limit for our first hidden layer size. We investigated two PLC descriptor feature parameters using fivefold random-split cross-validation with the DNN of this size: (1) the interaction bin distance and (2) the smoothing parameter β (eq ). We selected an initial smoothing parameter value of 5.0 based on prior 3DA QSAR investigations in which values greater than one were effective.[16−18] Subsequently, we varied the interaction bin distances at 1.0 Å intervals between 4.0 and 9.0 Å and compared NMAE and Pearson correlation across the cross-validation splits. Our results suggested that distances greater than 5.0 Å were best (Figure S17). In the interest of keeping our feature set relatively small, we selected 7.0 Å for our final models. Similarly, we varied the smoothing parameter between 0.1 and 10.0 at a fixed bin distance of 7.0 Å. We found that β values between 3.0 and 10.0 produced similar results (Figure S18); therefore, we retained a value of 5.0 for all additional studies. With the PLC parameters selected, we then performed additional fivefold random-split cross-validation studies to determine an appropriate first hidden layer size. We decreased the number of neurons from 512 by powers of 2 down to the size of the second hidden layer (32 neurons). For completeness, we also evaluated a shallow ANN ranging in size from 32 to 128 neurons using either a leaky rectifier (eq ) or sigmoid (eq ) transfer function. Generally, we observed that shallow and deep networks with smaller (32–64 neurons) first hidden layers performed the worst independent of transfer function. We also noted that two hidden layers seemed better than one, with little improvement in cross-validation performance between 256 and 512 neurons (Figure S19). We note that all cross-validation studies for PLC feature parameter and model hyperparameter tuning were done with the BCL-AffinityNet training set of size 7568 protein–ligand complexes (PDBbind v.2016 refined set, excluding the coreset and including select general set complexes; see the Methods subsection Model Validation for details). Model performance on the external test sets was not evaluated during feature parameter or model hyperparameter tuning.

Resolving Hydrogen Bond Angles in Feature Space

BCL-DockANNScore contains an additional feature type not present in BCL-AffinityNet. Specifically, we binned hydrogen bonding pairs by both distance and angle. We considered that the strength of hydrogen bonding interactions is often approximated not only with distances between donor and acceptor atoms but also with the orientation angle. Therefore, we also developed a complementary feature to (eq ) to assist with the description of well-formed hydrogen bonds. While (eq ) is generalizable to any atom-based descriptor (or pair of descriptors if performing an asymmetric correlation) returning a scalar value, this descriptor is exclusively for hydrogen bond donor/acceptor pairs. Essentially, each distance interval specified by the boundaries ra and rb in eq is equally partitioned into a user-specified number of bins (for this manuscript, nominally 45 bins of 8° each). Thus, for each distance bin there is also an angular component. See the Supporting Information for sample BCL code object files containing all properties employed in this study.

Input Sensitivity Analysis

The predictions for BCL-AffinityNet (and separately, BCL-DockANNScore) are the average predictions of the five cross-validated models. We can readily calculate feature importance for a single ANN by computing the magnitude of the input sensitivity across a dataset with respect to a given feature, after appropriate rescaling of the inputs. For model ensembles, the magnitude cannot be used or meaningfully averaged because feature input sensitivity may differ in the sign for various feature-instance pairings. While we could look at the raw average of input sensitivity of models across a given instance-feature pairing and then average the absolute value of that over the dataset, we suffer from an issue with relative scaling of the input sensitivities due to the nonlinearity of the ANN’s transfer function. Rather than deriving an optimal weighted feature importance metric for ANN ensembles by some criteria, we chose to simply evaluate how often the models in the ensemble agreed on the sign of the derivative for each feature, averaged across the dataset. This is a form of input sensitivity analysis we refer to as “consistency”. Here, we specifically evaluate the consistency of feature column perturbations on result labels across cross-validation models. Features for which models in the ensemble agree on the derivative sign most routinely are interpreted as those that are of most importance to the ensemble’s performance. Consistency is thus insensitive to the magnitude of feature’s influence. To calculate consistency, we iterate across all input feature columns of a training sample, perturb the feature value by a small amount (e.g., 0.01), propagate the perturbed inputs, and measure the result. For efficiency, we perform a forward propagation pass, followed by a backpropagation pass with a slightly modified result, which is readily transformed into the forward input sensitivities. This is done for each cross-validation model (in this manuscript, we performed fivefold cross-validation for all models). For each feature column, we count the number of models that predict that the perturbation will improve the score vs the number of models that predict that the perturbation will worsen the score. This number is normalized such that when half of the models predict a negative change to the result and the other half predicts a positive change to the result, the net consistency is zero. The consistency result is averaged across all examples in the training set for each individual feature.

49 in total

1. SAMFA: simplifying molecular description for 3D-QSAR.

Authors: John Manchester; Ryszard Czermiński
Journal: J Chem Inf Model Date: 2008-05-27 Impact factor: 4.956

2. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field.

Authors: Lingle Wang; Yujie Wu; Yuqing Deng; Byungchan Kim; Levi Pierce; Goran Krilov; Dmitry Lupyan; Shaughnessy Robinson; Markus K Dahlgren; Jeremy Greenwood; Donna L Romero; Craig Masse; Jennifer L Knight; Thomas Steinbrecher; Thijs Beuming; Wolfgang Damm; Ed Harder; Woody Sherman; Mark Brewer; Ron Wester; Mark Murcko; Leah Frye; Ramy Farid; Teng Lin; David L Mobley; William L Jorgensen; Bruce J Berne; Richard A Friesner; Robert Abel
Journal: J Am Chem Soc Date: 2015-02-12 Impact factor: 15.419

3. End-Point Binding Free Energy Calculation with MM/PBSA and MM/GBSA: Strategies and Applications in Drug Design.

Authors: Ercheng Wang; Huiyong Sun; Junmei Wang; Zhe Wang; Hui Liu; John Z H Zhang; Tingjun Hou
Journal: Chem Rev Date: 2019-06-24 Impact factor: 60.622

4. Boosting Docking-Based Virtual Screening with Deep Learning.

Authors: Janaina Cruz Pereira; Ernesto Raúl Caffarena; Cicero Nogueira Dos Santos
Journal: J Chem Inf Model Date: 2016-11-29 Impact factor: 4.956

5. Assessing the performance of MM/PBSA and MM/GBSA methods. 7. Entropy effects on the performance of end-point binding free energy calculation approaches.

Authors: Huiyong Sun; Lili Duan; Fu Chen; Hui Liu; Zhe Wang; Peichen Pan; Feng Zhu; John Z H Zhang; Tingjun Hou
Journal: Phys Chem Chem Phys Date: 2018-05-30 Impact factor: 3.676

6. Autocorrelation descriptor improvements for QSAR: 2DA_Sign and 3DA_Sign.

Authors: Gregory Sliwoski; Jeffrey Mendenhall; Jens Meiler
Journal: J Comput Aided Mol Des Date: 2015-12-31 Impact factor: 3.686

7. BCL::Conf: Improved Open-Source Knowledge-Based Conformation Sampling Using the Crystallography Open Database.

Authors: Jeffrey Mendenhall; Benjamin P Brown; Sandeepkumar Kothiwale; Jens Meiler
Journal: J Chem Inf Model Date: 2020-12-22 Impact factor: 4.956

Review 8. Better together: Elements of successful scientific software development in a distributed collaborative community.

Authors: Julia Koehler Leman; Brian D Weitzner; P Douglas Renfrew; Steven M Lewis; Rocco Moretti; Andrew M Watkins; Vikram Khipple Mulligan; Sergey Lyskov; Jared Adolf-Bryfogle; Jason W Labonte; Justyna Krys; Christopher Bystroff; William Schief; Dominik Gront; Ora Schueler-Furman; David Baker; Philip Bradley; Roland Dunbrack; Tanja Kortemme; Andrew Leaver-Fay; Charlie E M Strauss; Jens Meiler; Brian Kuhlman; Jeffrey J Gray; Richard Bonneau
Journal: PLoS Comput Biol Date: 2020-05-04 Impact factor: 4.475

9. BCL::EMAS--enantioselective molecular asymmetry descriptor for 3D-QSAR.

Authors: Gregory Sliwoski; Edward W Lowe; Mariusz Butkiewicz; Jens Meiler
Journal: Molecules Date: 2012-08-20 Impact factor: 4.411

10. Study of the Applicability Domain of the QSAR Classification Models by Means of the Rivality and Modelability Indexes.

Authors: Irene Luque Ruiz; Miguel Ángel Gómez-Nieto
Journal: Molecules Date: 2018-10-24 Impact factor: 4.411

7 in total

1. Introduction to the BioChemical Library (BCL): An Application-Based Open-Source Toolkit for Integrated Cheminformatics and Machine Learning in Computer-Aided Drug Discovery.

Authors: Benjamin P Brown; Oanh Vu; Alexander R Geanes; Sandeepkumar Kothiwale; Mariusz Butkiewicz; Edward W Lowe; Ralf Mueller; Richard Pape; Jeffrey Mendenhall; Jens Meiler
Journal: Front Pharmacol Date: 2022-02-21 Impact factor: 5.810

2. A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios.

Authors: Andreas Tosstorff; Markus G Rudolph; Jason C Cole; Michael Reutlinger; Christian Kramer; Hervé Schaffhauser; Agnès Nilly; Alexander Flohr; Bernd Kuhn
Journal: J Comput Aided Mol Des Date: 2022-09-25 Impact factor: 4.179

3. Prediction of Binding Free Energy of Protein-Ligand Complexes with a Hybrid Molecular Mechanics/Generalized Born Surface Area and Machine Learning Method.

Authors: Lina Dong; Xiaoyang Qu; Yuan Zhao; Binju Wang
Journal: ACS Omega Date: 2021-11-21

Review 4. Artificial intelligence and machine learning approaches for drug design: challenges and opportunities for the pharmaceutical industries.

Authors: Chandrabose Selvaraj; Ishwar Chandra; Sanjeev Kumar Singh
Journal: Mol Divers Date: 2021-10-23 Impact factor: 2.943