Literature DB >> 29888034

Incorporating Protein Dynamics Through Ensemble Docking in Machine Learning Models to Predict Drug Binding.

Fatemah Alghamedy¹, Jeevith Bopaiah¹, Derek Jones¹, Xiaofei Zhang¹, Heidi L Weiss¹, Sally R Ellingson¹.

Abstract

Drug discovery is an expensive, lengthy, and sometimes dangerous process. The ability to make accurate computational predictions of drug binding would greatly improve the cost-effectiveness and safety of drug discovery and development. This study incorporates ensemble docking, the use of multiple protein conformations extracted from a molecular dynamics trajectory to perform docking calculations, with additional biomedical data sources and machine learning algorithms to improve the prediction of drug binding. We found that we can greatly increase the classification accuracy of an active vs a decoy compound using these methods over docking scores alone. The best results seen here come from having an individual protein conformation that produces binding features that correlate well with the active vs. decoy classification, in which case we achieve over 99% accuracy. The ability to confidently make accurate predictions on drug binding would allow for computational polypharamacological networks with insights into side-effect prediction, drug-repurposing, and drug efficacy.

Entities: Chemical Disease Gene

Year: 2018 PMID： 29888034 PMCID： PMC5961778

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Machine learning is currently being used to advance many scientific disciplines, including drug binding predictions, and shows promise in increasing accuracy enough to make reliable polypharmacological predictions. Chemical drug features have been combined with molecular docking scores in machine learning models to rescore the interactions of one candidate drug to multiple proteins [1]. Components of docking scoring functions can be used as features in a machine learning model to greatly improve the accuracy of identifying active compounds in models specific for one protein [2]. A study on protein druggability used protein features in machine learning models to predict whether or not a protein is druggable [3]. Molecular flexibility can contribute to a favorable change in free energy of binding [4, 5, 6]. Protein-ligand complexes undergo a wide range of motions ranging from small changes in binding site residues to large-scale motions of entire protein domains. Molecular docking is an efficient (but not highly accurate) computational method that predicts how and how well a drug will bind to a protein. To keep these calculations efficient in order to investigate large libraries of chemical compounds, proteins are generally kept static or mostly static in which only a few selected side chains can rotate. Commonly, a crystal structure of the protein (experimentally resolved 3-dimensional protein structure) is used in molecular docking in which a ligand was bound for crystallization but removed from the Protein Data Bank (PDB)[7] file for docking. However, proteins may exist in multiple druggable states, and potentially none of these states are those favored for crystallization and may differ from those favored by a different bound compound. Ensemble docking is used here to describe the process of using multiple possible protein conformations in molecular docking to represent many potentially druggable states. In this case we use molecular dynamics to generate a trajectory of protein movement and select distinct conformations from the trajectory. Previous studies have shown the usefulness of this technique to improve molecular docking in general [8]. However, ensemble docking is still not perfect and there is no way to pick the best conformation used in ensemble docking without prior knowledge on sets of binding and non-binding compounds. There also still remains deficiencies in the molecular docking scoring functions which must assume a fixed functional form. This paper explores the combination of ensemble docking, biomedical data sources, and machine learning algorithms to greatly increase the accuracy of binding predictions. In this study, we investigate Tyrosine-protein kinase Lck (LCK) which is implicated as a drug target in many cancers and also known to have toxic effects when unintentionally targeted. LCK is also the subject of a previous study in our lab that investigates theoretically more accurate binding calculations than molecular docking[9].

Methods

In this paper we perform ensemble docking by first collecting protein conformations from a molecular dynamics trajectory and performing molecular docking with these structures. We collect features on our drug set and also generate features on the protein conformations. We use a random forest regressor method to rate the importance of features and then investigate several machine learning models using K Nearest Neighbors (Knn) machine learning algorithm for classification of active (binding) vs decoy compounds (non-binding).

Data Collection

The various types of features collected are described below and the total number of features collected for each category is given in Table 1.

Table 1:

The total number of features in each category for the base dataset, the best 100 features for the whole dataset, and the best features for each protein conformation (Conf 1-7).

Category	Base	Best	Conf 1	Conf 2	Conf 3	Conf 4	Conf 5	Conf 6	Conf 7
Drugs	2,792	99	10	54	20	7	5	10	7
Protein	103	0	0	0	0	0	0	0	0
Binding	109	1	0	1	0	8	0	0	8
The total	3004	100	10	55	20	15	5	10	15

Binding Descriptors The conformations of LCK used in this study are those extracted from a molecular dynamics simulation used for molecular mechanics and generalized Born and surface area continuum solvation (MM/GBSA) calculations in a previous study. Details on the ensemble docking method used can be found in the published work [9]. The conformations are extracted from the molecular dynamics trajectory using a root mean squared deviation clustering method of all the atoms near the binding site. The results of the clustering method is distinct conformations in which the first selected conformation has the most similar frames in the trajectory. Therefore, the first conformation is considered to be the conformation in which the protein is in most often, with the last conformation being a rare, but potentially important, state. The previous study used homology modeling to predict more complete structures of different states of LCK and the one used here corresponds to a homology model made using PDB structure 1QCF [10]. This structure is of HCK and chosen based on sequence identity. This model is of the inactive state. Active and decoy compounds for LCK come from the Directory of Useful Decoys (DUD-e) [11] and are prepared for docking using modified ADT scripts and a wrapper script for automation. Docking was performed using VinaMPI [12], which allows the distribution of a large number of Autodock Vina [13] docking jobs on MPI-enabled high-performance computers. The default number of models per docking job (predicted bound drug conformations with binding scores) was changed to collect up to 20 models per docking job which was for the MM/GBSA protocol. The results of the docking jobs were submitted to Autodock Vina using the “–score-only” option to collect the individual terms calculated in the scoring function. This includes terms for gauss1, gauss2, repulsion, hydrophobic, and hydrogen interactions. The values for all the models and averages of each term for all models are kept. There are also features for the final docking score and a normalized ranking. Protein Descriptors Since we are only considering one protein here, but seven different conformations, we collected features using two different webservers that calculate features using the 3-dimensional structure provided in a PDB file. These servers include Coach [14, 15] and 2struc [16]. Coach calculates features used for ligand binding site predictions and 2struc gives information on the actual secondary structure of a 3-dimensional protein structure (i.e. not predicted secondary structure from amino acid sequence). Drug Descriptors There are 28,536 different molecules (drugs) for which we collect features. Drug features are calculated using the Dragon Software [17]. Dragon can calculate over 5 thousand molecular descriptors, including the simplest atom types, functional groups and fragment counts, topological and geometrical descriptors, and three-dimensional descriptors. It also includes several property estimations like logP and drug-like alerts like Lipinski’s alert. There are 1,058 descriptors excluded because their values are constant values, near-constant values, or standard deviation less than 0.0001. In this study three-dimensional descriptors are also left out because the input structures for Dragon are the predocking structures and not those predicted by molecular docking. Labels A drug is labeled as the positive class if it is a known active for LCK in DUD-e [11]. The drug is labeled as the negative class if it comes from the DUD-e decoy set for LCK. Since we do not know the conformation in which each drug binds, and it could potentially bind to all conformations, the drug is labeled the same for each conformation.

Feature Selection

A random forest regressor method implemented in scikit-learn [18] is used to rate all of the features by importance when classifying compounds as active vs. decoy. This was done using all conformations at once and also one conformation at a time. The dataset was first balanced by randomly selecting the same number of decoy compounds as actives. We also selected some protein features based on their ability to identify the protein conformation.

Classification

The classification in this study is done using the scikit-learn [18] Knn implementation. Knn is a fairly simple machine learning algorithm in which a test object is classified by the majority vote of its K nearest neighbors, thus having the same class as most of its neighbors. There are many different measures that quantify the closeness of two objects. Several distance measures, K = 1, 2,3,4, and 5 nearest neighbors, and n = 10, 5, and 2 for n-fold cross validation were evaluated. Since the data is unbalanced (with approximately 50 decoys to every active compound), the dataset is first balanced by randomly selecting the same number of decoys as actives. We first select all active drugs in the dataset such that the label = 1, and get the total number of all active drugs as a sampleSize. For ten experiments, we randomly select a total of sampleSize decoy drugs. The sample() method, which is in pandas.DataFrame(), is used to randomly select the decoy set. After that, we merge the negative and positive data into one dataset and n-fold cross validation is performed on this set. To prepare the data for the machine learning, we split the data into X and y suchthat X holds all the features except the label and y represents the label column. For each n-fold, we provide indices (sKf) to the knn method (where neighbors = k) for testing and training using the StratifiedKFold() method to ensure a balance of positive and negative classes. Finally, cross_val_predict() method is used with knn, X, y, and sKf to predict if the drug is an active or a decoy. The metrics presented in the results is an average over the ten experiments. The process is given in Algorithm 1. Algorithm 1 In some cases we want to reduce our feature set to the smallest size that gives the best metrics. In this case, classification was performed with the top 5 features with the best predicted contributions to the classification and the next 5 most important features are incrementally added. The optimal small feature set is selected based on the AUC as defined in Table 2.

Table 2:

Metrics used in this study

Name	Definition	Formula
Youden’s index	Performance of dichotomous test. The value 1 indicates a perfect test and -1 indicates a useless test.	TPTP + FN + TNTN + FP + 1
AUC	Area under the curve for the ROC curve. The probability P that a randomly chosen positive instance ranks higher than a randomly chosen negative one (X₀ and X₁ are ranks for negative and positive instances, respectively).	P(X₁ > X₀)
Accuracy	Percent of correct predictions	TP+TNTP + FP+TN+FN
F1	Harmonic mean of precision and recall	2TP2TP + FP+FN
Precision	Positive predictive value	TPTP + FP
Recall	True positive rate	TPTP + FP

Some of our analysis did not work with the cross validation method described. In this case we still balanced the data as described, tested for the same K values, evaluated testing set sizes of 10%, 20%, and 50% of the data, and averaged results over ten experiments.

Evaluation

In this study we compare three different ways of looking at the data using the machine learning algorithm and also compare the performance to the computed docking score. Docking scores are typically used for ranking compounds from most likely to least likely to bind and there is no standard that defines an exact docking score that determines a binding prediction. In order to compare binding predictions from the docking score alone to the machine learning models, the maximum Youden’s index (or J value) is calculated for each model. The best J value is calculated from the docking score receiver operator characteristic (ROC) curve and used as a cut-off to define true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values for the docking results. All the metrics presented in the Results are defined in Table 2. The three different data models are described below. Model 1 In this model, we split the dataset based on the protein conformation number, giving us seven different datasets. The assumption here is that one of the conformations best represents the druggable state. Model 2 In this model, we keep all entries for every drug combined with each conformation. We ensure that all entries for one drug (seven total - one for each conformation) are entirely in the testing or entirely in the training set. The assumption here is that the drug may bind to all conformations of the protein at varying rates. Model 3 In this model, a consolidated dataset is made by keeping one entry per drug. The entry that is kept is the one with the best overall docking score. The assumption here is that the conformation with the best docking score is the preferred conformation for actual binding. This method has shown to improve docking results with ensemble docking [8].

Results

The top 100 features are kept when using the entire dataset for feature selection and also the top 100 features per protein conformation. The types of features important for classifying the entire dataset are given in Table 1. The total number of features that gave the best machine learning model AUC per individual conformation and their categories are given in Table 1 as well. Only conformations 2, 4, and 7 had important binding features. Just over half of the important features for conformations 4 and 7 are binding features. There are 28 protein features that have a weight greater than zero when selected for classification of the protein conformation.

Youden’s index for docking ROC curves

The maximum Youden’s index (or J value) is calculated for each model. The best J value is used to define TP, FP, TN, and FN values for each model using docking scores. The best J values and the docking score cut-off for each model are given in Table 3.

Table 3:

Youden’s Index for each models.

Value	Model 1							Model 2	Model 3
Value	Conf 1	Conf 2	Conf 3	Conf 4	Conf 5	Conf 6	Conf 7	Model 2	Model 3
Docking Score cut-off	-9.2	-8.2	-8.5	-2.8	-8.3	-8	37.4	-8.1	-8.9
Best J Value	0.1600	0.0829	0.0832	0.1052	0.1142	0.2103	0.0125	0.0874	0.1283

Many metrics are reported here, but with a focus on maximizing the F1 score, this metric is bold in all the tables and the best models are highlighted in gray. The results for Model 1 are given in Table 4. The results presented here are for K = 1 nearest neighbors, n = 10 n-fold cross validation, and running in the ‘Auto’ mode (in which the algorithm attempts to choose the best metric to define the nearness of neighbors). These values usually give the best results with not a lot of variation. We evaluate Model 1 using all the features, the best 100 features from feature selection on the entire dataset, the best 100 features when doing feature selection on each individual conformation, and a subset of the last feature set that gives the best result. Interestingly, conformation 7 always performs the best by machine learning F1 even though it is the worst in many docking metrics. Since we saw a decrease in metric values when going from features selected on the entire dataset to conformations, we did a further analysis on the best subset of features and picked the subset that gave the best machine learning AUC. All conformations perform better with their smaller subset than any conformation using the best 100 features obtained using the entire dataset (the best again is conformation 7 and highlighted in yellow).

Table 4.

Model 1. Conf = conformation; MD = molecular docking; ML = machine learning model; Acc. = Accuracy; Prec. = Precision

The results for Model 2 are given in Table 5. Since we had to ensure drugs were always in the testing or training set in this model we could not run Model 2 with the same cross validation method used for Model 1. The results here are for a 90:10 training/testing split and metrics are averages of running the analysis 10 times with a different random set of decoys. We wanted to test the inclusion of protein conformation features. We tested this by using a base model of the best 100 features selected using the whole dataset and then adding the 2 protein features most important when classifying the conformation and also all 28 with a non-zero importance. We see a decline in performance at each step.

Table 5:

Model 2. MD = molecular docking; ML = machine learning model; Acc. = Accuracy; Prec. = Precision

We wanted to run Model 3 with cross validation but also wanted to directly compare Model 2 and Model 3. Therefore, Table 6 gives a comparison of both models using the same protocol as used for Model 2. They both have very similar metrics, with the average of Model 2 slightly higher, but some of the individual runs of each performing better than the other. Table 7 gives the results of Model 3 using cross validation. Here we compare the models using the entire feature set, the best 100 features, the best 100 features plus the base docking features (for the best predicted docking model, averages of all, the final docking score, and a ranking), the best 100 features plus all docking features (includes up to 20 docked models), and the docking features alone. The best 100 features performs the best, but all still have an improvement over docking alone. Using only docking features in the machine learning model actually has a slight decrease for recall compared to the docking calculation (highlighted in yellow), but the better precision with the machine learning model gives it a better F1.

Table 6:

Model 2 and 3 using best 100 features

Table 7:

Model 3. MD = molecular docking; ML = machine learning model; Acc. = Accuracy; Prec. = Precision

Discussion

The best Model 1 is obtained using conformation 7 and the small subset of features obtained doing feature selection on just that conformation’s subset of data and only taking the number of features that gives the best AUC. Conformation 7 performs the worst according to docking AUC, however the docking recall is nearly as high as the machine learning recall. This is because the best J-value is actually quite low and the docking cut-off is one that you would not expect to signify an active compound (+37.4, when the most negative scores are best). Because of this cut-off there are a large number of false positives driving down the docking precision. We believe the success of conformation 7 for machine learning is the fact that it has so many binding features that correlate with the active vs decoy classification (8 of the 15 most important features). We see the same trend in the next best Model 1 which is obtained using conformation 4 and its subset of important features, which also has 8 out of 15 as binding features. You can see from the docking cut-off value for classification that conformation 4 and 7 actually have the worst docking values. However, when using the individual components in the machine learning algorithm this information can classify active vs. decoy with great precision and recall. This shows the importance of the terms calculated for a docking scoring functions but the deficiencies in the scoring functions themselves. Since the length of the trajectory used to capture different conformations is only 200 ns, there is not a huge structural variation in the conformations. However, upon visualization it appears that conformation 7 may have a slightly smaller binding pocket which may allow more favorable interactions with the active compounds. Model 2 looks at keeping all entries in the dataset. Model 2 performs better with the best 100 features than any single conformation by F1 metrics. We did not get any improvement in any metrics by adding protein features. This could be because there is not enough variation in these values to help with predictions. Including such features when there are larger differences may provide further insight. Information on which conformation drugs bind to and which they do not (so drugs could be marked as active to one conformation and decoy to another) could make these features more useful as well. However, creating that dataset would be difficult if not impossible and is discussed further in the limitations. Even though we were not successful using protein features in this study, we include them here for thoroughness. We are also interested in models that incorporate multiple proteins and conformations obtained from longer trajectories, therefore we are still interested in the usefulness of these features. We can see from the Model 3 results that although the binding features play a significant role in good classification when they correlate with the classification, forcing all the binding features to be in the model does not help. We found in this study that we can successfully use machine learning to increase the prediction of drug binding and using drug features calculated from protein conformations selected from molecular dynamics trajectories increases the predictability of the models. Lessons learned in this study will be used to build models of different proteins to be used in drug discovery applications. In fact, we are already developing an “all kinase” model to predict drug repurposing and potential side-effects.

Limitations

There are several limitations of this study that warrant further investigation. We are likely overfitting the data, as feature selection was performed on the same data used in the validation. We have omitted a large number of the decoy compounds from feature selection to balance the data. However, in this initial study, we wanted to have a large number of active compounds in the selection to better understand the features that drive the classification. We plan to increase the size of our dataset to enable the omission of a large number of actives for validation in the future. This study utilizes DUD-e, in which the decoys are only assumed non-binders. However, in this study we are using the enhanced version of the dataset that has a reduced number of false decoys by having a more stringent filtering process and experimental validation when available. Another limitation is the fact that active drugs from the DUD-e dataset are labeled as active for all the possible conformations assessed in this study. However, if we only use protein-drug complexes with experimentally resolved conformations we are setting ourselves up for further limitations as they may not relate to how the drug binds to the protein when not in the experimental conditions used to resolve the structures.

Future Work

Some potential future directions of this research include, (1) testing other machine learning algorithms, (2) incorporating data on multiple possible binding sites, (3) having a strict validation set not used in feature selection, and (4) applying models to novel drug discovery applications.

Conclusion

We successfully incorporated ensemble docking, biomedical and biological data sources, and machine learning to improve binding predictions. The addition of protein features did not help this model but it may in cases where there is more variation (such as if multiple proteins are in the dataset). The best results seen here come from having an individual conformation that produces binding features that correlate well with the active vs decoy classification, giving models with over 99% accuracy. We also see that every way we examine the data using machine learning gives improvement over molecular docking alone.

14 in total

1. Direct determination of vibrational density of states change on ligand binding to a protein.

Authors: Erika Balog; Torsten Becker; Martin Oettl; Ruep Lechner; Roy Daniel; John Finney; Jeremy C Smith
Journal: Phys Rev Lett Date: 2004-07-09 Impact factor: 9.161

2. DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins.

Authors: Ali Akbar Jamali; Reza Ferdousi; Saeed Razzaghi; Jiuyong Li; Reza Safdari; Esmaeil Ebrahimie
Journal: Drug Discov Today Date: 2016-01-25 Impact factor: 7.851

3. Vibrational softening of a protein on ligand binding.

Authors: Erika Balog; David Perahia; Jeremy C Smith; Franci Merzel
Journal: J Phys Chem B Date: 2011-05-10 Impact factor: 2.991

4. A machine learning-based method to improve docking scoring functions and its application to drug repurposing.

Authors: Sarah L Kinnings; Nina Liu; Peter J Tonge; Richard M Jackson; Lei Xie; Philip E Bourne
Journal: J Chem Inf Model Date: 2011-02-03 Impact factor: 4.956

5. Theory and normal-mode analysis of change in protein vibrational dynamics on ligand binding.

Authors: Kei Moritsugu; Brigitte M Njunda; Jeremy C Smith
Journal: J Phys Chem B Date: 2010-01-28 Impact factor: 2.991

6. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment.

Authors: Jianyi Yang; Ambrish Roy; Yang Zhang
Journal: Bioinformatics Date: 2013-08-23 Impact factor: 6.937

7. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Authors: Michael M Mysinger; Michael Carchia; John J Irwin; Brian K Shoichet
Journal: J Med Chem Date: 2012-07-05 Impact factor: 7.446

8. 2Struc: the secondary structure server.

Authors: D P Klose; B A Wallace; Robert W Janes
Journal: Bioinformatics Date: 2010-08-24 Impact factor: 6.937

9. Combining machine learning systems and multiple docking simulation packages to improve docking prediction reliability for network pharmacology.

Authors: Kun-Yi Hsin; Samik Ghosh; Hiroaki Kitano
Journal: PLoS One Date: 2013-12-31 Impact factor: 3.240