Literature DB >> 35382271

Collaborative Approach between Explainable Artificial Intelligence and Simplified Chemical Interactions to Explore Active Ligands for Cyclin-Dependent Kinase 2.

Tomomi Shimazaki1, Masanori Tachikawa2.   

Abstract

To improve virtual screening for drug discovery, we present a collaborative approach between explainable artificial intelligence (AI) and simplified chemical interaction scores to efficiently search for active ligands bound to the target receptor. In particular, we focus on cyclin-dependent kinase 2 (CDK2), which is well known as a cancer target protein. Docking simulation alone is insufficient to distinguish active ligands from decoy molecules. To identify active ligands, in this paper, machine learning is employed together with scoring functions that simplify the screened Coulomb and Lennard-Jones interactions between the ligands and residues of the target receptor. We demonstrate that these simplified interaction scores can significantly improve the classification ability of machine learning models. We also demonstrate that explainable AI together with the simplified scoring method can highlight the important residues of CDK2 for recognizing active ligands.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35382271      PMCID: PMC8973106          DOI: 10.1021/acsomega.1c06976

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

In recent years, the use of machine learning with molecular databases[1−10] has rapidly increased in the field of physical chemistry. Many machine learning algorithms focus on finding correlations (patterns) in the given data; however, the patterns found are often difficult for humans to understand due to the complex nonlinear behavior of machine learning models. The incomprehensibility of machine learning results has become a major problem in both scientific and nonscientific fields. To alleviate this problem, explainable artificial intelligence (AI) techniques, which attempt to understand and explain the behavior of machine learning models, have gained much attention.[11−13] In this paper, we examine several explainable AI algorithms, focusing on docking simulation (virtual screening) to search for drug candidate molecules, as a large amount of experimental data on both positive and negative cases is available. In the early stages of drug development, virtual screening (docking simulation) is routinely employed to narrow down candidate molecules, where the binding ability of ligand molecules against the target protein is virtually tested on a computer.[14−24] However, the use of docking simulation alone is not sufficient to search for drug candidates. Thus, more precise physicochemical simulations, such as (subsystem-based) quantum chemistry and (classical) molecular mechanics/dynamics, have been performed to improve the screening process; however, they consume a large amount of computational resources.[25−36] In contrast, machine learning is employed to achieve efficient and accurate virtual screening.[37−52] For example, Perreira et al. proposed a deep learning approach to improve docking simulations and extracted relevant features from protein–ligand complex data.[49] Yan et al. developed a descriptor based on interactions between protein and ligand for machine learning method.[53] Ragoza et al. reported a convolutional neural network scoring function approach and highlighted the key features of protein–ligand interactions.[41] Molecular descriptions suitable for machine learning have also been developed.[53−57] Several scores based on machine learning techniques are reported.[50,58,59] In this paper, we discuss a virtual screening technique based on machine learning using simplified interaction scoring functions. To verify virtual screening techniques, it is necessary to use decoy molecules that have similar chemical structures and properties to those of active ligands. The use of decoy molecules allows for stricter verification than using active ligands alone. In other words, virtual screening techniques must classify active ligands (positive cases) from a large number of decoy molecules (negative cases). For this purpose, Mysinger et al. developed the Directory of Useful Decoys: Enhanced (DUD-E) data set, which provides not only experimentally verified active ligands but also a large number of property-matched decoy molecules.[60] The number of decoys registered in the DUD-E data set is approximately 35 times the number of active ligands for each target protein (receptor). In this paper, we use the decoy molecules provided in the DUD-E data set to verify the machine learning models. We focus on cyclin-dependent kinase 2 (CDK2) as the target receptor in this paper. CDK2 is a catalytic subunit of the cyclin-dependent kinase complex and plays a critical role in the abnormal growth process of cancer cells.[61,62] Thus, CDK2 inhibitors have been widely researched as drug candidates.[63] For CDK2, 798 active ligands and 28,328 decoy molecules are provided in the DUD-E data set.[60] Thus, a small number of active ligands must be distinguished from a large number of decoys. However, the affinity (binding free energy) determined by conventional docking simulation is not sufficient to classify active ligands. Therefore, we introduce simplified interactions (scores) between ligands and the CDK2 receptor. These simplified interaction scores can significantly improve the classification ability of machine learning models. In addition, important residues for ligand recognition of CDK2 can be highlighted based on a collaborative approach between the interaction scores and explainable AI. The knowledge of important residues may be a useful guideline to understand the binding phenomena between ligands and the target protein. In Section , we describe the docking simulation, the simplified interaction scores containing chemical features, machine learning techniques, and the collaborative approach between the interaction scores and explainable AI. The calculation results and discussion are provided in Section , and concluding remarks are presented in Section .

Methods

In this section, we describe the calculation process to classify a small number of active ligands bound to the target protein (receptor) among a much larger number of decoy molecules. Figure summarizes the classification process. The first step is docking simulation between ligands and the target receptor. We employed the AutoDock Vina program package (referred to as Vina in this paper).[21] Vina provides simulation results with a higher average accuracy than AutoDock4 and efficiently predicts the ligand–receptor structure using a simple scoring function.[16] However, there are many common features between Vina and AutoDock4. For example, Vina employs the PDBQT molecular structure file format for input and output, which is also used in AutoDock4. In addition, the subroutines of MGLtools, which were originally developed for AutoDock4, can be used to prepare input files for the Vina docking simulation.[64,65] Vina can also evaluate the binding free energy (affinity) between a ligand and the target protein. The PDB receptor structure data registered in DUD-E, whose PDB-ID is 1h00,[66] was employed in this study. We also used the molecular data provided in DUD-E without any modifications.
Figure 1

Calculation and analysis flow used in this paper. In the first step, docking simulation using Vina is employed to obtain the complex structure and binding energy (affinity) between CDK2 and ligands. In the second step, the simplified screened Coulomb (SC) and Lennard-Jones (LJ) interaction scores between ligands and residues around CDK2 are calculated using ligand–receptor complex structures. In the third step, machine learning models to classify active ligands are created using feature vectors with simplified interaction scores and affinities. In the final step, explainable AI analysis is performed for the trained machine leaning models.

Calculation and analysis flow used in this paper. In the first step, docking simulation using Vina is employed to obtain the complex structure and binding energy (affinity) between CDK2 and ligands. In the second step, the simplified screened Coulomb (SC) and Lennard-Jones (LJ) interaction scores between ligands and residues around CDK2 are calculated using ligand–receptor complex structures. In the third step, machine learning models to classify active ligands are created using feature vectors with simplified interaction scores and affinities. In the final step, explainable AI analysis is performed for the trained machine leaning models. Figure a presents the histograms of the affinities of active and decoy ligands for the target CDK2 protein. The blue and red transparent histograms represent active and decoy molecules, respectively. The number of active ligands was much smaller than the number of decoys; thus, the histogram of the active ligands is much smaller than that of the decoys and is difficult to distinguish. To more easily compare the two histograms, normalized histograms are presented in Figure b. There is a large overlap between the histograms, although the active ligands tend to have a slightly lower affinity than the decoys on average. The large overlap in affinity leads to difficult classification of active ligands. In fact, we were unable to create a satisfactory model to classify active ligands even using machine learning, as shown later.
Figure 2

(a) Histograms of the binding energy (affinity) of active and decoy ligand molecules for CDK2 evaluated by Vina docking simulation. Here, red and blue transparent histograms represent active and decoy molecules, respectively. The number of active ligands is much smaller than that of decoys. (b) Normalized histograms for active and decoy molecules for a clearer comparison between the two. There is a large overlap between the active and decoy histograms, indicating that docking simulation cannot fully classify active ligands from a large number of decoys.

(a) Histograms of the binding energy (affinity) of active and decoy ligand molecules for CDK2 evaluated by Vina docking simulation. Here, red and blue transparent histograms represent active and decoy molecules, respectively. The number of active ligands is much smaller than that of decoys. (b) Normalized histograms for active and decoy molecules for a clearer comparison between the two. There is a large overlap between the active and decoy histograms, indicating that docking simulation cannot fully classify active ligands from a large number of decoys. In the second step, simplified interaction scores between ligands and residues around CDK2 were calculated, which are used in feature vectors to create improved machine learning models. To evaluate the interaction scores, we focused on amino acid residues around the pocket region of the target receptor where ligands were bound. We display a CDK2–ligand complex structure obtained by Vina in Figure a and 17 residues around the ligand-binding pocket of CDK2 in Figure b. In this study, we calculated the interaction scores between a ligand molecule and each residue based on the simplified SC interaction and the LJ potential as followswhere is the distance between atom of each CDK2 residue and atom of the ligand molecule and is a constant parameter set to 0.1 [Å–2]. To evaluate these scores, we used the ligand–receptor complex structure obtained from the Vina docking simulation. Equation represents the simplified SC interaction. Here, only the atomic species of O, N, S, and F are taken into account to evaluate the score because they play an important role in hydrogen bonding. The simplified SC scoring function does not take into account the partial charges on atoms, unlike classical force field models. Equation is based on the LJ potential; however, it does not consider the differences between atom species. In the simplified LJ interaction score, the same parameter is employed for all atom species, although ordinary classical force fields have many parameters. Thus, we employed simplified SC and LJ scores instead of considering more complex interactions because the former can enable easy, robust, and fast evaluation. In addition, this simplification is useful to highlight important residues for binding active ligands to the target protein.
Figure 3

(a) Structure of CDK2 described by the ball-and-stick representation. Here, a ligand molecule is bound to the pocket of CDK2. The ligand–receptor complex structure is obtained by the Vina docking simulation. (b) Residues around the pocket of CDK2, where ligand molecules are bound.

(a) Structure of CDK2 described by the ball-and-stick representation. Here, a ligand molecule is bound to the pocket of CDK2. The ligand–receptor complex structure is obtained by the Vina docking simulation. (b) Residues around the pocket of CDK2, where ligand molecules are bound. In the third step, we employed machine learning to classify active ligands from a large number of decoys. Here, the simplified SC and LJ interactions as well as the affinity calculated by Vina were used in feature vectors to construct machine learning models. We examined several machine learning algorithms using scikit-learn and LightGBM (light gradient boosting machine) libraries.[67,68] The hyperparameters were tuned using grid search and Bayesian optimization with the scikit-optimize library.[69] Cross-validation was used to evaluate the machine learning models. Finally, in the fourth step, we analyzed the trained machine learning models using explainable AI methods, such as the permutation importance algorithm[70] and the Shapley additive explanation (SHAP) method using the scikit-learn[67] and shap libraries.[71,72] The explainable AI analysis allowed us to distinguish important CDK2 residues to classify active ligands. In Section , we present the details of the calculation results based on the explainable AI approach together with the simplified SC and LJ interaction scores.

Results and Discussion

This section discusses the calculation results obtained from machine learning models with simplified interaction scores. The simplified SC and LJ interaction scores were used in feature vectors to construct the machine learning models. To calculate the simplified SC and LJ scores, 17 residues of CDK2 near the pocket were considered. In addition, the affinity obtained from Vina was also employed to describe the features of the ligand molecules. Thus, the size (dimension) of the feature vectors for each ligand was 35 (= 17 × 2 + 1). Standardization (normalization) of the feature vectors was employed to construct the machine learning models (classifiers). In this study, we examined several classifiers based on the logistic regression, random forest, and support vector machine (SVM) algorithms implemented in scikit-learn and the gradient boosting algorithm implemented in LightGBM. In the SVM method, the radial basis function (RBF) and the third-order polynomial (poly3) kernels were employed to consider nonlinear relationships in the data.[73,74] Table summarizes the performance of the machine learning models to classify active ligands. Five-fold cross-validation was used to obtain these metrics. The number of active ligands was much smaller than the number of decoys; thus, the accuracy (= (TP + TN)/(TP + TN + FP +FN)) was not a suitable metric to evaluate the machine learning models. Here, , , , and represent true positive, true negative, false positive, and false negative, respectively. Thus, we also used other metrics, such as precision (= TP/(TP + FP)) and recall (= TP/(TP + FN)), which have a tradeoff relationship. In addition, the F1 score and the Matthews correlation coefficient (MCC) can be used to comprehensively compare the model performance.[75]Table also presents the area under the receiver operating characteristic (AUC) curve scores.
Table 1

Comparison of Machine Learning Models in Classifying Active Ligands (Positive Cases) Bound to CDK2

modelsaccuracyprecisionrecallF1 scoreMCCAUC score
gradient boosting (LightGBM)0.980.980.480.600.650.93
SVM–RBF0.980.790.600.680.680.92
SVM–poly30.980.650.480.550.550.87
random forest0.980.990.280.440.520.89
logistic regression0.770.080.710.140.180.81
The results in Table indicate that the LightGBM model, which was based on the gradient boosting algorithm, demonstrated superior classification ability for active ligands. The LightGBM algorithm was much faster than other methods; thus, we focus on the LightGBM model in this paper. Table presents the mixed matrix for the LightGBM model using 35 features. For comparison, Table also presents the mixed matrix when only the affinity was used. The decoys demonstrated good results for all metrics in both cases due to the much larger number of decoy molecules than active ligands. In contrast, different results were observed for active ligands, where the use of the affinity alone was insufficient to construct an effective classifier. For example, an F1 score of 0.02 and a precision of 0.30 were obtained when only the affinity was used for the feature vectors. However, when the simplified SC and LJ interaction scores together with the affinity were used, a machine learning model could be constructed with an F1 score of 0.60 and a precision of 0.98. The use of interaction scores thus greatly improved the machine learning model. Therefore, using the LightGBM model with the simplified SC and LJ interaction scores was effective in classifying active ligands.
Table 2

Mixed Matrix for LightGBM Model

  precisionrecallF1 score
model using 35 featuresactive ligands0.980.430.60
 decoy molecules0.981.000.99
model using only the affinityactive ligands0.300.010.02
 decoy molecules0.971.000.99
To analyze the trained machine learning models, we employed the permutation importance algorithm, which is an explainable AI method. This algorithm makes it possible to evaluate the importance of features even in nonlinear models. In this algorithm, the values of a feature are randomly shuffled, and then the target data are predicted. The feature with shuffled values becomes useless and usually reduces the predictive performance of the trained model. The importance of a feature can be evaluated from the performance deterioration. This operation is repeatedly executed for all features, and the importance of each feature is evaluated. Figure a presents the top 15 most important features obtained by the permutation importance method for the LightGBM model, where the maximum value was scaled to 1.0. More detailed results are provided in the Supporting Information. To obtain these results, only active ligands (positive cases) were employed to avoid the strong influence of the larger number of decoys. The explainable AI analysis provided an insight into the behavior of the complex nonlinear machine learning model. For example, we determined that the affinity calculated by Vina was the most important feature in the LightGBM model. The docking simulation results were essential even if the simplified SC and LJ interaction scores were used. It should be noted that the affinity was not the most important feature in the SVM model (see also the Supporting Information). The second-most important feature in the LightGBM model was the simplified LJ interaction between ligands and the Val18 residue, while the third- and fourth-most important features were the simplified SC scores on Val18 and Leu134, respectively. Thus, we obtained information on which interactions between ligands and the target protein were important for classifying active ligands.
Figure 4

Scaled importance of features is obtained by the following explainable AI algorithms for the LightGBM model to classify active ligands: (a) permutation importance algorithm and (b) SHAP algorithm. These algorithms provide different interpretations of the importance of features, but common important residues are observed for recognizing active ligands. (c) Five of the top six residues of CDK2 are observed in both algorithms.

Scaled importance of features is obtained by the following explainable AI algorithms for the LightGBM model to classify active ligands: (a) permutation importance algorithm and (b) SHAP algorithm. These algorithms provide different interpretations of the importance of features, but common important residues are observed for recognizing active ligands. (c) Five of the top six residues of CDK2 are observed in both algorithms. Figure b illustrates the importance of features obtained by the SHAP algorithm, which is based on the Shapley value in game theory.[76] In this paper, these values are referred to as the SHAP importance. Here, the SHAP importance values were also scaled so that the maximum value was 1.0 for comparison purposes. The SHAP importance values were different from those obtained by the permutation importance algorithm. This suggests that the importance of features is not uniquely determined but depends on the selected algorithms. However, we were able to recognize common important residues from the different results. To clarify this point, we focus on only the residue information. Table presents the top six most important residues provided by the permutation importance and SHAP algorithms. To obtain the results, the residues were simply aggregated based on the importance values without distinguishing between the simplified SC and LJ scores. The results demonstrate that both the permutation importance and SHAP algorithms produced similar important residues, although their order was slightly different. That is, five residues of Ala31, Lys33, Leu83, His84, and Leu134 were common to the top six residues, as illustrated in Figure c. Thus, the explainable AI techniques suggest that these residues play essential roles in the molecular recognition of CDK2 for classifying active ligands in the LightGBM model. Figure displays the change in MCC values according to the number of features. Here, the features were first sorted according to their importance values, and the ones with higher importance were used. Both graphs display a similar behavior, where the performance approached saturation at approximately 15 features. Thus, it was demonstrated that the top residues played more essential roles in classifying active ligands.
Table 3

Most Important Residues Provided by the Permutation Importance and SHAP Algorithms in the LightGBM Model

 permutation importanceSHAP
1Val18Leu83
2Leu134His84
3Lys33Ala31
4Leu83Leu134
5His84Asp86
6Ala31Lys33
Figure 5

Change in MCC according to the number of features used in the (a) LightGBM model and (b) SVM with the radial distribution function (SVM–RBF) model. Here, features sorted by permutation importance are used to obtain the results. The classification ability of both learning models saturates at approximately 15 features.

Change in MCC according to the number of features used in the (a) LightGBM model and (b) SVM with the radial distribution function (SVM–RBF) model. Here, features sorted by permutation importance are used to obtain the results. The classification ability of both learning models saturates at approximately 15 features. Next, we discuss the SVM algorithm with the RBF kernel (SVM–RBF) to investigate the behavior of the permutation importance algorithm. Table presents the top five most important residues for SVM–RBF, which were obtained in the same way as in Table . In addition, we present the results calculated by the LightGBM model for comparison. The results differed slightly between the two models because the models behaved differently. For example, the precision of the SVM–RBF and LightGBM models was 0.79 and 0.98, respectively, as displayed in Table . In addition, the recall of these models was 0.68 and 0.60, respectively. The LightGBM model exhibited a superior performance in terms of precision, whereas the SVM–RBF model exhibited a superior recall. Thus, each machine learning model focused on different aspects of the data; therefore, the importance of the residues differed between these machine learning models. However, the three residues of Lys33, Leu83, and His84 were common to the top five residues in SVM–RBF and LightGBM.
Table 4

Most Important Residues Provided by the Permutation Importance Algorithm in the SVM–RBF and LightGBM Models

 SVM–RBFLightGBM
1Leu83Val18
2Glu81Leu134
3Lys33Lys33
4Val18Leu83
5His84His84
In both LightGBM and SVM models, the recall values are lower than those of precision. One of the factors is the failure of the docking simulation process. The simplified SC and LJ interaction scores are evaluated from the protein–ligand complex structure obtained from the docking simulation. If the docking simulation did not provide the correct complex structure, active ligands would be mistakenly classified as decoy molecules. Thus, the recall metric would be a lower value even if the simplified SC and LJ interaction scores are employed together with machine learning. In this paper, we executed Vina docking simulations against CDK2 with the fixed structure. More sophisticated docking simulation techniques such as using a flexible receptor model may improve the recall metric. The simplified SC and LJ interaction approach can be easily incorporated into other docking simulation techniques and therefore will be useful to enhance the performance of various docking simulations. To analyze the behaviors of the trained LightGBM model, this paper employed the partial dependence (PD) analysis.[77] In the PD analysis, a feature is counterfactually changed, and (virtual) prediction results are accumulated while averaging effects from other features. Thus, we can investigate the average behaviors of the machine learning model when a feature is changed. Here, the probability to classify active ligands is employed for the counterfactual PD calculations. In Figure , we summarize the PD analysis results for Lys33, Leu83, and His84, where the left and right columns are used for the simplified SC and LJ interactions, respectively. The horizontal axis represents the standardized SC and LJ interaction score values. The vertical axis describes the counterfactual properties (probabilities) obtained from the PD analysis. From these calculations, we can investigate how the machine learning model distinguishes between actives and decoys. For example, if a stronger SC interaction for Lys33 is given, the LightGBM model behaves to classify the molecule as an active ligand. Conversely, in the LJ interaction case for Lys33, the counterfactual probability takes higher values within a certain area. We also observe similar behaviors on the SC and LJ interaction scores for Leu83. These PD calculations indicate that a molecule needs to have appropriate interaction strengths for Lys33 and Leu83 residues to become an active ligand. Thus, the PD analysis with simplified interaction scores can provide a rough understanding of how the model recognizes ligand molecules.
Figure 6

PD analysis on simplified SC and LJ interaction scores for the LightGMB model. The horizontal and vertical axes represent the standardized interaction score and the counterfactual property (probability), respectively. The higher the counterfactual values, the more the machine learning model classifies molecules as active ligands.

PD analysis on simplified SC and LJ interaction scores for the LightGMB model. The horizontal and vertical axes represent the standardized interaction score and the counterfactual property (probability), respectively. The higher the counterfactual values, the more the machine learning model classifies molecules as active ligands. The residues of Lys33, Leu83, and His84 are found in Table ; therefore, they may be particularly important for binding between ligands and CDK2. Figure illustrates these residues, where the ligand molecule is sandwiched between them. The simplified SC and LJ interaction scores are based on (simplified) physicochemical concepts; thus, the explainable AI analysis reflects the actual chemical and biological behavior of CDK2. In fact, the Leu83 residue is essential for ligand recognition of the CDK2 receptor.[78−83] Many machine learning techniques focus only on correlations (patterns) in data rather than on causal relationships (physicochemical laws). Thus, the patterns found by machine learning models do not always have actual physicochemical meaning. However, the simplified SC and LJ interaction scores possess physicochemical meaning; thus, explainable AI analysis can capture actual physicochemical phenomena. This collaborative approach between explainable AI techniques and simplified interaction scores may help to extract physical and chemical meaning contained in data. In particular, the causal relationships in biological data are sometimes weak; therefore, the collaborative approach may be especially useful in this situation.
Figure 7

Protein structure around the ligand-binding pocket of CDK2. The residues of Lys33, Leu83, and His84 are highlighted, which play an essential role in classifying active ligands. The ligand molecule is sandwiched between these residues.

Protein structure around the ligand-binding pocket of CDK2. The residues of Lys33, Leu83, and His84 are highlighted, which play an essential role in classifying active ligands. The ligand molecule is sandwiched between these residues. In this paper, we discussed that the simplified SC and LJ interaction score can work well to classify active ligands. In some physical and chemical approaches, more precise descriptions such as using first-principles quantum chemistry and molecular dynamics have been pursued for protein–ligand interactions. Conversely, such precise descriptions may contain some difficult problems to handle with actual assays. For example, the X-ray resolution to obtain the receptor structure may be rough. The protonation state of ligands and protein residues may be unclear. In such situations, the simplified score approach together with machine learning may be more suitable than precise interaction descriptions because of its robustness. Other scores may be available for the purpose.[53−57] However, it is not always clear what simplifications will work well for each target application. Systematic research studies for simplified interactions may be an important topic for the future. Finally, we discuss calculation results based on a different data set with 824 molecules, which is extracted from the ChEMBL database[5] and classified the molecule with the IC50 value of 150 nM or less for CDK2 as active ligands. In addition, we employed the CDK2 structure (PDB-ID 6Q4G) with a resolution of 0.98 Å.[84] Thus, we examined different classification conditions from those used in the DUD-E database. When only affinity scores calculated from Vina were employed, we could not effectively classify active ligands, where the LightGBM model showed 0.24, 0.01, and 0.03 for the precision, recall, and F1-score metrics, respectively. Conversely, the performance of the same machine learning algorithm was largely improved by using simplified SC and LJ scores, and 0.70, 0.48, and 0.57 were achieved for the precision, recall, and F1-score metrics, respectively. Therefore, the simplified interaction approach is useful to improve the classification performance of machine learning. In our approach, simplified SC and LJ scores were employed together with docking simulation results to construct machine leaning models, and therefore, it can be usually expected that better results will be obtained compared to the use of docking simulation alone, although machine learning is a probability-based algorithm. We can systematically handle with both docking simulation results and simplified interaction scores in the machine learning approach. In addition, behaviors of machine learning models can be investigated by using explainable AI techniques to highlight important residues of the target protein. The collaborative approach between explainable AI and simplified chemical interactions will become a useful tool.

Concluding Remarks

Docking simulation is routinely employed to explore ligand candidates in the early stages of molecular drug research and development. However, the affinity (binding free energy) estimated from docking simulation is not sufficient for classifying active ligands. Thus, in this study, we introduced simplified SC and LJ interaction scores to improve the classification ability of machine learning models. The gradient boosting algorithm implemented in LightGBM demonstrated superior prediction ability for CDK2, where a precision of 0.98 was achieved in classifying active ligands. In addition, we examined the explainable AI approach using the permutation importance and SHAP algorithms to evaluate the trained machine learning models. The explainable AI analysis with the simplified interaction scores provided important residues for ligand recognition of CDK2. The number of experimental data increases as the drug research progresses. The accumulated data and the proposed simplified interaction scores can improve the search for active ligands. In addition, data analysis based on machine learning has become indispensable in various chemical fields. Explainable AI can be a useful tool to analyze the complex behavior of (nonlinear) machine learning models and to obtain insights into various research topics. Although this study employed simplified chemical interactions to describe ligands, these interactions were effective in improving the machine learning models. The results thus suggest that it is useful to incorporate chemical concepts into feature vectors to improve machine learning models even if all of the target systems are not well understood. In this paper, we discuss CDK2–ligand complexes based on a collaborative approach between the simplified interaction scores and explainable AI, where ligand molecules registered in DUD-E were examined. Although the data set was carefully created, it may have some bias. On the other hand, relatively new data sets such as CASF-2007,[8] CASF-2013,[9] and CASF-2016[10] were reported. In the future, we will discuss other data sets based on the collaborative approach between the simplified interaction score and machine learning techniques.
  69 in total

1.  Benchmarking sets for molecular docking.

Authors:  Niu Huang; Brian K Shoichet; John J Irwin
Journal:  J Med Chem       Date:  2006-11-16       Impact factor: 7.446

2.  From Local Explanations to Global Understanding with Explainable AI for Trees.

Authors:  Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee
Journal:  Nat Mach Intell       Date:  2020-01-17

3.  Combining machine learning and pharmacophore-based interaction fingerprint for in silico screening.

Authors:  Tomohiro Sato; Teruki Honma; Shigeyuki Yokoyama
Journal:  J Chem Inf Model       Date:  2010-01       Impact factor: 4.956

4.  Identification of protein kinase CK2 inhibitors using solvent dipole ordering virtual screening.

Authors:  Isao Nakanishi; Katsumi Murata; Naoya Nagata; Masakuni Kurono; Takayoshi Kinoshita; Misato Yasue; Takako Miyazaki; Yoshinori Takei; Shinya Nakamura; Atsushi Sakurai; Nobuko Iwamoto; Keiji Nishiwaki; Tetsuko Nakaniwa; Yusuke Sekiguchi; Akira Hirasawa; Gozoh Tsujimoto; Kazuo Kitaura
Journal:  Eur J Med Chem       Date:  2015-04-15       Impact factor: 6.514

5.  Group molecular orbital approach to solve the Huzinaga subsystem self-consistent-field equations.

Authors:  Tomomi Shimazaki; Kazuo Kitaura; Dmitri G Fedorov; Takahito Nakajima
Journal:  J Chem Phys       Date:  2017-02-28       Impact factor: 3.488

Review 6.  The CDK inhibitors in cancer research and therapy.

Authors:  Jonas Cicenas; Mindaugas Valius
Journal:  J Cancer Res Clin Oncol       Date:  2011-08-30       Impact factor: 4.553

7.  AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening.

Authors:  Duc Duy Nguyen; Guo-Wei Wei
Journal:  J Chem Inf Model       Date:  2019-07-01       Impact factor: 4.956

8.  Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Authors:  Michael M Mysinger; Michael Carchia; John J Irwin; Brian K Shoichet
Journal:  J Med Chem       Date:  2012-07-05       Impact factor: 7.446

9.  Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening.

Authors:  Zixuan Cang; Lin Mu; Guo-Wei Wei
Journal:  PLoS Comput Biol       Date:  2018-01-08       Impact factor: 4.475

10.  PubChem Substance and Compound databases.

Authors:  Sunghwan Kim; Paul A Thiessen; Evan E Bolton; Jie Chen; Gang Fu; Asta Gindulyte; Lianyi Han; Jane He; Siqian He; Benjamin A Shoemaker; Jiyao Wang; Bo Yu; Jian Zhang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2015-09-22       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.