Literature DB >> 34191245

Molecular insights on ABL kinase activation using tree-based machine learning models and molecular docking.

Philipe Oliveira Fernandes¹, Diego Magno Martins², Aline de Souza Bozzi², João Paulo A Martins², Adolfo Henrique de Moraes², Vinícius Gonçalves Maltarollo³.

Abstract

Abelson kinase (c-Abl) is a non-receptor tyrosine kinase involved in several biological processes essential for cell differentiation, migration, proliferation, and survival. This enzyme's activation might be an alternative strategy for treating diseases such as neutropenia induced by chemotherapy, prostate, and breast cancer. Recently, a series of compounds that promote the activation of c-Abl has been identified, opening a promising ground for c-Abl drug development. Structure-based drug design (SBDD) and ligand-based drug design (LBDD) methodologies have significantly impacted recent drug development initiatives. Here, we combined SBDD and LBDD approaches to characterize critical chemical properties and interactions of identified c-Abl's activators. We used molecular docking simulations combined with tree-based machine learning models-decision tree, AdaBoost, and random forest to understand the c-Abl activators' structural features required for binding to myristoyl pocket, and consequently, to promote enzyme and cellular activation. We obtained predictive and robust models with Matthews correlation coefficient values higher than 0.4 for all endpoints and identified characteristics that led to constructing a structure-activity relationship model (SAR).

Entities: Chemical

Keywords: ABL kinase activators; LBDD; Machine learning; Molecular docking; QSAR; SAR; SBDD

Mesh：

Substances：

Year: 2021 PMID： 34191245 PMCID： PMC8241884 DOI： 10.1007/s11030-021-10261-z

Source DB: PubMed Journal: Mol Divers ISSN： 1381-1991 Impact factor: 3.364

Introduction

Abelson kinase (c-Abl) is a non-receptor tyrosine kinase located in many subcellular compartments, including the endoplasmic reticulum, cytoplasm, nucleus, cell cortex, and mitochondria [1]. This enzyme modulates several biological processes, including actin polymerization [2, 3], structural changes in chromatin [4], responses to DNA damage [5, 6], and other essential ones for cell proliferation, differentiation, migration, survival, and death [7]. The inhibition of c-Abl has been studied since its over-activation is associated with various diseases such as chronic myeloid leukemia (CML), cancer, immunological diseases, neurological disorders such as Parkinson's disease and others [8-11]. However, the activation of c-Abl has become the subject of investigation in the last years since the transient activation of this enzyme may have a therapeutic application such as the treatment of chemotherapy-induced neutropenia [12-14]. Moreover, c-Abl activators can block TGFβ-responsive mammary tumor growth in mice, showing a potential strategy to deal with breast cancer [15]. Abl kinases can also be oncogenes in some cases, as in CML, and tumor suppressors in others, such as prostate cancer with α3 integrin deficiency [16]. For this reason, in specific circumstances, the use of small-molecule activators of c-Abl kinases could help prostate cancer treatment [14]. Other potential applications of activators include treating ischemic injury [17] and synergistic effects when combined with BCR-Abl inhibitors that would improve treatment efficiency [18]. The c-Abl autoinhibition is mainly regulated by a series of intramolecular interactions in the SH3 and SH2 domains that stabilize the kinase domain inactive conformation [1, 19]. Another regulation mechanism is the N-terminal myristoyl binding to the myristoyl binding site in the kinase domain's C-lobe (Fig. 1a, b). This binding helps form a compact conformation in which the enzyme is in an autoinhibited state [20, 21].

Fig. 1

a Representation of the conformational changes in the c-Abl kinase during their autoinhibition process; b cartoon representation of c-Abl kinase colored by their domains (PDB ID 2FO0) and c representation of DPH bounded to the myristoyl binding site (PDB ID 3PYY) Yang et al. identified in 2011 the DHP, the first cell-permeable molecule capable of activating the c-Abl [21] (Fig. 1c). Furthermore, subsequent studies carried out by the same research group [22] identified a series of compounds capable of activating c-Abl whose mechanism of action involves the binding to the myristoyl binding site, inducing conformational changes [23, 24]. Compared to inhibitors, only a few examples of small-molecule activators of enzymes have their mechanism of action well-characterized. Moreover, beyond the possible therapeutic applications, the study of enzymatic activators can help understand how the enzyme’s activation can affect a particular metabolic pathway and explain the conformational changes that govern the protein function [25]. However, even though recent studies have contributed to an advance in understanding c-Abl activation mechanisms, few investigations have explored the rational design of molecules capable of performing this function [22]. Traditional drug development had an estimated cost of 58.8 billion USD in 2015, 10% higher than in 2014 [26]. Besides being expensive, this process is also time-consuming, requiring about 10–15 years. The high cost and time associated with a low success rate in the traditional drug discovery process highlighted the necessity of using computer-aided drug discovery (CADD) in the drug development pipeline [27]. The computational strategies could be classified as structure-based drug design (SBDD) and ligand-based drug design (LBDD). The SBDD process analyzes the interactions between the molecular target and ligand to rationalize the design of novel bioactive compounds [28]. A classic example of the SBDD approach is molecular docking. This method aims to estimate the binding mode using searching algorithms and the interaction energies using a scoring function [29]. The LBDD is an indirect approach based on analysis of the physical chemistry properties or molecular features of known active compounds [30]. This approach is advantageous compared to SBDD since it does not require preliminary knowledge of the biological target. Many LBDD strategies use quantitative structure–activity relationship techniques (QSAR) such as comparative molecular field analysis (CoMFA) [31, 32], comparative similarity indices analysis (CoMSIA) [33, 34], hologram quantitative structure–activity relationship (HQSAR) [35-37], QuBiLS-MIDAS [38, 39], radial distribution function (RDF) indices [40], and GETAWAY descriptors [41]. In addition, machine learning approaches have increasingly been applied in drug design and adopted by many pharmaceutical industries [42-50]. The tree-based models proposed by Breiman et al. in 1984 [51] are good examples of machine learning algorithms for drug design purposes [52, 53]. In these models, the goal is to split the dataset into binary groups with the highest possible homogeneity. In the beginning, the chosen feature allows the highest gain on homogeneity. Then, as the tree grows, it adds other features to the splitting process for increasing the homogeneity between these groups [54]. Well-known examples of these models are the decision tree (DT) [55], random forest (RF) [56], and adaptive boosting (AdaBoost) [57]. Tree-based models are widely used in several research areas, such as metabolomics [58], disease detection [59], toxicological predictions [60-62], and stock prediction [63], due to their simplicity, efficiency, and interpretability. In this context, this work aimed to study the structure–activity relationship (SAR) of a series of c-Abl kinase activators by combining molecular docking simulations with tree-based classification machine learning models, AdaBoost, decision tree, and random forest, for predicting myristoyl binding (MB), enzyme activation (EA), and cellular activation (CA).

Materials and methods

Dataset description and preparation

We selected compounds classified as c-Abl kinase activators identified by Simpson et al. [22]. We classified the c-Abl kinase activators according to enzyme activation capability and myristoyl binding affinity, measured as the logarithmic half-maximal inhibitory concentration (pIC50), and cellular activation, measured as the logarithmic half-maximal effective concentration (pEC50). The structures of all compounds are shown in Fig. 2, and their biological activities are listed in Support Information Table S1. Among 52 reported compounds, 49 were used since the racemate and chiral compounds lacking defined stereochemistry 33, 45, and 50 were excluded. Compounds' 3D structures were generated using Avogadro 1.2.0 software [64]. Geometry optimization was performed using the Open Babel package [65] with MMFF94 force-field [66-70], Steepest Descent optimization algorithm, and Newton’s method linear search.

Fig. 2

Structure of the 52 c-Abl activators identified by Simpson et al. [22]

Molecular docking

Simpson et al. [22] solved three c-Abl structures with enzyme activators using X-ray crystallography (PDB IDs: 6NPV, 6NPU, and 6NPE). The selected protein structure for docking simulation was the chain A of the one co-crystallized with compound 51 (PDB ID: 6NPV). This crystal structure was chosen because it has the best crystal resolution (1.86 Å). Then, the protein structure was prepared for docking by removing all crystallographic water and phosphate molecules, and the amino acid residues ionization states were adjusted using Discovery Studio Visualizer v19.1.0.18287 [71]. Molecular docking simulations were carried out using GOLD 5.8.1 software [72], using a grid of radius of 10 Å centered at the ALA452 CB, which contained the myristoyl binding site. All remaining parameters regarding ligand flexibility were kept as default. The genetic algorithm used in GOLD was set to maximum search efficiency with 50 GA runs per ligand. All conformations were classified according to the ChemPLP [73] score function. Redocking using ligand 51 and cross-dockings using ligands 6 and 29 were carried out to evaluate the experimental binding mode’s predictability and validate the docking protocol. In this approach, compounds 6 and 29 were docked in the selected structure and were compared with their co-crystallized ones deposited in PDB under the ID: 6NPE (resolution of 2.15 Å), and 6NPU (resolution of 2.33 Å), respectively. Root mean square deviation (RMSD) calculation was the criteria used to assess whether the simulation conditions were adequate when comparing the docking results with the crystallized ligands. Besides, the results of each ligand were grouped in clusters of poses that differ by a maximum of 1 Å from one another. In this process, the best score ranking poses were selected based on the most representative clusters that reproduced the experimental data and were analyzed using the software PyMOL v1.8 [74], Discovery Studio v19.1.0.18287, and PLIP algorithm [75] for poses visualization, interactions evaluation, and figures creation.

Machine learning models

The calculation of molecular fingerprints was performed with PaDEL descriptors [76]. The selected fingerprints for constructing the machine learning models were AtomPairs2D, Klekota Roth, MACCS, PubChem, and Substructure, based on their interpretability. Compounds were divided into active and inactive according to their biological activity parameters (myristoyl binding—MB, enzymatic activation—EA, and cellular activation—CA). For values of either pEC50 or pIC50 not precisely determined for the endpoint, the entry was removed from the mean activity value calculation. Thus, the MB and EA models were built using 48 compounds with mean activity values of 5.733 and 5.925, respectively, while the CA model was built from 44 compounds with a mean activity value of 5.270. All models were constructed using similar strategies previously applied [53, 77]. Python library Scikit-learn was used for the data analysis [78] (Fig. 3). Also, the random training test splits were performed using the train_test_split module from Scikit-learn in an 80:20 ratio based on random selection’s capability to generate predictive models [79]. The AdaBoost (AB) models were constructed varying the estimators’ maximum number that ended the boosting process, from 1 to 106 (in 5 steps), 200, 500, and 600 (n_estimators) and the learning rate from 0.1 to 20 (in 0.1 steps) (learning_rate). For the Decision Tree (DT), the maximum depth of the tree varied from 10 to 100 (in 10 steps) and without limitation (max_depth). The minimum number of samples required to split an internal node was varied from 2 to 100 (in 2 step size) (min_samples_split). The random forest (RF) models were carried out by varying the same max_depth and min_samples_split from the DT models and same n_estimators from AB models. Accuracy, precision, recall, F1 score, and Matthews correlation coefficient (MCC) were calculated for each model using internal and external validations. Internal validation was also carried out using the fivefold cross-validation module (cross_validate). The model evaluation was performed using the MCC values for validation processes, and models with no positive predicted values were excluded in further analysis. The best model for each endpoint had an X-scramble validation following the SCRAMBLE’N’GAMBLE methodology [80].

Fig. 3

Flowchart of the machine learning process applied in this work

Flowchart of the machine learning process applied in this work The applicability domain was assessed by the bounding box approach using principal component analysis (PCA) [81, 82] for the best model for each endpoint, using the implemented method on the Scikit-learn library. Only the independent variables selected by each endpoint from the training set were employed in the model construction. The same model was used to transform the test set data. The PCA data also were applied to perform applicability domain based on the range, and the distance of each test sample from the training dataset using the Euclidean, Manhattan, Cosine, and Wasserstein (probability distribution) distances implemented in the Scipy library [83], obtaining a consensus analysis for applicability domain [84] using a threshold of 95%. The best result for each endpoint was interpreted using the permutation importance from the Scikit-learn library (permutation_importance) several times to permute a feature equals 10 (n_repeats) using MCC as the metric. Features were interpreted using the SMARTS pattern with the SMART.plus web service [85]. Plotting was performed using Matplotlib and Seaborn libraries.

Results and discussion

The RMSD values obtained in the redocking and cross-docking processes were equal to 0.878 Å (Fig. 4a), 0.402 Å (Fig. 4b), and 0.698 (Fig. 4c) for compounds 51, 29, and 6, respectively. Therefore, those results indicated that the employed docking protocol is suitable for pose prediction because RMSD values lower than 1.5 or 2 Å, depending on ligand size, were taken to indicate that the docking protocol successfully predicted the experimental binding mode [86]. Then, the same protocol was used to perform the docking simulations for all other compounds.

Fig. 4

Molecular docking validation results. a Redocking with compound 51 in the myristoyl binding site of Abl kinase crystal structure (PDB ID 6NPV); b Cross-docking of compound 29 in the Abl crystal structure (PDB ID 6NPU); and c Cross-docking of compound 6 in the Abl crystal structure (PDB ID 6NPE) The myristoyl binding site is composed of several hydrophobic residues [20]. It was observed that the aromatic ring of the compounds fits in a hydrophobic pocket, a region deeper into the myristoyl binding site where a series of interactions including π-stacking with PHE512, Van der Waals interactions with ALA363, LEU359, LEU448, ILE451, and VAL487 [87], and halogen bonding with LEU448 occur. In the region most exposed to the solvent, the compounds access various interaction sites such as hydrogen bond acceptors and donors and Van der Waals interactions, depending on their substituting groups. Starting from five fingerprints, 479 AdaBoost (AB) models, 549 decision tree (DT) models, and 2.903 random forest (RF) models were built and validated for each fingerprint and endpoint, summing 59,640 machine learning models (a boxplot analysis for each model and fingerprints are displayed in the Supplementary Figure S1). The cross-validation MCC value was selected to evaluate the model’s performance due to their capability of classifying the performance by a single value, comprising all parameters of a confusion matrix [88, 89]. In this sense, if the model had the higher CVMCC and EXTMCC, then it was selected. Otherwise, the distance was considered from perfection for these two metrics to select a balanced model in both validations. Figure 5a illustrates the first situation where can be seen a model from MACCS fingerprint having the higher CVMCC and EXTMCC among them, while Fig. 5b, c shows the second.

Fig. 5

Scatter plot of cross-validation and external validation MCC values from different methods and endpoints: a Random forest for cellular activation; b decision tree for myristoyl binding; and c Adaboost for myristoyl binding For MB models, AdaBoost performed better than the others for both metrics using the PubChem fingerprint (CVMCC of 0.445 and EXTMCC of 0.612). Also, the most predictive random forest model used the same fingerprint. For this endpoint, the decision tree achieved the highest result using AtomPairs2D, but this model displayed the worst generalization capability compared to the others. The selected models for the EA were considered perfect for the three methods according to the MCC value for external validation and achieved a high generalization capability in the cross-validation process. RF was the best method for the CVMCC displaying a 0.616 value, followed by AB and DT with, respectively, 0.603 and 0.603 MCC values. At least, RF performed better in the models for CA, achieving a perfect score for external validation and 0.622 in the cross-validation using MACCS fingerprint. Also, AdaBoost with AtomPairs2D achieved internal and external MCC values equal to 0.536 and 0.790, respectively. Decision tree with PubChem achieved a CVMCC of 0.445 and EXTMCC of 0.612. Table 1 summarizes the result of the selected models for each endpoint. Only AB with PubChem model for myristoyl binding, RF with PubChem model for enzymatic activation, and RF with MACCS model for cellular activation were considered for further analysis due their higher scores during the validation process and the remain models were disregarded.

Table 1

Endpoint	Method	Fingerprint	_CVMCC	_EXTMCC	_CVAUC	_EXTAUC	_CVF1	_EXTF1
Myristoyl binding	AdaBoost	PubChem	0.445	0.612	0.692	0.875	0.659	0.857
	Decision tree	AtomPairs2D	0.188	0.500	0.579	0.813	0.605	0.769
	Random forest	PubChem	0.422	0.500	0.692	0.813	0.718	0.769
Enzymatic activation	AdaBoost	AtomPairs2D	0.603	1.000	0.777	1.000	0.876	1.000
	Decision tree	AtomPairs2D	0.496	1.000	0.707	1.000	0.830	1.000
	Random forest	PubChem	0.616	1.000	0.760	1.000	8.667	1.000
Cellular activation	AdaBoost	AtomPairs2D	0.536	0.790	0.738	0.916	0.537	0.909
	Decision tree	PubChem	0.441	0.500	0.700	0.833	0.506	0.923
	Random forest	MACCS	0.622	1.000	1.000	0.788	0.693	1.000

Method, fingerprint, cross-validation MCC (CVMCC), external validation MCC (EXTMCC), cross-validation AUC (CVAUC), external validation AUC (EXTAUC), cross-validation F1-score (CVF1), and external validation F1-score (CVF1) for the selected model for each endpoint It is important to mention that the models were generated with a small amount of chemical and biological data and, as expected, could present some bias and low variance which could led to the underfitting of the models [90]. Regarding the recall, a relevant metric for finding new active compounds, the RF models for enzymatic and cellular activation performed perfectly achieving the value equals to 1 while the model for myristoyl binding achieved a value equal to 0.625. However, despite the lower results from model for myristoyl binding model, the models were properly validated and could be used together (consensus predictions) to achieve a better success rate in finding new activators, strategies already used described in the literature [91-94]. Then, additional validations were carried out to verify the robustness of the three selected models. The selected model for each endpoint was submitted to another validation process to ensure the model’s predictability. For regression models, Y-scrambling is a widely used validation strategy [95], and by definition, it measures the prediction errors and/or validation coefficients of artificial models generated with scrambled target values (y) [96]. However, in classification models, the absolute error value is always one due the categorical nature of the target value, and for that reason, X-scrambling can provide more diversity in the scrambled input data for model generation. Thus, X-scrambling validation was performed indicating that selected models were not generated by chance (Fig. 6) because artificial models generated with X-scrambled data failed in internal and/or external validations in comparison to original models.

Fig. 6

Comparison between cross-validation MCC (CVMCC) and external validation MCC (EXTMCC) of the selected original model for each endpoint (pink) and 20 X-scrambled models (green): a myristoyl binding model; b enzymatic activation model; and c cellular activation model After the validation process, the selected model for each endpoint had its applicability domain assessed by fitting a PCA into the training data and transforming the test data. The PCA bounding box approach for each endpoint is shown in Fig. 7 and it was found that the test data is within the training set applicability domain in the three analyzes. For the myristoyl binding and enzymatic activation data, the PCA constructed using PubChem (Fig. 7a, b), 3 PC’s represent 60.5% of the total variance. Similarly, the PCA built using MACCS and cellular activation data, hold-out 62.1% of the total variance (Fig. 7c). Using the range and Euclidean, Manhattan, Cosine, and Wasserstein distances approaches for the applicability domain assessment, no test set compounds for the three models were considered out of the domain. Therefore, all the training and test sets splitting were suitable for the model validation.

Fig. 7

The PCA bounding box approach for the applicability domain assessment for each endpoint: a PubChem fingerprints and myristoyl binding data; b PubChem fingerprints and enzymatic activation; and c MACCS fingerprints and cellular activation data. Each PC axis displays the contained variance Leonard and Roy [97] already discussed in their work methods for a rational selection of the training and test sets to obtain more predictive models. Despite this, in our work, the random selection achieved predictable models and a suitable distribution of the chemical space between the training and test sets as shown in the Hierarchical Clustering Analysis dendrograms presented in supplementary information Figure S2. These analyses may be helpful to characterize a molecular dataset, especially with multiple endpoint cases since the HCA indirectly measures the applicability domain. Also, it can be used to split data in training and test sets. Finally, the essential features of the selected models were interpreted using the permutation importance in the training data (Fig. 8). This process is defined as the decrease in a model score, in this case, MCC, in the training set when only one feature is randomly shuffled. The AdaBoost for MB model using PubChem fingerprint returned 10 features with importance different from zero, and the four most important were the positions 645, 333, 364, and 579, respectively. For the random forest for EA model, also using PubChem, the only feature important in the permutation process was the position 780. For the random forest model for CA using MACCS, 14 features were obtained differently from zero, and the four most important features were the positions 95, 160, 42, and 106, respectively. These features were selected to carry out the structure–activity relationship interpretation.

Fig. 8

Feature importance from permutation process: a AdaBoost for myristoyl binding model using PubChem fingerprint and b random forest model for cellular activation using MACCS fingerprint

Structure–activity relationship

The representation of each fingerprint key pattern and the frequency in the active/inactive compounds are shown in Table 2. With this data, it is possible to see that some fingerprint keys had substantial presence and distinct frequencies in the active and inactive compounds. Using this information, it is possible to infer structure–activity relationships for this dataset. For this analysis, features with an accumulated frequency higher than 30% were selected.

Table 2

Visual interpretation for the fingerprints with permutation feature importance different from zero used in each endpoint

Endpoint	Fingerprint key	Presence in active compounds (%)	Presence in inactive compounds (%)	Visual interpretation
AdaBoost model for Myristoyl binding	PubChemFP333	47.916	22.916
	PubChemFP364	8.333	6.250
	PubChemFP579	29.166	4.166
	PubChemFP645	37.500	16.666
Random forest model for enzymatic activation	PubChemFP780	29.545	11.363
Random forest model for cellular activation	MACCSFP42	10.416	4.166	F
	MACCSFP95	2.083	12.500
	MACCSFP160	8.333	35.416	CH₃

Dashed bonds represent any bond, and A means any atom

Visual interpretation for the fingerprints with permutation feature importance different from zero used in each endpoint Dashed bonds represent any bond, and A means any atom From the MB model, the fingerprint PubChemFP333 has a frequency two times higher in the active compounds. The majority position of this pattern is the methylated carbon in the pyrazoline ring. Also, it is possible to see in the crystallographic structure of compound 51 complexed with the c-Abl the Van der Waals interactions of this pattern (in this case, a carbon near to methylene) with LEU359 (Fig. 8a) and compound 47 with a methyl group from the pyrazoline ring interacting the same residue (Fig. 8b). PubChemFP579 was over six times higher in the active compounds, showing the importance of the aliphatic bulk group for the myristoyl binding affinity model activity. This importance can be exemplified by the docking results, where compounds 4 and 32 occupy a hydrophobic site between the carbon chain in the GLU481 and TYR454 residues at the entrance of the myristoyl binding site (Fig. 8c, d). For PubChemFP645, the frequency in the active group is over two times higher than the occurrence in the inactive group. It is understandable why this pattern is important in the model because the presence of nitrogen between a carbonyl group and two carbons selects the nitrogen in the proper position to interact with ALA452 residue, acting as a hydrogen bond donor (Fig. 8e), an advance proposed by Huong et al. in 2014 [87]. This pattern also displayed the same behavior, even with different rings in both positions, as shown in compound 17 (Fig. 8f) and compound 19 (Fig. 8g), highlighting the importance of this interaction for compound recognition. For the enzymatic activation model, the pattern in PubChemFP780 has a frequency two times higher in the active compounds when compared to the inactive compounds. The importance of this moiety is due to its optimal fitting to the hydrophobic pocket forming the Van der Waals interactions with residues LEU359, ALA452, TYR454, and PRO484 (Fig. 9h). Furthermore, the chlorine atom in para position may be used as a discriminant feature among all rings since its substitution to a polar group will disfavor these interactions [25]. The hydrophobic cavity involving this group can be seen in the compound 51 crystal structure and in the docking of compound 47 (Fig. 9a, b, respectively). Also, despite not being shown in the machine learning models, the chlorine atom in the meta-position can interact with LEU442 forming a halogen bond.

Fig. 9

Visual interpretation of the fingerprints using visual analysis: a crystal structure of compound 51 (PDB ID 6NPG), where Van Der Waals interactions are represented as dashed lines; b docking result of compound 47, where Van Der Waals interactions are represented as dashed lines; c docking result of compound 4, where the surface is colored by hydrophobicity; d docking result of compound 32, where the surface is colored by hydrophobicity; e docking result of compound 34, where hydrogen bond interaction is represented as a blue line; f docking result of compound 17, where hydrogen bond interaction is represented as a blue line; g docking result of compound 19, where hydrogen bond interaction is represented as a blue line; and h surface of myristoyl binding pocket colored by hydrophobicity For the CA model, the generic methyl group recurring in the inactive compounds was the most frequent pattern. This result highlights the information from PubChemFP333, showing that only substitution in the pyrazoline ring is favorable. Also, a common position for this generic methyl group is along with the same carbonyl group of PubChemFP579, showing the importance of aliphatic bulk in this position. Finally, combining all information from ML models and docking simulations, a SAR model was reported and described in Fig. 10.

Fig. 10

SAR model for the activation of c-Abl kinase based on the molecular docking and machine learning models identified in three compounds. The patterns from de machine learning models are bold and colored. Interactions from the crystal structure and molecular docking are represented as dashed lines and display: hydrophobic contacts (Hyd), hydrogen bond donor (HBD), and halogen bond (HB) Finally, the generated models corroborated experimental binding modes and docking studies suggesting that the combination of LBDD and SBDD strategies could be employed in further drug design studies. It is well known in the literature that consensus predictions improved the predictability of QSAR models [91-94] and virtual screening protocols [98-100]. Therefore, despite the limitations of generated models (moderate predictability of MB model and the possibility of bias due the small dataset), a prediction of novel compounds followed by docking studies and following the proposed SAR by our work and the previous studies [22, 87] could be a useful to the drug design of c-Abl activators.

Conclusion

Using classification machine learning models allowed the construction of robust and predictive models for c-Abl activation, including myristoyl binding, enzyme activation, and cellular activation. For the prediction of myristoyl binding affinity, the AdaBoost algorithm using PubChem fingerprint achieved better MCC results for external and cross-validation. For enzyme activation, the random forest algorithm, and PubChem, had the best performance. Finally, for cellular activation, random forest obtained the highest MCC value using the MACCS fingerprint. It is important to mention that the using of a small dataset to train and validate the models could provide bias to the generated models, limiting the model’s application and extrapolation in the SAR study. However, the combination of molecular docking with molecular fingerprints interpretation from the machine learning models corroborated SARs described by Simpson et al. [25] and provided new insights into this structure–activity relationship. This work may assist the next steps forward in identifying and designing more potent novel kinase activators. Below is the link to the electronic supplementary material. Supplementary file 1 (DOCX 532 KB)

69 in total

Review 1. Chronic myeloid leukemia.

Authors: C L Sawyers
Journal: N Engl J Med Date: 1999-04-29 Impact factor: 91.245

Review 2. Oncogenic kinase signalling.

Authors: P Blume-Jensen; T Hunter
Journal: Nature Date: 2001-05-17 Impact factor: 49.962

3. Nuclear c-Abl-mediated tyrosine phosphorylation induces chromatin structural changes through histone modifications that include H4K16 hypoacetylation.

Authors: Kazumasa Aoyama; Yasunori Fukumoto; Kenichi Ishibashi; Sho Kubota; Takao Morinaga; Yasuyoshi Horiike; Ryuzaburo Yuki; Akinori Takahashi; Yuji Nakayama; Naoto Yamaguchi
Journal: Exp Cell Res Date: 2011-10-02 Impact factor: 3.905

Molecular insights on ABL kinase activation using tree-based machine learning models and molecular docking.

Introduction

Materials and methods

Dataset description and preparation

Molecular docking

Machine learning models

Results and discussion

Structure–activity relationship

Conclusion

Review 1. Chronic myeloid leukemia.

Review 2. Oncogenic kinase signalling.

3. Nuclear c-Abl-mediated tyrosine phosphorylation induces chromatin structural changes through histone modifications that include H4K16 hypoacetylation.

4. The c-Abl tyrosine kinase regulates actin remodeling at the immune synapse.

Review 5. Cycling, stressed-out and nervous: cellular functions of c-Abl.

Review 6. Regulation of F-actin-dependent processes by the Abl family of tyrosine kinases.

Review 7. The capable ABL: what is its biological function?

Review 8. Regulation of the c-Abl and Bcr-Abl tyrosine kinases.

Review 9. Role of ABL family kinases in cancer: from leukaemia to solid tumours.

Review 10. Determination of cell fate by c-Abl activation in the response to DNA damage.