The interpretation of high-dimensional structure-activity data sets in drug discovery to predict ligand-protein interaction landscapes is a challenging task. Here we present Drug Discovery Maps (DDM), a machine learning model that maps the activity profile of compounds across an entire protein family, as illustrated here for the kinase family. DDM is based on the t-distributed stochastic neighbor embedding (t-SNE) algorithm to generate a visualization of molecular and biological similarity. DDM maps chemical and target space and predicts the activities of novel kinase inhibitors across the kinome. The model was validated using independent data sets and in a prospective experimental setting, where DDM predicted new inhibitors for FMS-like tyrosine kinase 3 (FLT3), a therapeutic target for the treatment of acute myeloid leukemia. Compounds were resynthesized, yielding highly potent, cellularly active FLT3 inhibitors. Biochemical assays confirmed most of the predicted off-targets. DDM is further unique in that it is completely open-source and available as a ready-to-use executable to facilitate broad and easy adoption.
The interpretation of high-dimensional structure-activity data sets in drug discovery to predict ligand-protein interaction landscapes is a challenging task. Here we present Drug Discovery Maps (DDM), a machine learning model that maps the activity profile of compounds across an entire protein family, as illustrated here for the kinase family. DDM is based on the t-distributed stochastic neighbor embedding (t-SNE) algorithm to generate a visualization of molecular and biological similarity. DDM maps chemical and target space and predicts the activities of novel kinase inhibitors across the kinome. The model was validated using independent data sets and in a prospective experimental setting, where DDM predicted new inhibitors for FMS-like tyrosine kinase 3 (FLT3), a therapeutic target for the treatment of acute myeloid leukemia. Compounds were resynthesized, yielding highly potent, cellularly active FLT3 inhibitors. Biochemical assays confirmed most of the predicted off-targets. DDM is further unique in that it is completely open-source and available as a ready-to-use executable to facilitate broad and easy adoption.
Chemical space is vast
and can only be explored to a small extent
with experimental methods to find suitable hits for drug discovery
programs.[1,2] The search for new chemical starting points
to modulate therapeutic targets is essential for the development of
novel drugs. It has been postulated that the best way to find a new
drug is to start with an old drug.[3] This
is in line with the central paradigm in medicinal chemistry that similar
structures exert similar biological activities.[4] Protein kinases are an important class of drug targets
because of their key role in intracellular signal transduction processes
involved in cancer, autoimmune diseases, and (neuro)inflammation.[5,6] The therapeutic value of the protein kinase family is demonstrated
by the 38 kinase inhibitors (KIs) currently approved by the FDA and
a plethora of molecules being tested in clinical trials for this enzyme
family.[7] It is anticipated that these clinically
approved KIs may serve as starting points to identify novel drug candidates
for other kinases.Most KIs interact with a structurally and
functionally conserved
ATP-binding site that is present in all 518 human protein kinases.
It is well established that KIs bind multiple members of the kinase
family and that this may affect their efficacy and toxicity.[8] Detailed investigation of the target interaction
landscape of KIs is therefore important to understand their molecular
mode of action and offers the opportunity to identify new starting
points for other therapeutically interesting kinases. Many complex,
high-dimensional data sets with structure–activity relationships
(SARs) of KIs over a broad selection of kinases have become available
(Table S1).[9−14] These empirical data sets may serve as guides to explore chemical
space around this drug target family and predict (off-)target activity
using advanced computational chemistry methods, such as quantitative
SAR (QSAR) models, the similarity ensemble approach (SEA), support
vector machines, k-nearest neighbor, random forest,
naïve Bayes, (deep learning) neural networks (NNs), and principal
component analysis (PCA).[15−18]Advanced machine learning models promise to
revolutionize the field
of drug discovery. Employing high-dimensional data sets, these models
are used to predict a wider range of biological activities for a compound
compared with traditional drug design methods (e.g., molecular modeling,
docking, and early QSAR models such as Hansch and Free–Wilson
analyses[19]). However, advanced machine
learning models are hampered in their applicability by a lack of clear
interpretation and a tendency to overfit high-dimensional data. Many
of the best-performing machine learning models are black boxes in
which it is unclear how the data are used to generate novel hypotheses.
They also require in-depth knowledge of advanced cheminformatics and
highly specialized or purpose-built software. These technical requirements
slow the implementation of the tools in the daily practice of drug
discovery and consequently prevent the research community from taking
full advantage of the wealth of data becoming available. Therefore,
there is a clear need for better tools to interpret and visualize
complex, high-dimensional SAR data sets in an easy and intuitive manner
and to predict the biological activity profiles of novel hits for
drug discovery programs. Here we present Drug Discovery Maps (DDM),
a machine learning tool that allows the visualization and prediction
of target–ligand interaction landscapes.
Results
t-SNE Maps
the Molecular Similarity of Experimental Drugs in
Chemical Space
On the basis of the principle that the chemical
structure of a compound determines its biological and chemical properties,
a machine learning algorithm that predicts target–ligand interaction
landscapes should be able to recognize molecular similarity between
different molecules. Traditionally, chemical similarity is measured
by the Tanimoto coefficient (Tc).[20] A molecular
fingerprint, which is a high-dimensional bit vector that captures
the presence or absence of chemical groups in a molecule, is used
by the Tc to calculate the similarity between compounds. As a similarity
metric the Tc has its limitations, predominantly because it averages
differences over all bits, thereby losing information.[21] Thus, we envisioned that the data contained
in the molecular fingerprint could be used more efficiently by a machine
learning algorithm to determine molecular similarity.In recent
years, the t-distributed stochastic neighbor embedding
(t-SNE) algorithm has been shown to be a powerful tool to visualize
complex high-dimensional data sets in diverse experimental settings.[22−26] This state-of-the-art unsupervised machine learning technique is
especially powerful in preserving local data structures in high-dimensional
data. It can be readily applied to bit strings of any length and as
such is easily applicable to chemical structures represented by molecular
fingerprints. We aimed to use t-SNE at the core of our prediction
model, where the algorithm is used to find and cluster the most similar
molecules in a large data set and visualize that similarity clustering
in two-dimensional space.We decided to apply the t-SNE algorithm
to visualize the molecular
similarity of molecules from the Drug Repurposing Hub, an online repository
containing compounds that have been clinically tested in humans.[27] We selected only the launched drugs (2274) and
manually classified them into 27 chemotypes. Morgan fingerprints (RDKit,
4096 bits, radius = 2) were generated for each of these 2274 clinical
compounds using KNIME, an open-source software package.[28,29] The fingerprints were fed into the Python implementation of the
Barnes–Hut t-SNE algorithm to generate a map of the drug-like
chemical space.[30] The resulting map (Figure A) shows remarkable
colocalization of most of the chemotypes. As an example, the family
of penicillin-like structures at the far right of the plot (cyan)
is completely separated from all other chemical matter. Some unannotated
molecules (in gray) are visible in the cluster, but upon detailed
inspection they all constitute β-lactams in which the sulfur
is either substituted or omitted. In addition, many other highly dense
clusters are visible at the boundaries of the map, corresponding to
highly defined chemotypes such as the rapamycin, conazole, and oxytocin
analogues. It is noteworthy that even in the apparently less defined
center of the map, clear colocalization of similar molecules can be
observed, for example, a cluster of aspirin-like molecules (orange,
near the origin). Thus, t-SNE is able to map the chemical space of
approved drugs following a chemist’s intuition and recognizes
molecular similarity in a broad set of diverse drug-like molecules.
Figure 1
t-SNE
visualization of chemical space. (a) t-SNE embedding of the
“launched” drugs in the Drug Repurposing Hub. Embedding
is based on the 4096-bit Morgan fingerprint. t-SNE settings: perplexity
= 25, learning rate = 50, iterations = 10 000. Markers are
colored according to 27 manually attributed chemotypes. An animation
of the process of embedding is included in the supporting video. (b) t-SNE embedding of the Published Kinase
Inhibitor Set. Embedding is based on the 4096-bit Morgan fingerprint.
t-SNE settings: perplexity = 50, learning rate = 50, iterations =
10 000. Markers are colored according to 31 manually attributed
chemotypes.
t-SNE
visualization of chemical space. (a) t-SNE embedding of the
“launched” drugs in the Drug Repurposing Hub. Embedding
is based on the 4096-bit Morgan fingerprint. t-SNE settings: perplexity
= 25, learning rate = 50, iterations = 10 000. Markers are
colored according to 27 manually attributed chemotypes. An animation
of the process of embedding is included in the supporting video. (b) t-SNE embedding of the Published Kinase
Inhibitor Set. Embedding is based on the 4096-bit Morgan fingerprint.
t-SNE settings: perplexity = 50, learning rate = 50, iterations =
10 000. Markers are colored according to 31 manually attributed
chemotypes.Next, we wanted to test
whether t-SNE is still able to recognize
molecular similarity within a smaller set of drug-like molecules that
is more homogeneous and has higher molecular similarity. To this end,
we performed t-SNE-mediated clustering of the molecules from the Published
Kinase Inhibitor Set (PKIS).[31] The PKIS
is a 364-member library of molecules assembled by GSK that are all
classified as inhibitors of protein kinases. The PKIS represents 31
chemotypes, and their activities have been measured on 200 kinases.[13] The resulting map of chemical space representing
the KIs (Figure B)
again shows clear colocalization of specific chemotypes. A more in-depth
analysis (see the Supporting Information and Figure S1) confirms the initial visual
inspection and shows high statistical correlation between the autonomously
derived clustering and the human annotation. Of the 31 chemotypes
annotated, 23 were fully collected in one computationally assigned
cluster. For example, the orange and gold clusters on the left of
the map are completely isolated and comprise all of the compounds
of those chemotypes (Figure S1). This illustrates
how t-SNE is capable of recognizing and clustering molecular entities
in a highly specific manner and allows the visual inspection of high-dimensional
chemical structural data, or chemical space, in an easy and intuitive
way.
t-SNE Map of the Target Space of Kinases Recapitulates Phylogenetic
Information
On the basis of the observation that binding
sites in closely related proteins bind similar endogenous molecules
and (experimental) drugs, we wanted to determine whether the t-SNE
algorithm is capable of clustering proteins on the basis of the chemical
similarity of their amino acids in the binding pocket. Conceptually,
this approach is analogous to proteochemometric modeling.[32] To this end, we chose the protein kinase family
as the drug target class because this is a large family of over 500
members that all use ATP in their active site and often show cross-reactivity
toward (experimental) drugs. To quantify the similarity of kinases,
we aligned the amino acid sequences of the whole kinase domains containing
the ATP-binding pocket and used a fingerprint based on physicochemical
properties of the amino acids.[33] The fingerprints
were used to create a two-dimensional map of the target space by the
t-SNE algorithm. The resulting map (Figure A) is striking, as it almost seamlessly recreates
the phylogenetic tree published by Manning et al. in 2002.[34] To assign the kinases to clusters, the coordinates
of the t-SNE embedding were fed into the unsupervised clustering algorithm
DBSCAN (see Supporting Information for
details).[35] All 10 assigned clusters were
significantly (P < 0.0001, hypergeometric test)
enriched for a specific kinase group as assigned by Manning et al.
(Figure A). Closer
inspection of some of the kinases unassigned by DBSCAN reveals that
they belong to distinct branches of the phylogenetic tree, corresponding
to their separation from the main clusters. As an example, the four
TK kinases at the far right of the embedding (burgundy) all belong
to the JAK family (JAK1, -2, and -3 and Tyk2) but only represent their
second kinase domain. The first kinase domain is more closely associated
with the rest of the TK group and lies just outside the DBSCAN-assigned
cluster. The close association of the second kinase domains with the
RGC cluster (colored brown) is especially striking, as these domains,
just like the RGC kinases, are considered to be pseudokinases. The
same holds true for MLKL, IRAK2, and IRAK3. Intriguingly, the IRAK
family of TKL kinases has four members, of which IRAK1 and IRAK4 are
catalytically active whereas IRAK2 and IRAK3 are not.[36] In the t-SNE embedding, the former are located in the major
TKL cluster (orange), whereas the latter are actually assigned to
the RGC-dominated cluster. MLKL has also been shown to lack catalytic
activity in at least one report.[37]
Figure 2
t-SNE visualization
of kinase domains reveals phylogenetic information.
(a) t-SNE embedding of physicochemical fingerprints of the kinase
domains of 535 human kinase domains. t-SNE settings: perplexity =
50, learning rate = 50, iterations = 25 000. Arbitrary t-SNE
coordinates are rotated to match the dendrogram orientation of Manning
et al.[34] Markers are colored according
to the 12 groups defined by Manning et al., and the background is
colored on the basis of the DBSCAN-generated clustering, colored by
the dominant kinase group in that cluster (blanks are unclustered
kinases). (b) Manning et al. manually curated kinome dendrogram overlaid
with circles colored according to the background coloring from the
t-SNE map in (A) based on the unsupervised DBSCAN clustering.[39]
t-SNE visualization
of kinase domains reveals phylogenetic information.
(a) t-SNE embedding of physicochemical fingerprints of the kinase
domains of 535 human kinase domains. t-SNE settings: perplexity =
50, learning rate = 50, iterations = 25 000. Arbitrary t-SNE
coordinates are rotated to match the dendrogram orientation of Manning
et al.[34] Markers are colored according
to the 12 groups defined by Manning et al., and the background is
colored on the basis of the DBSCAN-generated clustering, colored by
the dominant kinase group in that cluster (blanks are unclustered
kinases). (b) Manning et al. manually curated kinome dendrogram overlaid
with circles colored according to the background coloring from the
t-SNE map in (A) based on the unsupervised DBSCAN clustering.[39]Another interesting feature is the separation of a group
(left
of the plot) of TKL kinases from the major cluster. This subset features
all but one of the STKR family of cell-surface-bound receptor kinases.
Upon closer inspection, even the subfamilies of STRK1 and -2 are discernible.
Strikingly, the MISR2 (AMHR2) kinase receptor is located with kinases
categorized as “Other”. This receptor kinase has an
atypical DFG motif (DLG) and as such can indeed be classified as a
pseudokinase, although phosphorylation activity has experimentally
been shown.[38] The other members of the
STKR family do all share the conserved DFG motif. Finally, on the
lower side of the t-SNE plot, several AGC-colored kinases have been
clustered with the CAMK kinases. These actually represent the second
kinase domains of the RSK family, which were also attributed to the
CAMK group by Manning et al.[34]In
summary, this analysis of target space of the binding site of
protein kinase domains ensured us that this embedding is able to recognize
overall similarity but also detect subtle differences between the
different binding domains of most kinase inhibitors.
DDM Can Predict
Target–Ligand Interaction Landscapes
On the basis
of chemical and target space maps of kinases and their
inhibitors, we envisioned that these could provide a workflow to predict
the activity of novel compounds for the entire kinome. We dubbed this
approach Drug Discovery Maps (DDM). The bioactivity data measured
by Elkins et al.[13] for the PKIS were used
as the training set, as the PKIS contains the most unique interactions
of all open data sets (Table S1). The optimization
of the workflow with all of the parameters is described in more detail
in the Supporting Information. The final
architecture of the algorithm is depicted in Figure and illustrated for the EGFR inhibitor erlotinib.
At first, a t-SNE embedding is generated in which erlotinib is mapped
onto the chemical space of the PKIS (top left). This information is
used to find the nine most similar molecules (top right). Of these,
the inhibition data measured by Elkins et al. are averaged, and all
of the kinases above a threshold value C are considered
targets (bottom right). A view the inhibition profiles for this process
is included in Figure S5. These kinases
are then looked up in the target space map (Figure ), and the most similar kinases are appended
(bottom left) to yield the final prediction (center). As the molecular
t-SNE embedding is slightly stochastic, the described process is repeated
several times (R), and the number of times a kinase
is predicted is tracked. Our DDM model was validated using an independent
data set generated by Karaman et al.[9] The
resulting prediction statistics for each of the 38 compounds in this
test set are summarized in Table S2. The
average positive prediction value (PPV) was 40% with a Matthews correlation
coefficient (MCC) of 0.21. We compared these statistics with previously
published methods and found that DDM was better than QSAR models and
equal in performance to random-forest-based proteochemometric models
(Figure S2). A receiver operating characteristic
(ROC) analysis of the performance of DDM on this test set showed an
area under the curve (AUC) of 0.76 (Figure S3). Taken all together, these result show that we have developed and
validated a novel machine learning model to predict kinome inhibitor
landscapes.
Figure 3
Schematic overview of the DDM workflow. In this example, the targets
of erlotinib are predicted. On the basis of a t-SNE embedding (top
left), the PKIS inhibitors nearest to erlotinib are found (top right).
For these, the inhibition data as measured by Elkins et al.[13] are averaged and used as an initial prediction
(bottom right). These targeted kinases are then looked up in the t-SNE embedding (bottom left), where the most similar kinases
are added to yield the final prediction (center).
Schematic overview of the DDM workflow. In this example, the targets
of erlotinib are predicted. On the basis of a t-SNE embedding (top
left), the PKIS inhibitors nearest to erlotinib are found (top right).
For these, the inhibition data as measured by Elkins et al.[13] are averaged and used as an initial prediction
(bottom right). These targeted kinases are then looked up in the t-SNE embedding (bottom left), where the most similar kinases
are added to yield the final prediction (center).
Discovery of Novel FLT3 Inhibitors Using DDM
To investigate
the utility of the model in early drug development, it was applied
for the identification of new inhibitors for FMS-like tyrosine kinase
3 (FLT3). FLT3 is implicated in advanced myeloid leukemia, where approximately
30% of patients carry an internal tandem duplication (ITD) in their
FLT3 gene that activates the kinase and acts as a driver mutation.[40] Recently, midostaurin has been approved by the
FDA for the treatment of acute myeloid leukemia (AML) patients, and
several other inhibitors are currently being tested in clinical trials.
However, fast adaptive mutations in the FLT3 gene quickly result in
drug-induced resistance of the AML, warranting the search for novel
chemotypes to inhibit this kinase. To this end, the DDM model was
used to predict the kinome–ligand interaction landscape of
a small kinase-focused library of 1152 molecules. They were analyzed
using various values for the activity cutoff C and
were ultimately filtered with C = 40% and a prediction
count of at least nine out of 10 runs in order to have a balanced
number of molecules to be tested. These stringent cutoffs yielded
a set of 44 compounds predicted to be active at FLT3.To validate
our virtual DDM screen, we performed a time-resolved fluorescence
resonance energy transfer (FRET)-based biochemical assay with all
1152 compounds against FLT3 at an initial concentration of 10 μM.
This screen yielded 184 actives with >50% loss of activity (16%
of
all compounds). Of these compounds, the pEC50 values were
measured, resulting in 135 compounds with pEC50 > 5,
with
a mean of 6.7 ± 0.9. Eighteen of the 184 compounds were also
identified by our DDM screen, which results in a PPV (or hit rate)
of 41% (Figure A, P < 0.0001 (hypergeometric test)), which is almost 3-fold
higher than the hit rate of the biochemical assay. Interestingly,
15 of the predicted compounds demonstrated EC50 values
of <2 μM (34%, P < 0.0001 (hypergeometric
test)) with an average pEC50 of 7.3 ± 1.1; this group
included the most active compound found in the screen, crenolanib
(pEC50 = 9.0). The hit rate was nearly identical to the
validation statistics for the test set (Figure S2), where an overall PPV of 40% was achieved. The same holds
for the negative predictive value (89%) and the sensitivity (11%).
The successful application of our model for the FLT3 screen may partially
be attributed to the high coverage for the TK family of kinases. It
should be noted that the relatively low sensitivity (11%) is a balanced
choice between minimizing the number of compounds to screen and finding
more actual hits. This can easily be tuned by varying the cutoff parameter.
Figure 4
Discovery
of novel FLT3 inhibitors using DDM. (a) Scatter plot
of all compounds and their inhibitory effects at 10 μM as measured
in the high-throughput screen. DDM-predicted molecules are marked
red. (b) Structures and syntheses of the two compounds resynthesized
and tested in situ against MV4:11 cells. Reagents and conditions:
(i) cyanamide, nitric acid, ethanol, 78 °C, 76%; (ii) dimethylformamide
diethyl acetal, toluene, 80 °C, 80%; (iii) K2CO3, ethanol, 78 °C, 31%; (iv) 4-aminophenol, NaOH, DMSO,
100 °C, 65%; (v) triphosgene, DCM, 40 °C; (vi) 1,4-dioxane,
110 °C, 44% over two steps. (c) Dose–response curves for
compounds 1 and 2 against recombinant FLT3
in a FRET-based activity assay. Markers denote mean ± SD (N = 4). Dotted lines denote the 95% confidence intervals
of the EC50 fits. (d) Dose–response curves of compounds 1 and 2 against MV4:11 leukemia cells. Markers
denote mean ± SD (N = 3). Dotted lines denote
the 95% confidence intervals of the EC50 fits. (e) Docking
poses of 1 and 2 in the 3D models of FLT3
and the corresponding 2D interaction plots.
Discovery
of novel FLT3 inhibitors using DDM. (a) Scatter plot
of all compounds and their inhibitory effects at 10 μM as measured
in the high-throughput screen. DDM-predicted molecules are marked
red. (b) Structures and syntheses of the two compounds resynthesized
and tested in situ against MV4:11 cells. Reagents and conditions:
(i) cyanamide, nitric acid, ethanol, 78 °C, 76%; (ii) dimethylformamide
diethyl acetal, toluene, 80 °C, 80%; (iii) K2CO3, ethanol, 78 °C, 31%; (iv) 4-aminophenol, NaOH, DMSO,
100 °C, 65%; (v) triphosgene, DCM, 40 °C; (vi) 1,4-dioxane,
110 °C, 44% over two steps. (c) Dose–response curves for
compounds 1 and 2 against recombinant FLT3
in a FRET-based activity assay. Markers denote mean ± SD (N = 4). Dotted lines denote the 95% confidence intervals
of the EC50 fits. (d) Dose–response curves of compounds 1 and 2 against MV4:11 leukemia cells. Markers
denote mean ± SD (N = 3). Dotted lines denote
the 95% confidence intervals of the EC50 fits. (e) Docking
poses of 1 and 2 in the 3D models of FLT3
and the corresponding 2D interaction plots.Two of the predicted compounds, 1 and 2 (Figure B), were
selected on the basis of their chemical properties, novelty regarding
FLT3 inhibition, and predicted interaction profiles (vide infra).
These compounds were resynthesized using established methods (see Figure B and the Supporting Information). The activity of the
compounds was confirmed in a FRET assay using recombinant humanFLT3
(Figure C). Compounds 1 and 2 showed a concentration-dependent activity
with pEC50 values of 7.3 ± 0.1 and 8.8 ± 0.1,
respectively. To determine the cellular activities of these two compounds,
a cell proliferation assay using the FLT3-dependent AML cell line
MV4:11 was performed. Both 1 and 2 showed
clear cellular activity with pEC50 values of 6.3 ±
0.1 and 8.5 ± 0.1, respectively (Figure D). In summary, the experimental validation
of the hits illustrates the power of our DDM workflow for compound
selection in the lab.Finally, to explain the potential binding
mode of compounds 1 and 2, these compounds
were docked using a
DFG-in model for 1 and a DFG-out structure (PDB entry 4RT7) for 2 (Figure E). Compound 1 binds to the hinge region with the aminopyrimidine moiety
in a fashion typical for type 1 kinase inhibitors. Compound 2 binds in the DFG-out conformation much like RIPK2 (PDB entry 5AR7) by forming hydrogen
bonds to the DFG motif using the urea functionality and to the hinge
region using the pyridinenitrogen.[41]
Kinome Activity Spectrum Prediction Using DDM
To reduce
potential toxic side effects, kinase cross-reactivity is ideally minimized.
DDM enables rapid assessment of the predicted cross-reactivity because
by default DDM predicts the interactions with the entire kinome. Thus
far, however, only the FLT3 prediction has been taken into account.
As final validation, we tested the activities of the two inhibitors
on the predicted off-targets in biochemical assays. In addition to
FLT3, compounds 1 and 2 were predicted to
be active against 35 and 33 kinases, respectively (C = 40%, R > 0.5). The off-targets were validated
using KinaseProfiler by Eurofins at 10 μM. The inhibition data
per compound are shown in Table S3. For
compound 1 the predictions were 69% accurate (24 of the
35 off-targets confirmed (<50% remaining activity) with two additional
off-targets in the low 50% residual activity range). For compound 2 the prediction was exceedingly accurate, as 26 of the 33
targets (79%) were indeed inhibited >50%. To conclude, DDM was
able
to predict the kinome–inhibitor interaction landscape with
a relatively high accuracy.
Discussion
Drug
discovery is still largely an empirical process that is challenging,
time-consuming and hard.[42] Multiparameter
optimization of chemical structures, which is needed to balance the
activity and selectivity of a drug candidate, requires the understanding
of high-dimensional data sets. Machine learning algorithms have been
employed to analyze and predict compound activity using large data
sets with varying success.[15−17] Some of the major drawbacks of
most computational models are the complexity of the algorithm and
the “black box” nature of the systems. Implementation
and interpretation of such systems is not trivial, and consequently,
they have not been widely adopted by the drug discovery community.Here we present DDM, which is an intuitive, data-driven (bio)molecule
similarity clustering procedure using state-of-the-art machine learning
techniques. The model is based on the t-distributed stochastic neighbor
embedding (t-SNE) algorithm to generate a visualization of molecular
similarity in two dimensions.[43,44] Color is used as a
third dimension to interactively visualize the biological activity
or compound class (chemotype). DDM combines two different maps. The
first map depicts the chemical space, in which compounds are clustered
on the basis of their molecular similarity, whereas in the second
map protein targets are clustered on the basis of the chemical similarity
of the amino acids making up the kinase domain. By combining the two
maps, DDM is able to predict bioactivities of small molecules across
a protein family. We applied DDM to visualize the chemical space of
currently available drugs, the published kinase inhibitor set (PKIS)
and the target space of the protein kinase family (kinome). DDM was
able to predict the kinome activity profile of another independent
set of kinase inhibitors with comparable or better scores than the
currently available machine learning techniques. We applied DDM to
identify new hits for the oncogene FMS-like tyrosine kinase 3 (FLT3),
a validated therapeutic target for the treatment of acute myeloid
leukemia.[45] The hits were resynthesized,
and their biological activities were validated in biochemical and
cellular assays. Finally, the off-target profiles of the hits as predicted
by DDM were validated in a panel of kinase assays.Although
our model performs equally well or better than the current
computational drug discovery tools, it is envisioned that our model
can be further improved when more comprehensive data sets become available
in the public domain. In the PKIS training set, 364 inhibitors were
tested at only two concentrations on approximately 200 unique wild-type
kinases. A more expansive data set of a broader set of more diverse
compounds tested on a larger number of kinases in a concentration–response
fashion would inherently improve the predictions generated over the
entire kinome.The added value of direct knowledge of the off-targets
of these
compounds enables prioritization in medicinal chemistry efforts, as
demonstrated by the KinaseProfiler screen of predicted off-targets.
This allows medicinal chemists to rank scaffolds on the basis of acceptable
off-targets, which in turn depends on biological questions or medical
indications. The information obtained from the docking poses of these
molecules can also be used for structure-based design, directly incorporating
the knowledge derived from the clinically relevant mutations into
the hit-optimization project.The DDM concept presented here
can easily be adapted to work with
any data set available. Because all data, algorithms, and data processing
tools used are in the public domain or open-source, it is highly adaptable
and extensible. Concrete examples include different druggable protein
classes, such as G-protein-coupled receptors, ion channels, or nuclear
hormones, or the ability to be trained on a different molecular set
altogether, e.g., solubility, membrane permeability, metabolic stability,
pharmacokinetics, or toxicological data.To aid in the implementation
of our tool as it is presented here,
a Python-based executable including a graphical user interface (Figure ) has been made available
online via Github.[46] The unpackaged Python
script with a list of dependencies is also available. Also included
is a fully annotated KNIME workflow to allow step-by-step execution
and analysis. This set of tools should enable the integration of this
data-driven approach into any project without any need of investments
a priori.
Figure 5
Graphical user interface (left) and generated output (right) of
the Python implementation of the DDM algorithm presented here. Only
a SMILES string is required as input, and the output is provided as
depicted on the right. The packaged executable as well as the original
Python script have been made available online.[46]
Graphical user interface (left) and generated output (right) of
the Python implementation of the DDM algorithm presented here. Only
a SMILES string is required as input, and the output is provided as
depicted on the right. The packaged executable as well as the original
Python script have been made available online.[46]To conclude, the machine learning
algorithm Barnes–Hut t-SNE
was successfully implemented in a drug discovery setting to predict
ligand–protein interaction landscapes. The concept of DDM is
applicable to a multitude of drug discovery challenges, which, given
the proper data set, can be used to design a small molecule with a
balanced set of physicochemical and biological properties as required
for drug candidates. It is envisioned that DDM may make the drug discovery
process more efficient.
Authors: Walid M Abdelmoula; Benjamin Balluff; Sonja Englert; Jouke Dijkstra; Marcel J T Reinders; Axel Walch; Liam A McDonnell; Boudewijn P F Lelieveldt Journal: Proc Natl Acad Sci U S A Date: 2016-10-10 Impact factor: 11.205
Authors: Serge Christmann-Franck; Gerard J P van Westen; George Papadatos; Fanny Beltran Escudie; Alexander Roberts; John P Overington; Daniel Domine Journal: J Chem Inf Model Date: 2016-08-11 Impact factor: 4.956
Authors: James M Murphy; Qingwei Zhang; Samuel N Young; Michael L Reese; Fiona P Bailey; Patrick A Eyers; Daniela Ungureanu; Henrik Hammaren; Olli Silvennoinen; Leila N Varghese; Kelan Chen; Anne Tripaydonis; Natalia Jura; Koichi Fukuda; Jun Qin; Zachary Nimchuk; Mary Beth Mudgett; Sabine Elowe; Christine L Gee; Ling Liu; Roger J Daly; Gerard Manning; Jeffrey J Babon; Isabelle S Lucet Journal: Biochem J Date: 2014-01-15 Impact factor: 3.857
Authors: Huikun Zhang; Spencer S Ericksen; Ching-Pei Lee; Gene E Ananiev; Nathan Wlodarchak; Peng Yu; Julie C Mitchell; Anthony Gitter; Stephen J Wright; F Michael Hoffmann; Scott A Wildman; Michael A Newton Journal: PLoS Comput Biol Date: 2019-08-05 Impact factor: 4.475
Authors: Nicolas Bosc; Eloy Felix; Ricardo Arcila; David Mendez; Martin R Saunders; Darren V S Green; Jason Ochoada; Anang A Shelat; Eric J Martin; Preeti Iyer; Ola Engkvist; Andreas Verras; James Duffy; Jeremy Burrows; J Mark F Gardner; Andrew R Leach Journal: J Cheminform Date: 2021-02-22 Impact factor: 5.514