Srilok Srinivasan1, Rohit Batra1, Henry Chan1,2, Ganesh Kamath3, Mathew J Cherukara1,4, Subramanian K R S Sankaranarayanan1,2. 1. Center for Nanoscale Materials, Argonne National Laboratory, Lemont, Illinois 60439, United States. 2. Department of Mechanical and Industrial Engineering, University of Illinois, Chicago, Illinois 60607, United States. 3. Dalzielfiver LLC, 3500 Carlfield Street, El Sobrante, California 94803, United States. 4. Advanced Photon Source, Argonne National Laboratory, Lemont, Illinois 60439, United States.
Abstract
An extensive search for active therapeutic agents against the SARS-CoV-2 is being conducted across the globe. While computational docking simulations remain a popular method of choice for the in silico ligand design and high-throughput screening of therapeutic agents, it is severely limited in the discovery of new candidate ligands owing to the high computational cost and vast chemical space. Here, we present a de novo molecular design strategy that leverages artificial intelligence (AI) to discover new therapeutic agents against SARS-CoV-2. A Monte Carlo tree search algorithm combined with a multitask neural network surrogate model for expensive docking simulations, and recurrent neural networks for rollouts, is used in an iterative search and retrain strategy. Using Vina scores as the target objective to measure binding to either the isolated spike protein (S-protein) at its host receptor region or to the S-protein/angiotensin converting enzyme 2 receptor interface, we generate several (∼100's) new therapeutic agents that outperform Food and Drug Administration (FDA) (∼1000's) and non-FDA molecules (∼million). Our AI strategy is broadly applicable for accelerated design and discovery of chemical molecules with any user-desired functionality.
An extensive search for active therapeutic agents against the SARS-CoV-2 is being conducted across the globe. While computational docking simulations remain a popular method of choice for the in silico ligand design and high-throughput screening of therapeutic agents, it is severely limited in the discovery of new candidate ligands owing to the high computational cost and vast chemical space. Here, we present a de novo molecular design strategy that leverages artificial intelligence (AI) to discover new therapeutic agents against SARS-CoV-2. A Monte Carlo tree search algorithm combined with a multitask neural network surrogate model for expensive docking simulations, and recurrent neural networks for rollouts, is used in an iterative search and retrain strategy. Using Vina scores as the target objective to measure binding to either the isolated spike protein (S-protein) at its host receptor region or to the S-protein/angiotensin converting enzyme 2 receptor interface, we generate several (∼100's) new therapeutic agents that outperform Food and Drug Administration (FDA) (∼1000's) and non-FDA molecules (∼million). Our AI strategy is broadly applicable for accelerated design and discovery of chemical molecules with any user-desired functionality.
In
view of the unprecedented human and economic loss resulting
from the COVID-19 pandemic, there is an extensive amount of research
being conducted across the globe to design effective therapeutic agents
and vaccines against SARS-CoV-2, the virus causing the disease. Computational
docking simulations have traditionally been used for in silico ligand design and remain a popular method of choice for high-throughput
screening of therapeutic agents in the fight against COVID-19.[1] However, given the relatively high computing
cost of docking simulations, one can typically only screen a few 1000
molecules even with the use of leadership computing resources (for
instance, summit at ORNL was used by Smith and Smith[1] for their ensemble docking calculations). Despite the vast
chemical space[2] (millions to billions of
molecules) that can be potentially explored, we remain severely limited
in the search of new candidate compounds owing to the high computational
cost of the ensemble docking simulations employed in traditional in silico approaches.[1,3]A crucial first
step toward the development of new therapeutic
ligands is the identification and characterization of new candidate
compounds that are synthetically feasible and effective for the proposed
application. Despite advances in high-throughput screening, only a
small fraction of the enormous possibilities of organic compounds
have been explored. More recently, advances in artificial intelligence
(AI) have accelerated this exploration by building cheaper surrogate
models, that is, machine learning (ML) models (instead of expensive
docking simulations) that are trained against a reasonably elaborate
set of training data set derived from ensemble docking simulations.[4] A variety of AI methods have been applied to
the problem of generating and evaluating new molecules including reinforcement
learning,[5,6] Bayesian molecular design,[7] and generative models, such as variational autoencoders[8,9] and recurrent networks.[5,10,11] We have shown in a recent work[4] that
a machine learnt model can be used to quickly and effectively screen
active agents, from an existing database, for their potential effectiveness
against SARS-CoV-2 based on binding affinity to either the isolated
S-protein at its host receptor region or to the S-protein/angiotensin
converting enzyme 2 (ACE2) interface complex, thereby potentially
limiting and/or disrupting the host–virus interactions. Other
strategies for repurposing existing therapeutics based on the principle
of “neighborhood behavior” in the chemical space,[12−14] docking simulations,[15,16] quantum mechanical scoring,[17] and sequence comparison statistics[18] were also recently reported. While these examples
are effective in screening and repurposing the molecules within an existing database, we are still limited
within the known chemical space.Here, we demonstrate that one
can take full advantage of the recent
advances in AI and, in particular, deep learning methods [multi-task
neural network (MTNN) and recurrent neural network (RNN)] to efficiently
explore the elaborate chemical phase space. We draw inspiration from
the success of AI algorithms such as Monte Carlo tree search (MCTS),
which have proven to be extremely effective in computer games (Go
Game and several others)[19] against human
competitors. Our AI-based de novo design strategy
allows us to rapidly explore the chemical space and identify dozens
of new candidates that are both physically viable and outperform existing
Food and Drug Administration (FDA) and non-FDA molecules in terms
of their Vina scores obtained from the docking simulations. Although
our workflow here is demonstrated in the context of accelerated discovery
of therapeutic agents against SARS-CoV-2, the approach is much broader
and can be applied generally for the inverse design of chemical molecules
with user-desired properties.
Computational Methods
A schematic of our workflow is shown in Figure . Our de novo molecular
design is made possible by deploying a cheap surrogate MTNN model
in place of docking simulations and exploring the exhaustive SMILES
chemical space of millions of prospective molecules using AI algorithms
such as MCTS with rollout(s) using RNN. We represent the SMILES strings
of the explored molecules as a tree with each character representing
a node, so that, any path from the head node to the terminal nodes
represents a unique SMILES string. The goal of MCTS is to discover
new ligands that strongly bind to the S-protein of the SARS-CoV-2
virus and thereby shield it from the human receptors. The strength
of S-protein binding is quantified by the binding Vina scores as defined
by Smith and Smith.[1] A MTNN model, trained
to predict the binding Vina scores for a given SMILES string, is used
to evaluate the merit of the nodes. The accuracy of the MTNN model,
particularly for molecules with strong binding, is improved using
transfer learning strategies. The different components of our workflow
are discussed in more detail below.
Figure 1
(a) Schematic illustration of the MCTS
algorithm. The head node
contains a dummy character &. A RNN model is used to complete
the partial SMILES string, obtained by traversing the tree from the
head node to the leaf node selected. S-protein binding Vina scores
for the valid SMILES strings are predicted using a MTNN model described
in Figure , (b) Venn
diagram representation of phase space of all the possible SMILES,
valid SMILES, and the ligands with high binding energy.
(a) Schematic illustration of the MCTS
algorithm. The head node
contains a dummy character &. A RNN model is used to complete
the partial SMILES string, obtained by traversing the tree from the
head node to the leaf node selected. S-protein binding Vina scores
for the valid SMILES strings are predicted using a MTNN model described
in Figure , (b) Venn
diagram representation of phase space of all the possible SMILES,
valid SMILES, and the ligands with high binding energy.
Figure 2
(a) MTNN architecture to predict Vina
scores of different molecules.
(b) Parity plot of the S-protein and interface ML models for the training
and the test sets, demonstrating the good prediction accuracy achieved
by both the ML models. RMSE: root mean square error, MAE: mean absolute
error, and R2: correlation coefficient.
Monte Carlo Tree Search
We define
our search domain as the space of all the possible combinations of
“building blocks” of SMILES corresponding to the atoms,
number of bonds, rings, and branching. The “building blocks”
are identified as a list of unique and indivisible chemical units
gathered from SMILES strings within binding DB[20] and ZINC[21] database. Our search
space of SMILES strings contains 174 unique building blocks (see Supporting Information S4). We use the MCTS algorithm
within the ChemTS[22] python library for
molecular generation to sample from this search space. Each node of
the MCTS tree contains a “building block” of the SMILES
string and a upper confidence bound (UCB) score defined aswhere v is the number of visits to the node i. The
constant C determines the balance between the exploration
and exploitation of the tree. The merit of a node, w, is calculated asThe head
node and the terminal node
contain dummy characters “&” and “$”,
respectively. Each branch of the tree corresponds to a unique SMILES
string representation. The complete SMILES string is built by traversing
the tree from the head node to the terminal nodes. Figure a illustrates the selection,
expansion, playout, and update scores steps of the MCTS. ChemTS[22] uses a RNN model during the playouts to sequentially
determine the subsequent nodes until the terminal node is reached.
For example, if the playout is performed at a node depth of d, the RNN uses a partial SMILES string Spartial = s1, ..., s, obtained by traversing the
path from the head node, to compute a probability distribution for
the subsequent building block s. The partial SMILES string is then extended by sampling
from the probability distribution and a new probability distribution
for s is computed
using the extended partial SMILES string Spartial = s1, ..., s, s. The complete SMILES string is constructed by iteratively extending
until the terminal symbol is obtained. At the end of the playout,
we obtain a list of the complete SMILES string. We eliminate the ones
which do not satisfy the SMILES grammar before computing the binding
energies using the MTNN model. The node scores are computed aswhere S is the complete SMILES string after
playout, BVS(S) is the
S-protein binding Vina score
predicted by the MTNN, LP(S) is a penalty term for large log P values
computed using eq ,
and RP(S) is a penalty
term for a large number of rings computed using eq .Here, ⟨log P⟩
and NRavg are the mean log P and the number
of rings of the molecules in the BindingDB[20] and ZINC database.[21] NRstd is the standard deviation of the number of rings.The final
step of the MCTS cycle involves updating the UCB scores
of all the nodes in the tree. The MCTS cycle of selection, expansion,
playouts, and updating the scores is continued until a desired stopping
criteria is met.
RNN for SMILES
The RNN maps an input
string s1, ..., s to a probability distribution P(y = j) = g(h), where h = f(h, x), g is a softmax activation
function and x is a
one-hot coded vector of length 175 representing the input s. The architecture of the
RNN is the same as the one used in the original ChemTS[22] framework where two gated recurrent units with
a 256 dimensional hidden state are stacked on top of each other and
represents the function f.The training data
set consist of the 250k molecules from ZINC database[21] used in original ChemTS and an additional 800k molecules
from BindingDB database.[20] The training
of the RNN is performed by minimizing the following loss functionwhere D is the relative entropy, x is the ith SMILES string in the training
data, and θ represents the
network parameters. The training is performed using a ADAM[23] optimizer with a batch size of 512.
MTNN Model for Binding Energy
A MTNN
model was used as a surrogate to quickly predict Vina scores of the
sampled candidate SMILES strings during the MCTS search. The architecture
of the MTNN model is shown in Figure a. The input layer
consists of 358-dimensional molecular descriptor, followed by a series
of dense layers (with/without dropout) diverging into two branches,
one for the isolated S-protein and the other for the interface Vina
score predictions. The molecular descriptor was obtained based on
our past experience on fingerprinting organic materials, including
polymers.[24] Our fingerprinting scheme uses
the SMILES representation of the molecule to capture its chemical
information at multiple length scales, that is, atomic, block-level,
and at a slightly higher morphological level. More details on the
descriptor definition can be obtained elsewhere.[24] Superior performance of multitask networks over their single-task
counterparts for several different classes of supervised learning
problems motivates the use of such a network architecture here.[25] The MTNN model, as implemented in TensorFlow,[26] was trained using the data set from Smith and
Smith,[1] which consists of Vina scores for
the S-protein and the S-protein/ACE2 interface systems for nearly
8000 molecules computed using docking simulations. The loss function
consisted of the sum of the mean absolute errors (MAEs) of the S-protein
and the interface scores, while the root mean square error and correlation
coefficient (R2) were computed to measure
model performance. To avoid overfitting, the MTNN model used dropout
layers. Further, the number of training epochs was determined by tracking
the loss function on the validation set (20% of the Smith dataset).
The ADAM optimizer[23] was used to train
the model. We note that the Vina scores for many molecules within
the Smith data set were reported to be extremely high (as much as
1,000,000 kcal/mol), while those with favorable binding energetics
have Vina scores roughly between −7 and 0 kcal/mol. In order
to remove such skewness in the data and train the MTNN model geared
toward identifying favorable molecules with lower Vina scores, data
points with scores greater than 5 kcal/mol were excluded during the
training (by masking their contribution to the cost function). Further
for a few cases, the SMILES representation could not be resolved correctly
and were filtered out. Overall, the MTNN model when trained on 80%
of the Smith data set, resulted in MAEs of 0.21 and 0.81 kcal/mol
in Vina scores of the S-protein and the interface systems. Parity
plots demonstrating the learning trends for both the training and
the validation sets are included in the Supporting Information (Figure S1).(a) MTNN architecture to predict Vina
scores of different molecules.
(b) Parity plot of the S-protein and interface ML models for the training
and the test sets, demonstrating the good prediction accuracy achieved
by both the ML models. RMSE: root mean square error, MAE: mean absolute
error, and R2: correlation coefficient.
Transfer Learning to Improve
Network Performance
While good overall accuracy was achieved
using the MTNN model when
trained on the Smith data set, the model performance for extreme cases
that are more relevant to this work, that is, ligands with very low
S-protein or interface Vina scores, was low. This is a known issue
with surrogate models, whose loss function is based on the overall
data distribution, rather than a specific/extreme part of it. In order
to improve the MTNN performance to reliably identify cases with extremely
low Vina scores, we retrained our model by including the newly discovered
molecules in the training data using a transfer learning approach.
The Vina scores from the docking simulations of 177 screened candidates
from MCTS search were combined with 510 randomly selected cases from
the Smith data set to obtain the “retraining” data set.
With the MTNN model weights initialized from the previous training
(using Smith dataset), the model was retrained to reduce error on
the new retraining data set. Again, error on the validation set (80%
of the retraining data set) was used to track the number of epochs,
with the results for the retrained MTNN model on the training and
the validation sets, as shown in Figure b. Overall MAE of 0.70 and 0.27 kcal/mol,
respectively, for the interface and the S-protein Vina scores on the
validation set denotes the good accuracy achieved by the MTNN model,
which is comparable to the 1 kcal/mol accuracy expected from the reference
docking simulations. The MTNN architecture was kept fixed during the
retraining step. Extending the training data beyond the distribution
of initial data set allows the MTNN model to learn the correlation
between the SMILES strings and Vina scores over a larger domain of
SMILES space.
Results and Discussion
Using the MCTS–MTNN-based workflow described above, we search
for molecules with low S-protein binding energies. Our workflow discovered
many new molecules, not present in the existing databases and have
a high affinity toward the S-protein of the virus. In total, we identified
97,973 unique molecules and their performance, as predicted by the
surrogate MTNN model, are shown in the Supporting Information (Figure S4). A large fraction of these newly discovered
molecules perform better, that is, have better Vina scores compared
to top FDA molecules sampled in ref (27).In order to validate the MTNN predictions
on the sampled space
of SMILES strings, we perform docking simulations on the molecules
whose S-protein and interface Vina scores are both less than −5.8
kcal/mol. We chose this criteria so that the number of more expensive
docking simulations are kept in the order of few hundred while the
molecules with high binding affinities, as predicted by MTNN, are
also included in the docking simulation set. We perform docking calculations,
using the AutoDock Vina software.[28] Similar
to the work of Smith and Smith,[1] we estimate
the binding affinities of molecular ligands in the binding pocket
region of the SARS-CoV-2S-protein/ACE2 complex based on the Vina
score, a hybrid empirical and knowledge-based scoring function that
ranks molecular conformations and estimates the free energy of binding
based on intermolecular contributions, for example, steric, hydrophobic,
hydrogen bonding, and so forth.[28] As such,
the Vina score can be used as a proxy that correlates to the binding
affinity between molecular ligands and the SARS-CoV-2S-protein/ACE2
receptor. The structure of the ACE2 receptor has a protein data bank
ID PDB: 2AJF whereas the structure of the SARS-CoV-2S-protein has a NCBI Reference
Sequence YP_009724390.1, which has the necessary mutations from
its predecessor SARS variety SARS-CoV, namely, at L(455), F(486),
Q(493), S(494), and N(501), respectively. A total of 12 docking receptors
that include six conformations of the SARS-CoV-2S-protein/ACE2 interface
and the corresponding isolated SARS-CoV-2S-protein receptors, that
is, with the ACE2 receptor removed, are used in our docking calculations.
These docking receptor conformations are obtained from Smith and Smith,[1] which were originally sampled using root mean
squared displacement based clustering from 1.3 microsecond long all-atom
temperature replica exchange GROMACS simulations of the SARS-CoV-2S-protein/ACE2 complex in water. Details regarding the construction
and modeling of the complex are described here.[1] The structure of the docking ligands is converted from
SMILES string using Open Babel software.[29] As in the work of Smith and Smith,[1] we
define a 1.2 nm × 1.2 nm × 1.2 nm search space in our docking
calculation setup, which encompasses the binding pocket located at
the S-protein/ACE2 interface. The same search space is explored for
the isolated SARS-CoV-2S-protein receptor cases. In our docking procedure,
we find the top 10 optimized docking configurations for each molecule
candidate and pick the best-scoring configuration as the predicted
Vina score.The Vina scores computed using docking simulations
for the 177
molecules that satisfies the above criteria are compared against the
FDA approved drugs and BindingDB data set, as shown in Figure . Our strategy of using a surrogate
MTNN model to quickly screen and steer the MCTS search algorithm toward
a promising region in the SMILES string space has successfully identified
molecules with a better binding affinity than the existing ones (Figure ). However, we note
that the MAE between the MTNN model predictions and the docking simulations
are 1.19 and 2.38 kcal/mol for the S-protein and interface Vina scores,
respectively (see Supporting Information, Figure S2). Such high errors in the MTNN predictions are expected
because the sampled molecules are not representative of the data distribution
that was used to train the model but instead lie in a region with
extreme Vina scores. In other words, the MTNN model is being perhaps
utilized in the extrapolative regime for these new molecules, and
its accuracy can be improved using active learning (or retraining).
Figure 3
Vina scores
for the isolated S-protein and S-protein/ACE2 interface
using docking calculations. The 170 candidates selected after the
first round of RNN + NN models are shown in green, with a few superior
candidates displaying low Vina scores. For comparison, previously
considered candidates from past works (ref a,[1] ref b[4]) are also included.
Vina scores
for the isolated S-protein and S-protein/ACE2 interface
using docking calculations. The 170 candidates selected after the
first round of RNN + NN models are shown in green, with a few superior
candidates displaying low Vina scores. For comparison, previously
considered candidates from past works (ref a,[1] ref b[4]) are also included.To improve the MTNN performance and thereby the efficiency
of MCTS
search, we retrain the surrogate MTNN model by including the newly
discovered molecules in the training data. We use a transfer learning
approach as detailed in Section . Extending the training data beyond the distribution
of initial data set allows the MTNN model to learn correlation between
the SMILES strings and Vina scores over a larger domain of SMILES
space. We next repeat the MCTS search with the improved MTNN model
and now identify a significantly larger number of unique molecules
(∼300,000) compared to the previous search. The distribution
of the MTNN Vina scores for these molecules are shown in Supporting Information (Figure S5). The best
molecules were selected by sorting the points by its distance from
the line S-proteinVina + interfaceVina ≤
−7.5 kcal/mol, which is the same screening criteria used in
ref (4), and selecting
the 200 farthest molecules. The Vina scores computed from the docking
simulation for these 200 best candidates, sampled using the retrained
MTNN model, are compared with the molecules from the first round of
MCTS–MTNN as well as FDA approved drugs from the CureFFI database[30] and a dataset of common active ingredients from
DrugCentral[31] (see Figure ). The MAE in the MTNN predictions are 0.38
and 1.28 kcal/mol for the S-protein and interface Vina scores, respectively,
which is less than the original MTNN model. Sampling with an improved
MTNN model which can correctly bias the MCTS algorithm to search the
promising regions of the SMILES string space thus enables the design
of better performing ligands (purple markers in Figure ).
Figure 4
Vina scores for the isolated S-protein and S-protein/ACE2
interface
using docking calculations. The 200 candidates selected after the
second round of RNN + MTNN models are shown in green, with a few superior
candidates displaying low Vina scores. For comparison, previously
considered candidates from past works (ref a,[1] ref b[4]) are also included.
Vina scores for the isolated S-protein and S-protein/ACE2
interface
using docking calculations. The 200 candidates selected after the
second round of RNN + MTNN models are shown in green, with a few superior
candidates displaying low Vina scores. For comparison, previously
considered candidates from past works (ref a,[1] ref b[4]) are also included.We note that by extending the data set to include new on-the-fly
sampled data and performing the transfer learning, the MTNN model
has learnt better the correlation between SMILES string and binding
energies. In principle, we can iteratively improve the MTNN model
by including diverse molecules sampled by MCTS during transfer learning.
The advantage of iterative sampling during any given search cycle
is twofold: (i) MCTS designed molecules to perform better with each
iteration and (ii) surrogate MTNN model to predict the binding energies
get progressively better with each iteration because it learns the
correlation over a larger region in the search space. Furthermore,
using our MTNN model, we evaluate the performance of some of the commonly
used drugs for the treatment of COVID-19 (see Supporting Information, Table S2). Our model successfully
predicts the strong binding nature (low Vina scores) of these candidate
molecules as revealed by AutoDocking simulations.Although the
goal in this work is to discover molecules with high
binding affinities toward the S-protein of the SARS-CoV-2 virus, one
can also utilize other metrics related to the druglikeness of the
compounds to further determine the viability of the drug. Such analysis
are particularly valuable with regard to recent reports[32,33] on the performance plateau of classical scoring functions during
virtual screening. Here, we evaluate the druglikeness of the identified
molecules using the Lipinski’s rule of 5, which requires the
following criteria to be met:Number of H-bonds < 5Number of H-acceptors
< 10Molecular weight < 500 DaOctanol–water partition coefficient
(log P) < 5Using
these above metrics, we further screen the candidates identified
by our MCTS–MTNN search. The distribution of the Lipinski attributes
for the top 200 molecules are shown in Figure . The S-protein Vina scores of the 200 molecules
are provided in Table S1 in Supporting Information. Furthermore, the table of the top candidates and their complete
Lipinski metrics are provided in the Supporting Information files. Out of the 200 best molecules, 134 molecules
satisfy the rule of 5 criteria. Some of the top ranking candidate
molecules from this list are depicted in Figure .
Figure 5
Histogram of the Lipinski attributes for the
top 200 molecules
discovered by the MCTS algorithm. 134 out of the 200 molecules satisfy
all the 4 criteria.
Figure 6
Top candidates identified
from this work along with their Vina
scores for the S-protein/ACE2 interface (labeled, interface) and the
S-protein systems using the ensemble docking simulations. The bottom
panel shows the best binding pose for the corresponding molecules
at the interface of nCov and ACE2 receptor.
Histogram of the Lipinski attributes for the
top 200 molecules
discovered by the MCTS algorithm. 134 out of the 200 molecules satisfy
all the 4 criteria.Top candidates identified
from this work along with their Vina
scores for the S-protein/ACE2 interface (labeled, interface) and the
S-protein systems using the ensemble docking simulations. The bottom
panel shows the best binding pose for the corresponding molecules
at the interface of nCov and ACE2 receptor.In this work, the therapeutic prowess of the molecules are determined
by their Vina scores, which is a measure of the binding energies.
The binding energies of the molecules are dictated by the strength
of the chemical interaction at the S-protein of the virus as well
as the interface formed by the S-protein and human receptors. We identify
the common chemical fragments and patterns within the set of strongly
binding molecules to gain insights into the chemical features which
favors binding. Knowledge of such chemical fragments which make-up
the top candidates will allow accelerated design of therapeutic agents
against SARS-CoV-2.First, we analyze the frequency of occurrence
of chemical features
by splitting each molecule into smaller fragments and counting the
number of unique fragments. We use the molecular fragmenting algorithm
within RDKit[34] to break the molecules across
the single bonds and building a catalogue of unique fragments. A fragment
can be a part of a bigger fragment which can contribute to further
bigger ones. We represent this hierarchy as a directed graph (see Supporting Information S6) and pick only the
terminal nodes for our analysis. The frequency of the top 15% fragments
within the best molecules sampled by MCTS and the best molecules identified
by ref (4) are shown
in Figure . The strings
within the angular brackets describe the functional group attached
to the fragment. The fragments found among the MCTS sampled molecules
are predominantly different from the molecules identified from the
FDA and DrugCentral database in ref (4). Hence, the superior performance of the molecules
identified in this work can be partly attributed to the presence of
unique fragments favoring strong binding.
Figure 7
Counts for a given fragment
within the 200 best performing molecules
sampled by MCTS and the 200 best performing molecules identified from
the FDA approved drugs and the DrugCentral database. Only the fragments
which are present in more than 15% of the molecules are shown here.
The functional group contained within the angular brackets are attached
to the fragments.
Counts for a given fragment
within the 200 best performing molecules
sampled by MCTS and the 200 best performing molecules identified from
the FDA approved drugs and the DrugCentral database. Only the fragments
which are present in more than 15% of the molecules are shown here.
The functional group contained within the angular brackets are attached
to the fragments.Based on the fragment
distribution, we find that the top candidates
from our search tend to include fluoro-alkyl or fluoro-aromatic groups
in their molecular makeup. Fluorinated derivatives and substitution
are known to play a unique role in medicinal chemistry. The tailored
incorporation of a fluoro-alkyl or fluoro-aromatic group (enamine
also having cyclic amines) into bioactive compounds influences the
binding affinity, stability, lipophilicity, membrane permeability,
and bioactivity.[35] Examples including some
of our top compounds such as ligand 2 and ligand 6 consists of the
fluoromethyl group (CF3), which are known to improve adsorption
digestion metabolism and extraction and pharmacokinetic properties,[35,36] see Figure a. It
is worth noting that heterocycles and fused rings constitute of about
70% of the drug market with one of the reasons being the limited entropic
effects showcased by their molecular rigidity. Nitrogen forms a key
atom in such fused rings and heterocycles with a large number of ligands
consisting of the azole moiety. The analysis of some of the top fragments
for the 200 best performing molecules indicate the presence of unsaturated
amines (e.g., ccccn ccccnc), as shown in Figure . This suggests the ubiquity of enamine and
azole chemistry in drug design—such moieties have had considerable
success as antifungal agents. To gain more insights into the performance
of the top ranked candidates, we closely analyze our autodocking simulations
and evaluate the binding pose of these candidates on the interface
of the ACE2 receptor and the virus. We observe that the exocylic nitrogen
(NH2) amino group of the cyclic lactum forms hydrogen bond
interactions with the serine (S494) amino acid of the n-CoV. In the case of another top-ranked molecule, the benzene moiety
interacts with the oxygen of the serine amounting to about −4
to −5 kcal/mol energetic interaction. Based on these observations,
we postulate the role of the amino-azole group with the serine residue
(S494 one of the five mutating sites from the SARS-CoV 2002 virus),
as shown in Figure b, to be an influential factor in determining the efficacy of the
ligands.
Conclusions
In conclusion, we present
an AI search strategy (decision trees
such as MCTS) to discover therapeutic agents against the SARS-CoV-2
virus. We hypothesize that the therapeutic prowess against the SARS-CoV-2
virus is measured by the binding affinity (Vina scores) of the molecules
to either the isolated S-protein of SARS-CoV-2 at its host receptor
region or to the S-protein/ACE2 receptor interface. We replace the
expensive docking simulations with a surrogate MTNN model to accelerate
the exploration of the vast chemical space represented by SMILES strings.
The Vina scores predicted by the MTNN surrogate model are used as
target objectives in the MCTS–RNN search. Our search and retrain
strategy to iteratively explore the design space and concurrently
improve the accuracy of the surrogate model is shown to significantly
accelerate the discovery; we predict over a 100 new molecules that
outperform existing FDA (∼100’s) and non-FDA (∼million)
molecules in their therapeutic prowess measured via Vina scores. We
perform detailed analysis using other metrics related to the druglikeness
of the compounds to further determine the viability of these agents.
Finally, we also analyze structural features and identify structural
similarities between top performing candidates. We note that our molecular
discovery approach using MCTS and RNN is a more general technique
which can be applied to design chemical molecules with the desired
property as long we have a reasonably fast and accurate surrogate
model to screen the SMILES string space and guide the search algorithm.