György Abrusán1, Joseph A Marsh1. 1. MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine , University of Edinburgh , Crewe Road , Edinburgh EH4 2XU , U.K.
Abstract
The spread of antibiotic resistance is one of the most serious global public-health problems. Here we show that a particular class of homomers with binding sites spanning multiple protein chains is particularly suitable for targeting by broad-spectrum antibacterial agents because due to the slow evolutionary change of such binding pockets, ligands of such homomers are much more likely to bind their homologs than ligands of monomers, or homomers with a single-chain binding site. Additionally, using de novo ligand design and deep learning, we show that the chemical compounds that can bind several different receptors have common structural characteristics and that halogens and fragments similar to the building blocks existing antimicrobials are overrepresented in them. Finally, we show that binding multiple receptors selects for flexible compounds, which are less likely to accumulate in Gram-negative bacteria; thus there is trade-off between reducing the emergence of resistance by multitargeting and broad-spectrum antibacterial activity.
The spread of antibiotic resistance is one of the most serious global public-health problems. Here we show that a particular class of homomers with binding sites spanning multiple protein chains is particularly suitable for targeting by broad-spectrum antibacterial agents because due to the slow evolutionary change of such binding pockets, ligands of such homomers are much more likely to bind their homologs than ligands of monomers, or homomers with a single-chain binding site. Additionally, using de novo ligand design and deep learning, we show that the chemical compounds that can bind several different receptors have common structural characteristics and that halogens and fragments similar to the building blocks existing antimicrobials are overrepresented in them. Finally, we show that binding multiple receptors selects for flexible compounds, which are less likely to accumulate in Gram-negative bacteria; thus there is trade-off between reducing the emergence of resistance by multitargeting and broad-spectrum antibacterial activity.
Currently
the world faces an emerging antibiotic resistance crisis
because the rate of developing new antibiotics has not caught up with
the pace of the spread of antibiotic resistance.[1,2] This
is caused by several factors: antibiotic resistance is an ancient
evolutionary phenomenon and is unavoidable,[1,2] while
the small number of novel antibiotics entering the market might be
partly caused by the limitations of existing compound libraries used
in the pharmaceutical industry and the possible lack of unexplored,
“low-hanging-fruit” drug classes[3,4] (but
see also ref (5)).
Socioeconomic factors also contribute, including the irresponsible
use of antibiotics promoting resistance and the relatively low profitability
of novel antimicrobials, which, exactly to prevent the emergence of
resistance, are likely to be used as last-resort drugs rather than
first-line medications. Since antibiotic resistance can emerge quickly,
even in laboratory settings,[6] developing
drugs that reduce the likelihood of resistance is a central goal of
the field.[7−9] Resistance can emerge due to several factors, like
changes in the proteins targeted by the antibiotic, changes in the
rate of removal or uptake of the antibiotic, or changes in the degradation
rate of the antibiotic. However, the analysis of currently available
antibiotics indicates that most successful antibiotics or antibiotic
classes bind several protein targets, e.g., β-lactam antibiotics,
fluoroquinolones (or target substrates rather than enzymes, e.g.,
vancomycin[10]), while resistance emerges
much more quickly for antibiotics that target only a single protein
(e.g., sulfonamides, trimetophrim), and such drugs are used mostly
in combination with other drugs.[7,11] The most likely cause
of this phenomenon is that in the case of single target drugs, a few
mutations at a single binding site can be sufficient to make the drug
ineffective, whereas for multitarget drugs, several binding sites
have to be mutated to achieve resistance.As a consequence,
the strategies that have been employed to slow
down the emergence of resistance typically rely on targeting several
proteins simultaneously, by either a single drug or “cocktails”
of drugs. The central goal is obviously to find novel drug classes,
but an alternative and very promising strategy is to create hybrid
molecules that contain the core pharmacophores of several existing
drugs, connected by a linker.[7,12−14] For several difficult to treat infections like Helicobacter
pylori or Mycobacterium tuberculosis (but
also for pathogens like HIV or Plasmodium), combination
therapy is already the only effective treatment, and multitarget drugs
are currently being developed.[15] Additionally,
it has been shown that the emergence of resistance for one antibiotic
frequently influences (and sometimes increases) the sensitivity to
other antibiotics,[16−20] suggesting that, aside from combination therapy, developing drugs
that target the “right” protein combinations may significantly
reduce the speed at which resistance develops.A theoretical
problem in designing multitarget antibiotics is that
the proteins targeted by them may not be similarly druggable in different
species. First, species that are distantly related (i.e., Gram positive
and Gram negative bacteria) may not share all the proteins targeted
by the drug, even if they are essential. Second, and more importantly,
the binding sites of homologous proteins in distantly related species
may not be structurally identical, and thus the drug might not be
able to bind efficiently all of them, even if they are present. These
problems are expected to scale exponentially with the number of targeted
proteins, and as a consequence, drugs that are multitarget in one
species might effectively be single target in another one and therefore
not be “resistance resistant”. Therefore, selecting
targets that have similar binding sites and functions across the broadest
possible spectrum of bacterial species is of great importance to the
selection of antibiotic targets.Recently, we have found that
the structure of ligand binding sites
has profound consequences for the evolution of protein function and
structural divergence of binding pockets, particularly in proteins
that form homomers–protein complexes made up of multiple copies
of the same polypeptide subunit.[21] In prokaryotes,
homomers are by far the most common type of protein complexes, with
>50% of bacterial proteins of known structure falling into this
category.[21,22] Binding sites that are formed by multiple
protein chains show much
higher structural conservation than sites that are formed by the residues
of individual protein chains, and have more similar ligands than binding
sites of monomers, or homomers with a single chain binding site.[21] Since similar binding pockets generally bind
similar ligands,[23] proteins where distant
homologs have similar binding pockets are potentially good candidates
for broad-spectrum antibiotic targets. In this paper, using in silico
experiments, we show that ligands and de novo designed ligands of
homomers with multichain binding sites (MBS, see Figure A) are more likely to bind
their homologs than ligands of monomers or homomers with a single-chain
binding site (SBS, see Figure B). This shows that considering the quaternary structure of
proteins and the structures of their ligand-binding sites can aid
the selection of protein targets for new broad-spectrum antibiotics.
In addition, using de novo ligand design and methods based on deep
learning, we also test whether the chemical compounds that can bind
several different receptors have common structural characteristics.
We show that fragments similar to the building blocks of existing
antibiotics are overrepresented in them and that there is likely to
be a trade-off between preventing the emergence of drug resistance
and broad spectrum activity. We expect these findings to be useful
in selecting targets of novel leads and in generating targeted fragment
libraries for antibiotic design.
Figure 1
Examples of homomers with multichain binding
sites (MBS), single-chain
binding sites (SBS), and an overview of binding site identification
in homologs. (A) Structure of nitroreductase ydjA from E.
coli (PDB code 3bm1). The dimer structure has two multichain binding sites,
both “sandwiched” between the two chains of the complex.
The ligand (flavin-mononucleotide) is displayed in red, and ligand
binding residues are in yellow. (B) Structure FabH protein from E. coli (PDB code 1ebl). The dimer has two binding sites, both restricted
to a single chain. The ligand (coenzyme A) is displayed in red, and
ligand binding residues are in yellow. (C) Structure of SiaP protein
from Haemophilus influenzae (PDB code 2wyk). The protein is
a monomer, and has a single binding site, with its ligand (N-glyconeuraminic acid) displayed in red. (D) We searched
for similar binding sites in homologous proteins with ProBiS and defined
the region of binding sites for grid building and docking through
local structural alignments. The structural alignment shows the H. influenzae SiaP protein binding site superposed with
the binding site (identified by ProBis) of the homologous c4-dicarboxylate-binding
protein of Pseudomonas aeruginosa (PDB code 4nf0); the red box indicates
the region of the binding sites. Note that the alignment optimized
the superposition of the binding sites and not the global protein
structures. (E) Once the binding site has been identified in c4-dicarboxylate-binding
protein, the ligand of H. influenzae SiaP protein
was docked into it, and the binding energies (i.e., grid score) in
the two structures were compared.
Examples of homomers with multichain binding
sites (MBS), single-chain
binding sites (SBS), and an overview of binding site identification
in homologs. (A) Structure of nitroreductase ydjA from E.
coli (PDB code 3bm1). The dimer structure has two multichain binding sites,
both “sandwiched” between the two chains of the complex.
The ligand (flavin-mononucleotide) is displayed in red, and ligand
binding residues are in yellow. (B) Structure FabH protein from E. coli (PDB code 1ebl). The dimer has two binding sites, both restricted
to a single chain. The ligand (coenzyme A) is displayed in red, and
ligand binding residues are in yellow. (C) Structure of SiaP protein
from Haemophilus influenzae (PDB code 2wyk). The protein is
a monomer, and has a single binding site, with its ligand (N-glyconeuraminic acid) displayed in red. (D) We searched
for similar binding sites in homologous proteins with ProBiS and defined
the region of binding sites for grid building and docking through
local structural alignments. The structural alignment shows the H. influenzae SiaP protein binding site superposed with
the binding site (identified by ProBis) of the homologous c4-dicarboxylate-binding
protein of Pseudomonas aeruginosa (PDB code 4nf0); the red box indicates
the region of the binding sites. Note that the alignment optimized
the superposition of the binding sites and not the global protein
structures. (E) Once the binding site has been identified in c4-dicarboxylate-binding
protein, the ligand of H. influenzae SiaP protein
was docked into it, and the binding energies (i.e., grid score) in
the two structures were compared.
Results
Ligands of MBS Homomers Bind Their Homologs Significantly Better
than Ligands of SBS Homomers or Monomers
In the first step
of the analysis we identified bacterial proteins that are likely to
be suitable targets for antibiotics. Using BLAST and the prokaryotic
proteins present in the Protein Data Bank (PDB), we compiled a data
set of 687 pairs of homologous bacterial proteins. We only used proteins
that form homomers or monomers. The proteins of the homologous pairs
were selected according to the following criteria: (1) none of the
proteins have a homolog in the human genome with BLAST e-value <10–3; (2) the pairs have homologous
regions, with BLAST e-value <10–5; (3) at least one protein of the pair is essential; (4) the sequence
similarity between them is less than 40% because above 40% protein
structures are very similar (see ref (24) for examples); (5) both proteins have a biologically
relevant ligand, i.e., present in the BioLiP database;[25] and (6) their quaternary structure and binding
site type (MBS vs SBS) are similar. Next, in each PDB structure of
the protein pairs, the ligand-binding site of every small molecule
ligand was identified (Figure C), and we identified the best binding site candidate in the
structures of their homologs with ProBiS[26,27] (Figure D). Once
the location of the putative binding site was determined in the homolog,
we docked the ligand with DOCK[28] to the
original binding site and the binding site in the homolog and compared
their estimated binding energies (i.e., grid score, Figure E). Only those pockets (and
proteins) were used where the performance of DOCK was acceptable;
i.e., it could reproduce the position of the original ligand in the
top 10 scoring poses (see Methods for details).
In addition, the binding sites of metals and cofactors were excluded
from the analysis because metals typically have very small binding
sites, while the binding pockets of cofactors are structurally so
conserved that they are likely to be similar in bacterial and mammalian
proteomes.[21] This procedure resulted in
a set of 1007 binding pocket pairs (78 in MBS homomers, 212 in SBS
homomers, and 717 in monomers) between 1464 different binding pockets
of 262 different proteins (see Table S1 for a list of pocket pairs). Note that the pocket pairs contain
redundancies; the number of nonredundant pockets (excluding pockets
with metals and cofactors) is 524 in the original PDB receptors and
940 in their homologs.The binding pocket searches with ProBiS
indicate that this data set, despite its much smaller size, shows
a qualitatively similar pattern of binding pocket similarity as the
entire PDB in our previous study:[21] the
frequency of MBS homomer pairs that have no highly similar (ProBiS Z score >2) binding pockets is much smaller than in SBS
homomers or monomers (Figure A, p = 0.029 and p = 0.0001,
respectively; tests of proportions). Docking of the ligands into their
original binding site and the best binding site of their homologs
shows a similar trend (Figure B). The normalized grid score [(scoreoriginal –
scorehomolog)/scoreoriginal] indicates that
for MBS homomers, the binding of the ligand in their homolog is nearly
as strong as in the original binding site, while in SBS homomers and
especially monomers, ligands bind much more weakly at the homologous
biding sites (Figure B, p = 0.013 between MBs and SBS homomers, and p ≪ 0.001 for other comparisons, Dunn’s test
[Kruskal–Wallis rank sum test]). The performance of DOCK was
highly dependent on the size of the ligand (and number of rotatable
bonds), as reported previously[28,29] (Figure C), and performed best on ligands with 10–30
heavy atoms. The estimated binding energy of ligands (i.e., DOCK grid
score, which estimates interaction strength based on a simplified
force field) generally follows a linear trend in all three quaternary
structure types, with the ligands of monomers having many more clashes
in their homologs than homomers (Figure D–F). Statistical analyses with ANCOVA
indicate a similar trend as the normalized grid scores: there is a
marginally significant difference between MBS homomers and their homologs
(Figure E), and the
difference is largest between monomers and their homologs (Figure F). Taken together,
these findings indicate that ligands of MBS homomers bind much more
strongly in their homologs than ligands of SBS homomers or monomers.
Figure 2
Ligands
of homomers are more likely to bind their homologs than
the ligands of monomers. (A) Frequency of homologous protein pairs
without similar ligand binding sites, (ProBiS Z score
of >2). See also ref (21) for general trends in ligand binding site evolution. MBS
homomers
are much more likely to have an identical binding site in their homologs
than SBS homomers (p = 0.029, test of proportions)
or monomers (p = 0.00011, test of proportions). (B)
The difference between the estimated ligand binding energies (grid
score) in the original binding sites and homologous binding sites
shows that homomers, and particularly MBS homomers, are much more
likely to bind the ligands of their homologs than monomers. Normalized
grid score difference was calculated as (scoreoriginal –
scorehomolog)/scoreoriginal. All three possible
comparisons are significant (p = 0.013 for MBS vs
SBS homomer; p ≪ 0.005 for MBS/SBS homomer
vs monomers; Kruskal–Wallis rank sum tests). (C) Pose reproduction
success rate of DOCK for ligands of different sizes. The high flexibility
of ligands above 40 heavy atoms in this particular data set results
in a relatively low (∼50%) success rate for large ligands.
(D–F) The relationships between ligand size and binding energy
(grid score) follow a linear trend in the original ligand binding
sites and the binding sites of homologs. The difference between the
original binding sites and the binding sites of homologs is lowest
in MBS homomers and largest in monomers (ANCOVA).
Ligands
of homomers are more likely to bind their homologs than
the ligands of monomers. (A) Frequency of homologous protein pairs
without similar ligand binding sites, (ProBiS Z score
of >2). See also ref (21) for general trends in ligand binding site evolution. MBS
homomers
are much more likely to have an identical binding site in their homologs
than SBS homomers (p = 0.029, test of proportions)
or monomers (p = 0.00011, test of proportions). (B)
The difference between the estimated ligand binding energies (grid
score) in the original binding sites and homologous binding sites
shows that homomers, and particularly MBS homomers, are much more
likely to bind the ligands of their homologs than monomers. Normalized
grid score difference was calculated as (scoreoriginal –
scorehomolog)/scoreoriginal. All three possible
comparisons are significant (p = 0.013 for MBS vs
SBS homomer; p ≪ 0.005 for MBS/SBS homomer
vs monomers; Kruskal–Wallis rank sum tests). (C) Pose reproduction
success rate of DOCK for ligands of different sizes. The high flexibility
of ligands above 40 heavy atoms in this particular data set results
in a relatively low (∼50%) success rate for large ligands.
(D–F) The relationships between ligand size and binding energy
(grid score) follow a linear trend in the original ligand binding
sites and the binding sites of homologs. The difference between the
original binding sites and the binding sites of homologs is lowest
in MBS homomers and largest in monomers (ANCOVA).
Characteristics of Antibiotics and de Novo Ligands
We next
determined whether de novo designed ligands show similar
trends as the previous analysis of the original ligands of the PDB
structures. We used de novo DOCK[30] to design
novel ligands in each binding site where the pose reproduction of
the original ligand was successful, excluding cofactor and metal binding
sites and also sites where the original ligand had only one or no
rotatable bonds (see Methods for details),
with a maximum molecular weight cutoff of 550 Da. This resulted in
453 binding sites where de novo ligands were built (see Table S3 for the full list). From all de novo
ligands that were generated in each receptor we selected the 50 with
the lowest grid scores (i.e., the 50 best binders) and used only these
in the further analyses, altogether 22 650. Note that de novo
ligands were not designed in receptors identified through homology.
We also downloaded the SMILES strings of FDA approved antibiotics
present in DrugBank[31] from the ZINC15 database[32] to compare the properties of de novo ligands
with real antimicrobial agents. Antibiotics with molecular weight
of >600 Da were not included for two reasons. First, only a relatively
small number of compounds fall into this size category (glycopeptides
[e.g., vancomycin], lipopeptides [e.g., daptomycin], or macrolides),
and these are typically active only against Gram positive bacteria.[33] Additionally, these compounds generally have
complex, cyclic structures, and designing functional ligands of this
size and complexity is beyond the capabilities of the current de novo
design tools. The list of antibiotics included is available in Table S2.The comparison of molecular weight
distributions of antibiotics and de novo ligands shows that the de
novo ligands of MBS homomers are somewhat biased toward larger molecular
weights than antibiotics, which have a peak between 400 and 500 Da
(Figure A,B), while
monomers (Figure B)
and SBS homomers (Figure S1A) are characterized
by more uniform distributions. Next, the 50 best scoring de novo ligands
of each receptor were re-docked into their original binding site and
to the corresponding binding site in their homologous proteins, and
we determined the fraction of ligands that have an approximately similar
molecular weight to the original ligand (min. 85%), and bind in the
homologous binding sites similarly as in the original site (min 85%
of grid score). The comparison of de novo ligands in the different
quaternary structures indicates that in homomers, and particularly
MBS homomers, the fraction of ligands that bind similarly in homologs
and in the original binding site is much higher than in monomers,
particularly above 300 Da (Figure C, Figure S1B; ∗∗, p ≪ 0.005; tests of proportions). The difference
is primarily caused by the higher similarity of binding sites and
not by the qualitatively different performance of de novo DOCK in
proteins with different binding sites (although it may somewhat contribute
to the pattern in SBS homomers) because the fraction of strongly binding
ligands (grid score of de novo ligand is min 85% of the original ligand)
is comparable in all quaternary structure types (Figure S2). Surprisingly, SBS homomers perform somewhat better
than MBS homomers or monomers (Figure S2), although the difference is less than 20% for most molecular weight
bins.
Figure 3
General properties of antibiotics and de novo ligands. For each
receptor, the 50 best scoring ligands were included. (A) Molecular
weight distribution of Drugbank antibiotics, below 600 Da. (B) Molecular
weight distributions of de novo ligands in MBS homomers and monomers.
The de novo ligands of MBS homomers are significantly larger than
of monomers, while the ligands of SBS homomers (Figure S1) do not show the same trend. (C) The frequency of
ligands that bind similarly in homologs and their original binding
site is significantly higher among MBS homomers than monomers above
300 Da (∗∗, p ≪ 0.005, tests
of proportions). (D) The log P values of antibiotics
and de novo ligands. For most antibiotics log P is below 3, but for a significant number of de novo ligands it is
above 3, particularly above 300 Da. (E) The druglikeness of antibiotics
and de novo ligands follows the same trend and declines above molecular
weight of 300. (F) Below 300 Da the synthetic accessibility of de
novo ligands is considerably worse than of real antibiotics.
General properties of antibiotics and de novo ligands. For each
receptor, the 50 best scoring ligands were included. (A) Molecular
weight distribution of Drugbank antibiotics, below 600 Da. (B) Molecular
weight distributions of de novo ligands in MBS homomers and monomers.
The de novo ligands of MBS homomers are significantly larger than
of monomers, while the ligands of SBS homomers (Figure S1) do not show the same trend. (C) The frequency of
ligands that bind similarly in homologs and their original binding
site is significantly higher among MBS homomers than monomers above
300 Da (∗∗, p ≪ 0.005, tests
of proportions). (D) The log P values of antibiotics
and de novo ligands. For most antibiotics log P is below 3, but for a significant number of de novo ligands it is
above 3, particularly above 300 Da. (E) The druglikeness of antibiotics
and de novo ligands follows the same trend and declines above molecular
weight of 300. (F) Below 300 Da the synthetic accessibility of de
novo ligands is considerably worse than of real antibiotics.Next we examined whether there
are consistent qualitative differences
between antibiotics and de novo ligands for the three different quaternary
structure types. We used a chemical variational autoencoder (VAE),
a recently developed method based on deep learning,[34] to estimate the synthetic accessibility[35] (SAS), quantitative estimate of druglikeness[36] (QED), and octanol–water partition coefficient
(log P) values for antibiotics and each de novo
ligand using their SMILES (Figure D–F). The solubility in organic solvents (log P values) of de novo ligands gradually increase with molecular
weight and, for ligands with molecular weight above 300 Da, can substantially
exceed the values observed in antibiotics, which are generally below
3 (Figure D). The
quantitative estimate of druglikeness (QED, higher is better) is a
more modern estimate of druglikeness than Lipinski’s rule of
five and is a combination of eight different molecular descriptors.[36] Antibiotics and de novo ligands show a qualitatively
similar, nonlinear pattern, with the most druglike compounds having
molecular weight of 200–400 Da (Figure E). However, even the compounds with molecular
weight above 400 Da have reasonably good (>0.5) QED estimates,
and
since antibiotics and natural compounds are known to not follow the
same trend in druglikeness as synthetic drugs,[33] the druglikeness of de novo ligands should be generally
seen as satisfactory. The largest difference between antibiotics and
de novo ligands is in their synthetic accessibility (SAS, lower is
better), which estimates the ease of synthesis of chemical compounds
(Figure F). Antibiotics
show a clearly increasing trend with molecular weight, with SAS being
below 3 for small compounds (easy synthesis) and gradually increasing
to ∼6 (more difficult synthesis) above 300 Da. In contrast,
de novo ligands show almost no change with molecular weight, with
high variability in every weight class (Figure F). Moreover, their SAS below 300 Da is much
higher than of real antibiotics. This is in agreement with previous
reports that the synthetic accessibility of de novo designed ligands
can be low. However, it should be noted that most natural products
also fall in the range of SAS = 3–6, and thus the values we
observe are not prohibitively high.Finally, using the VAE,
we tested whether there are consistent
differences in the chemical composition of de novo ligands of MBS
homomers, SBS homomers, and monomers. The VAE is fundamentally a pair
of neural networks that “encode” (convert to) and “decode”
(convert back) discrete chemical compounds into a continuous, high
dimensional space called “latent space”, where they
are represented as a numerical vector. The vector representation offers
several powerful operations on chemical compounds, like generating
entirely novel molecules, interpolating between molecules, or optimizing
existing molecules[34] (see also further
analyses below). Additionally, since the latent space is structured,
where similar compounds are located close to each other, it can also
be used to visualize whether there are any structural clusters in
the de novo ligands. We transformed each de novo ligand into a vector
in the latent space using the encoder module of the VAE. The latent
space has 196 dimensions, and to reduce its dimensionality, we used
Barnes–Hut t-SNE[37] for clustering.
This transforms the position of chemical compounds in a high dimensional
space into 2D, which is suitable for visualization. The 2D maps of
de novo ligands show that monomers and MBS homomers have different
distributions in the chemical space, and SBS homomers show an intermediate
pattern (Figure S3). However, much of the
variation is simply due to differences in size (see Figure ), and aside from these coarse
grained differences, no distinct, quaternary structure specific clusters
could be identified: in the areas of the plot where all quaternary
structure types are present, their distribution is homogeneous (e.g.,
red circle, Figure S3).
The Best Ranking
de Novo Ligands Show Similar Binding Patterns
as the Original Ligands of the Receptors
Next we tested whether
the de novo ligands show similar patterns in binding the homologous
binding sites as the original ligands of the receptors (Figure ). We found that this is the
case for the best ranking ligands, although the pattern is less pronounced
than for the original ligands (Figure ) most likely due to the higher flexibility of de novo
ligands. For the 10 best scoring de novo ligands of every receptor,
the normalized grid scores are significantly different in all three
possible comparisons (p ≪ 0.001 for all three
comparisons, Dunn’s test, [Kruskal–Wallis rank sums
test], Figure A).
The frequency of clashes (Figure B) is significantly higher in monomers than in MBS
and SBS homomers (p < 0.001 in both cases, tests
of proportions), but there is no difference between MBS and SBS homomers
(p = 0.369, test of proportions). The grid scores
of de novo ligands show a similar linear scaling with their size,
in their original and homologous binding sites, with somewhat weaker
binding in homologs (Figure C–E). For the lower ranking ligands, we found that
the difference between ligands of MBS and SBS homomers is not consistent;
however, de novo ligands of monomers have consistently worse (higher)
scores in the binding sites of homologs than MBS or SBS homomers (not
shown).
Figure 4
The best scoring de novo ligands show qualitatively similar, although
weaker trends than the original ligands of the receptors. (A) Difference
between the normalized grid score of the 10 best ligands of each receptor.
All three possible comparisons are significant (p < 0.0001 Kruskal–Wallis rank sum tests). (B) The frequency
of clashes is significantly higher in monomers than in homomers (p < 0.001 for both MBS and SBS homomers, tests of proportions),
but there is no difference between MBS and SBS homomers homomers (p = 0.43). (C–E) Relationships between ligand size
and estimated binding energy (grid score) of the best scoring ligand
of each receptor in the original ligand binding site and the binding
sites of homologs. Similar to the original ligands of the receptor,
the relationship between size and grid score is linear, and the difference
between the original binding sites and the binding sites of homologs
is smallest in MBS homomers and largest in monomers.
The best scoring de novo ligands show qualitatively similar, although
weaker trends than the original ligands of the receptors. (A) Difference
between the normalized grid score of the 10 best ligands of each receptor.
All three possible comparisons are significant (p < 0.0001 Kruskal–Wallis rank sum tests). (B) The frequency
of clashes is significantly higher in monomers than in homomers (p < 0.001 for both MBS and SBS homomers, tests of proportions),
but there is no difference between MBS and SBS homomers homomers (p = 0.43). (C–E) Relationships between ligand size
and estimated binding energy (grid score) of the best scoring ligand
of each receptor in the original ligand binding site and the binding
sites of homologs. Similar to the original ligands of the receptor,
the relationship between size and grid score is linear, and the difference
between the original binding sites and the binding sites of homologs
is smallest in MBS homomers and largest in monomers.
Characteristics of de Novo Ligands That Can
Target Several Proteins
Next, we examined whether de novo
ligands that are likely to bind
several proteins have common structural characteristics. We performed
the analysis in two steps. First, since de novo DOCK uses a fixed
fragment library to build the ligands, we performed a fragment enrichment
analysis of the de novo ligands that have structurally similar ligands
in multiple receptors. Second, we optimized/evolved these ligands
using the VAE to test whether structural motifs that were absent in
the fragment library of de novo DOCK emerge during the optimization.From the de novo ligands (using the best 50 in each receptor),
we selected ligands predicted to target several proteins with the
following procedure. (1) First we converted each de novo ligand into
its vector representation in the latent space of the VAE and calculated
all possible (256 million) distances between the 22 650 de
novo ligands in the latent space, using their vector representation (see Figure A). (2) Next we selected the pairs of de novo ligands where their
distance in the latent space is less than 13, and both de novo ligands
have lower (better) grid score than the original ligand of the binding
site. The distance in the latent space correlates with the structural
similarity of compounds; however, it also depends on the size of the
ligand: in the case of ligands of 300–400 Da, the distance
13 roughly corresponds to Tanimoto similarity 0.7 (Figure B,C), while for larger ligands,
larger distances have to be used for the same structural similarity
(see Figure S4). We chose the distance
cutoff 13, because the druglikeness of de novo ligands in our data
set is highest below 400 Da; see Figure E. (3) From the ligand pairs, we selected
the ligands that have a low (<13) scoring pair in at least two
more bacterial species than the query ligand. (4) We discarded ligands
with properties that are substantially different from real antibiotics:
molecular weight of <300 Da, log P > 3,
and
where SAS > (molecular weight/100) + 1. These steps reduced the
total
pool of ligands to only 56 (1 duplicate ligand), which are expected
to be druglike, have structurally similar ligands in the receptors
of minimum three different species, and are likely to be easy to synthesize
(see Supplmentary Data for the list of
ligands and their structures).
Figure 5
Outline of the structural comparisons
using the VAE vector representation.
(A) De novo ligands were converted to vector, which represents the
parameters of a statistical distribution and defines the location
of the compound in the latent space. The Euclidean distance between
the vectors was used to measure the structural similarity between
all ∼256 million possible pairs of compounds. (B) The frequency
of pairs with distance below 13 is negatively correlated with their
molecular weight. (C) For molecules above 300 Da the distance cutoff
13 results in an approximate Tanimoto similarity of 0.7 (most of them
are in the range of 300–400 Da).
Outline of the structural comparisons
using the VAE vector representation.
(A) De novo ligands were converted to vector, which represents the
parameters of a statistical distribution and defines the location
of the compound in the latent space. The Euclidean distance between
the vectors was used to measure the structural similarity between
all ∼256 million possible pairs of compounds. (B) The frequency
of pairs with distance below 13 is negatively correlated with their
molecular weight. (C) For molecules above 300 Da the distance cutoff
13 results in an approximate Tanimoto similarity of 0.7 (most of them
are in the range of 300–400 Da).The statistical variability of such a small set of ligands
is expected
to be much larger than for the whole data set of 22 650 molecules.
Therefore, to estimate the uncertainty associated with our analysis,
we repeated the entire de novo design process and generated a second,
fully independent set of de novo ligands using the same methods. In
this second set, de novo ligands were built in 452 binding sites (see Table S3). This resulted in an additional, independent
set (set 2) of 73 putatively antibiotic-like ligands (with three duplicate
ligands; see Supplementary Data).Several of the selected de novo ligands are reminiscent of compounds
and drugs with known antimicrobial activity, particularly of quinolones,
quinazolinones, oxadiazoles, and morpholine antifungals (Figure A; Figure S5A; note the top left compound of Figure A, which resembles a hybrid
between a quinolone and an oxazolidinone antibiotic, and the bottom
left compound, which is similar to morpholine antifungal). Quinazolinones and oxadiazoles represent two new groups
of antibacterials that were shown to be effective against methicillin
and vancomycin resistant Staphylococcus aureus.[38−44] From the 380 fragments of the fragment library, 11 (set 1) and 12
(set 2) are significantly enriched in the compounds (p < 0.05, tests of proportions), and there are overlaps between
the two sets of fragments (Figure B–D; Figure S5B–D). The most common fragment is the carboxyl group (side chain 23),
and in both sets halogen-containing fragments are enriched. Additionally,
in set 1 the oxadiazole linker (lnk.36), and the 4-morpholinyl side
chain (sid.68) that is the core pharmacophore of several ergosterol
synthesis inhibiting antifungals (amorolfine, fenpropimorph, tridemorph),[45−47] is enriched, while in set 2, a quinolone-like linker (quinazolinone,
lnk.308, Figure S5C) is enriched. However,
although being significant in only one of the sets, quinolone-like
(quinazolinone) and morpholine groups are common in both sets (see Supplementary Data; note that only quinazolinone
and not quinolone fragments was present in the fragment library).
Figure 6
Properties
of the selected de novo ligands in set 1. (A) Examples
of interesting de novo ligands. See Supplementary Data for the full list. Several of them are reminiscent of
quinolones, quinazolinones, oxadiazoles, and morpholine antifungals.
Oxazolidinone, quinazolinone, oxadiazole, and morholine groups are
highlighted with red, and the DOCK anchor fragments are highlighted
with blue. A particularly interesting case (top left) contains a quinazolinone
linker and an oxazolidinone side chain (oxazolidinones are among the
newest antibiotics, e.g., linezolid, tedizolid,[9] used to treat vancomycin resistant Gram-positive bacteria).
The compound at the top center is reminiscent of an oxadiazole antibiotic,
while the bottom left compound resembles morpholine antifungals. (B)
Word cloud of the significantly enriched fragments (p < 0.05, tests of proportions; sid. = side chain, a fragment with
only one connection; lnk. = linker, a fragment with two connections;
scf. = scaffold, a fragment with three connections). The size of the
symbols corresponds to the abundance of the fragments. (C) Enrichment
of the fragments. Enrichment was calculated as frequency in set 1/frequency
in all de novo ligands with similar size and properties to set 1.
(D) 2D structures of the significantly enriched fragments. Several
fragments with halogen groups (Cl/Br) are present among them and also
a morpholine side chain (sid.68).
Properties
of the selected de novo ligands in set 1. (A) Examples
of interesting de novo ligands. See Supplementary Data for the full list. Several of them are reminiscent of
quinolones, quinazolinones, oxadiazoles, and morpholine antifungals.
Oxazolidinone, quinazolinone, oxadiazole, and morholine groups are
highlighted with red, and the DOCK anchor fragments are highlighted
with blue. A particularly interesting case (top left) contains a quinazolinone
linker and an oxazolidinone side chain (oxazolidinones are among the
newest antibiotics, e.g., linezolid, tedizolid,[9] used to treat vancomycin resistant Gram-positive bacteria).
The compound at the top center is reminiscent of an oxadiazole antibiotic,
while the bottom left compound resembles morpholine antifungals. (B)
Word cloud of the significantly enriched fragments (p < 0.05, tests of proportions; sid. = side chain, a fragment with
only one connection; lnk. = linker, a fragment with two connections;
scf. = scaffold, a fragment with three connections). The size of the
symbols corresponds to the abundance of the fragments. (C) Enrichment
of the fragments. Enrichment was calculated as frequency in set 1/frequency
in all de novo ligands with similar size and properties to set 1.
(D) 2D structures of the significantly enriched fragments. Several
fragments with halogen groups (Cl/Br) are present among them and also
a morpholine side chain (sid.68).Next we tested whether the selected ligands originate from
binding
sites of proteins with similar functions to the proteins that are
targeted by the antibiotics they are similar to. Quinolones target
the DNA binding gyrase and topoisomerase IV,[48] quinazolinones and oxadiazoles target penicillin binding protein
2a,[38,39,44] and morpholine
antifungals inhibit ergosterol synthesis.[46] The de novo ligands of the two sets originate from the binding sites
of 23 and 25 proteins (set 1 and set 2, respectively, see Table S4 and S5). Neither their quaternary structure
(Figure A, Figure S6A) nor their gene ontology (GO) terms
show significant enrichment compared to the full set of proteins we used. The frequency of cellular component
GO terms indicates that they are present in several different cellular
components, including the cytoplasm, cell wall, or periplasmic space
(Figure B, Figure S6B). The frequency of molecular function
GO terms show that most of them have enzymatic functions that are
not related to the functions of quinolone, quinazolinone, oxadiazole,
or morpholine antimicrobials, although many of them utilize NADP cofactors
and ATP in catalysis (Figure C,D; Figure S6C, Tables S4 and S5). While we excluded cofactor-binding sites
from our analysis, the binding sites of substrates and cofactors frequently
form a continuum, and the structural constraints imposed by nucleic
acid containing cofactors or ATP could contribute to the similarity
of several de novo ligands to quinolones.
Figure 7
Gene ontology analysis
of the receptors of selected ligands. (A)
The quaternary structure composition is not significantly different
from the total set of proteins (p = 0.42), although
the frequency of MBS homomers is lower, as a result of the requirement
to have similar structures in at least three species. (B) Graph of
cellular component terms associated with the proteins. Note that the
color-coding of terms (intensity of red) corresponds to the frequency
of terms and not statistical significance. The proteins are present
in most cellular components, including cell wall, periplasmic space,
plasma membrane, and cytoplasm. (C, D) Graphs of molecular function
terms associated with the proteins. The two highest-level terms are
binding (C) and catalytic activity (D), with nucleoside/nucleotide
binding and oxidoreductase activity being the most common terms.
Gene ontology analysis
of the receptors of selected ligands. (A)
The quaternary structure composition is not significantly different
from the total set of proteins (p = 0.42), although
the frequency of MBS homomers is lower, as a result of the requirement
to have similar structures in at least three species. (B) Graph of
cellular component terms associated with the proteins. Note that the
color-coding of terms (intensity of red) corresponds to the frequency
of terms and not statistical significance. The proteins are present
in most cellular components, including cell wall, periplasmic space,
plasma membrane, and cytoplasm. (C, D) Graphs of molecular function
terms associated with the proteins. The two highest-level terms are
binding (C) and catalytic activity (D), with nucleoside/nucleotide
binding and oxidoreductase activity being the most common terms.
Optimization and Evolution
of de Novo Ligands Using AI
The main limitation of using
fragment libraries by de novo design
tools is that the extent of chemical space that can be explored by
the de novo ligands is limited by the fragment library. To overcome
this, we further optimized the selected sets of de novo ligands (56
and 73 compounds) using the VAE (Figure A). We applied the following, so-called “hill
climbing” protocol for ligand optimization. (1) First we docked
the selected ligands to all the receptors where a de novo ligand with
distance in the latent space less than 13 was found and selected the
three receptors, each from a different species, where the grid score
is the lowest. (2) Next we encoded the ligand and sampled its neighborhood
in the latent space for structurally related molecules (see Figure A and Figure S7). We used several different Z-distance cutoffs (see ref (34) and Methods), from 0.5
to 50, which correspond to increasingly larger added noise levels
to the encoded molecule. (3) The neighbors were converted to 3D structures
and were docked to the three previously selected receptors. If at
least two neighbors had lower (better) grid scores than the encoded
molecule, the best performing neighbor was selected, and the cycle
was repeated until no improvement was detected (Figure A). This process generally resulted in lower
grid scores (see Figure S8 for an example),
and it also takes into account the similarity of properties like log P, QED, or SAS during the optimization of the ligand. (4)
In the final step, using the ligands of the entire optimization process
(including the neighbors that were not selected for further optimization),
we selected the ligand with the best average grid score that also
satisfied the same criteria as the original de novo ligand: log P < 3, SAS < (molecular weight/100) + 1, and molecular
weight of >300.
Figure 8
Optimization and evolution of ligands with AI. (A) Outline
of the
method. First, the selected de novo ligands were encoded with the
VAE, and the latent space was sampled in their vicinity to search
for related structures. Next the, samples were “decoded”
into chemical structures and were docked into three different receptors,
and the best performing (lowest scoring) molecule was used to repeat
the cycle until there was no more improvement in the estimated binding
energies (grid score). See also Figure S8. (B) The optimization resulted in a comparable improvement in binding
energies (grid score), irrespective of the Z distance
cutoff used. (C) The median Tanimoto similarity of optimized ligands
and de novo ligands is 0.7. (D, E) The frequency of four- and three-membered
rings in the optimized ligands is significantly higher than in the
original de novo ligands and the random expectation (∗, p < 0.05; ∗∗, p < 0.005,
randomization tests). Distance 0 indicates the original de novo ligands,
red horizontal bars indicate the observed frequency of four- or three-membered
rings, while black horizontal bars indicate their expected frequency.
Violin plots show the frequency distribution of 10 000 random
replicates. Outliers above 20% were not plotted for clarity. Note
that in the case of four-member rings, the expected frequency is higher
than their frequency among the de novo ligands (Z = 0) because they are more common in the VAE samples. (F) Example
of the emergence of a β-lactam-like ring in a de novo ligand.
Optimization and evolution of ligands with AI. (A) Outline
of the
method. First, the selected de novo ligands were encoded with the
VAE, and the latent space was sampled in their vicinity to search
for related structures. Next the, samples were “decoded”
into chemical structures and were docked into three different receptors,
and the best performing (lowest scoring) molecule was used to repeat
the cycle until there was no more improvement in the estimated binding
energies (grid score). See also Figure S8. (B) The optimization resulted in a comparable improvement in binding
energies (grid score), irrespective of the Z distance
cutoff used. (C) The median Tanimoto similarity of optimized ligands
and de novo ligands is 0.7. (D, E) The frequency of four- and three-membered
rings in the optimized ligands is significantly higher than in the
original de novo ligands and the random expectation (∗, p < 0.05; ∗∗, p < 0.005,
randomization tests). Distance 0 indicates the original de novo ligands,
red horizontal bars indicate the observed frequency of four- or three-membered
rings, while black horizontal bars indicate their expected frequency.
Violin plots show the frequency distribution of 10 000 random
replicates. Outliers above 20% were not plotted for clarity. Note
that in the case of four-member rings, the expected frequency is higher
than their frequency among the de novo ligands (Z = 0) because they are more common in the VAE samples. (F) Example
of the emergence of a β-lactam-like ring in a de novo ligand.The majority
of unoptimized de novo ligands
could bind all three receptors without clashes (78% and 75%, in set
1 and set 2, respectively), and 98 and 95% of them could bind at least
two. Nevertheless, the optimization resulted in a clear improvement
of grid scores in both sets of molecules (Figure B, Figure S10A). While the final optimized molecules frequently differed, there
was little difference in the magnitude of the grid score improvement
when different noise levels (Z-distances) were used.
This is due to several processes: first the characteristics of the
VAE sampling play a role, as the structural diversity of the sampled
SMILES changes little with the magnitude of the selected noise (Z-distance) level (Figure S9).
With larger noise levels, the probability of a obtaining a valid SMILES
string decreases, and in consequence most decoded ligands still originate
from the neighborhood of the input even when the added noise (Z-distance) is large. (Note that the VAE is inherently stochastic,
and decoding the same vector several times, without any added noise,
results in variability in the returned SMILES.) Second, the ligands
can probably be improved only to a limited degree, and the different
runs reached different (and comparable) local optima. Similarly, the
Tanimoto similarity score of the final optimized molecules with the
original de novo ligand depends little on the Z-distance
used and is approximately ∼0.75 for all Z cutoffs
(Figure C, Figure S10B).The optimized
ligands show structural differences compared to the
original de novo ligands. Quinolone-like and aromatic rings were generally
not modified substantially by the optimization, while morpholine groups
(and nonaromatic rings) were more variable and were frequently modified
or substituted (see Supplementary Data Sets). Currently it is unclear to what degree these differences are caused
by the nonhomogeneous nature or other characteristics of the latent
space, as chemical generative methods are still under rapid development[49,50] (see also Segler et al.[51] for a different,
recurrent neural network based approach). We tested the optimized
(and de novo) ligands for toxicity and the presence of pan-assay interference
compounds (PAINS)[52,53] with FAF-Drugs4[54] and the rd_filters tool, using the Glaxo filter.[55] We found that none of the de novo ligands and
only a small fraction of optimized ligands (∼2%; 4 in set 1
and 7 in set 2, using all Z-distances) contain PAINS
fragments. The fraction of putatively toxic (“rejected”)
compounds is, 6% and 9%, while the fraction of compounds with low-risk
structural alerts is 50–60% both in de novo and in optimized
ligands in the two sets using FAF-Drugs4 (note that low-risk structural
alerts are frequent in existing drugs). The Glaxo filters flag ∼16%
of the de novo and ∼20–21% of the optimized compounds
as problematic in both sets (see Table S6 and Supplementary Data). However, most
likely both tools underestimate the frequency of toxic or unstable
fragments.Generally the optimized ligands are characterized
by a significantly
higher frequency of three- and four-membered rings than the original
de novo ligands (red bars, Figure D,E, Figure S10C,D). This
pattern can be the result of two separate processes. First, in the
molecules returned by the VAE sampling, the frequency of small rings
can be higher than in the original de novo ligands due to the characteristics
of the training set of the VAE. Second, such fragments can be enriched
due to the structural constraints imposed by binding in several receptors.
We separated these two processes by Monte Carlo simulations. We determined
the presence of three- and four- membered rings in the molecules returned
by the VAE sampling for each de novo ligand with RDKit. Next, we resampled
the returned molecules 10 000 times to determine the random
expectation for the frequency of three- and four-membered rings in
the optimized ligands. The results show that the frequency of three-
and four-membered rings in the final, optimized ligands is significantly
higher (randomization test, see Methods) than
the random expectation for most Z-distances, despite
the fact that in the case of four-membered rings the expected frequency
is considerably higher than their frequency in de novo ligands (Figure D,E, Figure S10C,D). This indicates that the necessity
to bind multiple receptors selects for small rings, and their enrichment
is not a simple byproduct of the sampling characteristics of the VAE.Neither the fragment library of DOCK nor the compounds of the VAE
were specifically tailored for antibiotic design. The fragment library
of DOCK was extracted from 13 million ZINC compounds by selecting
the most common ones,[30] while the VAE was
built from 250K randomly selected ZINC molecules.[34] Using the optimized compounds to update the de novo fragment
library and repeating the entire de novo design and VAE optimization
process several times may be useful in evolving fragment libraries
tailored for specific tasks from a generic one (both for computational
or experimental screening[56]). Using DOCK,
we extracted the fragments from the optimized ligands and filtered
out those already present in the input de novo ligands with Open Babel
and also those lacking a carbon atom. This resulted in a very high
diversity of fragments: altogether we identified 352 new fragments
in the two sets (Figure S11), which do,
however, contain several fragments that are toxic, reactive, or unstable,
like epoxide, aziridine, or cyclobutadiene-like rings (and others).
Note that different fragments can contain the same substructure if
the number and location of their attachment points (−Xx) are
different. However, the majority of them are present in only in a
single structure, and selecting the ones that are present in at least
two structures in the same VAE batches (and excluding putatively toxic
ones) results in a much smaller set of 28 fragments (Figure ), of which four contain halogens
and seven contain four-membered rings, that are almost completely
absent in the fragment library of DOCK. Several of them contain reactive
double bonds; thus in practice their saturated versions should be
used. The effect of such (e.g., azetidine-like) four-membered rings
on toxicity is less clear; azetidine-based inhibitors of herpesvirus
proteases are known and being developed,[57] and azetidines have been suggested for use as peptidomimetics in
medicinal chemistry.[58] However, exactly
due to its incorporation into proteins (as a substitute of proline),
azetidine-2-carboxylic acid is known to be toxic.
Figure 9
Fragments of the optimized
ligands, which are present in at least
two compounds in any of the VAE batches (excluding fragments that
are likely to be toxic). Several halogenated compounds and four-membered
rings are present among them. Since the four-membered rings frequently
contain reactive double bonds, in practice their saturated versions
should be used.
Trade-Off between
Preventing Resistance and Accumulation in
Gram-Negative Bacteria
One of the key difficulties in developing
broad spectrum antibiotics that are active against both Gram-positive
and Gram-negative bacteria is due to the outer membrane and efficient
efflux pumps in the latter, which prevent the accumulation of antibiotics.
In general, accumulating compounds are expected to be small (<600
Da), but currently there are no established rules that would enable
the design of compounds with a high likelihood off accumulation in
Gram-negative bacteria. Recently it has been suggested that a few
key parameters, like the number of rotatable bonds, the globularity
of the molecule, and the presence of ionizable nitrogen (particularly
primary amines) might be important for accumulation and passing through
the porins of the outer membrane.[59−61] It is unclear how general
these rules are, as they were derived from a relatively small set
of accumulating molecules; broad-spectrum β-lactam antibiotics
(but also polymyxins, macrolides) typically have many more rotatable
bonds than recommended, and neither fluoroquinolones nor broad-spectrum
β-lactams are particularly enriched in ionizable primary amines
(in fact, frequently lack them). However, some compounds active only
against Gram-positive bacteria could be successfully converted to
broad-spectrum antibiotics using these rules,[59] and recent measurements of accumulation in Gram-negative bacteria[62] also provide some support for them, which indicates
that they are nevertheless a step in the right direction.To
test whether optimization to bind multiple receptors might influence
the properties necessary for accumulation in Gram-negative bacteria,
we examined whether it influences the number of rotatable bonds and
globularity of the designed compounds (rotatable bonds were calculated
ignoring hydrogens; thus methyl or amino groups were not assigned
as rotatable; see Methods). We separated the
possible biases of the VAE from the effect of optimization by calculating
the number of rotatable bonds (and globularity) in all the compounds
returned by the VAE when de novo ligands were used as the input (expectation)
and compared it with the number of rotatable bonds (and globularity)
in the final optimized ligands. We find that in both sets 1 and 2,
optimization resulted in a highly significant increase in the number
of rotatable bonds in all VAE optimized batches (Figure ) and that there is a weak
but significant correlation between the change in rotatable bonds
and improvement in grid score (note that in the case of set 1 it is
significant [p = 7.88 × 10–4] only after removing the outliers with a rotatable bond change of
−2). The globularity of compounds (measured with the plane
of best fit [PBF] algorithm;[63] see Methods) does not show a similar consistent change
(Figure S12), although in some VAE batches
of set 2, we do observe a slightly increased globularity compared
to de novo ligands, mostly due to biases of the VAE (Figure S12C). The frequency of compounds with primary amines
is relatively high in the de novo ligands (23% and 13% in set 1 and
set 2, respectively), and somewhat lower in the optimized ligands
(not shown). However, this is most likely due to the low frequency
of primary amines in the compounds used to train the VAE (2.7%) rather
than the result of binding multiple targets.
Figure 10
Ligands optimized for
binding multiple receptors have more rotatable
bonds than the original de novo ligands. (A) Number of rotatable bonds
in de novo and VAE optimized ligands of set 1. Dark red represents
the original de novo (Z = 0) or final optimized ligands,
and blue indicates the expected number of rotatable bonds in the compounds
returned by the VAE, without selection for multitarget binding. This
was calculated as the averages of all ligands returned by the VAE
when the de novo ligands were used as the input. In all VAE batches
the number of rotatable bonds is significantly higher than in the
expectation or de novo ligands (∗∗, p < 0.005, Wilcoxon tests on the differences from de novo ligands).
(B) Relationship between the change in rotatable bonds and grid score
in set 1, using the pooled data of all VAE batches. The correlation
is not significant due to a few outliers with rotatable bond change
of −2; excluding them results in significant correlation (p = 7.88 × 10–4). (C) Number of rotatable
bonds in de novo and VAE optimized ligands of set 2. The color coding
is the same as on panel A. Similar to set 1, in all VAE batches the
number of rotatable bonds is significantly higher than in the expectation
or de novo ligands (∗∗, p < 0.005,
Wilcoxon tests on the differences from de novo ligands). (D) In set
2, the correlation between the change in rotatable bonds and grid
score is highly significant (p = 7.57 × 10–10), indicating that the more flexible is the molecule,
the better it can bind multiple receptors.
Fragments of the optimized
ligands, which are present in at least
two compounds in any of the VAE batches (excluding fragments that
are likely to be toxic). Several halogenated compounds and four-membered
rings are present among them. Since the four-membered rings frequently
contain reactive double bonds, in practice their saturated versions
should be used.Ligands optimized for
binding multiple receptors have more rotatable
bonds than the original de novo ligands. (A) Number of rotatable bonds
in de novo and VAE optimized ligands of set 1. Dark red represents
the original de novo (Z = 0) or final optimized ligands,
and blue indicates the expected number of rotatable bonds in the compounds
returned by the VAE, without selection for multitarget binding. This
was calculated as the averages of all ligands returned by the VAE
when the de novo ligands were used as the input. In all VAE batches
the number of rotatable bonds is significantly higher than in the
expectation or de novo ligands (∗∗, p < 0.005, Wilcoxon tests on the differences from de novo ligands).
(B) Relationship between the change in rotatable bonds and grid score
in set 1, using the pooled data of all VAE batches. The correlation
is not significant due to a few outliers with rotatable bond change
of −2; excluding them results in significant correlation (p = 7.88 × 10–4). (C) Number of rotatable
bonds in de novo and VAE optimized ligands of set 2. The color coding
is the same as on panel A. Similar to set 1, in all VAE batches the
number of rotatable bonds is significantly higher than in the expectation
or de novo ligands (∗∗, p < 0.005,
Wilcoxon tests on the differences from de novo ligands). (D) In set
2, the correlation between the change in rotatable bonds and grid
score is highly significant (p = 7.57 × 10–10), indicating that the more flexible is the molecule,
the better it can bind multiple receptors.Taken together these results indicate that binding multiple
receptors
selects for flexible compounds with more rotatable bonds. Thus, preventing
the evolution of resistance by multitargeting is likely to have the
side effect that compounds capable of binding several receptors efficiently
are at the same time less likely to accumulate in Gram-negative bacteria.
Since high flexibility is the result of the necessity to adapt to
binding sites with somewhat different topologies, one possible way
to overcome this trade-off is to target binding sites with high structural
similarity; thus MBS homomers may be preferable targets over other
SBS homomers or monomers also for this reason.
Discussion
Drug discovery is an extremely lengthy and costly process: on average,
developing a new drug takes ∼14 years and costs an estimated
1–2 billion USD,[64] more than the
cost of sending a spacecraft to an asteroid.[65] One of the earliest, and most critical steps in the drug discovery
process is the identification of a druggable target protein, for which
a ligand that modifies its function and has a therapeutic effect can
be designed. Selecting the right drug target is critical, as poor
target protein selection is one of the main causes of the low (∼10%)
success rate of drug candidates entering the clinical phase.[64] Our results indicate that considering the quaternary
structure of proteins can help in the selection of drug targets where
the goal is targeting several different pathogen species. Ligands
and de novo ligands of homomers, particularly MBS homomers, are much
more likely to bind the binding sites of their equally diverged homologs
than monomers (Figures , 3, 4). Since high
conservation is always the consequence of strong constraints on function,
the higher structural conservation of multichain binding sites in
homomers[21] also indicates that such sites
are less likely to accumulate mutations than the binding sites of
SBS homomers or monomers. Finally, their more conserved binding sites
reduce the need for flexibility and might improve the chances of accumulation
in Gram-negative bacteria.Successful antiretroviral drugs offer
insights into this hypothesis,
as the very high evolutionary rates of drug targets is a particularly
severe problem in the treatment of retroviral infections like HIV.
Interestingly, several antiretroviral drugs bind residues from more
than one protein chain in their targets. HIV protease is an MBS homomer,
and drugs targeting it bind both chains of its binding site.[66] Besides binding the active site, some HIV protease
inhibitors (darunavir and tipranavir) also inhibit the dimerization
of the protease, which has been suggested as a factor in the slow
evolution resistance for these compounds.[66] Most HIV reverse transcriptase (RT) inhibitors are nucleoside analogs,
although some of the non-nucleoside analog inhibitors like rilpivirine
and etravirine bind residues from both chains of the RT heterodimer.[67,68] Drugs targeting HIV integrase typically bind the catalytic site,
which interacts with DNA; however compounds not targeting the catalytic
site bind allosteric pockets with residues from multiple chains.[69] Interestingly, one HIV integrase inhibitor,
elvitegravir, is also characterized by a fluoroquinolone-like structure.[70]Taken together, our results indicate that
the high binding site
similarity of MBS homomers makes them promising targets for broad
spectrum antibacterial agents. Moreover, their slow structural change
during evolution[21] suggests that targeting
MBS homomers might also slow down the evolution of resistance, due
to the high functional constraints on such sites. Targeting multichain
binding sites might also result in the inhibition of protein complex
formation itself, which is likely to have a significant effect on
the evolution of resistance. Finally, our recent results show that
allostery is much more frequent among MBS complexes than among SBS
complexes, particularly in the case of homomers.[71] This indicates that MBS homomers also offer more targetable
pockets than SBS homomers for drug development, and more diverse mechanisms
can be exploited for developing novel inhibitors. Although our work
focused primarily on homomers, we also note that a “trivial”
way of achieving multitargeting is to target multichain binding sites
of heteromeric protein complexes that are formed by multiple distinct
polypeptide subunits. In humans, such sites are characterized by the
highest frequency of pathogenic mutations in all quaternary structure
types, indicating strong purifying selection.[21] However, MBS heteromers are generally not characterized by more
conserved binding sites[21] or much higher
frequency of allostery than monomers.[71]The analysis of the structural properties of de novo ligands
with
similar de novo structures in several receptors indicates that such
ligands do have common structural characteristics (Figure , Figure S5): they are enriched in halogen-containing, quinolone-like
(quinazolinone), oxadiazole, and morpholine-like groups. Such fragments
are characteristic in current broad-spectrum fluoroquinolone antibiotics
(also the HIV integrase inhibitor elvitegravir), quinazolinone and
oxadiazole antibacterials, and morpholine antifungals, suggesting
that morpholine-based inhibitors could also be developed for bacteria.
Additionally, the high frequency of quinazolinone fragments in our
data sets suggests that quinazolinones are promising compounds for
further development.Halogens, particularly fluorine, are routinely
used in medicinal
chemistry for optimization and to improve ligand–target interactions.[72,73] However, the fact that halogen groups primarily interact with the
carbonyl oxygens of the protein backbone and not with the side chains[72−74] makes them particularly suitable for use in “resistance-resistant”
drugs,[75] as point mutations affect primarily
the side chains of amino acids and change the protein backbone much
less. Ligand–backbone interactions are already used in HIV
reverse trascriptase inhibitors[75] and the
protease inhibitor darunavir[66] to slow
down the emergence of resistance (although in the latter it is not
a halogen that interacts with the backbone). Additionally, our results
suggest that the high overall frequency of halogens in drugs (∼40% [72]) might be partly the result of interacting with
several targets, even in the case of drugs that were originally designed
to target only a single protein.Most of the selected de novo
compounds could bind several receptors
without modifications, and they could be further improved by optimization
with the VAE (Figure , Figure S10). In the majority of the
cases the optimization resulted in compounds that are not dramatically
different from the original de novo ligands, with Tanimoto similarity
of ∼0.7–0.75. However, in a fraction of the cases, it
resulted in substantially different compounds (see Supplementary Data Sets). A characteristic structural difference
in the optimized ligands is the appearance of small, three- and four-membered
(sometimes β-lactam-like; see Figure F) rings, of which the latter are effectively
absent in the original de novo ligands.While the selected compounds
contain many novel topologies, most
of the interesting enriched fragments (quinolones/quinazolinones,
oxadiazoles, morpholines, four-member rings [β-lactams/monobactams],
organohalogens) are already used in the core pharmacophores of existing
antimicrobials. This indicates that our approach has the potential
to identify fragments relevant for antibiotic design but also that
the chemical space of broad-spectrum antibiotics is not unlimited
and, unfortunately, that the perception of the industry that many
of the “low hanging fruit” antibiotic classes might
have already been discovered[3,4] is not unfounded. From
the strategies that help to preserve our current antibiotics, combination
therapy is likely to be a successful strategy[17] as it effectively implements multitargeting, while recent work suggests
that antibiotic cycling and mixing may not be very effective in practice.[76,77] We did not customize the fragment library used by de novo DOCK toward
antibiotics, and also the VAE was based on a random selection of compounds[34] without any specific tailoring for antibiotic
design. Thus, the emergence of compounds enriched in antibiotic-like
fragments is not the result of such biases and indicates that the
combination of fragment library based tools like DOCK, and deep-learning
based tools like the VAE, that can explore the chemical space in ways
fragment library based tools cannot is a powerful combination and
can also be used to customize fragment libraries for a specific task.
Its main current limitation is that some of the resulting compounds
can be toxic, difficult to synthesize, or unstable (see Table S6 and Supplementary Data), and that occasionally the combination of certain fragments
can result in compounds with protonation states that are not relevant
at physiologically relevant pH.Another limitation of our analysis
is that the protein targets
that resulted in the ligand sets used for the enrichment analysis
and evolution using AI represent a relatively small set of molecular
functions (Figure , Figure S6, Tables S4 and S5). Currently only a small fraction of the essential
proteins of clinically relevant bacteria have a 3D structure deposited
in the PDB, and even the available structures frequently cover only
fragments or individual domains of them. Thus, our analysis was by
necessity limited by the available structures, and a large, possibly
global effort to determine the structures of the “essentialome”
in the clinically most problematic microbes (e.g., ESKAPE) could have
a major impact on designing new antibiotics with fundamentally novel
structures.
Methods
Selection of Bacterial
Proteins
We selected the bacterial
proteins for the analysis as follows. First, using the OGEE,[78] CEG,[79] and DEG[80] essential gene databases, we identified the
known essential prokaryotic genes in the PDB. Next, using BLAST with
an e-value cutoff of 10–3, we removed
those proteins from the data set that have a homolog in the human
genome, the ones that form heteromeric protein complexes, and ones
that have no structure with a small molecule ligand in the BioLiP
database.[25] Finally, using BLAST with an e-value of 10–5 cutoff, we identified
pairs of homologous bacterial proteins in the PDB where (1) the sequence
similarity is below 40%, (2) at least one member of the pair is essential,
(3) none of them have a homolog in the human proteome (e-value of 10–3), (4) their structures overlap in
the alignment, thus their structures also contain homologous regions,
(5) they have similar quaternary structure, and (6) the type of ligand
binding (MBS or SBS) is similar. The list of protein pairs and the
PDB codes used in the analysis is available in Supporting Information Table S1.
Docking, Preparation of
Receptors and Ligands, and de Novo Design
of Ligands
We used DOCK 6.8[28] and
its utility tools for binding site and ligand preparation, and docking.
The first biological assembly was used for all PDB entries. Since
some DOCK utility tools are unable to process large proteins and complexes,
before docking we identified the residues of the receptor within 10
Å of the ligand with ProBiS, discarded all other residues of
the structure, and built the grids using this substructure of the
receptor. This did not have any measurable effect on the performance
or accuracy of DOCK (the grids were built within 5 Å of the ligand)
and enabled us to process large complexes. Receptor proteins and ligands
were prepared with a standard procedure: in the receptor, incomplete
side chains were completed, hydrogens were added, residues with low
occupancy were removed, and Amber charges were added with the dock-prep
tool of Chimera.[81] Ligands were converted
to mol2 format, and hydrogens and Amber charges were added with Chimera.
In a small number of cases where Chimera was unable to process the
ligand, we added hydrogens and (Gasteiger) charges with OpenBabel.[82] To estimate the pose reproduction success rates,
we first minimized the ligand through rigid docking, next docked it
to its receptor with the FLX (flexible) algorithm.[28,83] The parameters were similar as in ref (83), the main parameters being max_orients = 1000,
pruning_max_orients = 1000, pruning_clustering_cutoff = 100 (see http://ringo.ams.sunysb.edu/downloads/SB2010/FLX.in for the original parameters). We used only those receptors in further
analyses where at least one of the first 10 clusterheads had RMSD
< 2 Å with the original position of the ligand. We used this
relaxed strategy because in the case of large ligands with several
rotatable bonds, it is common that the core of the ligand is in the
correct position, but certain side chains are not, pushing their RMSD
above 2 Å. Furthermore, ligands in the PDB are not necessarily
in a location that is the energetically most favorable, and in such
cases a ”successful” pose reproduction is actually an
error. Further docking was performed with the FLX protocol, both for
the original binding site of the ligand and for the homologous binding
site.De novo ligands were built with the de novo algorithm
of DOCK (a prerelease version of 6.9),[30] and we used the fragment library distributed with it, which was
built using 13 million compounds of the ZINC database.[32] Only those binding sites were used where pose
reproduction was successful, and binding sites of cofactors or metals
were not used. We followed the following procedure for ligand generation:
(1) First, we identified the number of rotatable bonds of the original
ligand of the receptor with DOCK. (2) Next, we decomposed the original
ligand of the receptor to fragments and identified the largest one.
(3) From the fragment library we randomly selected a fragment with
±1 heavy atoms as the largest fragment of the original ligand
for anchor. (For ligands where the largest fragment had 9 or more
heavy atoms, fragments with 9–12 heavy atoms were used). (4)
Next, the number of rotatable bonds of the ligand of the original
binding site was determined with DOCK, and we set the number of growth
layers as the number of rotatable bonds/2. Only binding sites of ligands
with two or more rotatable bonds were processed. We used the graph
sampling method, and the maximum allowed molecular weight of the de
novo ligands was set to 550. (5) We ran 30 independent replicates
for each receptor (i.e., we used 30 different anchors), and from the
output of the 30 independent runs we selected a total of 50 de novo
ligands with the best grid score. Overall this procedure resulted
in de novo ligands of comparable size and complexity as the original
ligand.
Ligand Characterization
The characteristics of the
50 best de novo ligands of each receptor were calculated with the
chemical variational autoencoder (VAE) developed by the Aspuru-Guzik
lab,[34] using the autoencoder distributed
with the tool itself (“zinc_properties”). We used the
VAE to calculate log P, synthetic accessibility
score (SAS),[35] and quantitative druglikeness
(QED)[36] for each de novo ligand, and also
a vector corresponding to their representation in the latent space
(with 196 dimensions). Molecular weight was calculated with the obprop
tool of the Open Babel suite. Clustering of the latent space representation
of de novo ligands (for visualization) was performed with Barnes–Hut
t-SNE,[37] with two dimensions and perplexity
50.
Optimization and Evolution of de Novo Ligands with Docking and
AI
The selected de novo ligands were further optimized and
evolved with a strategy that used DOCK 6.8 and the VAE (Figure ). First, from the list of
binding sites that had a de novo ligand with a distance less than
13, we selected the binding site from the three different species
where the selected de novo ligand could be docked with the best grid
score. Next, each selected de novo ligand was converted to a SMILES
string with RDKit, and we used the VAE to sample the latent space
for structurally related chemicals, using Z cutoffs
of 0.5, 1, 3.125, 6.25, 12.5, 25 and 50, taking 30 000 samples.
This usually returned 10–100 unique chemical structures. If
the number of returned chemicals was higher than 50, we randomly selected
50 of them. The structures returned by the VAE were docked to the
three receptors, and the one with the best average score (if its score
was an improvement in at least two of the receptors) was chosen and
used in the next round of sampling/docking. This cycle was repeated
as long as there was an improvement in the grid score of the new ligands.
Monte Carlo Simulations (Randomization Test)
We performed
Monte Carlo simulations to test whether small rings are enriched in
the optimized molecules due to the constraints of binding in multiple
receptors. First, in the SMILES returned in the first step of the
optimization (see Figure S8 and FigureS8_FullResult.txt for an example) we determined
for every atom whether it is part of a three-membered and four-membered
ring using RDKit, and the number of such rings. Next we took 10 000
samples using all de novo ligands by selecting one SMILES randomly
from the returned molecules of every de novo ligand and calculated
the frequency of three- and four-membered ring in every random sample.
Finally, we estimated whether the observed frequency of three- and
four-membered rings is significantly higher than the expectation using
the formula p = (n + 1)/(N + 1), where N is the total number of
samples (10 000) and n is the number of samples
with equal or higher frequency of small rings than their observed
frequency. The expected frequency of small rings was determined as
the median of random samples.
Globularity and Rotatable
Bond Estimation
Globularity
was measured with the plane of best fit (PBF) method.[63] SMILES of de novo and VAE ligands were first converted
to 3D structures with Open Babel using the --gen3d flag, which uses
the MMFF94 force field. PBF scores were calculated on hydrogen-free
structures,[63] with the pbf function of
RDKit. A fraction of compounds (∼3% in the optimized ligands,
5–10% in the raw ligands returned by the VAE) were excluded,
due to the inability of RDKit to process their 3D structure. The number
of rotatable bonds in each ligand was calculated with the obprop tool
of the Open Babel suite, ignoring hydrogens; thus methyl or amino
groups are not counted as rotatable bonds.
Visualization and Statistics
All statistical tests
were performed with in-house Perl scripts and R and were corrected
for multiple testing with the Benjamini–Hochberg method.[84] Protein structures were visualized with Chimera
(version 1.11.2), chemical structures with Open Babel 2.3.1.
Authors: Lee D Fader; Eric Malenfant; Mathieu Parisien; Rebekah Carson; François Bilodeau; Serge Landry; Marc Pesant; Christian Brochu; Sébastien Morin; Catherine Chabot; Ted Halmos; Yves Bousquet; Murray D Bailey; Stephen H Kawai; René Coulombe; Steven LaPlante; Araz Jakalian; Punit K Bhardwaj; Dominik Wernic; Patricia Schroeder; Ma'an Amad; Paul Edwards; Michel Garneau; Jianmin Duan; Michael Cordingley; Richard Bethell; Stephen W Mason; Michael Bös; Pierre Bonneau; Marc-André Poupart; Anne-Marie Faucher; Bruno Simoneau; Craig Fenwick; Christiane Yoakim; Youla Tsantrizos Journal: ACS Med Chem Lett Date: 2014-01-22 Impact factor: 4.345
Authors: Andrew Spaulding; Khuloud Takrouri; Pornachandran Mahalingam; Dillon C Cleary; Harold D Cooper; Paola Zucchi; Westley Tear; Bilyana Koleva; Penny J Beuning; Elizabeth B Hirsch; James B Aggen Journal: Bioorg Med Chem Lett Date: 2017-10-16 Impact factor: 2.823