Iuri Casciuc1,2, Artem Osypenko2, Bohdan Kozibroda2,3, Dragos Horvath1, Gilles Marcou1, Fanny Bonachera1, Alexandre Varnek1, Jean-Marie Lehn2. 1. Laboratoire de Chémoinformatique UMR 7140 CNRS, Institut Le Bel 4, rue B. Pascal 67081 Strasbourg, France. 2. Laboratoire de Chimie Supramoléculaire, Institut de Science et d'Ingénierie Supramoléculaires (ISIS), Université de Strasbourg, 8 allée Gaspard Monge, 67000 Strasbourg, France. 3. Institute of High Technologies, Taras Shevchenko National University of Kyiv, 4g Hlushkova Avenue, 03022 Kyiv, Ukraine.
Abstract
Dynamic combinatorial libraries (DCLs) display adaptive behavior, enabled by the reversible generation of their molecular constituents from building blocks, in response to external effectors, e.g., protein receptors. So far, chemoinformatics has not yet been used for the design of DCLs-which comprise a radically different set of challenges compared to classical library design. Here, we propose a chemoinformatic model for theoretically assessing the composition of DCLs in the presence and the absence of an effector. An imine-based DCL in interaction with the effector human carbonic anhydrase II (CA II) served as a case study. Support vector regression models for the imine formation constants and imine-CA II binding were derived from, respectively, a set of 276 imines synthesized and experimentally studied in this work and 4350 inhibitors of CA II from ChEMBL. These models predict constants for all DCL constituents, to feed software assessing equilibrium concentrations. They are publicly available on the dedicated website. Models rationally selected two amines and two aldehydes predicted to yield stable imines with high affinity for CA II and provided a virtual illustration on how effector affinity regulates DCL members.
Dynamic combinatorial libraries (DCLs) display adaptive behavior, enabled by the reversible generation of their molecular constituents from building blocks, in response to external effectors, e.g., protein receptors. So far, chemoinformatics has not yet been used for the design of DCLs-which comprise a radically different set of challenges compared to classical library design. Here, we propose a chemoinformatic model for theoretically assessing the composition of DCLs in the presence and the absence of an effector. An imine-based DCL in interaction with the effector human carbonic anhydrase II (CA II) served as a case study. Support vector regression models for the imine formation constants and imine-CA II binding were derived from, respectively, a set of 276 imines synthesized and experimentally studied in this work and 4350 inhibitors of CA II from ChEMBL. These models predict constants for all DCL constituents, to feed software assessing equilibrium concentrations. They are publicly available on the dedicated website. Models rationally selected two amines and two aldehydes predicted to yield stable imines with high affinity for CA II and provided a virtual illustration on how effector affinity regulates DCL members.
Dynamic
combinatorial chemistry (DCC) implements the generation
of sets of dynamic molecular (or supramolecular) entities by the recombination
of building blocks linked by covalent (or non-covalent) bonds formed
in a variety of reversible chemical reactions.[1−8] The central feature of such dynamic combinatorial libraries (DCLs)
is their operation under thermodynamic control, in comparison with
“classical” combinatorial libraries, which may be considered
as static in view of the high kinetic stability of the covalent bonds
that build up their members. The members of the DCL (called constituents) are in equilibrium with one another through
constant exchange of building blocks (called components) via reversible covalent (or noncovalent) reactions. As a consequence,
such a DCL can adapt to the action of physical stimuli
or chemical entities (called effectors), resulting
in amplification of the fittest constituent(s),[7,9−16] for that specific physical or chemical agent, through selection and exchange of components. The agent can be of
variable nature—a physical stimulus, like a change of temperature,[17] or a chemical effector like a metal ion,[18] a protein/enzyme,[19,20] or properties
of the medium (solvent, pH, viscosity).[21,22]Along
with their numerous applications,[1−8] DCLs are of particular interest for drug discovery,[23−30] where they have been used to identify binders/inhibitors to proteins/enzymes,[23−25] nucleic acids,[26−29] and even living cells.[30] Addition of
a biological target (e.g., an enzyme) to a DCL of potential inhibitors
has been shown to drive the selection of the most
potent binder/inhibitor in the DCL, causing its amplification with respect to the distribution in the absence of the target protein.[19,20,23−25] Hence, the
DCLs may be implemented for lead generation in drug discovery. Enabling
the protein to actively enhance the formation of its preferred ligand(s),
from the pool of virtual binders, in a sort of “The Lock generates its Key” process, provides an
approach that can be beneficial over the high-throughput screening
(HTS)[31] of individual compounds of classical
“static” combinatorial libraries[23−30] obtained by mixing sets of reagents of the same category–typically, n nucleophilic species N1, N2, ...N and m electrophilic reagents
E1, E2, ...E. The
key benefit expected from DCLs is maintenance of the simple “mixture”
strategy but for a set of equilibrating constituents while improving
the chances that strong affinity products will emerge—because
they are dynamically selected and amplified. The final DCL consists
of the equilibrium population of the constituents representing all
of the possible combinations generated by the reversible connection
of the components. The addition of an effector will modify the distribution
of constituents depending on their affinity for the target entity,
amounting to an adaptation of the DCL to the effector.[7,32−34] Note that a procedure of dynamic
deconvolution may be applied to complex DCLs.[23,35,36]Chemoinformatics is a key player in
HTS library design. On one
hand, it may help to “focus” on compounds most likely
to bind the screened target (thereby eliminating testing of species
predicted to be inactive).[37] On the other
hand, it is also widely used to design generic “diverse”
libraries,[38] to be used in HTS against
targets with no ligand structure–affinity information on which
to base a focusing strategy. Diversifying a library means maximizing
chemical space coverage and ensuring that included compounds are not
redundant. So far, however, chemoinformatics has, to our knowledge,
not yet been invoked for the design of DCLs, given that such design
comprises a radically different set of challenges. Note that unsupervised
machine learning (including PCA, LDA, and cluster analysis) has been
used for statistical DCL data analysis.[39−41]First, as in classical
drug design, chemoinformatics may help to
select appropriate building blocks that are highly likely to lead
to products that fulfill the (steric, electronic, pharmacophoric)
constraints required for activity. In this context, there is no need
to precisely identify which of the possible products will be most
active because the DCL strategy per se provides a
powerful search mechanism for the latter. This is fortunate because
typically chemoinformatics approaches are not accurate enough (except
for, perhaps, costly free energy perturbation simulations)[42] to explain subtle activity differences between
strongly related members of a combinatorial library. They are, however,
well suited to quickly discard building block combinations that are
almost certainly unlikely to lead to active products. In principle,
a DCL should not be prebiased but based on sets of the highest molecular
component diversity without any preconceived ideas. However, the DCL
may be simplified by not including building blocks that are with a
high probability either not expected to engender actives or predicted
to be highly active when combined with at least some of the other
partners. (Bio)activity prediction models—either based on machine-learned
quantitative structure–activity relationships (QSAR) or on
ligand-site interaction models (pharmacophores, docking)—are
thus important for DCL design, as they are for classical library design.Second, for the application of chemoinformatics to DCLs, the partner
building blocks should be selected so as to present comparable reactivity
and to form products of comparable thermodynamic stability. The effector
may displace the equilibrium concentration in favor of its preferred
binders unless other concurrent reactions lead to some extremely stable
adducts. Thermodynamic stability of DCL constituents and target/library
constituents association are properties that can
be machine learned on the basis of experimentally studied
cases.[43] From such data, quantitative structure–property
relationships (QSPR) models can be used to predict, on one hand, product
stability as a function of its structure, and, on the other hand,
the affinity of DCL constituents for the effector. The stability problem
is, however, complicated by the impact of the solvents, as that used
for DCL experiments (usually water for biological targets) may differ
from that for which the QSPR model has been calibrated. Extrapolating
measured equilibrium constants to a solvent different from that in
which the measurement was performed (chloroform to water, for example)
can be estimated on the basis of partition coefficients (log P) between the two concerned immiscible solvents. Once all
the equilibrium constants are presumed known, the equilibrium concentrations—and
their effector-induced shifts—can be calculated by a speciation
algorithm[44,45] so that the DCL behavior can be simulated
in silico.Chemical diversity considerations are particularly
important. Selected
building blocks should be chemically as diverse as permitted by the
above constraints (matching activity requirements and ensuring a balanced
distribution of relative product stability). If there are building
blocks based on distinct chemotypes predicted to be compatible with
the constraints above, then they should be selected—instead
of limiting the DCL to a redundant collection of building block homologues.This work tentatively explores all the three key points above,
in order to (i) provide a concrete and technically detailed illustration
for an in silico DCL design strategy, (ii) prepare required data—both
from public databases and in-house experimental measures—and
(iii) finally build the models in view of a future DCL design campaign,
followed by an experimental assessment. Building on seminal work in
this area,[19] human carbonic anhydrase II
(CA II)[46] was chosen as an effector to
model the adaptive behavior of the imine-based DCL.
Rationale and
Workflow of Speciation Modeling of DCLs
The three steps of the modeling workflow are
shown in Figure and
in Figure S1 (see Supporting Information). They comprised the following operations.
Figure 1
Main steps
and outputs of speciation modeling workflow.
Main steps
and outputs of speciation modeling workflow.
Part 1.
Experimental and Theoretical Assessment of Equilibrium
Constants for Imine Formation
Preparation of the training data set.Preselection
of the aldehydes and
amines based on their “popularity” estimated by the
number of references in a scientific database (primary data set: 400
aldehydes and 300 amines);Selection of small diverse (nonredundant)
pools of amines and aldehydes.Experimental determination of the
formation constants of 276 imines in deuterated chloroform (CDCl3) from the selected training data set.Building of a predictive machine-learning
model for the logarithm of imine formation constant (log KC) in chloroform as a function of the structure.Since DCL–effector
protein
interaction is occurring in water, imine stability in water needs
to be assessed using the predicted stability in CDCl3.
This was achieved with the help of the predictive model for the chloroform–water
partition coefficient (log PC/w) prepared
in this work.
Part 2. Preparation of
the Model for the Affinity of Organic
Molecules to the Effector
The model for the logarithm of
the dissociation constant (pKi) of organic
molecules from human CA II was prepared using experimental data extracted
from the ChEMBL database.[47−49]
Part 3. Speciation Modeling
of DCL in the Presence of the Enzyme
As Effector
The dynamic behavior of a simple DCL (2 amines
× 2 aldehydes) was simulated by applying speciation software
to compute equilibrium concentrations of free and protein-bound imines,
given their estimated stability in water and affinities to the protein
effector.Pending experimental validation, this work focused
on the key steps of the envisaged strategy, with evaluation of the
strengths and potential pitfalls of the models.
Results and Discussion
Part 1.
Experimental and Theoretical Assessment of Equilibrium
Constant of Imine Formation
In the field of dynamic combinatorial
chemistry, imines,[50−55] (Scheme ) formed
by reversible amine-aldehyde condensation, represent a class of compounds
of particular significance for several reasons:
Scheme 1
Generalized
Reaction Scheme of Imine Formation from an Aldehyde and
an Amine; and the Corresponding Expression for Its Equilibrium Constant
(Equation 1)
they display high diversity in terms
of structures and physicochemical properties;their building blocks, aldehydes
and amines, are readily synthetically or commercially available;in most cases, they
offer a convenient
range of exchange kinetics at room temperature in various media, including
neat conditions, organic solvents (such as chloroform, toluene, or
DMSO) and water.Although imine formation and component exchange in aqueous medium
are of special interest, the presence of several possible intermediates
and various side reactions such as amine protonation as well as the
formation of aldehyde-hydrate and hemiaminals are serious challenges
for experimental investigation, as we have shown elsewhere.[55]Thus, it was decided to use deuterated
chloroform as a medium for
the reaction. Chloroform is the most widely used NMR solvent, and
imine formation in chloroform usually leads to negligible amounts
of side-products (such as aldehyde hydrates, hemiaminals, and aminals).
Selection
of the Experimental Data Set
First, amine
and aldehyde building blocks were taken as the top 400 most cited
aromatic aldehydes and top 300 primary amines according to SciFinder,
using the following protocol: (a) the compounds were sorted by the
frequency of their use; (b) only molecules with a molecular weight
≤ 400 Da were selected; (c) compounds with only one aldehyde/primary
amine group were chosen; multifunctional compounds would produce much
more complex dynamic sets and represent a further step of investigation;
(d) preselected sets were manually checked, for compatibility reasons
(duplicates, functional group incompatibility, aggregate state incompatibility,
solubility, availability, etc.). Note that aliphatic aldehydes were
excluded from the study because experimental tests revealed various
side reactions, making the analysis challenging.[51,54]This procedure resulted in a set of 120 000 possible imines
(400 × 300) serving as a reference pool out of which a small
combinatorial sublibrary of 360 imines was selected for experimental
assessment of their thermodynamic stability. This core was defined
as the combination of maximum diversity reagents: the MaxMin[56] algorithm was applied separately to the amine
and aldehyde sets in ISIDA fragment descriptor space (see Table S1 in Supporting Information for details),
picking subsets of 24 aromatic aldehydes and 15 primary amines, respectively
(Figure ).
Figure 2
Chemical structures
of selected aldehydes and amines for experimental
determination of imine formation reaction constants.
Chemical structures
of selected aldehydes and amines for experimental
determination of imine formation reaction constants.These reagent subsets should in principle span as broad as
possible
reactivity ranges, in order to yield an informative pool of imines
of significantly different stability, from which machine learning
would easily identify structural features enhancing and respectively
decreasing stability. Intuitively, it is therefore legitimate to ask
whether this diversity selection should not been rather conducted
in a quantum-chemical descriptor space, as the latter is perceived
as most directly related to reactivity issues. However, there are
several strong arguments in favor of the herein adopted strategy:ISIDA fragment descriptors
are excellent
descriptors of reactivity—as will be proven further on, when
discussing their propensity to fit to experimentally measured equilibrium
constants. This simply means that key quantum-chemical descriptors
(HOMO or LUMO energies, for example) are effectively covariant with
the presence of specific (electron-withdrawing or -donating) fragments
captured by the ISIDA fingerprint.Quantum-chemical descriptors alone
fail to account for sterical hindrance, which is better rendered by
fragment counts—albeit in an implicit way. Also, they are geometry-dependent—HOMO/LUMO
energy differences in response of a conformational change may be actually
larger than differences between analogous molecules in comparable
geometries.In view
of that mentioned above, it
is no surprise that machine-learning models of the imine stability
based on 30 quantum-chemical descriptors issued from DFT calculations
(see their list in Section 3.3 in Supporting Information) are not better than the much easier to use and much faster fragment
descriptor counterparts (see Table S2 in Supporting Information). Moreover, ISIDA-descriptor-driven diversity is
perfectly suited to select amines and aldehydes that are also “diverse”
in terms of quantum-chemical terms, as it is shown on the example
of HOMO/LUMO energies distribution in Table S3 in Supporting Information.Finally, yet importantly, time-consuming
DFT calculations can hardly be recommended to calculate descriptors
for large combinatorial libraries of 120 000 virtual imines. On the
other hand, the generation of ISIDA descriptors is very fast, which
makes them particularly attractive when working with big chemical
data.Out of the above-mentioned 360
pairwise combinations, 276 imines
were synthesized, and the equilibrium constants for their formation
were measured using 1H NMR spectroscopy.From the
structural point of view, the set of aldehydes is quite
diverse (Figure (top)):
half of the molecules are heterocyclic aldehydes, e.g., containing
furan (A2 and A6), thiophene (A5, A16), and thiazole (A22 and A23) cores. The set incorporates aldehydes presenting either electron-donating
groups (e.g., A4, A8, A9, A13, A19), or electron-withdrawing groups (e.g., A3, A6, A7, A11, A15).The set of amines, on the other hand, predominantly
consists of
various aliphatic amines (11 out of 15 molecules), three anilines
(B1, B7, and B11), and one
heterocyclic amine (B14); see Figure (bottom).
Measurement of Stability
Constants (log KC)
Stock solutions
of all the aldehydes and amines
were prepared in deuterated chloroform. Prior to use, CDCl3 was filtered through basic alumina to remove the possible traces
of acid; then, it was saturated with water to ensure a constant water
content of 73.8 mM,[57] and hexamethyldisiloxane
(HMDSO) was added as an internal standard. Imines were prepared directly
in NMR tubes by mixing the stock solutions of aldehydes and amines
to reach a concentration of 20 mM. To speed up the reaction, 2 mol
% of trifluoroacetic acid (TFA) was added to each tube, and the reactions
were equilibrated for 24 h at room temperature. Notice that kinetics
of equilibration for several checked samples was well below 1 h.Thus, from a virtual pool of 120 000 imines, 276 were synthesized.
The reaction constant for each was calculated from direct measurement
of the concentrations of the imine and of the residual aldehyde and
amine by integrating their corresponding NMR signals relative to an
internal standard (see the “Experimental measurements”
section in Supporting Information). In
most cases, the integrals could be measured so as to provide stability
constant (KC) values with a reasonable
precision of 0.15 log K units (Figure , green bars), but where reaction was limited
or strongly favored or where signal overlap occurred, errors were
large and the KC values can only be described
as “estimated” (Figure , orange bars).
Figure 3
log KC values
distribution. Data are
annotated as “exact” and “estimated”,
respectively. “Estimated” labels were assigned in cases
featuring (i) too low concentrations of reactants/products or (ii)
overlapping signals, leading to difficulties in quantitative identification
of compounds.
log KC values
distribution. Data are
annotated as “exact” and “estimated”,
respectively. “Estimated” labels were assigned in cases
featuring (i) too low concentrations of reactants/products or (ii)
overlapping signals, leading to difficulties in quantitative identification
of compounds.As expected, the imines having
high log KC, A6B6 (5.40), A7B6 (5.48), A21B9 (5.54), etc., are formed by
highly nucleophilic amines
and highly electrophilic aldehydes. Note that most of the imines with
log KC > 5 contain the cyclopropylamine
fragment (B6). The imines with very low log KC are formed from electron-poor amines and electron-rich
aldehydes. For instance, electron-deficient amines such as B7 and B14 are poorly reactive in reactions with most
aldehydes (log KC < −3). Some
amines (e.g., aniline B11) lead to imines with a broad
range of stability from −3.25 (A19B11) to 1.91
(A15B11). In this case, steric effects apparently play
a significant role in modulating the stability of constituents.
Predictive Model of Imine Stability in Chloroform
The
data obtained were used to calibrate and validate the model. Seven
support vector regression (SVR)[58] individual
models, each built on a particular type of ISIDA descriptor[59,60] (see Supporting Information), contributed
to consensus calculations. Their predictive performance was assessed
in five-fold cross-validation. Finally, experimental versus predicted
(cross-validated) log KC values were compared
(Figure a). For most
molecules, the predicted log KC values
were close to those determined experimentally (root-mean-squared error
(RMSE) is 0.62 log K units, see Figure ), whereas most erroneous predictions
were found for the compounds labeled as “estimated”.
Figure 4
(a) Experimental
vs predicted (cross-validated) log KC values
plot of the consensus SVR model with Q2 = 0.93 and RMSE = 0.62 log K units (see details
in Supporting Information). The dotted
line corresponds to ideal predictions. (b) Distribution
of predicted values of log KC of imine
formation in chloroform. (c) (Top) Examples of “inert”
aldehydes (left) and “inert” amines (right). Their interactions
with any other aldehyde and amine, respectively, lead in approximately
60% of cases to negative predicted log K. (Bottom)
Examples of “reactive” aldehydes (left) and “reactive”
amines (right). Their interactions with other aldehydes and amines,
respectively, lead in more than 60% of the cases to log KC > 1.
(a) Experimental
vs predicted (cross-validated) log KC values
plot of the consensus SVR model with Q2 = 0.93 and RMSE = 0.62 log K units (see details
in Supporting Information). The dotted
line corresponds to ideal predictions. (b) Distribution
of predicted values of log KC of imine
formation in chloroform. (c) (Top) Examples of “inert”
aldehydes (left) and “inert” amines (right). Their interactions
with any other aldehyde and amine, respectively, lead in approximately
60% of cases to negative predicted log K. (Bottom)
Examples of “reactive” aldehydes (left) and “reactive”
amines (right). Their interactions with other aldehydes and amines,
respectively, lead in more than 60% of the cases to log KC > 1.Aside from its predictive
utility, another important criterion
characterizing the obtained model is chemical space coverage, identified
as the applicability domain (AD) of the model. The role of the AD
is to define the boundaries in the chemical space within which a model
can be used and provide reliable and accurate predictions. According
to Vapnik,[61] statistical models are directly
applicable to any test instance drawn from the statistical distribution
describing a training set; i.e., loosely speaking, the training and
test molecules should not be too different. Here, we used the fragment control approach[59] to
identify the AD. If a test compound contains an ISIDA fragment absent
in the training set structures, it is considered to be out of the
AD and, therefore, should be discarded. In this context, the SVR consensus
model trained on log KC of 276 imines
should provide reliable predictions for almost 50% of the considered
imines (59 935 out of 120 000). For approximately half of the imines
within the AD (Figure b), the log KC has been predicted as
≤0, for around 30 000 imines the predicted log KC values were in the range between 0 and 3, and for 768
imines the predicted values of log KC were
>3 (Figure b).
Thus,
the latter group can be considered as suitable candidates for a DCL.For these 59 935 imines, the chemotypes of their source reactants
were analyzed. Some aldehydes and amines have been identified as “inert”:
with >60% of the coupling partners, their products have negative
log KC values. By contrast, “reactive”
compounds have been identified as those with log KC values >1 in approximately 60% of the reactions involving
them. As expected, inert/reactive amines have, respectively, electron-acceptor/electron-donor
substituents, which reduce/increase the reagent’s basicity
(Figure c). Conversely,
inert/reactive aldehydes carry, respectively, electron-donor/electron-acceptor
substituents.
Estimation of Imine Formation Constants in
Water
Predicting
the speciation of dynamic imine networks in the presence of biological
molecules as effectors requires the prediction of the equilibrium
constant of imine formation in water (log KW) instead of the chloroform (log KC),
considered so far. The conversion of the constant in chloroform to
that in water can be related to differences in solvation of the involved
species, which are nothing but expressions of water–chloroform
partition coefficients:The detailed
derivation of eq is
given in Supporting Information, eqs S1–S3.However, the required log PC values have not been experimentally assessed for
all the
DCL reagents and even less so for the large pool of possible products.
Therefore, a computational predictive log PC model was successfully developed (see Supporting Information) on the basis of a training
set containing 50 compounds from the ChEMBL database[48,49] with experimentally measured chloroform/water partition coefficients
log PC. However, because
of the relatively small size of the training set (50 fragment-like
molecules), the applicability domain of the model is very restricted.
Thus, reliable predictions have been obtained only for 64 imines constituted
from 14 amines and 22 aldehydes, with structures given in the DCL_data.zip
file in Supporting Information. Application
of eq to the set of
these 64 imines shows that formation constants in water are always
larger than that in chloroform, i.e., log KW > log KC. However, this notwithstanding,
the corresponding imine concentrations will be lower in water (which
shifts the equilibrium toward the reagents—amine and aldehyde).
As water concentrations are constant both in aqueous (55.56 M) and
chloroform environments (saturation concentration of 73.8 mM) and
hence do not need to be monitored in the subsequent speciation calculations,
it makes sense to introduce “effective” stability constants
instead of the thermodynamic values employed so far:where [H2O] stands
for the above-mentioned water concentrations in the respective phases.For 64 imines within a chloroform/water partition coefficient AD,
a simple relation between effective constants of imines formation
in water and in chloroform has been observed (Figure S4):Thus, effective stability constants reflect the intuitive
expectation
of a net decrease of effective stability paralleling the net decrease
of imine concentrations in water. It is also a useful shortcut for
the speciation simulations. A linear dependence of unit slope, involving
a simple constant offset is also expected as far as the intervening
players displaying “ideal” solvation behavior in both
chloroform and water so that their respective log P values may be considered as additive in terms of functional group
contributions. If so, the only net difference is expected to stem
from the replacement of the oxygen of the aldehyde carbonyl by the
nitrogen of the amine which loses most of its basicity when converted
to the =N– of the imine. Contributions of conserved
functional groups on the aldehyde or the amine to the chemical potential
of solvation will be roughly the same, irrespective of whether they
are carried by the reagents of the product and hence cancel out according
to eq —hence
the constant offset practically observed in Figure S4. Of course, this simple assumption is no longer valid if
functional groups would mutually interact in the product and/or reagents.
A state-of-the-art chemoinformatics model of log P might indeed be trained to capture such effects—but, unfortunately,
not in this case, given the sparseness of measured chloroform/water
partition coefficient values log PC. Thus, we assume in the following that eq offers so far the best available estimation
of imine stability in water and can be applied to all 59 935 virtual
imines found within AD for the log KC model.
Part 2. Modeling of Binding Affinity to Human CA II
The
ChEMBL database was used as a source for experimental ligand
binding affinity data (cited as the negative logarithm of the dissociation
or “instability” constant, pKi). The training data for the modeling contained 4350 unique inhibitors
of human CA II with experimentally measured pKi varying from 0 to 11 (Figure S5 in Supporting Information). This set included 41 imines, most of which had
a pKi in the range between 6 and 9. The
developed consensus SVR model (refer to Supporting Information for details) of R2 =
0.96 and RMSE = 0.27 log Ki units was used to predict pKi for the set of 59 935 imines within the applicability
domain of the model for imine equilibrium constants. For these molecules,
the predicted pKi values vary from 4 to
8, and the distribution function has a maximum at pKi = 5–6 (Figure S6 in Supporting Information).
Part 3. Speciation Modeling of the DCL
To illustrate
the operation of the speciation workflow, we decided to select the
simplest DCL consisting of two aldehydes, two amines, and the related
four imines. Ideal imines selected for the DCL should fulfill the
following requirements: (i) their formation constants should be similar
and high enough in order to provide comparable and rather high concentrations,
and (ii) one of the imines should have a much larger affinity for
the effector than the other DCL members and their binding blocks,
in order to reveal the effector-induced dynamic enhancement.After obtaining the 59 935 effector affinity constants for imines
within the model AD, this set was filtered by the requirement of a
predicted log KWeff > 3 (stable in water). It was achieved
by
applying the predictive model of thermodynamic stability in chloroform,
converting the result to the “effective” constant in
chloroform (eq ) and
then eventually to the effective constant in water (eq ). A total of 3615 imines passed
this test. This pool is a collection of individual products, not a
combinatorial library. “Singletons” were removed from
this collection, in the sense that an aldehyde (or amine) A was kept if and only if there was at least another reagent A′ of the same class, as well as two partners B and B′ such that all combinations (AB, AB′, A′B, A′B′) were among the 3615 selected. This led
to a restrained subset of 3091 of the above 3615 imines, forming a
sparse matrix of 278 aldehydes × 89 amines as a mosaic of several
complete combinatorial sublibraries (Figure S8).One of these 2 × 2 sublibraries (see Scheme ) was chosen to illustrate
the speciation
analysis, the final step of the present workflow. First, effector
affinities for CA II (pKi) were also estimated
for the amine and aldehyde reagents, as these might also interact
with the protein (see Figure a and Table S5 in Supporting Information). These values were used as an input to the ChemEqui speciation
software[62] in order to calculate the species
concentrations in the absence and in the presence of the CA II protein
receptor (Figure b
and Table S6 in Supporting Information).
Scheme 2
Aldehydes (A, A′),
Amines (B, B′), and Corresponding
Imines (AB, AB′, A′B, A′B′) Selected for the Speciation Experiments
Figure 5
Calculated thermodynamic and speciation parameters for
aldehydes
(A, A′), amines (B, B′), and corresponding imines (AB, AB′, A′B, A′B′). (a) Predicted log KWeff (in orange) and pKi values (in green). (b) The concentrations of the species
in the absence (blue) and in the presence of human CA II protein (gray
for uncomplexed, and red for complexed species). (c) Effect of dynamic
amplification (up-regulation) and down-regulation (%).
Calculated thermodynamic and speciation parameters for
aldehydes
(A, A′), amines (B, B′), and corresponding imines (AB, AB′, A′B, A′B′). (a) Predicted log KWeff (in orange) and pKi values (in green). (b) The concentrations of the species
in the absence (blue) and in the presence of human CA II protein (gray
for uncomplexed, and red for complexed species). (c) Effect of dynamic
amplification (up-regulation) and down-regulation (%).As shown in Figure a, A′B′ has the largest predicted
pKi value, although it does not stand
out in comparison
with the others in this respect. As expected, in the absence of the
effector, the concentration of all imines is larger than that of their
building blocks, and A′B′ is the dominant
product. In the presence of the CA II enzyme, the interplay between
the different ligand-enzyme stabilities results in significant changes
of the constituent distribution of the DCL. The imine A′B′, which has the highest binding affinity for the effector CA II (pKi = 6.70), becomes involved in a shift of the
global equilibrium toward this ligand–enzyme complex. Consequently,
the concentrations of its free building blocks in solution decrease,
the increase of concentration —“amplification”—of
the dynamically selected A′B′ leading to
a decrease or “downregulation” of the poorly bound AB′ and A′B (Figure c). To sum up, the addition of the human
CA II to the solution increased the overall concentration of AB by 12% and A′B′ by 27% with respect to their
concentrations in the absence of the effector associated with a decrease
of the concentrations of AB′ and A′B by 26% and 28%, respectively.
Discussion
In
the present study, the stability constants for imine formation,
log K, and the affinity constants toward carbonic
anhydrase CA II, pKi (predicted), of almost
60 000 imines were determined. With help from the speciation tool,
a focused array of n aldehydes × m amines could be picked such as to ensure that (a) there are putative
strong CA binders among the n × m imines, and
(b) these putative binders are not penalized by an intrinsic instability
that might jeopardize their “selection” by the protein
site. Of this pool of imines, the results show that there was no “minimalistic”
DCL obtained from a pairwise reaction of two aldehydes and two amines,
which would result in the exclusive complexation of the human CA II
enzyme with only one imine. This is not surprising, as it echoes an
already known feature of combinatorial libraries—the high degree
of relatedness of its members: near neighbors sharing a parent may
also share comparable activity levels for the target. The positive
aspect of this result is that the discovery of a series of active
analogues may help the subsequent hit-to-lead optimization efforts.
However, it is clear that the DCL investigated so far is incomplete,
failing to include important structural features, notably in this
case a (phenyl)sulfonamide group because (i) of its low solubility
in chloroform, and (ii) it would overwhelmingly bias the DCL, as it
is expected to interact very strongly with the Zn(II) cation in the
active site of the enzyme, thus overshadowing any other constituents.
Investigation of other building blocks is required. This is nevertheless
not a liability at this stage because any “primary hits”
detected by a DCL would not have direct applications in drug discovery.
DCLs are key tools to probe the protein binding patterns and provide
structure–affinity information for refinement of affinity prediction
models (machine-learned, pharmacophore-based, or docking-based).The availability of experimental data on imine formation in a given
solvent and on effector-imine affinity is crucial for machine-learning
models. Models trained on small training sets have restricted applicability
domains, which may significantly reduce the number of the considered
DCL candidates.Clearly, chemoinformatics will not predict a
sole winner, given
the inherent inaccuracies of the underlying models. Even the most
accurate affinity prediction tool—computer-intensive free energy
perturbation calculations—would fall short of this goal. Actually,
even if all stability and affinity constants of individual DCL members
were experimentally measured, the intrinsic experimental error of
measurements (typically on the order of 0.5 log units) would still
introduce significant uncertainty in the output of predicted speciation.
Also note that predicted protein–ligand interactions are only
prone to happen at the “envisaged” binding site for
which the affinity model was tuned (as far as training data are binding
site specific, as is the case of the classical Ki determinations from dose–response reference ligand
displacement curves). Should some ligands bind at different protein
sites—possibly modulating the protein activity—they
would be selected by the DCL but not recognized as privileged ligands
according to the predictive models. Such binding may give rise to
secondary site bioactivity, for instance, by operation of an allosteric
effect. This eventuality is an especially attractive feature of the
DCL approach, more than direct binding to the “main”
receptor site as highlighted by crystallographic data. It amounts
to exploration of potential (virtual) sites versus design for a known
site—but cannot benefit from chemoinformatics support, which
is conditioned by prior knowledge. From the drug discovery point of
view, it would suggest new regions for exploration of structure/activity
relationships. In practice, the application of a DCL is a task of
identification of the best/optimal binder(s), and it implicitly is
much facilitated by an a priori knowledge of the
protein structure and hence the knowledge about the binding site(s).[63]
Conclusions
The present study shows
that detailed in silico predictions of the behavior
of DCLs is technically feasible, pending
experimental validation to prove that such insights gained from simulations
may indeed help to rationally design DCLs maximizing the expectation
to discover useful new protein inhibitors, metal ion chelators, and
synthetic receptors. So far, training data quantity and quality are
not sufficient to build ideally predictive models, with extrapolation
capacities such as to render predicted equilibrium constant values
accurate enough to support a prediction of equilibrium concentrations
in such a complex system as a DCL. However, this ultimate goal is
not the actual objective of chemoinformatics, which has proven of
great utility in spite of the inaccuracy of its predictions. Total
“computational deconvolution” of a
DCL is hardly an achievable goal. Fortunately, this is not needed
because the DCL is per se an outstanding search tool
for the optimal binder, allowing for simultaneous “testing”
of large numbers of competing structures.[23,35,36] The approach may, however, be sufficiently
accurate to ensure that a computer-designed DCL stands enhanced chances
of success compared to some random mixture of reagent pools. Discovery
is not expected to come from one initial “perfect” prediction
but from cycles of prediction—experimentation—model
reassessment and refinement, taking into account the latest experimental
results. The present work outlines the technical feasibility of the
computational part, leaving the experimental validation challenge
open for future work.
Authors: Peter T Corbett; Julien Leclaire; Laurent Vial; Kevin R West; Jean-Luc Wietor; Jeremy K M Sanders; Sijbren Otto Journal: Chem Rev Date: 2006-09 Impact factor: 60.622
Authors: Sam Michael; Douglas Auld; Carleen Klumpp; Ajit Jadhav; Wei Zheng; Natasha Thorne; Christopher P Austin; James Inglese; Anton Simeonov Journal: Assay Drug Dev Technol Date: 2008-10 Impact factor: 1.738
Authors: David Mendez; Anna Gaulton; A Patrícia Bento; Jon Chambers; Marleen De Veij; Eloy Félix; María Paula Magariños; Juan F Mosquera; Prudence Mutowo; Michal Nowotka; María Gordillo-Marañón; Fiona Hunter; Laura Junco; Grace Mugumbate; Milagros Rodriguez-Lopez; Francis Atkinson; Nicolas Bosc; Chris J Radoux; Aldo Segura-Cabrera; Anne Hersey; Andrew R Leach Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971