Salvatore Galati1,2, Dimitar Yonchev1, Raquel Rodríguez-Pérez1, Martin Vogt1, Tiziano Tuccinardi2, Jürgen Bajorath1. 1. Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany. 2. Department of Pharmacy, University of Pisa, 56126 Pisa, Italy.
Abstract
Carbonic anhydrases (CAs) catalyze the physiological hydration of carbon dioxide and are among the most intensely studied pharmaceutical target enzymes. A hallmark of CA inhibition is the complexation of the catalytic zinc cation in the active site. Human (h) CA isoforms belonging to different families are implicated in a wide range of diseases and of very high interest for therapeutic intervention. Given the conserved catalytic mechanisms and high similarity of many hCA isoforms, a major challenge for CA-based therapy is achieving inhibitor selectivity for hCA isoforms that are associated with specific pathologies over other widely distributed isoforms such as hCA I or hCA II that are of critical relevance for the integrity of many physiological processes. To address this challenge, we have attempted to predict compounds that are selective for isoform hCA IX, which is a tumor-associated protein and implicated in metastasis, over hCA II on the basis of a carefully curated data set of selective and nonselective inhibitors. Machine learning achieved surprisingly high accuracy in predicting hCA IX-selective inhibitors. The results were further investigated, and compound features determining successful predictions were identified. These features were then studied on the basis of X-ray structures of hCA isoform-inhibitor complexes and found to include substructures that explain compound selectivity. Our findings lend credence to selectivity predictions and indicate that the machine learning models derived herein have considerable potential to aid in the identification of new hCA IX-selective compounds.
Carbonic anhydrases (CAs) catalyze the physiological hydration of carbon dioxide and are among the most intensely studied pharmaceutical target enzymes. A hallmark of CA inhibition is the complexation of the catalytic zinc cation in the active site. Human (h) CA isoforms belonging to different families are implicated in a wide range of diseases and of very high interest for therapeutic intervention. Given the conserved catalytic mechanisms and high similarity of many hCA isoforms, a major challenge for CA-based therapy is achieving inhibitor selectivity for hCA isoforms that are associated with specific pathologies over other widely distributed isoforms such as hCA I or hCA II that are of critical relevance for the integrity of many physiological processes. To address this challenge, we have attempted to predict compounds that are selective for isoform hCA IX, which is a tumor-associated protein and implicated in metastasis, over hCA II on the basis of a carefully curated data set of selective and nonselective inhibitors. Machine learning achieved surprisingly high accuracy in predicting hCA IX-selective inhibitors. The results were further investigated, and compound features determining successful predictions were identified. These features were then studied on the basis of X-ray structures of hCA isoform-inhibitor complexes and found to include substructures that explain compound selectivity. Our findings lend credence to selectivity predictions and indicate that the machine learning models derived herein have considerable potential to aid in the identification of new hCA IX-selective compounds.
Human carbonic anhydrases
(hCAs) are metalloenzymes
that catalyze a reversible hydration of carbon dioxide producing bicarbonate
with the release of a proton.[1] Among the
eight genetically distinct CA families (α, β, γ,
δ, ζ, η, θ, and ι), 15 α-CA isoforms
are known in humans, i.e., hCA I–hCA XIV, which include two V-type isoforms (hCA VA
and hCA VB) that differ in cellular distribution
and functions. These metalloenzymes are involved in numerous physiological
processes such as pH regulation, CO2 homeostasis, bone
resorption, and gluconeogenesis.[2] Due to
the wide spectrum of physiological roles played by CAs, they have
been shown to be involved in different diseases such as glaucoma,
obesity, osteoporosis, various types of tumors, epilepsy, and neuropathic
pain. Therefore, hCAs are regarded as important therapeutic
targets, and hCA modulators are recognized as promising
agents for clinical applications.[3] Among
the different hCA isoforms, hCA
IX and XII are predominantly found in tumor cells and show a rather
limited diffusion in normal cells. Both isoforms are multidomain trans-membrane
proteins with an extracellular CA domain and were demonstrated to
participate in the rather complex machinery of pH regulation.[4] In particular, the membrane-associated hCA IX is considered a tumor-associated protein due to its
low level of expression in normal tissues and high overexpression
in almost all hypoxic tumors, where it contributes to survival, proliferation,
invasion, and metastasis of cancer cells.[5] For these reasons, the hCA IX isoform has attracted
the attention of many researchers focusing their efforts on the development
of potent hCA IX inhibitors. As a result, a plethora
of inhibitors has been reported in literature with compounds mainly
belonging to the sulfonamide, dithiocarbamate, coumarin, sulfocoumarin,
sulfamate, and carboxylate classes. Furthermore, an ongoing clinical
trial (NCT03450018) is evaluating the sulfonamide inhibitor SLC-0111
in hCA IX-positive patients diagnosed with metastatic
pancreatic ductal adenocarcinoma.[6]At present, many compounds that act as low nanomolar or subnanomolar hCA IX inhibitors are known; however, beyond inhibitory
potency, an important key feature that must be considered for a potential
therapeutic application of these compounds is their selectivity against
the other hCA isoforms and especially against hCA I/II, which are ubiquitously distributed and involved
in key physiological processes.[7] This is
particularly true for hCA II since it has the widest
tissue distribution and is highly expressed in red blood cells.[8] Because most of the drugs are administered systemically
and are membrane-permeable, hCA II is likely to sequester
nonselective hCA inhibitors reducing their circulating
concentrations, decreasing their bioavailability for hCA IX, and thus limiting their exposure within tumors.[9]Overcoming the lack of selectivity for
a specific hCA isoform represents the major challenge
in the development of hCA inhibitors for therapy.
The difficulty in finding a
compound selective for a specific hCA isoform is
due to the high sequence and structural homology shared by all hCA isoforms.[10] The large number
of compounds reported in literature tested for their inhibition activity
against hCA IX and hCA II has prompted
us to generate a database of compounds with selectivity for hCA IX over hCA II or nonselective. Machine
learning (ML) was then applied to predict isoform-selective inhibitors,
and compound features determining successful predictions were identified.
The resulting feature patterns were further analyzed on the basis
of X-ray structures of hCA-inhibitor complexes, revealing
individual features that were directly implicated in isoform selectivity.
Materials and Methods
Compound Data Sets
Our compound collection
was assembled from publicly available data extracted from the PubChem
BioAssay database (accessed September 2019).[11] Compounds with measured potencies against hCA II
and hCA IX (corresponding to UniProt IDs “P00918”
and “Q16790”, respectively) were collected. In order to ensure homogeneous
experimental conditions and inter-assay data consistency,[12] only assays originating from the laboratory
of C. T. Supuran were considered, which amounted to a total of 1138
assays (PubChem AIDs) and 7121 compounds (CIDs). We intended to generate
a comprehensive and intrinsically heterogeneous set of inhibitors
covering different variants of inhibitory mechanisms, all of which
were directed against the active site of hCA. Since
we aimed at predicting isoform selectivity of hCA
inhibitors and rationalizing these predictions, we considered it important
to comprehensively analyze different types of inhibitors, which further
challenged machine learning. Training and test instances were available
for all types of inhibitors and considered in combination. Accordingly,
the results were generalizable (and not confined to subsets of inhibitors).
Only enzyme-inhibitor interactions for which numerically defined inhibition
constants (Ki values) were available were
considered. No Ki threshold was applied
for inhibitors. If two Ki values were
available for a compound, then preference was given to the one reported
in the source publication. For compounds with three or more measurements, Ki values deviating by more than 25% from the
calculated mean Ki were discarded, and
the mean Ki value was recalculated and
assigned as the final potency annotation. Applying these criteria
resulted in a total of 2506 inhibitors tested against both hCA isoforms for which a subsequent selectivity analysis
was carried out. For each ligand, a selectivity index (SI) was calculated
as the difference between the measured negative logarithmic (pKi) values for hCA IX and hCA II. Hence, compounds with SI > 0.7, corresponding
to
at least a five-fold higher potency for hCA IX over hCA II, were categorized as selective hCA IX inhibitors. Conversely, compounds with SI ≤ 0.7 were
classified as nonselective hCA inhibitors. This classification
scheme yielded a data set of 870 hCA IX-selective
and 1636 nonselective inhibitors.
Molecular
Representation
Building
ML models for distinguishing between selective and nonselective hCA inhibitors requires the use of molecular representations
such as numerical descriptors or fingerprints. Therefore, for each
compound, a modified version of the molecular graph-based (i.e., the
stereochemically insensitive) extended connectivity fingerprint with
bond diameter 4 (ECFP4)[13] was calculated
using the Morgan fingerprint implementation of RDKit.[14] ECFPs account for specific atom environments (for ECFP4,
those within a radius of two bonds around an atom), which are represented
as hash values. In cheminformatic ML applications, ECFP4 has become
a widely accepted standard representation for compounds with comparable
or superior performance relative to other (fingerprint) descriptors.[15] The ECFP4 hash values for all unique atom environments
in each data set compound were computed, resulting in a total of 6061
unique structural features. The hash value positions in the molecular
feature vectors were then organized according to their frequency of
occurrence in a descending manner. Hence, for each compound, the presence
or absence of a specific structural feature determined whether its
corresponding bit position in the 6061-dimensional molecular feature
vector was set to 1 or 0. This procedure did not include any additional
dimensionality reduction such as standard fingerprint “folding”
into a predefined fixed-length vector and hence avoided potential
bit collisions that may be caused by ambiguous feature-bit mappings.
As a result, unambiguous reverse mapping of fingerprints to their
corresponding structural features allowed for the visualization and
assessment of the importance of individual features during ML classification.
Structural Organization
To identify
analog series (ASs) formed by hCA inhibitors, data
set compounds were subjected to bond fragmentation according to a
set of retrosynthetic rules and organized into analog series (ASs)
using the compound-core relationship algorithm.[16] Accordingly, compounds containing the same structural core
and different substituents were combined into an AS.
Machine Learning Methods
Random
Forest
A random forest (RF)
is a supervised ML algorithm that consists of a large number of individual
decision trees forming an ensemble classifier. Each individual tree
produces a class label prediction for a given data instance, and the
final prediction outcome is determined by the majority class vote.[17] Class weight balancing was automatically inferred
by the model as inversely proportional to the class label frequencies
in the input data. All remaining hyperparameters were set to their
default values in scikit-learn version 0.23.1.[18]
Support Vector Machine
A support
vector machine (SVM) is a supervised ML algorithm that constructs
a hyperplane or set of hyperplanes in a multidimensional feature space,
which are used for classification or regression. In the case of classification,
acceptable separation is achieved by a hyperplane having the largest
distance to the nearest training data points of any class. Thus, maximizing
the margin lowers the generalization error of the classifier.[19]Furthermore, kernel functions enable the
algorithm to operate in a high-dimensional implicit feature space.
Instead of explicitly computing the data coordinates in that space,
the inner products of their pairwise projections are calculated. This
approach is commonly referred to as the “kernel trick”
and presents a computationally efficient alternative to explicit dimensionality
expansion.[20] Accordingly, if linear separation
via a hyperplane is not feasible in a given feature space, then the
kernel trick facilitates implicit mapping of training compounds into
a higher-dimensional feature space where linear separation might become
feasible. Herein, the linear and Tanimoto kernels[21] were used, and the better performing kernel was selected
for each model during internal cross validation (see below). The regularization
parameter C determines the magnitude of error penalization
and balances model performance in the training set and overfitting.
During parameter optimization, C values 0.01, 0.1,
1, 10, and 100 were evaluated. SVM training was performed using scikit-learn
version 0.23.1.
Cross Validation and Performance
Measures
All ML calculations were carried out by applying
a standard double
cross validation procedure. First, the ECFP4 representations of selective
and nonselective inhibitors were assigned classification labels of
“1” and “0”, respectively. Then, the data
set was recurrently divided 10 times by random sampling into 80% training
and 20% test compounds. Calculation parameters specified above were
optimized via internal five-fold cross validation on the training
set, and the best performing parameter settings were used for test
set predictions. Based on the predictions from the 10 independent
external cross validation trials, the following measures were computed
in order to evaluate model performance: balanced accuracy (BA),[22] F1 score,[23] and Matthew’s
correlation coefficient (MCC),[24] defined
as followsTP, TN, FP, and FN abbreviate true
positives, true negatives, false positives, and false negatives, respectively.
Here, TP + FN corresponds to the total number of selective compounds;
conversely, FP + TN corresponds to the total number of nonselective
compounds.As indicated by its formula, MCC takes into account
all values
of the confusion matrix derived from binary classification. It has
the range [−1,1] where MCC = 1 represents a perfect classification
(with no FP and FN), MCC = 0 is equivalent to random classification,
and MCC = −1 indicates complete disagreement between predicted
and actual class labels. BA accounts for the fraction of correct predictions
while taking data imbalance into account through equivalent weighting.
This was another appropriate measure for our analysis because the
compound data set contained approximately twice as many nonselective
(negative class) as selective (positive class) compounds. BA has the
range [0,1] with BA = 1 describing perfect, BA = 0.5 random, and BA
= 0 completely inaccurate classifications, respectively. The F1 score
is a composite measure representing the harmonic mean of precision
and recall. It strongly emphasizes TP values without taking TN values
into consideration. High F1 values indicate good model performance.In addition, receiver operating characteristic (ROC) curves[25] were computed to compare the TP rate ([0, 1], y-axis) to the FP rate ([0, 1], x-axis)
at different classification thresholds, and the area under the ROC
curve (AUROC) was determined.[25] In a ROC
curve, the diagonal line is equivalent to a random class prediction
and yields an AUROC value of 0.5. Increasing AUROC values between
0.5 and 1 are indicative of increasing model performance, with the
value of 1.0 representing a perfect prediction.Furthermore,
to ensure statistically sound comparisons of individual
inhibitors, we also required that each inhibitor was predicted as
a test set compound in at least five different external cross validation
trials. This criterion was met after 26 trials, and for each selective
and nonselective inhibitor, “model prediction consistency”
(MPC) was calculated as followsAccordingly, an MPC value
of 100% indicated that a compound was
consistently correctly classified. Conversely, an MPC value of 0%
resulted from consistently incorrect classification in each trial.
Feature Weighting and Frequency Analysis
To identify individual structural features determining the classification,
corresponding feature weights (FWs) were extracted from SVM models.
For FW extraction, two previously introduced methods using the Tanimoto
kernel were applicable.[26,27] In addition to their
numerical values, FWs were assigned positive or negative signs depending
on their relative importance for predicting a specific class label.
According to this definition, the SVM model assigned a positive sign
to a feature if its presence predominantly determined the prediction
of selective inhibitors. In contrast, a feature with a negative sign
predominantly determined the identification of nonselective inhibitors.The corresponding frequency distributions for selective and nonselective
compounds, respectively, were defined asIn order to determine whether the presence of a given structural
feature was more important for the identification of selective or
nonselective inhibitors, a frequency difference value ΔF was calculated for each feature iThus, features with positive ΔF values were
preferentially found in selective hCA IX inhibitors,
whereas negative ΔF values indicated features
that preferentially occurred in nonselective inhibitors. Furthermore,
features that were exclusively found in selective or nonselective
compounds were identified and prioritized.
Analysis
of X-ray Structures
Key
features determining predictions were further analyzed on the basis
of publicly available X-ray structures of hCA II
and hCA IX in complex with inhibitors. A total of
488 hCA II, 11 hCA IX, and 59 hCA IX-mimicking (mutated) proteins in complex with unique
inhibitors, many of which were contained in our data set (Table ), were obtained from
the RCSB Protein Data Bank (accessed September 2020).[28] An hCA IX-mimicking protein contains the
original hCA II isoform active site engineered by
site-directed mutagenesis to represent the wild-type hCA IX isoform by introducing relevant residue replacements. These
replacements included A65S, N67Q, E69T, I91L, F131V, K170E, and L204A
(hCA II sequence numbering).[29] Further analysis revealed that compounds in four of the hCA IX and 30 of the hCA IX-mimicking structures
were also cocrystallized with the hCA II isoform.
Thus, these compounds provided a meaningful basis for studying different
binding modes and specific interactions associated with hCA IX/hCA II selectivity taking into account structural
features that determine ML predictions. Superpositions of X-ray structures
were obtained using UCSF Chimera.[30]
Table 1
X-ray Structuresa
target
PDB entries
unique inhibitors
contained in the ML data set
shared by isoforms
shared
in the ML data set
hCA II
811
488
93
34
12
hCA IX
20
11
4
4
2
hCA IX-mimic
94
59
15
30
10
Reported are X-ray
structures of hCA-inhibitor complexes evaluated in
our analysis. For example,
from the PDB, 811 structures of hCA II-inhibitor
complexes were retrieved, which contained 488 unique inhibitors, 93
of which were contained in our data set for ML. Thirty-four of these
inhibitors were found in complex structures of all three hCA isoforms, and 12 of these shared inhibitors were contained in
our data set.
Reported are X-ray
structures of hCA-inhibitor complexes evaluated in
our analysis. For example,
from the PDB, 811 structures of hCA II-inhibitor
complexes were retrieved, which contained 488 unique inhibitors, 93
of which were contained in our data set for ML. Thirty-four of these
inhibitors were found in complex structures of all three hCA isoforms, and 12 of these shared inhibitors were contained in
our data set.
Results and Discussion
Inhibitors and Analog Series
Initially,
the data set of selective and nonselective inhibitors for ML was structurally
organized. It was found to contain 328 ASs with two or more compounds,
representing ∼70% of the 1748 inhibitors. ASs comprised only hCA IX-selective inhibitors (48 series), only nonselective
(163), or both selective and nonselective inhibitors (117 “mixed”
series). Figure shows
that these different AS categories displayed similar size distributions,
with a clear dominance of small series with less than five compounds.
Only small numbers of larger ASs comprising up to 25 compounds were
detected. Hence, there was no global or category-centered bias in
AS composition toward small numbers of large series, which might limit
predictive modeling or conclusions drawn from such investigations.
However, as revealed by the presence of 117 mixed ASs, many selective
and nonselective inhibitors displayed close structural relationships,
which principally challenged the prediction of selective inhibitors.
Furthermore, there were essentially twice as many nonselective than
selective inhibitors available (applying a moderate SI > 0.7 criterion),
which reflected the inherent difficulties in obtaining isoform-selective hCA inhibitors, as described above. Rather than balancing
the number of compounds with different class labels (positive/selective
or negative/nonselective) for training, which generally favors ML
predictions, we preferred retaining this intrinsic imbalance, thus
attempting predictions under realistic data conditions. Taken together,
in light of the statistical and structural characteristics of the
inhibitor data set, the selectivity prediction task was considered
challenging.
Figure 1
Distribution of analog series. Histograms report the size
distributions
of ASs exclusively consisting of nonselective (red) or selective (green)
inhibitors or combining both nonselective and selective compounds
(mixed series, orange). The three histograms are shown on the same
scale.
Distribution of analog series. Histograms report the size
distributions
of ASs exclusively consisting of nonselective (red) or selective (green)
inhibitors or combining both nonselective and selective compounds
(mixed series, orange). The three histograms are shown on the same
scale.
Prediction
of Selective Inhibitors
We then attempted to systematically
predict hCA
IX-selective inhibitors in cross validation trials. Contrary to our
expectations, generally high prediction accuracy was achieved, for
both RF and SVM models and on the basis of all performance measures,
as summarized in Figure . The performance of RF and SVM classification was very similar with
only little variation over different trials. With median F1 values
of ∼0.75, median BA of >0.8, and AUROC values close to 0.9,
the predictions consistently yielded reasonable to high accuracy,
as further indicated by median MCC values of ∼0.6. We also
assessed the predictions at the level of ASs, which mirrored the structural
organization of test data. As reported in Table , 31 of 48 ASs exclusively comprising selective
inhibitors were consistently correctly predicted (MPC = 100%), corresponding
to 173 of 219 compounds contained in selective ASs. Moreover, 145
of 163 ASs exclusively consisting of nonselective inhibitors were
consistently correctly predicted, including 508 of 549 nonselective
inhibitors. Overall, 83% of ASs consisting of either only selective
or nonselective inhibitors were always correctly predicted. Hence,
assessing the predictions at the level of ASs further confirmed their
global accuracy.
Figure 2
Prediction accuracies. Boxplots report prediction accuracy
over
10 independent RF (blue) and SVM (orange) trials using different training
and test sets. From the left to the right, results are shown for the
F1, BA, AUROC, and MCC measures. Boxplots show the smallest value
(lower whisker), lower quartile (lower boundary of the box), median
(vertical line in the box), upper quartile (upper boundary of the
box), and the maximum value (upper whisker). Values classified as
statistical outliers are represented as diamonds.
Table 2
Prediction of Analog Seriesa
compounds
analog series
selective
total data set
219
48
MPCselective = 100%
173
31
nonselective
total data set
559
163
MPCnonselective = 100%
508
145
Reported are ASs exclusively consisting
of selective and nonselective inhibitors and their subsets that were
consistently correctly predicted (MPC = 100%).
Prediction accuracies. Boxplots report prediction accuracy
over
10 independent RF (blue) and SVM (orange) trials using different training
and test sets. From the left to the right, results are shown for the
F1, BA, AUROC, and MCC measures. Boxplots show the smallest value
(lower whisker), lower quartile (lower boundary of the box), median
(vertical line in the box), upper quartile (upper boundary of the
box), and the maximum value (upper whisker). Values classified as
statistical outliers are represented as diamonds.Reported are ASs exclusively consisting
of selective and nonselective inhibitors and their subsets that were
consistently correctly predicted (MPC = 100%).
Feature Relevance Analysis
In light
of the observed accuracy, we further assessed the predictions by exploring
structural features that were responsible for the predictions. In
ML, diagnostic approaches are still rare but essential for rationalizing
successful predictions or failures. Given the equivalence of the results
obtained for RF and SVM classification and the consistently better
predictive performance of the Tanimoto over the linear kernel, we
focused the analysis on SVM calculations, for which feature weighting
approaches were applicable (see Materials and Methods). Accordingly, we determined ECFP4 features with positive and negative
SVM weights contributing to the correct prediction of selective and
nonselective inhibitors, respectively, and searched for contributing
features that exclusively occurred in selective or nonselective compounds. Figure shows that large
numbers of features were identified that contributed with varying
weights to positive or negative predictions and exclusively occurred
in selective and nonselective inhibitors, respectively. As indicated
by generally low ΔF values, exclusive features
typically only occurred in small subsets of compounds. Hence, there
were no distinguishing features that could be generalized, consistent
with the structural heterogeneity of selective and nonselective compounds,
as revealed by their partitioning into many different ASs of mostly
small size. Furthermore, most of the exclusive features had absolute
weights <0.10, and comparably few features with absolute weights
>0.15 were detected. While many features contributed to meaningful
SVM predictions, the latter features largely determined correct predictions
of selective or nonselective inhibitors. Figure shows the top 10 features with the largest
weights that exclusively occurred in selective inhibitors and thus
made the most important contribution to the prediction of selectivity.
These ECFP4 features defined different structural fragments that occurred
in test compounds as substructures. Notably, these features included
two distinct sulfonamide-containing substructures (numbers 1590 and
5500). As discussed above, the sulfonamide group complexing the catalytic
zinc cation in the active site is a hallmark of many potent hCA inhibitors, which is contained in both selective and
nonselective inhibitors. Thus, the presence or absence of a sulfonamide
group alone is insufficient to distinguish between selective and nonselective
inhibitors. Rather, the way in which sulfonamide is embedded in substructures/compounds
or specific feature combinations in which it occurs might contribute
to the prediction of selective inhibitors. Furthermore, the two top
ranked features in Figure were features 209 and 212, which delineated overlapping substructures
and had the largest weights and by far the highest ΔF value among positive features, as shown in Figure (where features 209 and 212
are highlighted). Thus, these two features accounting for similar
structural fragments made overall the most important contributions
to the predictions of selective inhibitors.
Figure 3
Distribution of exclusive
features. The scatterplot shows the distribution
of ECFP4 features that are exclusively found in nonselective (orange)
or selective (green) inhibitors. Each dot represents a unique feature.
The relative frequency of occurrence of a feature in nonselective
or selective compounds (ΔFrequency; nonselective < 0, selective
> 0) is plotted against the mean feature weight from SVM classification.
Negative and positive weights represent contributions to the prediction
of nonselective and selective compounds, respectively. Two features
with the largest Δfrequency values and the highest weights are
highlighted (red box, upper right corner).
Figure 4
Exclusive
features. Shown are the top 10 features with the largest
SVM weights that exclusively occurred in selective inhibitors (ordered
from upper left, top 1, to lower right, top 10). Features 209 and
212 are highlighted in Figure .
Distribution of exclusive
features. The scatterplot shows the distribution
of ECFP4 features that are exclusively found in nonselective (orange)
or selective (green) inhibitors. Each dot represents a unique feature.
The relative frequency of occurrence of a feature in nonselective
or selective compounds (ΔFrequency; nonselective < 0, selective
> 0) is plotted against the mean feature weight from SVM classification.
Negative and positive weights represent contributions to the prediction
of nonselective and selective compounds, respectively. Two features
with the largest Δfrequency values and the highest weights are
highlighted (red box, upper right corner).Exclusive
features. Shown are the top 10 features with the largest
SVM weights that exclusively occurred in selective inhibitors (ordered
from upper left, top 1, to lower right, top 10). Features 209 and
212 are highlighted in Figure .
Feature
Mapping
The keyed design
of the feature fingerprint with 1:1 bit-to-feature correspondence
made it possible to map key features to structures of test compounds.
Therefore, we searched for ASs exclusively comprising selective inhibitors
that were consistently correctly predicted and contained features
209 and/or 212. Several ASs were identified. Figure A shows an exemplary series of sulfocoumarin
derivatives in which both features were present and formed a substructure
covering most of the sulfocoumarin core. These compounds are potent
and selective inhibitors of the tumor-associated hCA IX and hCA XII isoforms.[31] Of note, coumarin and sulfocoumarin derivatives can act by complex
mechanisms. These compounds are known to undergo hydrolysis upon binding
to the catalytic site of hCAs. However, prior to
hydrolysis, they bind within the hCA active site
similarly to phenols, i.e., by anchoring to the zinc-bound water molecule/hydroxide
ion,[32] as confirmed by an X-ray structure
of 2-thioxocoumarine in complex with hCA II.[33] This recognition mechanism was intentionally
included in our ML analysis, yielding promising results.
Figure 5
(A,B) Feature
mapping. In (A), feature 209 from Figure is mapped (red) on exemplary
analogs from a selective AS with MPCselective = 100%. In
(B), members of another selective AS with MPCselective =
100% are shown. Features with the highest SVM weights are mapped on
the analogs. For a compound from this AS, X-ray structures of complexes
with hCA II and hCA IX were available.
(A,B) Feature
mapping. In (A), feature 209 from Figure is mapped (red) on exemplary
analogs from a selective AS with MPCselective = 100%. In
(B), members of another selective AS with MPCselective =
100% are shown. Features with the highest SVM weights are mapped on
the analogs. For a compound from this AS, X-ray structures of complexes
with hCA II and hCA IX were available.In Figure B, another
selective AS is shown in which features making the largest contributions
to consistently correct predictions were mapped on individual analogs
containing them. All of these features delineated an extended terminal
pyridyl or substituted phenyl ring systems distant from the sulfonamide
moiety. Thus, in both cases, key features for correct predictions
defined coherent substructures of corresponding regions of analogs,
which provided a basis of interpreting predictions.
Relating Important Features to Selectivity
Feature
weighting and mapping identified a number of features that
determined accurate SVM predictions of hCA IX-selective
inhibitors. However, although these features made major contributions
to ML predictions, it could not be concluded that they were implicated
in or responsible for selectivity. Structural features determining
predictions may or may not be of biological relevance, the assessment
of which goes beyond ML analysis. Hence, the question whether substructures
defined by the most important features we identified were indeed implicated
in inhibitor selectivity required additional analysis.
Structure-Based Analysis
To address
this question, we searched for selective inhibitors for which X-ray
structures of complexes with hCA II and hCA IX or hCA IX-mimics were available. Such structures
provided a basis for viewing mapped key features in light of enzyme-inhibitor
interactions and exploring potential differences implicated in selectivity.
Among the large number of publicly available hCA
isoform X-ray structures (Table ), a limited number of suitable hCA
II/IX structures with selective inhibitors we predicted were identified
and compared. A particularly instructive example was obtained by comparing
X-ray structures of hCA II and hCA IX-mimicking protein in complex with the hCA
IX-selective inhibitor SLC-0111 that belongs to the series in Figure B (PDB entries 3N4B
and 5JN3, respectively). The binding mode of SLC-0111 in the hCA II and hCA IX-mimic structures is shown
in Figure A and B,
respectively. As observed for all members of the corresponding AS,
the feature with the highest positive SVM weight mapped to the terminal
ring (in this case, a 4-fluorophenyl moiety) distant from the sulfonamide
group complexing the catalytic zinc ion. In both complexes, the benzenesulfonamide
fragment position was superimposable and interacted with the catalytic
zinc ion; in the hCA II structure, the N,N′-ureic portion of the ligand adopted a
less stable cis/trans conformation with the 4-fluorophenyl moiety
that interacted with residue P201. This orientation was determined
by the steric hindrance between the 4-fluorophenyl moiety and the
phenyl ring of the F130 side chain that determined the observed orientation
of the compound. The phenylalanine residue was not conserved in hCA IX where it was replaced by a smaller valine residue
(V262). Figure shows
the corresponding sequence alignment. This substitution led to the
absence of steric hindrance between the protein and the 4-fluorophenyl
moiety of the ligand. As a consequence, the inhibitor was able to
maintain a more stable trans/trans N,N′-ureic conformation with a strong lipophilic interaction
between 4-fluorophenyl and V262. By contrast, suboptimal interactions
in this region of hCA II resulted in a loss of potency
of the inhibitor compared to hCA IX and hence in
selectivity of the compound for hCA IX over hCA II. The importance of inhibitor interactions with residue
131 in hCA isoforms has also been pointed out in
the literature,[34] providing corroborating
evidence. These considerations were equally applicable to most of
the other analogs comprising the hCA IX-selective
series in Figure B,
which was consistently correctly predicted. In all instances, features
with the highest positive weights determining the predictions were
mapped to the corresponding ring structures, which were implicated
in selectivity-determining interactions with hCA
isoforms. Therefore, in this case, features that determined ML predictions
were directly implicated in critical enzyme-inhibitor interactions
determining compound selectivity and thus biologically relevant.
Figure 6
Structure-based
analysis. X-ray structures of SLC-0111 in complex
with (A) hCA II and (B) hCA IX-mimic
forms. (upper section) The catalytic zinc cation interacting with
the sulfonamide moiety of the inhibitor is depicted as a sphere; (lower
section) an interaction map is shown for the inhibitor and amino acid
residues lining the active site (green, lipophilic residues; sky blue,
polar residues; red, charged residues).
Figure 7
Sequence
alignment of hCA II and hCA IX.
Shown is the alignment of the hCA II (CAH2_HUMAN)
and hCA IX (CAH9_HUMAN) amino acid sequences taken
from UniProt.[35] Binding site residues are
highlighted in yellow, and nonconserved residues participating in
the formation of the binding site are shown in bold. Identical residues
are indicated with “*”, while conservative residue replacements
are marked with “:” and “.”.
Structure-based
analysis. X-ray structures of SLC-0111 in complex
with (A) hCA II and (B) hCA IX-mimic
forms. (upper section) The catalytic zinc cation interacting with
the sulfonamide moiety of the inhibitor is depicted as a sphere; (lower
section) an interaction map is shown for the inhibitor and amino acid
residues lining the active site (green, lipophilic residues; sky blue,
polar residues; red, charged residues).Sequence
alignment of hCA II and hCA IX.
Shown is the alignment of the hCA II (CAH2_HUMAN)
and hCA IX (CAH9_HUMAN) amino acid sequences taken
from UniProt.[35] Binding site residues are
highlighted in yellow, and nonconserved residues participating in
the formation of the binding site are shown in bold. Identical residues
are indicated with “*”, while conservative residue replacements
are marked with “:” and “.”.
Conclusions
Predicting target-selective
compounds typically represents a challenging task. In this work, we
have attempted to predict inhibitors with selectivity for the tumor-associated hCA IX isoform over the ubiquitous hCA
II isoform via ML. Surprisingly accurate and robust predictions were
obtained using RF and SVM models, including many selective or nonselective
ASs that were consistently correctly predicted, lending credence to
the computational approach. These rather encouraging findings prompted
us to further analyze the predictions. SVM feature weight analysis
revealed numerous features that exclusively occurred in selective
or nonselective compounds and contributed to positive and negative
predictions. Highly weighted features were found to map to corresponding
regions in ASs, hence rationalizing origins of successful predictions.
For selectivity analysis and compound design, signature features of
compound selectivity are of prime interest. However, there is no guarantee
that features that make large contributions to or determine positive
ML predictions are indeed biologically relevant. Therefore, we have
gone a step further and evaluated important features on the basis
of X-ray structures of complexes formed by the hCA
IX and hCA II isoforms and selective inhibitors.
For an exemplary selective AS, comparisons of corresponding X-ray
structures revealed that features determining correct predictions
defined substructures of inhibitors that were involved in selectivity-conferring
interactions, thus establishing proof-of-principle. Demonstrating
biological relevance of distinguishing features identified by ML is
far from being routine, and to our knowledge, this may be one of the
first studies doing so. Our findings also indicate that the ML models
reported herein should have potential for practical applications in
the search for new hCA IX-selective inhibitors. Therefore,
as a part of our study, trained RF and SVM models are made available
upon request.
Authors: Eric F Pettersen; Thomas D Goddard; Conrad C Huang; Gregory S Couch; Daniel M Greenblatt; Elaine C Meng; Thomas E Ferrin Journal: J Comput Chem Date: 2004-10 Impact factor: 3.376
Authors: Alfonso Maresca; Claudia Temperini; Lionel Pochet; Bernard Masereel; Andrea Scozzafava; Claudiu T Supuran Journal: J Med Chem Date: 2010-01-14 Impact factor: 7.446