Literature DB >> 33585783

Predicting Isoform-Selective Carbonic Anhydrase Inhibitors via Machine Learning and Rationalizing Structural Features Important for Selectivity.

Salvatore Galati^1,2, Dimitar Yonchev¹, Raquel Rodríguez-Pérez¹, Martin Vogt¹, Tiziano Tuccinardi², Jürgen Bajorath¹.

Abstract

Carbonic anhydrases (CAs) catalyze the physiological hydration of carbon dioxide and are among the most intensely studied pharmaceutical target enzymes. A hallmark of CA inhibition is the complexation of the catalytic zinc cation in the active site. Human (h) CA isoforms belonging to different families are implicated in a wide range of diseases and of very high interest for therapeutic intervention. Given the conserved catalytic mechanisms and high similarity of many hCA isoforms, a major challenge for CA-based therapy is achieving inhibitor selectivity for hCA isoforms that are associated with specific pathologies over other widely distributed isoforms such as hCA I or hCA II that are of critical relevance for the integrity of many physiological processes. To address this challenge, we have attempted to predict compounds that are selective for isoform hCA IX, which is a tumor-associated protein and implicated in metastasis, over hCA II on the basis of a carefully curated data set of selective and nonselective inhibitors. Machine learning achieved surprisingly high accuracy in predicting hCA IX-selective inhibitors. The results were further investigated, and compound features determining successful predictions were identified. These features were then studied on the basis of X-ray structures of hCA isoform-inhibitor complexes and found to include substructures that explain compound selectivity. Our findings lend credence to selectivity predictions and indicate that the machine learning models derived herein have considerable potential to aid in the identification of new hCA IX-selective compounds.

Entities: Chemical Disease Gene Mutation Species

Year: 2021 PMID： 33585783 PMCID： PMC7876851 DOI： 10.1021/acsomega.0c06153

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Human carbonic anhydrases (hCAs) are metalloenzymes that catalyze a reversible hydration of carbon dioxide producing bicarbonate with the release of a proton.[1] Among the eight genetically distinct CA families (α, β, γ, δ, ζ, η, θ, and ι), 15 α-CA isoforms are known in humans, i.e., hCA I–hCA XIV, which include two V-type isoforms (hCA VA and hCA VB) that differ in cellular distribution and functions. These metalloenzymes are involved in numerous physiological processes such as pH regulation, CO2 homeostasis, bone resorption, and gluconeogenesis.[2] Due to the wide spectrum of physiological roles played by CAs, they have been shown to be involved in different diseases such as glaucoma, obesity, osteoporosis, various types of tumors, epilepsy, and neuropathic pain. Therefore, hCAs are regarded as important therapeutic targets, and hCA modulators are recognized as promising agents for clinical applications.[3] Among the different hCA isoforms, hCA IX and XII are predominantly found in tumor cells and show a rather limited diffusion in normal cells. Both isoforms are multidomain trans-membrane proteins with an extracellular CA domain and were demonstrated to participate in the rather complex machinery of pH regulation.[4] In particular, the membrane-associated hCA IX is considered a tumor-associated protein due to its low level of expression in normal tissues and high overexpression in almost all hypoxic tumors, where it contributes to survival, proliferation, invasion, and metastasis of cancer cells.[5] For these reasons, the hCA IX isoform has attracted the attention of many researchers focusing their efforts on the development of potent hCA IX inhibitors. As a result, a plethora of inhibitors has been reported in literature with compounds mainly belonging to the sulfonamide, dithiocarbamate, coumarin, sulfocoumarin, sulfamate, and carboxylate classes. Furthermore, an ongoing clinical trial (NCT03450018) is evaluating the sulfonamide inhibitor SLC-0111 in hCA IX-positive patients diagnosed with metastatic pancreatic ductal adenocarcinoma.[6] At present, many compounds that act as low nanomolar or subnanomolar hCA IX inhibitors are known; however, beyond inhibitory potency, an important key feature that must be considered for a potential therapeutic application of these compounds is their selectivity against the other hCA isoforms and especially against hCA I/II, which are ubiquitously distributed and involved in key physiological processes.[7] This is particularly true for hCA II since it has the widest tissue distribution and is highly expressed in red blood cells.[8] Because most of the drugs are administered systemically and are membrane-permeable, hCA II is likely to sequester nonselective hCA inhibitors reducing their circulating concentrations, decreasing their bioavailability for hCA IX, and thus limiting their exposure within tumors.[9] Overcoming the lack of selectivity for a specific hCA isoform represents the major challenge in the development of hCA inhibitors for therapy. The difficulty in finding a compound selective for a specific hCA isoform is due to the high sequence and structural homology shared by all hCA isoforms.[10] The large number of compounds reported in literature tested for their inhibition activity against hCA IX and hCA II has prompted us to generate a database of compounds with selectivity for hCA IX over hCA II or nonselective. Machine learning (ML) was then applied to predict isoform-selective inhibitors, and compound features determining successful predictions were identified. The resulting feature patterns were further analyzed on the basis of X-ray structures of hCA-inhibitor complexes, revealing individual features that were directly implicated in isoform selectivity.

Materials and Methods

Compound Data Sets

Our compound collection was assembled from publicly available data extracted from the PubChem BioAssay database (accessed September 2019).[11] Compounds with measured potencies against hCA II and hCA IX (corresponding to UniProt IDs “P00918” and “Q16790”, respectively) were collected. In order to ensure homogeneous experimental conditions and inter-assay data consistency,[12] only assays originating from the laboratory of C. T. Supuran were considered, which amounted to a total of 1138 assays (PubChem AIDs) and 7121 compounds (CIDs). We intended to generate a comprehensive and intrinsically heterogeneous set of inhibitors covering different variants of inhibitory mechanisms, all of which were directed against the active site of hCA. Since we aimed at predicting isoform selectivity of hCA inhibitors and rationalizing these predictions, we considered it important to comprehensively analyze different types of inhibitors, which further challenged machine learning. Training and test instances were available for all types of inhibitors and considered in combination. Accordingly, the results were generalizable (and not confined to subsets of inhibitors). Only enzyme-inhibitor interactions for which numerically defined inhibition constants (Ki values) were available were considered. No Ki threshold was applied for inhibitors. If two Ki values were available for a compound, then preference was given to the one reported in the source publication. For compounds with three or more measurements, Ki values deviating by more than 25% from the calculated mean Ki were discarded, and the mean Ki value was recalculated and assigned as the final potency annotation. Applying these criteria resulted in a total of 2506 inhibitors tested against both hCA isoforms for which a subsequent selectivity analysis was carried out. For each ligand, a selectivity index (SI) was calculated as the difference between the measured negative logarithmic (pKi) values for hCA IX and hCA II. Hence, compounds with SI > 0.7, corresponding to at least a five-fold higher potency for hCA IX over hCA II, were categorized as selective hCA IX inhibitors. Conversely, compounds with SI ≤ 0.7 were classified as nonselective hCA inhibitors. This classification scheme yielded a data set of 870 hCA IX-selective and 1636 nonselective inhibitors.

Molecular Representation

Building ML models for distinguishing between selective and nonselective hCA inhibitors requires the use of molecular representations such as numerical descriptors or fingerprints. Therefore, for each compound, a modified version of the molecular graph-based (i.e., the stereochemically insensitive) extended connectivity fingerprint with bond diameter 4 (ECFP4)[13] was calculated using the Morgan fingerprint implementation of RDKit.[14] ECFPs account for specific atom environments (for ECFP4, those within a radius of two bonds around an atom), which are represented as hash values. In cheminformatic ML applications, ECFP4 has become a widely accepted standard representation for compounds with comparable or superior performance relative to other (fingerprint) descriptors.[15] The ECFP4 hash values for all unique atom environments in each data set compound were computed, resulting in a total of 6061 unique structural features. The hash value positions in the molecular feature vectors were then organized according to their frequency of occurrence in a descending manner. Hence, for each compound, the presence or absence of a specific structural feature determined whether its corresponding bit position in the 6061-dimensional molecular feature vector was set to 1 or 0. This procedure did not include any additional dimensionality reduction such as standard fingerprint “folding” into a predefined fixed-length vector and hence avoided potential bit collisions that may be caused by ambiguous feature-bit mappings. As a result, unambiguous reverse mapping of fingerprints to their corresponding structural features allowed for the visualization and assessment of the importance of individual features during ML classification.

Structural Organization

To identify analog series (ASs) formed by hCA inhibitors, data set compounds were subjected to bond fragmentation according to a set of retrosynthetic rules and organized into analog series (ASs) using the compound-core relationship algorithm.[16] Accordingly, compounds containing the same structural core and different substituents were combined into an AS.

Machine Learning Methods

Random Forest

A random forest (RF) is a supervised ML algorithm that consists of a large number of individual decision trees forming an ensemble classifier. Each individual tree produces a class label prediction for a given data instance, and the final prediction outcome is determined by the majority class vote.[17] Class weight balancing was automatically inferred by the model as inversely proportional to the class label frequencies in the input data. All remaining hyperparameters were set to their default values in scikit-learn version 0.23.1.[18]

Support Vector Machine

A support vector machine (SVM) is a supervised ML algorithm that constructs a hyperplane or set of hyperplanes in a multidimensional feature space, which are used for classification or regression. In the case of classification, acceptable separation is achieved by a hyperplane having the largest distance to the nearest training data points of any class. Thus, maximizing the margin lowers the generalization error of the classifier.[19] Furthermore, kernel functions enable the algorithm to operate in a high-dimensional implicit feature space. Instead of explicitly computing the data coordinates in that space, the inner products of their pairwise projections are calculated. This approach is commonly referred to as the “kernel trick” and presents a computationally efficient alternative to explicit dimensionality expansion.[20] Accordingly, if linear separation via a hyperplane is not feasible in a given feature space, then the kernel trick facilitates implicit mapping of training compounds into a higher-dimensional feature space where linear separation might become feasible. Herein, the linear and Tanimoto kernels[21] were used, and the better performing kernel was selected for each model during internal cross validation (see below). The regularization parameter C determines the magnitude of error penalization and balances model performance in the training set and overfitting. During parameter optimization, C values 0.01, 0.1, 1, 10, and 100 were evaluated. SVM training was performed using scikit-learn version 0.23.1.

Cross Validation and Performance Measures

All ML calculations were carried out by applying a standard double cross validation procedure. First, the ECFP4 representations of selective and nonselective inhibitors were assigned classification labels of “1” and “0”, respectively. Then, the data set was recurrently divided 10 times by random sampling into 80% training and 20% test compounds. Calculation parameters specified above were optimized via internal five-fold cross validation on the training set, and the best performing parameter settings were used for test set predictions. Based on the predictions from the 10 independent external cross validation trials, the following measures were computed in order to evaluate model performance: balanced accuracy (BA),[22] F1 score,[23] and Matthew’s correlation coefficient (MCC),[24] defined as follows TP, TN, FP, and FN abbreviate true positives, true negatives, false positives, and false negatives, respectively. Here, TP + FN corresponds to the total number of selective compounds; conversely, FP + TN corresponds to the total number of nonselective compounds. As indicated by its formula, MCC takes into account all values of the confusion matrix derived from binary classification. It has the range [−1,1] where MCC = 1 represents a perfect classification (with no FP and FN), MCC = 0 is equivalent to random classification, and MCC = −1 indicates complete disagreement between predicted and actual class labels. BA accounts for the fraction of correct predictions while taking data imbalance into account through equivalent weighting. This was another appropriate measure for our analysis because the compound data set contained approximately twice as many nonselective (negative class) as selective (positive class) compounds. BA has the range [0,1] with BA = 1 describing perfect, BA = 0.5 random, and BA = 0 completely inaccurate classifications, respectively. The F1 score is a composite measure representing the harmonic mean of precision and recall. It strongly emphasizes TP values without taking TN values into consideration. High F1 values indicate good model performance. In addition, receiver operating characteristic (ROC) curves[25] were computed to compare the TP rate ([0, 1], y-axis) to the FP rate ([0, 1], x-axis) at different classification thresholds, and the area under the ROC curve (AUROC) was determined.[25] In a ROC curve, the diagonal line is equivalent to a random class prediction and yields an AUROC value of 0.5. Increasing AUROC values between 0.5 and 1 are indicative of increasing model performance, with the value of 1.0 representing a perfect prediction. Furthermore, to ensure statistically sound comparisons of individual inhibitors, we also required that each inhibitor was predicted as a test set compound in at least five different external cross validation trials. This criterion was met after 26 trials, and for each selective and nonselective inhibitor, “model prediction consistency” (MPC) was calculated as follows Accordingly, an MPC value of 100% indicated that a compound was consistently correctly classified. Conversely, an MPC value of 0% resulted from consistently incorrect classification in each trial.

Feature Weighting and Frequency Analysis

To identify individual structural features determining the classification, corresponding feature weights (FWs) were extracted from SVM models. For FW extraction, two previously introduced methods using the Tanimoto kernel were applicable.[26,27] In addition to their numerical values, FWs were assigned positive or negative signs depending on their relative importance for predicting a specific class label. According to this definition, the SVM model assigned a positive sign to a feature if its presence predominantly determined the prediction of selective inhibitors. In contrast, a feature with a negative sign predominantly determined the identification of nonselective inhibitors. The corresponding frequency distributions for selective and nonselective compounds, respectively, were defined as In order to determine whether the presence of a given structural feature was more important for the identification of selective or nonselective inhibitors, a frequency difference value ΔF was calculated for each feature i Thus, features with positive ΔF values were preferentially found in selective hCA IX inhibitors, whereas negative ΔF values indicated features that preferentially occurred in nonselective inhibitors. Furthermore, features that were exclusively found in selective or nonselective compounds were identified and prioritized.

Analysis of X-ray Structures

Key features determining predictions were further analyzed on the basis of publicly available X-ray structures of hCA II and hCA IX in complex with inhibitors. A total of 488 hCA II, 11 hCA IX, and 59 hCA IX-mimicking (mutated) proteins in complex with unique inhibitors, many of which were contained in our data set (Table ), were obtained from the RCSB Protein Data Bank (accessed September 2020).[28] An hCA IX-mimicking protein contains the original hCA II isoform active site engineered by site-directed mutagenesis to represent the wild-type hCA IX isoform by introducing relevant residue replacements. These replacements included A65S, N67Q, E69T, I91L, F131V, K170E, and L204A (hCA II sequence numbering).[29] Further analysis revealed that compounds in four of the hCA IX and 30 of the hCA IX-mimicking structures were also cocrystallized with the hCA II isoform. Thus, these compounds provided a meaningful basis for studying different binding modes and specific interactions associated with hCA IX/hCA II selectivity taking into account structural features that determine ML predictions. Superpositions of X-ray structures were obtained using UCSF Chimera.[30]

Table 1

X-ray Structuresa

target	PDB entries	unique inhibitors	contained in the ML data set	shared by isoforms	shared in the ML data set
hCA II	811	488	93	34	12
hCA IX	20	11	4	4	2
hCA IX-mimic	94	59	15	30	10

Reported are X-ray structures of hCA-inhibitor complexes evaluated in our analysis. For example, from the PDB, 811 structures of hCA II-inhibitor complexes were retrieved, which contained 488 unique inhibitors, 93 of which were contained in our data set for ML. Thirty-four of these inhibitors were found in complex structures of all three hCA isoforms, and 12 of these shared inhibitors were contained in our data set.

Results and Discussion

Inhibitors and Analog Series

Initially, the data set of selective and nonselective inhibitors for ML was structurally organized. It was found to contain 328 ASs with two or more compounds, representing ∼70% of the 1748 inhibitors. ASs comprised only hCA IX-selective inhibitors (48 series), only nonselective (163), or both selective and nonselective inhibitors (117 “mixed” series). Figure shows that these different AS categories displayed similar size distributions, with a clear dominance of small series with less than five compounds. Only small numbers of larger ASs comprising up to 25 compounds were detected. Hence, there was no global or category-centered bias in AS composition toward small numbers of large series, which might limit predictive modeling or conclusions drawn from such investigations. However, as revealed by the presence of 117 mixed ASs, many selective and nonselective inhibitors displayed close structural relationships, which principally challenged the prediction of selective inhibitors. Furthermore, there were essentially twice as many nonselective than selective inhibitors available (applying a moderate SI > 0.7 criterion), which reflected the inherent difficulties in obtaining isoform-selective hCA inhibitors, as described above. Rather than balancing the number of compounds with different class labels (positive/selective or negative/nonselective) for training, which generally favors ML predictions, we preferred retaining this intrinsic imbalance, thus attempting predictions under realistic data conditions. Taken together, in light of the statistical and structural characteristics of the inhibitor data set, the selectivity prediction task was considered challenging.

Figure 1

Distribution of analog series. Histograms report the size distributions of ASs exclusively consisting of nonselective (red) or selective (green) inhibitors or combining both nonselective and selective compounds (mixed series, orange). The three histograms are shown on the same scale.

Prediction of Selective Inhibitors

We then attempted to systematically predict hCA IX-selective inhibitors in cross validation trials. Contrary to our expectations, generally high prediction accuracy was achieved, for both RF and SVM models and on the basis of all performance measures, as summarized in Figure . The performance of RF and SVM classification was very similar with only little variation over different trials. With median F1 values of ∼0.75, median BA of >0.8, and AUROC values close to 0.9, the predictions consistently yielded reasonable to high accuracy, as further indicated by median MCC values of ∼0.6. We also assessed the predictions at the level of ASs, which mirrored the structural organization of test data. As reported in Table , 31 of 48 ASs exclusively comprising selective inhibitors were consistently correctly predicted (MPC = 100%), corresponding to 173 of 219 compounds contained in selective ASs. Moreover, 145 of 163 ASs exclusively consisting of nonselective inhibitors were consistently correctly predicted, including 508 of 549 nonselective inhibitors. Overall, 83% of ASs consisting of either only selective or nonselective inhibitors were always correctly predicted. Hence, assessing the predictions at the level of ASs further confirmed their global accuracy.

Figure 2

Prediction accuracies. Boxplots report prediction accuracy over 10 independent RF (blue) and SVM (orange) trials using different training and test sets. From the left to the right, results are shown for the F1, BA, AUROC, and MCC measures. Boxplots show the smallest value (lower whisker), lower quartile (lower boundary of the box), median (vertical line in the box), upper quartile (upper boundary of the box), and the maximum value (upper whisker). Values classified as statistical outliers are represented as diamonds.

Table 2

Prediction of Analog Seriesa

		compounds	analog series
selective	total data set	219	48
selective	MPC_selective = 100%	173	31
nonselective	total data set	559	163
nonselective	MPC_nonselective = 100%	508	145

Reported are ASs exclusively consisting of selective and nonselective inhibitors and their subsets that were consistently correctly predicted (MPC = 100%).

Feature Relevance Analysis

In light of the observed accuracy, we further assessed the predictions by exploring structural features that were responsible for the predictions. In ML, diagnostic approaches are still rare but essential for rationalizing successful predictions or failures. Given the equivalence of the results obtained for RF and SVM classification and the consistently better predictive performance of the Tanimoto over the linear kernel, we focused the analysis on SVM calculations, for which feature weighting approaches were applicable (see Materials and Methods). Accordingly, we determined ECFP4 features with positive and negative SVM weights contributing to the correct prediction of selective and nonselective inhibitors, respectively, and searched for contributing features that exclusively occurred in selective or nonselective compounds. Figure shows that large numbers of features were identified that contributed with varying weights to positive or negative predictions and exclusively occurred in selective and nonselective inhibitors, respectively. As indicated by generally low ΔF values, exclusive features typically only occurred in small subsets of compounds. Hence, there were no distinguishing features that could be generalized, consistent with the structural heterogeneity of selective and nonselective compounds, as revealed by their partitioning into many different ASs of mostly small size. Furthermore, most of the exclusive features had absolute weights <0.10, and comparably few features with absolute weights >0.15 were detected. While many features contributed to meaningful SVM predictions, the latter features largely determined correct predictions of selective or nonselective inhibitors. Figure shows the top 10 features with the largest weights that exclusively occurred in selective inhibitors and thus made the most important contribution to the prediction of selectivity. These ECFP4 features defined different structural fragments that occurred in test compounds as substructures. Notably, these features included two distinct sulfonamide-containing substructures (numbers 1590 and 5500). As discussed above, the sulfonamide group complexing the catalytic zinc cation in the active site is a hallmark of many potent hCA inhibitors, which is contained in both selective and nonselective inhibitors. Thus, the presence or absence of a sulfonamide group alone is insufficient to distinguish between selective and nonselective inhibitors. Rather, the way in which sulfonamide is embedded in substructures/compounds or specific feature combinations in which it occurs might contribute to the prediction of selective inhibitors. Furthermore, the two top ranked features in Figure were features 209 and 212, which delineated overlapping substructures and had the largest weights and by far the highest ΔF value among positive features, as shown in Figure (where features 209 and 212 are highlighted). Thus, these two features accounting for similar structural fragments made overall the most important contributions to the predictions of selective inhibitors.

Figure 3

Distribution of exclusive features. The scatterplot shows the distribution of ECFP4 features that are exclusively found in nonselective (orange) or selective (green) inhibitors. Each dot represents a unique feature. The relative frequency of occurrence of a feature in nonselective or selective compounds (ΔFrequency; nonselective < 0, selective > 0) is plotted against the mean feature weight from SVM classification. Negative and positive weights represent contributions to the prediction of nonselective and selective compounds, respectively. Two features with the largest Δfrequency values and the highest weights are highlighted (red box, upper right corner).

Figure 4

Exclusive features. Shown are the top 10 features with the largest SVM weights that exclusively occurred in selective inhibitors (ordered from upper left, top 1, to lower right, top 10). Features 209 and 212 are highlighted in Figure .

Feature Mapping

The keyed design of the feature fingerprint with 1:1 bit-to-feature correspondence made it possible to map key features to structures of test compounds. Therefore, we searched for ASs exclusively comprising selective inhibitors that were consistently correctly predicted and contained features 209 and/or 212. Several ASs were identified. Figure A shows an exemplary series of sulfocoumarin derivatives in which both features were present and formed a substructure covering most of the sulfocoumarin core. These compounds are potent and selective inhibitors of the tumor-associated hCA IX and hCA XII isoforms.[31] Of note, coumarin and sulfocoumarin derivatives can act by complex mechanisms. These compounds are known to undergo hydrolysis upon binding to the catalytic site of hCAs. However, prior to hydrolysis, they bind within the hCA active site similarly to phenols, i.e., by anchoring to the zinc-bound water molecule/hydroxide ion,[32] as confirmed by an X-ray structure of 2-thioxocoumarine in complex with hCA II.[33] This recognition mechanism was intentionally included in our ML analysis, yielding promising results.

Figure 5

(A,B) Feature mapping. In (A), feature 209 from Figure is mapped (red) on exemplary analogs from a selective AS with MPCselective = 100%. In (B), members of another selective AS with MPCselective = 100% are shown. Features with the highest SVM weights are mapped on the analogs. For a compound from this AS, X-ray structures of complexes with hCA II and hCA IX were available. In Figure B, another selective AS is shown in which features making the largest contributions to consistently correct predictions were mapped on individual analogs containing them. All of these features delineated an extended terminal pyridyl or substituted phenyl ring systems distant from the sulfonamide moiety. Thus, in both cases, key features for correct predictions defined coherent substructures of corresponding regions of analogs, which provided a basis of interpreting predictions.

Relating Important Features to Selectivity

Feature weighting and mapping identified a number of features that determined accurate SVM predictions of hCA IX-selective inhibitors. However, although these features made major contributions to ML predictions, it could not be concluded that they were implicated in or responsible for selectivity. Structural features determining predictions may or may not be of biological relevance, the assessment of which goes beyond ML analysis. Hence, the question whether substructures defined by the most important features we identified were indeed implicated in inhibitor selectivity required additional analysis.

Structure-Based Analysis

To address this question, we searched for selective inhibitors for which X-ray structures of complexes with hCA II and hCA IX or hCA IX-mimics were available. Such structures provided a basis for viewing mapped key features in light of enzyme-inhibitor interactions and exploring potential differences implicated in selectivity. Among the large number of publicly available hCA isoform X-ray structures (Table ), a limited number of suitable hCA II/IX structures with selective inhibitors we predicted were identified and compared. A particularly instructive example was obtained by comparing X-ray structures of hCA II and hCA IX-mimicking protein in complex with the hCA IX-selective inhibitor SLC-0111 that belongs to the series in Figure B (PDB entries 3N4B and 5JN3, respectively). The binding mode of SLC-0111 in the hCA II and hCA IX-mimic structures is shown in Figure A and B, respectively. As observed for all members of the corresponding AS, the feature with the highest positive SVM weight mapped to the terminal ring (in this case, a 4-fluorophenyl moiety) distant from the sulfonamide group complexing the catalytic zinc ion. In both complexes, the benzenesulfonamide fragment position was superimposable and interacted with the catalytic zinc ion; in the hCA II structure, the N,N′-ureic portion of the ligand adopted a less stable cis/trans conformation with the 4-fluorophenyl moiety that interacted with residue P201. This orientation was determined by the steric hindrance between the 4-fluorophenyl moiety and the phenyl ring of the F130 side chain that determined the observed orientation of the compound. The phenylalanine residue was not conserved in hCA IX where it was replaced by a smaller valine residue (V262). Figure shows the corresponding sequence alignment. This substitution led to the absence of steric hindrance between the protein and the 4-fluorophenyl moiety of the ligand. As a consequence, the inhibitor was able to maintain a more stable trans/trans N,N′-ureic conformation with a strong lipophilic interaction between 4-fluorophenyl and V262. By contrast, suboptimal interactions in this region of hCA II resulted in a loss of potency of the inhibitor compared to hCA IX and hence in selectivity of the compound for hCA IX over hCA II. The importance of inhibitor interactions with residue 131 in hCA isoforms has also been pointed out in the literature,[34] providing corroborating evidence. These considerations were equally applicable to most of the other analogs comprising the hCA IX-selective series in Figure B, which was consistently correctly predicted. In all instances, features with the highest positive weights determining the predictions were mapped to the corresponding ring structures, which were implicated in selectivity-determining interactions with hCA isoforms. Therefore, in this case, features that determined ML predictions were directly implicated in critical enzyme-inhibitor interactions determining compound selectivity and thus biologically relevant.

Figure 6

Figure 7

Sequence alignment of hCA II and hCA IX. Shown is the alignment of the hCA II (CAH2_HUMAN) and hCA IX (CAH9_HUMAN) amino acid sequences taken from UniProt.[35] Binding site residues are highlighted in yellow, and nonconserved residues participating in the formation of the binding site are shown in bold. Identical residues are indicated with “*”, while conservative residue replacements are marked with “:” and “.”.

Structure-based analysis. X-ray structures of SLC-0111 in complex with (A) hCA II and (B) hCA IX-mimic forms. (upper section) The catalytic zinc cation interacting with the sulfonamide moiety of the inhibitor is depicted as a sphere; (lower section) an interaction map is shown for the inhibitor and amino acid residues lining the active site (green, lipophilic residues; sky blue, polar residues; red, charged residues). Sequence alignment of hCA II and hCA IX. Shown is the alignment of the hCA II (CAH2_HUMAN) and hCA IX (CAH9_HUMAN) amino acid sequences taken from UniProt.[35] Binding site residues are highlighted in yellow, and nonconserved residues participating in the formation of the binding site are shown in bold. Identical residues are indicated with “*”, while conservative residue replacements are marked with “:” and “.”.

Conclusions

Predicting target-selective compounds typically represents a challenging task. In this work, we have attempted to predict inhibitors with selectivity for the tumor-associated hCA IX isoform over the ubiquitous hCA II isoform via ML. Surprisingly accurate and robust predictions were obtained using RF and SVM models, including many selective or nonselective ASs that were consistently correctly predicted, lending credence to the computational approach. These rather encouraging findings prompted us to further analyze the predictions. SVM feature weight analysis revealed numerous features that exclusively occurred in selective or nonselective compounds and contributed to positive and negative predictions. Highly weighted features were found to map to corresponding regions in ASs, hence rationalizing origins of successful predictions. For selectivity analysis and compound design, signature features of compound selectivity are of prime interest. However, there is no guarantee that features that make large contributions to or determine positive ML predictions are indeed biologically relevant. Therefore, we have gone a step further and evaluated important features on the basis of X-ray structures of complexes formed by the hCA IX and hCA II isoforms and selective inhibitors. For an exemplary selective AS, comparisons of corresponding X-ray structures revealed that features determining correct predictions defined substructures of inhibitors that were involved in selectivity-conferring interactions, thus establishing proof-of-principle. Demonstrating biological relevance of distinguishing features identified by ML is far from being routine, and to our knowledge, this may be one of the first studies doing so. Our findings also indicate that the ML models reported herein should have potential for practical applications in the search for new hCA IX-selective inhibitors. Therefore, as a part of our study, trained RF and SVM models are made available upon request.

26 in total

1. Purification and properties of bovine erythrocyte carbonic anhydrase.

Authors: S LINDSKOG
Journal: Biochim Biophys Acta Date: 1960-04-08

2. UCSF Chimera--a visualization system for exploratory research and analysis.

Authors: Eric F Pettersen; Thomas D Goddard; Conrad C Huang; Gregory S Couch; Daniel M Greenblatt; Elaine C Meng; Thomas E Ferrin
Journal: J Comput Chem Date: 2004-10 Impact factor: 3.376

3. Thioxocoumarins Show an Alternative Carbonic Anhydrase Inhibition Mechanism Compared to Coumarins.

Authors: Marta Ferraroni; Fabrizio Carta; Andrea Scozzafava; Claudiu T Supuran
Journal: J Med Chem Date: 2015-12-29 Impact factor: 7.446

Review 4. Carbonic anhydrases: novel therapeutic applications for inhibitors and activators.

Authors: Claudiu T Supuran
Journal: Nat Rev Drug Discov Date: 2008-02 Impact factor: 84.694

5. Deciphering the mechanism of carbonic anhydrase inhibition with coumarins and thiocoumarins.

Authors: Alfonso Maresca; Claudia Temperini; Lionel Pochet; Bernard Masereel; Andrea Scozzafava; Claudiu T Supuran
Journal: J Med Chem Date: 2010-01-14 Impact factor: 7.446

6. Reorganizing the protein space at the Universal Protein Resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

7. Systematic Extraction of Analogue Series from Large Compound Collections Using a New Computational Compound-Core Relationship Method.

Authors: J Jesús Naveja; Martin Vogt; Dagmar Stumpfe; José L Medina-Franco; Jürgen Bajorath
Journal: ACS Omega Date: 2019-01-14

8. Development of a cheminformatics platform for selectivity analyses of carbonic anhydrase inhibitors.

Authors: Giulio Poli; Salvatore Galati; Adriano Martinelli; Claudiu T Supuran; Tiziano Tuccinardi
Journal: J Enzyme Inhib Med Chem Date: 2020-12 Impact factor: 5.051

9. Open-source platform to benchmark fingerprints for ligand-based virtual screening.

Authors: Sereina Riniker; Gregory A Landrum
Journal: J Cheminform Date: 2013-05-30 Impact factor: 5.514

1 in total

1. Synthesis, Molecular Docking Analysis and Biological Evaluations of Saccharide-Modified Thiadiazole Sulfonamide Derivatives.

Authors: Zuo-Peng Zhang; Ye Zhong; Zhen-Bin Han; Lin Zhou; Hua-Sheng Su; Jian Wang; Yang Liu; Mao-Sheng Cheng
Journal: Int J Mol Sci Date: 2021-05-22 Impact factor: 5.923

1 in total