Literature DB >> 30023828

Computational Analysis of LOX1 Inhibition Identifies Descriptors Responsible for Binding Selectivity.

Chrysoula Gousiadou¹, Irene Kouskoumvekaki¹.

Abstract

Lipoxygenases are a family of cytosolic, peripheral membrane enzymes, which catalyze the hydroperoxidation of polyunsaturated fatty acids and are implicated in the pathogenesis of major human diseases. Over the years, a substantial number of scientific reports have introduced inhibitors active against one or another subtype of the enzyme, but the selectivity issue has proved to be a major challenge for drug design. In the present work, we assembled a dataset of 317 structurally diverse molecules hitherto reported as active against 15S-LOX1, 12S-LOX1, and 15S-LOX2 and identified, using supervised machine learning, a set of structural descriptors responsible for the binding selectivity toward the enzyme 15S-LOX1. We subsequently incorporated these descriptors in the training of QSAR models for LOX1 activity and selectivity. The best performing classifiers are two stacked models that include an ensemble of support vector machine, random forest, and k-nearest neighbor algorithms. These models not only can predict LOX1 activity/inactivity but also can discriminate with high accuracy between molecules that exhibit selective activity toward either one of the isozymes 15S-LOX1 and 12S-LOX1.

Entities: CellLine Chemical Disease Gene Species

Year: 2018 PMID： 30023828 PMCID： PMC6044675 DOI： 10.1021/acsomega.7b01622

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Human lipoxygenases are a structurally related family of cytosolic, peripheral membrane enzymes, which catalyze the hydroperoxidation of polyunsaturated fatty acids producing leukotrienes, lipoxins, and/or hydroxy fatty acids (arachidonic acid cascade).[1−4] These products play important roles in the development of inflammation, and over the years, an accumulating number of scientific reports emphatically involves LOXs in the pathogenesis of almost all the diseases with major health relevance (bronchial asthma, atherosclerosis, cancer, obesity, osteoporosis, and neurodegenerative disorders).[5−13] As a result, lipoxygenase (LOX) research is a vital scientific area today with more than 500 new articles published annually.[2] Corresponding to the genes of the human ortholog, LOXs are named ALOX15, ALOX15B, ALOX12, ALOX12B, and ALOX5.[1] ALOX12B and ALOX15B are mainly expressed in the skin and other epithelial cells, whereas ALOX15, ALOX12, and ALOX5 are expressed in hematopoietic/immune cells.[13] LOX enzymes have considerable molecular mass (75–81 kDa) and share highly conserved structural features, as well as the unique topology of the catalytic (C-terminal) domain. The C-terminal domain contains both the catalytically active nonheme iron and the substrate-binding cavity.[14] Studies of various complexes with different inhibitors have found the latter in this location.[15−21] The natural substrate for human LOXs is arachidonic acid.[14,22] With respect to their stereo and positional specificity of arachidonic acid oxygenation, the conventional nomenclature classifies human LOXs as 5S-LOX1 (ALOX5), 12R-LOX1 (ALOX12B), 12S-LOX1 (ALOX12), 15S-LOX1 (ALOX15), and 15S-LOX2 (ALOX15B).[14,22] Currently, research is focused on the biological relevance of the different LOX-isoforms.[23] Functional and isoform multiplicity as well as heterogeneity of the different isoenzymes are confusing issues.[2,23] As a consequence, the arising “selectivity” theme has made the discovery of LOX inhibitors increasingly challenging for drug design. Structural studies on these proteins reveal that although their three-dimensional (3D) structure is highly conserved, their sequences share low identity and show both similarities and striking differences at the active site.[19,24] These differences are held responsible for substrate selectivity and are believed to be the key for the design of isoform-selective inhibitors.[23,24] To date, there is only one FDA-approved drug on the market (Zileuton, 5-LOX1).[25] In the present investigation, our main focus has been the inhibition of 15S-LOX1. This enzyme is reported to play a crucial role in obesity, since 15S-LOX1 expression is directly related with the proliferation and hypertrophy of adipose cells.[50] Furthermore, 15S-LOX1 promotes cancer by amplifying PPARγ transcription activity, and therefore, 15-LOX1 inhibitors may be suitable chemotherapy agents in the near future.[50] Over the last years, both industrial and academic laboratories have been working intensely toward the discovery of potent and selective 15S-LOX1 inhibitors,[5,29−34] but it is noteworthy that to date, not one of these compounds has been approved for therapeutic usage.[50] Computational methods[26−28] have been employed for the discovery and design of suitable inhibitors, natural[5,29,30] or synthetic[31−34] molecules interacting strongly and specifically with the protein. A relatively small number of data-driven models have been generated, concerning mostly the inhibition of 5S-LOX1, using small datasets of compounds with structural similarity.[35−38] These models, albeit statistically valid, have a limited scope and are unlikely to accurately predict chemical structures different from those with which they have been trained. In this work, we introduce the training of stacked classification models built on a dataset of structurally diverse molecules hitherto reported as active against three LOX subtypes (15S-LOX1, 12S-LOX1, and 15S-LOX2). The chemical structures were retrieved from review papers[50] and reports[51−79] as well as publicly available data sets[80−91] (hundreds to thousands of compounds tested for inhibition on LOX isozymes). The resulting dataset contains a large number of compounds with a wide range of molecular weight (165–597) belonging to different chemical families. Thus, the algorithms were provided with (a) a sufficient size of samples, which is essential for any machine learning analysis, and (b) a large amount of biological activity end points for a wide range of chemical compounds. Most importantly, the inhibitory activity toward two other human LOX subtypes besides 15-LOX1 (12S-LOX1 and 15S-LOX2) has been taken into consideration.[84−87] The resulting stacked quantitative structure–activity relationship (QSAR) models have high discriminatory ability and yield statistically valid predictions on the selective interaction of various chemicals with 15S-LOX1.

Results and Discussion

Data Preprocessing and Variable Selection

Initially, noninformative descriptors were removed, that is, all variables with missing values and zero variance (zero values for all ligands). This process reduced the number of descriptors to 236. The dataset was split randomly into an explicit training set (75%, 239 molecules) and a validation set (25%, 78 molecules) used to evaluate the performance of the models on unseen data. Each one of the subsets was a balanced representation of both the chemical structures and the classes contained in the initial dataset. An exploratory analysis using unsupervised methods was carried out on the training set to acquire information necessary for deciding the best modeling strategy. As many of the descriptors (134) were highly correlated (>0.75), attempts to include them all in the analysis led to models with poor performance. Ideally, it is desirable to have a reduced set of uncorrelated, nonredundant, and informative descriptors that would allow us to build interpretable prediction models. Therefore, we undertook to reduce the initial number of variables and evaluate the effects on the accuracy of the classifiers. First, we employed principal component analysis.[96] A preprocessing of the data was performed, and the descriptor variables were mean centered and scaled to unit variance. The scree plot (Figure a), where the percentage of the cumulative variance is explained, shows that the first five principal components explained 54% of the total variation in the data. The scatterplot matrix for components 1–5, where the points are colored by class (Figure ), reveals a poor separation between the classes and reflects the low percentage of variance explained by these components. On the whole, the exploratory analysis revealed that the separation between the two classes would be challenging. This challenge we decided to address with powerful, highly nonlinear models as will be described in the following section (QSAR modeling). By applying a criterion of a minimum value (80%) of cumulative percent of variance accounted for in our data, we selected the first 20 principal components to be used as variables for building a series of machine learning models for classifying the dataset into actives and inactives. These classifiers had moderate performance, and it became clear that alternative methods for variable reduction were needed to increase the accuracy and to reduce the complexity of the models. To this end, we employed two supervised methods, which evaluated the importance of the descriptors and subsequently selected a suitable subset of variables to be used for the modeling.

Figure 1

Figure 2

Scatterplot matrix by class for principal components 1–5: the plot reveals a poor separation between the classes and reflects the low percentage of variance (54%) explained by these components.

(a) Scree plot explaining the % percentage of the cumulative variance in the data. The first five principal components explain 54% of the total variation. (b) Breaks in the data corresponding to components 1–5. Scatterplot matrix by class for principal components 1–5: the plot reveals a poor separation between the classes and reflects the low percentage of variance (54%) explained by these components. Our first approach was to use the area under the receiver operating characteristic (ROC) curve to quantify the relevance of the descriptors (caret package, varImp algorithm).[97] The descriptor variables were used as inputs into the ROC curve. If a descriptor could perfectly separate the classes, there would be a cutoff for that descriptor that would achieve sensitivity and specificity of 1, and the area under the curve would be one. The query resulted in a set of 20 uncorrelated descriptors ranked according to their importance (Figure ).

Figure 3

Variable selection using the area under the ROC curve: a set of 20 uncorrelated descriptors are ranked according to their importance.

Variable selection using the area under the ROC curve: a set of 20 uncorrelated descriptors are ranked according to their importance. Our second approach was a simple backward selection of descriptors, that is,—recursive feature elimination with random forest (RF)[98] (caret package, RFE algorithm). RF applied a resampling method of 10-fold cross-validation for the selection of the descriptors and produced a set of 84 variables ranked according to accuracy. The top 5 variables were HybRatio, XLogP, nHBAcc, BCUTc.1h, and ALogP. Both approaches took into consideration the response variable (active = 1 and inactive = −1) when evaluating the importance of the selected descriptors (supervised method). There was a high consensus between the two methods with regard to the relevance of the descriptors, as 18 out of 20 descriptors that resulted from the ROC curve were included in the set of 84 proposed by RF, albeit in different order. Using as a starting point the descriptors common to both methods, we created a series of subsets by gradually including more variables from the list of 84. With these subsets we built a number of classifiers and compared their accuracy. A careful evaluation of the results highlighted a subset of 37 most relevant descriptors, which were subsequently used for the analysis. The selected variables optimized the accuracy of the models. Any attempt to increase the number of descriptors beyond that point had a negative effect on the model performance.

QSAR Modeling

Initially, a number of machine learning models were built on the training set using different algorithms with default parameters, to choose those that would fit our data best (Table ). For this process, we used the caret package in R. We chose both linear and nonlinear algorithms on the basis of their diversity of learning style, which included classification and regression trees (CARTs),[99] linear discriminant analysis (LDA),[100] support vector machines (SVMs) with radial basis function,[101] k-nearest neighbors (KNNs),[102] RFs,[103] and gradient boosting machines (GBMs).[104] The evaluation metrics used were “accuracy” and “kappa”. The generated models had different performance characteristics. A 10-fold cross-validation resampling method with 20 repeats was employed to get an estimate of the accuracy with which each model could predict unseen data. A summary table was created containing the evaluation metrics for each model (Table ). As can be seen, the mean accuracy across the board was rather low, which implied that the classes in the dataset could not be easily predicted. SVMs and RFs showed comparable performance and had the highest accuracy on this classification problem (68%), whereas KNNs were the weakest classifiers (56%). Both SVMs and RFs are powerful modeling methods and highly nonlinear functions of the descriptor variables.

Table 1

Summary Table Containing the Evaluation Metrics of Diverse Linear and Nonlinear Algorithms Used for the Data Analysis

	min.	mean	max.
Accuracy
CART	0.375	0.600	0.870
LDA	0.468	0.671	0.848
SVM	0.468	0.679	0.906
KNN	0.322	0.557	0.741
RF	0.515	0.680	0.838
GBM	0.451	0.668	0.870
Kappa
CART	–0.250	0.198	0.742
LDA	–0.062	0.342	0.695
SVM	–0.062	0.356	0.812
KNN	–0.364	0.113	0.483
RF	0.029	0.359	0.677
GBM	–0.100	0.335	0.741

The density plot of the distribution of the estimated accuracy (Figure ) showed an overlap in the behavior of the algorithms. Differences in the spread and peaks of the distributions provided further information on the model performance. The distributions of the classifiers KNN, SVM, and RF were depicted as normal distributions (bell-shaped). In the cases of SVM (light green) and RF (orange), the bells were steep (small standard deviation), indicating that the data were tightly clustered around the mean accuracy value (∼68%, Table ). This practically meant that the largest part of the data would be predicted with accuracy close to the mean value. On the other hand, the normal distribution of KNN (dark green) was depicted as a flattened bell curve (large standard deviation), indicating that the data were spread out, and therefore, a large part would be predicted with accuracy away from the mean value (55%, Table ).

Figure 4

Density plot showing the distribution of the estimated accuracy of various algorithms as depicted in Table . KNN (dark green) is the weakest classifier with the lowest mean accuracy (55%), whereas RF (orange) and SVM (light green) are the strongest with mean accuracy 68 & 67%, respectively. Based on the acquired information, we selected the more promising SVM and RF methods to build a new series of models with improved accuracy. To this end, parameters were properly tuned for the two algorithms. SVMs are particularly sensitive to the values of “gamma” (the width of the kernel function used for mapping the data into the high-dimensional space) and “C” (the error penalty parameter).[105,106] We optimized the parameters by performing a grid search. RF is a less sensitive machine learning technique, protected against overfitting,[103] but good parameter choices give robust models. The parameters with the biggest effect on the final accuracy of RF are “ntree” (the number of trees to grow) and “mtry” (the number of variables randomly sampled as candidates at each split). For calculating these parameters, we crafted a parameter search by creating a series of RF models and comparing their accuracy. The whole process resulted in a shortlist of SVM and RF classifiers with an improved performance (Table ). For estimating the accuracy of the models, we applied a 10-fold cross-validation with 20 repeats, and the observed total error was within a range of 11–23%. Subsequently, the ability to predict unseen data was evaluated for each model using the external validation dataset of 78 molecules. The total error of the models increased, ranging from 22 to 27% with a corresponding accuracy range of 78 to 73% (Table ).

Table 2

Evaluation of the Model Performancea

A: evaluation of the model performance with k-fold cross-validation [training set, class: 122 (“–1”), 117 (“1”)]
models	false positive	false negative	total error %	model summary
SVM1	16	25	17.2	k-fold = 3, nrepeat = 20, cost = 4, gamma = 0.01, support vectors (sv) = 174 (86, 88)
SVM3	1	54	23.0	k = 10, nrepeat = 20, cost = 16, gamma = 0.01, sv = 174 (103,71), class.weights (“1” = 0.3, “–1” = 0 0.7)
SVM3(b)	33	12	18.8	k = 10, nrepeat = 20, cost = 4, gamma = 0.01, sv = 181 (103,78), class.weights (“1” = 0.3, “–1” = 0.7)
SVM3(c)	20	7	11.3	k = 10, nrepeat = 20, cost = 16, gamma = 0.01, sv = 168 (75, 93), class.weights (“1” = 0.3, “–1” = 0.7)
ΚΝΝ1	37	48	35.5	k-fold = 10, nrepeat = 20, k-neighbors = 9
RF1	0	0	OOB: 35.2	importance = 148, mtry = 4, ntree = 1000
RF2	0	0	OOB: 34.7	importance = 148, mtry = 3, ntree = 2000

RF4: SVM1 + RF1 RF5: SVM1 + RF2 RF7: SVM1 + SVM3 + RF1 + RF2 RF8*: SVM1 + SVM3 + RF1 + RF2 + KNN1 RF9*: SVM1 + SVM3(b) + RF1 + RF2 + KNN1 RF10*: RF2 + SVM1 + SVM3(c) + SVM3(b) + KNN1.

RF4: SVM1 + RF1 RF5: SVM1 + RF2 RF7: SVM1 + SVM3 + RF1 + RF2 RF8*: SVM1 + SVM3 + RF1 + RF2 + KNN1 RF9*: SVM1 + SVM3(b) + RF1 + RF2 + KNN1 RF10*: RF2 + SVM1 + SVM3(c) + SVM3(b) + KNN1. In an attempt to boost model accuracy, classifiers from the list were combined in ensemble predictions, resulting in high-order models that best combined the predictions of the base classifiers. To this end, we included in the shortlist the weak classifier KNN1, and we employed the method of stacking algorithms.[107] We used RF algorithm to combine the predictions. Combining KNN1 with the SVM and RF submodels resulted in an impressive improvement of the evaluation metrics of the stacked RFs (Table ). As can be seen (Table , Figure ), the base classifiers were not highly correlated (<0.75), an indication of them being informative in different ways, allowing the high-order learner to get the best of each model. Two powerful new learners, RF9 and RF10, were thus created (Table ), (Supporting Information, Table S2). The ROC curve (Figure ) illustrating the diagnostic ability of the stacked model RF10 clearly shows optimized performance compared to the submodels combined to build it.

Figure 5

Figure 6

ROC curves illustrating the diagnostic abilities of the high-order learner RF10 (red) and the basic classifiers RF2 (black), SVM1 (pink), SVM3(c) (orange), SVM3(b) (yellow), and KNN1 (gray). RF10 is clearly shown to have an optimized performance compared to the submodels combined to build it.

Scatterplot matrix correlating the predictions from the algorithms KNN, SVM, and RF used in the analysis (Table , model correlation). The classifiers are not strongly correlated (<0.75); an indication of them being informative in different ways and, therefore, suited to be combined in ensemble models. ROC curves illustrating the diagnostic abilities of the high-order learner RF10 (red) and the basic classifiers RF2 (black), SVM1 (pink), SVM3(c) (orange), SVM3(b) (yellow), and KNN1 (gray). RF10 is clearly shown to have an optimized performance compared to the submodels combined to build it. The two learners RF9 and RF10 can be used in combination for the 15S-LOX1 classification problem (Table ), (Supporting Information, Table S2). RF9 is a strong classifier with very high sensitivity (0.9250) reflected in the small number of molecules incorrectly classified as active (3 false positive). This would imply that should this classifier be used for any future prediction on the biological activity of molecules toward 15S-LOX1, the chances for inactive molecules to pass the filter would be very low. On the other hand, RF9 has a somewhat lower specificity (0.7368) and falsely classifies as inactive more molecules compared to RF10. Model RF10 is a stronger classifier and highly successful in identifying true 15S-LOX1 inhibitors. Most importantly, both learners are able to discriminate accurately between molecules showing selectivity toward either one of the enzymes 15S-LOX1 and 12S-LOX1 (Supporting Information, Table S2). This discrimination ability of the stacked models was decisively helpful in our attempt to explore the selectivity issue later on.

Interpretability of the Models

In this investigation, apart from the attempt to model chemical structures that are active against the LOX enzymes in general, the focus has been to explore the underlying properties (and their combinations) responsible for selective activity toward 15S-LOX1. The topology of the catalytic domain is shared by all LOX isozymes. The binding pocket is lined by hydrophobic residues except for second shell residues near the catalytic iron.[19] In a previous study, where in an attempt to explore LOX1 inhibition, we generated a pharmacophore model, we found that the assembly of the pharmacophore features strongly pointed toward a hydrophobic active site, which is indeed the case for all LOX subtypes.[2,3,108] However, although the 3D structure of the enzyme’s subtypes is highly conserved, LOX isozymes share low sequence identity.[19,24] A sequence alignment of the five human LOX isozymes generated with UniProt[109] revealed that the highest identity between sequences is observed for 15S-LOX1 and 12S-LOX1 (65.5%), whereas 15S-LOX2, 12R-LOX1, and 5S-LOX1 share an identity of 39% or lower with 15S-LOX1 (Supporting Information). As a result of the low sequence identity, apart from the apparent similarities, these enzymes exhibit striking differences at the active site. The size and nature of the residues present at the active site are crucial for determining both the volume of the binding pocket and the type of interactions that would ensure the binding of any chemical with the protein.[19] In positions 410, 415, 593, 597, and 716 of the sequence alignment, we can see the five residues that coordinate the iron with octahedral geometry (iron ligands) (Supporting Information). Four of them are conserved for all the enzymes (H410, H415, H593, and I716). Different amino acids occupy the position 597 in the sequences of 15S-LOX1 and 12S-LOX1, that is, the electrically charged histidine (H) and the polar uncharged asparagine (N), respectively. Mutation experiments have indicated the vital importance of these five iron ligands for the enzymes’ activity [UniProtKB (P16050, O15296, P18054, O75342, and P09917)].[109] Numerous scientific reports have also considered the significance of crucial amino acids in other positions, as is the case with the nonconserved residues in positions 467 and 468 of 15S-LOX1 and 12S-LOX1 sequences, respectively.[19] The residues A467/V468 that occupy these positions in 12S-LOX1 are considerably smaller in size compared to I467/M468 in 15S-LOX1. This difference is responsible for both the increase of volume (∼6%) in the binding pocket of 12S-LOX1 and for the change in shape at the bottom of the active site.[19] In conclusion, based on the studies concerning the structure and the positional specificity of the LOX enzymes, we should expect that differences in the nature and geometry of crucial residues would affect the selective binding of ligands with the protein. In this work, we undertook to explain the biological activity of the molecules with the help of their structural descriptors. This might seem like a challenge difficult to address without including in the analysis chemical information on the protein structure, as would be the case with a proteochemometric approach.[110] Nevertheless, previous knowledge of the nature of the enzyme proved to be a key ingredient for the decisions about model development because it allowed for an informed use of the data. It laid the foundation for predictive models with good accuracy that helped clarify certain aspects of this problem. The high-order learner RF10 was able to classify correctly the molecules as active (1) and inactive (−1) toward 15S-LOX1 to a high degree (Supporting Information, Table S2). Furthermore, it has been possible to address the selectivity issue, since RF10 could successfully discriminate those molecules which, although being active toward other LOX isozymes (12S-LOX1), were inactive toward 15S-LOX1. We found these results intriguing, and therefore, we attempted to link the descriptors of the highest discriminatory affinity with the molecules’ ability to bind in a certain pocket and not in another. As RF10 was a stacked model combining the predictions of submodels, primarily SVMs and RFs, further knowledge of the influence of the variables on the submodels was highly desirable. However, a straightforward interpretation of SVM and RF models is not possible. Both are “black boxes” that do not reveal how they relate the descriptors to the response variable. Nevertheless, RF has an ensemble nature (bagging trees) and combines the predictions of a large number of decision trees that are used as base learners. This allows for a degree of transparency that enabled us to have a measure of the impact of the descriptors in the base classifiers RF1 and RF2. Therefore, for these classifiers only, it was possible to gain some insights into the importance of every variable in classifying the data, the decrease in accuracy of the RF models after the exclusion of a single variable (mean decrease accuracy), and the average gain of purity by splits of every given variable (mean decrease Gini) (Supporting Information, Table S3). Across the two supervised methods employed for evaluating the relevance of descriptors, the following were ranked very high: the number of hydrogen bond acceptors (nHBAcc), descriptors related to hydrophobicity (the octanol/water partition coefficients Xlog P and Alog P), two topological descriptors related to carbon hybridization (HybRatio and C2SP2), and a molecular descriptor related to partial charges (BCUTc.1h). Hydrophobicity and hydrogen bond acceptors were prominent pharmacophore features, and we had found this to be true in our previous investigation.[108] A molecule must be hydrophobic and possess lone electron pairs to bind with the protein. These results were generally in accord with the evaluation of descriptors performed later by the RF algorithms (RF1 & RF2) as part of the learning process, while the actual models were being generated (Supporting Information, Table S3), (Figure ).

Figure 7

Scatter plot matrix by class of the descriptors with highest relevance for the LOX classification analysis. Descriptor ranking was performed by the RF classifier RF2.

Scatter plot matrix by class of the descriptors with highest relevance for the LOX classification analysis. Descriptor ranking was performed by the RF classifier RF2. The classifiers identified the aforementioned variables as the most important, although in different order. In addition, three descriptors were ranked very highly by RF1 and RF2: (a) the topological weighted path descriptor WTPT.3, which is based on the molecular identification number and characterizes molecular branching. Descriptor WTPT.3 gives the sum of weighted paths starting from heteroatoms and[111] (b) the two valence chain descriptors VCH.6 and VCH.7.[95] Shown in Table are the values of the 10 top descriptors for a number of compounds from the validation set, which were correctly classified by RF10 (chemical structures of selective 15S-LOX1 (A) and 12S-LOX1 (B) inhibitors from Table are shown in Chart ). The ranking of the descriptors was performed by RF2, a submodel of RF10. The compounds are divided in four groups according to their inhibitory activity toward LOX, that is, selective 15S-LOX1 inhibitors (A), selective 12S-LOX1 inhibitors (B), nonselective inhibitors with strong preference for the 15S-LOX1 pocket (C), and inactive molecules (D) (Supporting Information, Table S1). At a first glance, an interesting observation could be made in regard to differences between the groups A and B, concerning the values of descriptors HybRatio and WTPT.3. Low values of HybRatio in combination with high values of WTPT.3 seemed to favor selectivity toward 12S-LOX1. On the other hand, ALogP seemed to be significant for the discrimination between the inhibitors in groups A, B, and C, and the inactive molecules in the group D. Large negative ALogP values were in most cases associated with inactivity toward the protein.

Table 3

Values of the 10 Top Descriptors (as Evaluated by RF2) for Selected Compounds from the Validation Set Correctly Classified by RF10b

indexa	nHBAcc	XLogP	ALogP	AMR	HybRatio	WTPT.3	BCUTc.1h	VCH.6	VCH.7	nAtomP	class
6^A	2	8.531	–1.655	76.42	0.850	5.316	0.261	0.068	0.048	3	1
7^A	7	2.885	–0.892	108.5	0.608	19.34	0.296	0.315	0.413	6	1
29^A	3	1.030	0.948	59.93	0.400	14.35	0.229	0.277	0.514	10	1
50^A	3	3.679	1.816	82.65	0.133	11.69	0.338	0.086	0.112	18	1
72^A	5	4.780	2.936	120.3	0.200	24.47	0.317	0.150	0.209	12	1
75^A	4	4.042	1.639	121.4	0.200	17.23	0.214	0.074	0.120	9	1
99^A	1	4.290	3.701	73.67	0.600	2.404	0.080	0	0	2	1
112^A	1	3.811	0.952	122.0	0.375	17.47	0.175	0.082	0.175	10	1
113^A	2	5.916	1.736	74.86	0.941	5.323	0.302	0.558	1.267	3	1
232^A	2	3.574	0.445	78.62	0	10.85	0.223	0.068	0.079	19	1

103^B	6	4.081	0.899	119.2	0.136	24.37	0.261	0.141	0.179	14	–1
127^B	5	2.989	0.405	123.9	0.095	26.26	0.248	0.117	0.135	20	–1
225^B	1	3.230	–0.173	82.13	0	21.20	0.276	0.060	0.107	23	–1

69^C	5	2.975	–0.262	100.3	0.235	21.09	0.249	0.265	0.361	19	1
106^C	0	3.193	0.631	76.21	0	10.07	0.126	0.048	0.070	18	1
168^C	3	7.323	0.216	146.0	0.285	17.45	0.240	0.157	0.209	16	1
172^C	3	5.293	1.527	135.9	0.120	17.83	0.240	0.205	0.278	18	1
228^C	0	2.785	0.216	82.54	0.250	10.98	0.139	0.078	0.118	9	1
234^C	0	5.773	1.375	103.1	0.636	8.498	0.129	0.122	0.453	11	1
236^C	0	4.817	1.991	87.99	0	15.76	0.135	0.044	0.167	15	1

257^D	5	4.322	–2.362	92.33	0.555	21.42	0.210	0.181	0.147	11	–1
259^D	5	5.337	1.957	125.6	0.272	23.87	0.211	0.118	0.146	9	–1
260^D	5	2.712	–0.550	80.64	0.333	19.52	0.163	0.139	0.138	10	–1
261^D	2	2.535	1.013	63.79	0	8.046	0.194	0.142	0.091	14	–1
262^D	5	0.973	–2.504	53.37	0.600	22.90	0.318	0.101	0.053	9	–1
263^D	2	1.885	–0.117	78.01	0.153	14.19	0.241	0.082	0.085	16	–1
264^D	6	3.087	–0.344	105.9	0.470	30.51	0.216	0.275	0.265	13	–1
268^D	6	2.238	–3.768	81.98	0.625	22.92	0.242	0.172	0.122	14	–1
277^D	8	3.873	–1.060	131.4	0.592	27.47	0.264	0.228	0.250	11	–1
281^D	7	2.635	–2.279	131.1	0.333	23.31	0.255	0.140	0.127	9	–1
282^D	7	2.527	–3.503	120.0	0.583	31.18	0.232	0.262	0.203	13	–1
286^D	5	2.224	–0.838	97.04	0.187	24.75	0.269	0.048	0.062	21	–1
287^D	7	3.143	–2.273	108.6	0.666	27.23	0.223	0.147	0.092	10	–1
293^D	8	1.711	–3.031	125.8	0.461	22.98	0.222	0.158	0.081	9	–1
297^D	4	5.833	2.786	119.9	0.318	18.96	0.217	0.217	0.360	11	–1
303^D	7	0.706	1.014	91.94	0.500	23.03	0.277	0.240	0.332	10	–1
312^D	5	2.059	–2.657	80.60	0.625	19.27	0.153	0.177	0.173	6	–1
313^D	5	3.122	–1.484	100.5	0.333	24.33	0.218	0.079	0.064	12	–1
314^D	8	0.977	–2.283	111.8	0.333	32.95	0.243	0.131	0.128	15	–1

Index numbers as they appear in the Supporting Information (Table S1)

Chemical structures of selective 15S-LOX1 (A) and 12S-LOX1 (B) inhibitors from this table are shown in Chart .

Chart 1

Chemical Structures of Selective 15S-LOX1 (A) and 12S-LOX1 (B) Inhibitors from Table

Index numbers as they appear in the Supporting Information (Table S1) Chemical structures of selective 15S-LOX1 (A) and 12S-LOX1 (B) inhibitors from this table are shown in Chart . However, as already mentioned, it is impossible to elucidate how RFs actually combined the descriptors and linked them to the response variable. This is even more the case for the support vector machine models used as base learners for the ensemble model RF10. Nevertheless, since our aim has been to acquire some elemental insights into how decisions were made, without necessarily knowing every detail of the full model, a simpler explanation could be sufficient. To this end, we used the rpart algorithm to create a single decision tree on the validation set using the 10 top descriptors as ranked by RF2. The decision path clarified which features were associated with every decision (Figure ). We see that 3 of the top 10 descriptors are the same, although with different ordering. ALogP is the top descriptor for the decision tree but is ranked 6th by RF2. There is a consensus regarding XLogP (second), whereas HybRatio is ranked third by the decision tree and fourth by RF. Interestingly, RF2 identifies nHBAcc as the top descriptor, whereas the decision tree ignores it. The differences observed in the ranking of descriptors between the RF and the single decision tree are not atypical as they are attributed to the greediness of the single tree.[112]

Figure 8

A single decision tree created on the validation set using the 10 top descriptors as ranked by the classifier RF2. The decision path clarifies which features are associated with every decision as well as the threshold values of the top descriptors that are responsible for a molecule being classified as active/inactive against 15-LOX1.

Conclusions

In the present study, we have created a dataset of structurally diverse molecules hitherto known as active against three LOX isoenzymes (15S-LOX1, 12S-LOX1, and 15S-LOX2). Subsequently, we considered a binary classification problem with the aim of exploring the selective activity of chemicals toward the enzyme 15S-LOX1. We identified a set of 37 structural descriptors of high discriminatory affinity, which are responsible for binding selectivity toward 15S-LOX1. Based on these descriptors, we created two highly accurate stacked models (RF9 and RF10) as ensembles of SVM, RF, and KNN algorithms that predict the selective interactions of various chemicals with 15S-LOX1. The models successfully classify the compounds as active/inactive against LOX1 and most importantly, they discriminate the molecules with selective activity toward either one of the isozymes 15S-LOX1 and 12S-LOX1. These QSAR classifiers can be used in drug discovery as computational filters in the virtual screening of novel LOX1 inhibitors, whereas the selected descriptors can be used as a guide for the design and synthesis of new 15S-LOX1 inhibitors.

Methods

Binary Classification and Model Generation

The present work is concerned with a binary classification problem studied in supervised machine learning.[39] Therefore, the response (output) variable is categorical, and the two categories are predefined according to the biological activity in vitro of various chemicals toward the enzyme 15S-LOX1, that is, active (1) and inactive (−1). QSAR modeling was carried out with R.[40] R is both a language and an environment for statistical computing and graphics. Extended functionalities were added to R by installing a number of packages, which are machine learning algorithms implemented as third party libraries.The following R packages were used for the analysis: rcdk,[41]SparseM,[42]randomForest,[43]caret,[44]e1071,[45]pROC,[46]rpart,[47]AppliedPredictiveModeling,[48] and caretEnsemble.[49]

Calculation of Molecular Descriptors

A total of 317 structurally diverse molecules with recorded biological activity in vitro toward three LOX subtypes (15S-LOX1, 12S-LOX1, and 15S-LOX2) were used for the classification analysis (Supporting Information, Table S1). The molecules have been collected from available public sources: (a) reviews and reports[50−79] and (b) the PubChem Bioassay database, where we found several assays concerning the biological activity of series of molecules toward LOX.[80−91] Special consideration was taken to avoid structural redundancy. The ligands were drawn in ACD/ChemSketch[92] (Advanced Chemical Development) and a single 3D conformation was created for each structure with the Bioclipse software.[93,94] An SDF file containing the 3D coordinates of the molecules was imported in R, and the rcdk package was used to calculate automatically a number of descriptor variables. These descriptors are divided broadly into three main groups,[95] that is, atomic, bond, and molecular and belong to the specific categories “topological”, “geometrical”, “hybrid”, “constitutional”, and “electronic”. The calculation resulted in 282 descriptors for each molecule.

IC50 Experimental Data

It is a common practice in in vitro experiments to use three different LOX subtypes[50−79] to draw conclusions regarding biological activity toward human 15S-LOX1: The human 15S-LOX1, the soybean 13LOX1, and the mammalian (rabbit reticulocyte) 12-/15-LOX1. Such extrapolations are valid on the grounds that the enzymes have high structural similarity, most importantly in the area of the binding site, and the mode of action of the small molecule inhibitors is similar.[4,13,14,19] Therefore, for the collection of data for the training of the classification models, we adopted the same rationale. Rabbit reticulocyte 12-/15-LOX1 has been characterized[19] and, as mentioned above, is often used as a standard for biochemical assays. Soybean 13LOX1 has been for many years a model system for understanding catalysis and structure of all lipoxygenases.[4] It is sufficiently stable, obtainable in large quantities, and relatively easy to purify.[4,19] In addition, the overall architecture of the soybean structure is similar to the mammalian, although the latter is much more compact.[19] Therefore, soybean 13LOX1 is used in bioassays for recording bioactivity toward 15S-LOX1, although the IC50 values of a molecule may be different for the two enzymes. In the present analysis, we have considered the molecules with IC50 ≥ 100 μM as being inactive. Also, as already mentioned, we have included a number of molecules, which, although inactive against 15S-LOX1, exhibit bioactivity toward other LOX isozymes. These molecules were considered to be inactive. Our aim was to train our models to distinguish subtle differences among chemical structures and to be able to predict those that showed selectivity for 15S-LOX1. In accordance to their inhibitory activity in vitro toward 15S-LOX1, the molecules were classified as follows: (a) active (1): selective 15S-LOX1 inhibitors and nonselective inhibitors with preference for the 15S-LOX1 pocket and (b) inactive (−1): molecules exhibiting selectivity toward two other LOX isozymes (12S-LOX1 and 15S-LOX2) but inactive against 15S-LOX1 and molecules inactive toward LOX in general. For all the molecules in the dataset, the IC50 values were recorded (Supporting Information, Table S1).

65 in total

Review 1. Zileuton.

Authors: K A McGill; W W Busse
Journal: Lancet Date: 1996-08-24 Impact factor: 79.321

Review 2. Arachidonate 12-lipoxygenases.

Authors: S Yamamoto; H Suzuki; N Ueda
Journal: Prog Lipid Res Date: 1997-03 Impact factor: 16.195

3. Role of 12/15-lipoxygenase in nitrosative stress and peripheral prediabetic and diabetic neuropathies.

Authors: Roman Stavniichuk; Viktor R Drel; Hanna Shevalye; Igor Vareniuk; Martin J Stevens; Jerry L Nadler; Irina G Obrosova
Journal: Free Radic Biol Med Date: 2010-06-22 Impact factor: 7.376

4. Minor iridoids from Scutellaria albida ssp. albida. Inhibitory potencies on lipoxygenase, linoleic acid lipid peroxidation and antioxidant activity of iridoids from Scutellaria sp.

Authors: Chrysoula Gousiadou; Charlotte H Gotfredsen; Marina Matsa; Dimitra Hadjipavlou-Litina; Helen Skaltsa
Journal: J Enzyme Inhib Med Chem Date: 2012-05-28 Impact factor: 5.051

5. Pyrazole-based sulfonamide and sulfamides as potent inhibitors of mammalian 15-lipoxygenase.

Authors: Khehyong Ngu; David S Weinstein; Wen Liu; Charles Langevine; Donald W Combs; Shaobin Zhuang; Xing Chen; Cort S Madsen; Timothy W Harper; Saleem Ahmad; Jeffrey A Robl
Journal: Bioorg Med Chem Lett Date: 2011-06-06 Impact factor: 2.823

6. Inhibitors of 15-lipoxygenase from orange peel.

Authors: K E Malterud; K M Rydland
Journal: J Agric Food Chem Date: 2000-11 Impact factor: 5.279

Review 7. 15-Lipoxygenases in cancer: a double-edged sword?

Authors: Adi J Klil-Drori; Amiram Ariel
Journal: Prostaglandins Other Lipid Mediat Date: 2013-08-08 Impact factor: 3.072

8. Discovery of a novel dual fungal CYP51/human 5-lipoxygenase inhibitor: implications for anti-fungal therapy.

Authors: Eric K Hoobler; Ganesha Rai; Andrew G S Warrilow; Steven C Perry; Christopher J Smyrniotis; Ajit Jadhav; Anton Simeonov; Josie E Parker; Diane E Kelly; David J Maloney; S L Kelly; Theodore R Holman
Journal: PLoS One Date: 2013-06-24 Impact factor: 3.240

9. Potent and selective inhibitors of human reticulocyte 12/15-lipoxygenase as anti-stroke therapies.

Authors: Ganesha Rai; Netra Joshi; Joo Eun Jung; Yu Liu; Lena Schultz; Adam Yasgar; Steve Perry; Giovanni Diaz; Qiangli Zhang; Victor Kenyon; Ajit Jadhav; Anton Simeonov; Eng H Lo; Klaus van Leyen; David J Maloney; Theodore R Holman
Journal: J Med Chem Date: 2014-05-13 Impact factor: 7.446

10. PubChem BioAssay: 2017 update.

Authors: Yanli Wang; Stephen H Bryant; Tiejun Cheng; Jiyao Wang; Asta Gindulyte; Benjamin A Shoemaker; Paul A Thiessen; Siqian He; Jian Zhang
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971