Chrysoula Gousiadou1, Irene Kouskoumvekaki1. 1. Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Lyngby, Denmark.
Abstract
Lipoxygenases are a family of cytosolic, peripheral membrane enzymes, which catalyze the hydroperoxidation of polyunsaturated fatty acids and are implicated in the pathogenesis of major human diseases. Over the years, a substantial number of scientific reports have introduced inhibitors active against one or another subtype of the enzyme, but the selectivity issue has proved to be a major challenge for drug design. In the present work, we assembled a dataset of 317 structurally diverse molecules hitherto reported as active against 15S-LOX1, 12S-LOX1, and 15S-LOX2 and identified, using supervised machine learning, a set of structural descriptors responsible for the binding selectivity toward the enzyme 15S-LOX1. We subsequently incorporated these descriptors in the training of QSAR models for LOX1 activity and selectivity. The best performing classifiers are two stacked models that include an ensemble of support vector machine, random forest, and k-nearest neighbor algorithms. These models not only can predict LOX1 activity/inactivity but also can discriminate with high accuracy between molecules that exhibit selective activity toward either one of the isozymes 15S-LOX1 and 12S-LOX1.
Lipoxygenases are a family of cytosolic, peripheral membrane enzymes, which catalyze the hydroperoxidation of polyunsaturated fatty acids and are implicated in the pathogenesis of major human diseases. Over the years, a substantial number of scientific reports have introduced inhibitors active against one or another subtype of the enzyme, but the selectivity issue has proved to be a major challenge for drug design. In the present work, we assembled a dataset of 317 structurally diverse molecules hitherto reported as active against 15S-LOX1, 12S-LOX1, and 15S-LOX2 and identified, using supervised machine learning, a set of structural descriptors responsible for the binding selectivity toward the enzyme 15S-LOX1. We subsequently incorporated these descriptors in the training of QSAR models for LOX1 activity and selectivity. The best performing classifiers are two stacked models that include an ensemble of support vector machine, random forest, and k-nearest neighbor algorithms. These models not only can predict LOX1 activity/inactivity but also can discriminate with high accuracy between molecules that exhibit selective activity toward either one of the isozymes 15S-LOX1 and 12S-LOX1.
Human
lipoxygenases are a structurally related family of cytosolic,
peripheral membrane enzymes, which catalyze the hydroperoxidation
of polyunsaturated fatty acids producing leukotrienes, lipoxins, and/or
hydroxy fatty acids (arachidonic acid cascade).[1−4] These products play important
roles in the development of inflammation, and over the years, an accumulating
number of scientific reports emphatically involves LOXs in the pathogenesis
of almost all the diseases with major health relevance (bronchial
asthma, atherosclerosis, cancer, obesity, osteoporosis, and neurodegenerative
disorders).[5−13] As a result, lipoxygenase (LOX) research is a vital scientific area
today with more than 500 new articles published annually.[2] Corresponding to the genes of the human ortholog,
LOXs are named ALOX15, ALOX15B, ALOX12, ALOX12B, and ALOX5.[1] ALOX12B and ALOX15B are mainly expressed in the
skin and other epithelial cells, whereas ALOX15, ALOX12, and ALOX5
are expressed in hematopoietic/immune cells.[13] LOX enzymes have considerable molecular mass (75–81 kDa)
and share highly conserved structural features, as well as the unique
topology of the catalytic (C-terminal) domain. The C-terminal domain
contains both the catalytically active nonheme iron and the substrate-binding
cavity.[14] Studies of various complexes
with different inhibitors have found the latter in this location.[15−21] The natural substrate for human LOXs is arachidonic acid.[14,22] With respect to their stereo and positional specificity of arachidonic
acid oxygenation, the conventional nomenclature classifies human LOXs
as 5S-LOX1 (ALOX5), 12R-LOX1 (ALOX12B),
12S-LOX1 (ALOX12), 15S-LOX1 (ALOX15),
and 15S-LOX2 (ALOX15B).[14,22] Currently, research is focused on the biological relevance of the
different LOX-isoforms.[23] Functional and
isoform multiplicity as well as heterogeneity of the different isoenzymes
are confusing issues.[2,23] As a consequence, the arising
“selectivity” theme has made the discovery of LOX inhibitors
increasingly challenging for drug design. Structural studies on these
proteins reveal that although their three-dimensional (3D) structure
is highly conserved, their sequences share low identity and show both
similarities and striking differences at the active site.[19,24] These differences are held responsible for substrate selectivity
and are believed to be the key for the design of isoform-selective
inhibitors.[23,24] To date, there is only one FDA-approved
drug on the market (Zileuton, 5-LOX1).[25]In the present investigation, our main focus has been the
inhibition
of 15S-LOX1. This enzyme is reported to play a crucial
role in obesity, since 15S-LOX1 expression is directly
related with the proliferation and hypertrophy of adipose cells.[50] Furthermore, 15S-LOX1 promotes
cancer by amplifying PPARγ transcription activity, and therefore,
15-LOX1 inhibitors may be suitable chemotherapy agents in the near
future.[50] Over the last years, both industrial
and academic laboratories have been working intensely toward the discovery
of potent and selective 15S-LOX1 inhibitors,[5,29−34] but it is noteworthy that to date, not one of these compounds has
been approved for therapeutic usage.[50] Computational
methods[26−28] have been employed for the discovery and design of
suitable inhibitors, natural[5,29,30] or synthetic[31−34] molecules interacting strongly and specifically with the protein.
A relatively small number of data-driven models have been generated,
concerning mostly the inhibition of 5S-LOX1, using
small datasets of compounds with structural similarity.[35−38] These models, albeit statistically valid, have a limited scope and
are unlikely to accurately predict chemical structures different from
those with which they have been trained.In this work, we introduce
the training of stacked classification
models built on a dataset of structurally diverse molecules hitherto
reported as active against three LOX subtypes (15S-LOX1, 12S-LOX1, and 15S-LOX2).
The chemical structures were retrieved from review papers[50] and reports[51−79] as well as publicly available data sets[80−91] (hundreds to thousands of compounds tested for inhibition on LOX
isozymes). The resulting dataset contains a large number of compounds
with a wide range of molecular weight (165–597) belonging to
different chemical families. Thus, the algorithms were provided with
(a) a sufficient size of samples, which is essential for any machine
learning analysis, and (b) a large amount of biological activity end
points for a wide range of chemical compounds.Most importantly,
the inhibitory activity toward two other human
LOX subtypes besides 15-LOX1 (12S-LOX1 and 15S-LOX2) has been taken into consideration.[84−87] The resulting stacked quantitative structure–activity relationship
(QSAR) models have high discriminatory ability and yield statistically
valid predictions on the selective interaction of various chemicals
with 15S-LOX1.
Results
and Discussion
Data Preprocessing and
Variable Selection
Initially, noninformative descriptors
were removed, that is, all
variables with missing values and zero variance (zero values for all
ligands). This process reduced the number of descriptors to 236. The
dataset was split randomly into an explicit training set (75%, 239
molecules) and a validation set (25%, 78 molecules) used to evaluate
the performance of the models on unseen data. Each one of the subsets
was a balanced representation of both the chemical structures and
the classes contained in the initial dataset. An exploratory analysis
using unsupervised methods was carried out on the training set to
acquire information necessary for deciding the best modeling strategy.
As many of the descriptors (134) were highly correlated (>0.75),
attempts
to include them all in the analysis led to models with poor performance.
Ideally, it is desirable to have a reduced set of uncorrelated, nonredundant,
and informative descriptors that would allow us to build interpretable
prediction models. Therefore, we undertook to reduce the initial number
of variables and evaluate the effects on the accuracy of the classifiers.First, we employed principal component analysis.[96] A preprocessing of the data was performed, and the descriptor
variables were mean centered and scaled to unit variance. The scree
plot (Figure a), where
the percentage of the cumulative variance is explained, shows that
the first five principal components explained 54% of the total variation
in the data. The scatterplot matrix for components 1–5, where
the points are colored by class (Figure ), reveals a poor separation between the
classes and reflects the low percentage of variance explained by these
components. On the whole, the exploratory analysis revealed that the
separation between the two classes would be challenging. This challenge
we decided to address with powerful, highly nonlinear models as will
be described in the following section (QSAR modeling). By applying
a criterion of a minimum value (80%) of cumulative percent of variance
accounted for in our data, we selected the first 20 principal components
to be used as variables for building a series of machine learning
models for classifying the dataset into actives and inactives. These
classifiers had moderate performance, and it became clear that alternative
methods for variable reduction were needed to increase the accuracy
and to reduce the complexity of the models. To this end, we employed
two supervised methods, which evaluated the importance of the descriptors
and subsequently selected a suitable subset of variables to be used
for the modeling.
Figure 1
(a) Scree plot explaining the % percentage of the cumulative
variance
in the data. The first five principal components explain 54% of the
total variation. (b) Breaks in the data corresponding to components
1–5.
Figure 2
Scatterplot matrix by
class for principal components 1–5:
the plot reveals a poor separation between the classes and reflects
the low percentage of variance (54%) explained by these components.
(a) Scree plot explaining the % percentage of the cumulative
variance
in the data. The first five principal components explain 54% of the
total variation. (b) Breaks in the data corresponding to components
1–5.Scatterplot matrix by
class for principal components 1–5:
the plot reveals a poor separation between the classes and reflects
the low percentage of variance (54%) explained by these components.Our first approach was to use
the area under the receiver operating
characteristic (ROC) curve to quantify the relevance of the descriptors
(caret package, varImp algorithm).[97] The descriptor variables were used as inputs
into the ROC curve. If a descriptor could perfectly separate the classes,
there would be a cutoff for that descriptor that would achieve sensitivity
and specificity of 1, and the area under the curve would be one. The
query resulted in a set of 20 uncorrelated descriptors ranked according
to their importance (Figure ).
Figure 3
Variable selection using the area under the ROC curve: a set of
20 uncorrelated descriptors are ranked according to their importance.
Variable selection using the area under the ROC curve: a set of
20 uncorrelated descriptors are ranked according to their importance.Our second approach was a simple
backward selection of descriptors,
that is,—recursive feature elimination with random forest (RF)[98] (caret package, RFE algorithm).
RF applied a resampling method of 10-fold cross-validation for the
selection of the descriptors and produced a set of 84 variables ranked
according to accuracy. The top 5 variables were HybRatio, XLogP, nHBAcc, BCUTc.1h, and ALogP. Both approaches
took into consideration the response variable (active = 1 and inactive
= −1) when evaluating the importance of the selected descriptors
(supervised method).There was a high consensus between the
two methods with regard
to the relevance of the descriptors, as 18 out of 20 descriptors that
resulted from the ROC curve were included in the set of 84 proposed
by RF, albeit in different order. Using as a starting point the descriptors
common to both methods, we created a series of subsets by gradually
including more variables from the list of 84. With these subsets we
built a number of classifiers and compared their accuracy. A careful
evaluation of the results highlighted a subset of 37 most relevant
descriptors, which were subsequently used for the analysis. The selected
variables optimized the accuracy of the models. Any attempt to increase
the number of descriptors beyond that point had a negative effect
on the model performance.
QSAR Modeling
Initially, a number
of machine learning models were built on the training set using different
algorithms with default parameters, to choose those that would fit
our data best (Table ). For this process, we used the caret package in
R. We chose both linear and nonlinear algorithms on the basis of their
diversity of learning style, which included classification and regression
trees (CARTs),[99] linear discriminant analysis
(LDA),[100] support vector machines (SVMs)
with radial basis function,[101] k-nearest
neighbors (KNNs),[102] RFs,[103] and gradient boosting machines (GBMs).[104] The evaluation metrics used were “accuracy”
and “kappa”. The generated models had different performance
characteristics. A 10-fold cross-validation resampling method with
20 repeats was employed to get an estimate of the accuracy with which
each model could predict unseen data. A summary table was created
containing the evaluation metrics for each model (Table ). As can be seen, the mean
accuracy across the board was rather low, which implied that the classes
in the dataset could not be easily predicted. SVMs and RFs showed
comparable performance and had the highest accuracy on this classification
problem (68%), whereas KNNs were the weakest classifiers (56%). Both
SVMs and RFs are powerful modeling methods and highly nonlinear functions
of the descriptor variables.
Table 1
Summary Table Containing
the Evaluation
Metrics of Diverse Linear and Nonlinear Algorithms Used for the Data
Analysis
min.
mean
max.
Accuracy
CART
0.375
0.600
0.870
LDA
0.468
0.671
0.848
SVM
0.468
0.679
0.906
KNN
0.322
0.557
0.741
RF
0.515
0.680
0.838
GBM
0.451
0.668
0.870
Kappa
CART
–0.250
0.198
0.742
LDA
–0.062
0.342
0.695
SVM
–0.062
0.356
0.812
KNN
–0.364
0.113
0.483
RF
0.029
0.359
0.677
GBM
–0.100
0.335
0.741
The density plot of the distribution
of the estimated accuracy
(Figure ) showed an
overlap in the behavior of the algorithms. Differences in the spread
and peaks of the distributions provided further information on the
model performance. The distributions of the classifiers KNN, SVM,
and RF were depicted as normal distributions (bell-shaped). In the
cases of SVM (light green) and RF (orange), the bells were steep (small
standard deviation), indicating that the data were tightly clustered
around the mean accuracy value (∼68%, Table ). This practically meant that the largest
part of the data would be predicted with accuracy close to the mean
value. On the other hand, the normal distribution of KNN (dark green)
was depicted as a flattened bell curve (large standard deviation),
indicating that the data were spread out, and therefore, a large part
would be predicted with accuracy away from the mean value (55%, Table ).
Figure 4
Density plot showing
the distribution of the estimated accuracy
of various algorithms as depicted in Table . KNN (dark green) is the weakest classifier
with the lowest mean accuracy (55%), whereas RF (orange) and SVM (light
green) are the strongest with mean accuracy 68 & 67%, respectively.
Density plot showing
the distribution of the estimated accuracy
of various algorithms as depicted in Table . KNN (dark green) is the weakest classifier
with the lowest mean accuracy (55%), whereas RF (orange) and SVM (light
green) are the strongest with mean accuracy 68 & 67%, respectively.Based on the acquired information,
we selected the more promising
SVM and RF methods to build a new series of models with improved accuracy.
To this end, parameters were properly tuned for the two algorithms.
SVMs are particularly sensitive to the values of “gamma”
(the width of the kernel function used for mapping the data into the
high-dimensional space) and “C” (the error penalty parameter).[105,106] We optimized the parameters by performing a grid search. RF is a
less sensitive machine learning technique, protected against overfitting,[103] but good parameter choices give robust models.
The parameters with the biggest effect on the final accuracy of RF
are “ntree” (the number of trees to grow) and “mtry”
(the number of variables randomly sampled as candidates at each split).
For calculating these parameters, we crafted a parameter search by
creating a series of RF models and comparing their accuracy.The whole process resulted in a shortlist of SVM and RF classifiers
with an improved performance (Table ). For estimating the accuracy of the models, we applied
a 10-fold cross-validation with 20 repeats, and the observed total
error was within a range of 11–23%. Subsequently, the ability
to predict unseen data was evaluated for each model using the external
validation dataset of 78 molecules. The total error of the models
increased, ranging from 22 to 27% with a corresponding accuracy range
of 78 to 73% (Table ).
Table 2
Evaluation of the Model Performancea
A: evaluation of
the model performance with k-fold cross-validation [training set, class: 122 (“–1”),
117 (“1”)]
RF4: SVM1 + RF1 RF5: SVM1
+ RF2 RF7: SVM1 + SVM3 + RF1
+ RF2 RF8*: SVM1 + SVM3 + RF1 + RF2 + KNN1 RF9*: SVM1 + SVM3(b) + RF1 + RF2 + KNN1 RF10*: RF2 + SVM1
+ SVM3(c) + SVM3(b) + KNN1.In an attempt to boost model accuracy, classifiers from the list
were combined in ensemble predictions, resulting in high-order models
that best combined the predictions of the base classifiers. To this
end, we included in the shortlist the weak classifier KNN1, and we
employed the method of stacking algorithms.[107] We used RF algorithm to combine the predictions. Combining KNN1
with the SVM and RF submodels resulted in an impressive improvement
of the evaluation metrics of the stacked RFs (Table ). As can be seen (Table , Figure ), the base classifiers were not highly correlated
(<0.75), an indication of them being informative in different ways,
allowing the high-order learner to get the best of each model. Two
powerful new learners, RF9 and RF10, were thus created (Table ), (Supporting Information, Table S2). The ROC curve (Figure ) illustrating the diagnostic ability of
the stacked model RF10 clearly shows optimized performance compared
to the submodels combined to build it.
Figure 5
Scatterplot matrix correlating
the predictions from the algorithms
KNN, SVM, and RF used in the analysis (Table , model correlation). The classifiers are
not strongly correlated (<0.75); an indication of them being informative
in different ways and, therefore, suited to be combined in ensemble
models.
Figure 6
ROC curves illustrating the diagnostic abilities
of the high-order
learner RF10 (red) and the basic classifiers RF2 (black), SVM1 (pink),
SVM3(c) (orange), SVM3(b) (yellow), and KNN1 (gray). RF10 is clearly
shown to have an optimized performance compared to the submodels combined
to build it.
Scatterplot matrix correlating
the predictions from the algorithms
KNN, SVM, and RF used in the analysis (Table , model correlation). The classifiers are
not strongly correlated (<0.75); an indication of them being informative
in different ways and, therefore, suited to be combined in ensemble
models.ROC curves illustrating the diagnostic abilities
of the high-order
learner RF10 (red) and the basic classifiers RF2 (black), SVM1 (pink),
SVM3(c) (orange), SVM3(b) (yellow), and KNN1 (gray). RF10 is clearly
shown to have an optimized performance compared to the submodels combined
to build it.The two learners RF9
and RF10 can be used in combination for the
15S-LOX1 classification problem (Table ), (Supporting Information, Table S2). RF9 is a strong classifier with very
high sensitivity (0.9250) reflected in the small number of molecules
incorrectly classified as active (3 false positive). This would imply
that should this classifier be used for any future prediction on the
biological activity of molecules toward 15S-LOX1,
the chances for inactive molecules to pass the filter would be very
low. On the other hand, RF9 has a somewhat lower specificity (0.7368)
and falsely classifies as inactive more molecules compared to RF10.
Model RF10 is a stronger classifier and highly successful in identifying
true 15S-LOX1 inhibitors. Most importantly, both
learners are able to discriminate accurately between molecules showing
selectivity toward either one of the enzymes 15S-LOX1
and 12S-LOX1 (Supporting Information, Table S2). This discrimination ability of the stacked models was
decisively helpful in our attempt to explore the selectivity issue
later on.
Interpretability of the Models
In
this investigation, apart from the attempt to model chemical structures
that are active against the LOX enzymes in general, the focus has
been to explore the underlying properties (and their combinations)
responsible for selective activity toward 15S-LOX1.
The topology of the catalytic domain is shared by all LOX isozymes.
The binding pocket is lined by hydrophobic residues except for second
shell residues near the catalytic iron.[19] In a previous study, where in an attempt to explore LOX1 inhibition,
we generated a pharmacophore model, we found that the assembly of
the pharmacophore features strongly pointed toward a hydrophobic active
site, which is indeed the case for all LOX subtypes.[2,3,108]However, although the
3D structure of the enzyme’s subtypes is highly conserved,
LOX isozymes share low sequence identity.[19,24] A sequence alignment of the five human LOX isozymes generated with
UniProt[109] revealed that the highest identity
between sequences is observed for 15S-LOX1 and 12S-LOX1 (65.5%), whereas 15S-LOX2, 12R-LOX1, and 5S-LOX1 share an identity of
39% or lower with 15S-LOX1 (Supporting Information). As a result of the low sequence identity, apart
from the apparent similarities, these enzymes exhibit striking differences
at the active site. The size and nature of the residues present at
the active site are crucial for determining both the volume of the
binding pocket and the type of interactions that would ensure the
binding of any chemical with the protein.[19] In positions 410, 415, 593, 597, and 716 of the sequence alignment,
we can see the five residues that coordinate the iron with octahedral
geometry (iron ligands) (Supporting Information). Four of them are conserved for all the enzymes (H410, H415, H593,
and I716). Different amino acids occupy the position 597 in the sequences
of 15S-LOX1 and 12S-LOX1, that is,
the electrically charged histidine (H) and the polar uncharged asparagine
(N), respectively. Mutation experiments have indicated the vital importance
of these five iron ligands for the enzymes’ activity [UniProtKB
(P16050, O15296, P18054, O75342, and P09917)].[109]Numerous scientific reports have also considered
the significance
of crucial amino acids in other positions, as is the case with the
nonconserved residues in positions 467 and 468 of 15S-LOX1 and 12S-LOX1 sequences, respectively.[19] The residues A467/V468 that occupy these positions
in 12S-LOX1 are considerably smaller in size compared
to I467/M468 in 15S-LOX1. This difference is responsible
for both the increase of volume (∼6%) in the binding pocket
of 12S-LOX1 and for the change in shape at the bottom
of the active site.[19] In conclusion, based
on the studies concerning the structure and the positional specificity
of the LOX enzymes, we should expect that differences in the nature
and geometry of crucial residues would affect the selective binding
of ligands with the protein.In this work, we undertook to explain
the biological activity of
the molecules with the help of their structural descriptors. This
might seem like a challenge difficult to address without including
in the analysis chemical information on the protein structure, as
would be the case with a proteochemometric approach.[110] Nevertheless, previous knowledge of the nature of the enzyme
proved to be a key ingredient for the decisions about model development
because it allowed for an informed use of the data. It laid the foundation
for predictive models with good accuracy that helped clarify certain
aspects of this problem. The high-order learner RF10 was able to classify
correctly the molecules as active (1) and inactive (−1) toward
15S-LOX1 to a high degree (Supporting Information, Table S2). Furthermore, it has been possible to address the selectivity
issue, since RF10 could successfully discriminate those molecules
which, although being active toward other LOX isozymes (12S-LOX1), were inactive toward 15S-LOX1.
We found these results intriguing, and therefore, we attempted to
link the descriptors of the highest discriminatory affinity with the
molecules’ ability to bind in a certain pocket and not in another.As RF10 was a stacked model combining the predictions of submodels,
primarily SVMs and RFs, further knowledge of the influence of the
variables on the submodels was highly desirable. However, a straightforward
interpretation of SVM and RF models is not possible. Both are “black
boxes” that do not reveal how they relate the descriptors to
the response variable. Nevertheless, RF has an ensemble nature (bagging
trees) and combines the predictions of a large number of decision
trees that are used as base learners. This allows for a degree of
transparency that enabled us to have a measure of the impact of the
descriptors in the base classifiers RF1 and RF2. Therefore, for these
classifiers only, it was possible to gain some insights into the importance
of every variable in classifying the data, the decrease in accuracy
of the RF models after the exclusion of a single variable (mean decrease
accuracy), and the average gain of purity by splits of every given
variable (mean decrease Gini) (Supporting Information, Table S3).Across the two supervised methods employed for
evaluating the relevance
of descriptors, the following were ranked very high: the number of
hydrogen bond acceptors (nHBAcc), descriptors related to hydrophobicity
(the octanol/water partition coefficients Xlog P and
Alog P), two topological descriptors related to carbon
hybridization (HybRatio and C2SP2), and a molecular descriptor related
to partial charges (BCUTc.1h). Hydrophobicity and hydrogen bond acceptors
were prominent pharmacophore features, and we had found this to be
true in our previous investigation.[108] A
molecule must be hydrophobic and possess lone electron pairs to bind
with the protein. These results were generally in accord with the
evaluation of descriptors performed later by the RF algorithms (RF1
& RF2) as part of the learning process, while the actual models
were being generated (Supporting Information, Table S3), (Figure ).
Figure 7
Scatter plot matrix by class of the descriptors with highest relevance
for the LOX classification analysis. Descriptor ranking was performed
by the RF classifier RF2.
Scatter plot matrix by class of the descriptors with highest relevance
for the LOX classification analysis. Descriptor ranking was performed
by the RF classifier RF2.The classifiers identified the aforementioned variables as
the
most important, although in different order. In addition, three descriptors
were ranked very highly by RF1 and RF2: (a) the topological weighted
path descriptor WTPT.3, which is based on the molecular identification
number and characterizes molecular branching. Descriptor WTPT.3 gives
the sum of weighted paths starting from heteroatoms and[111] (b) the two valence chain descriptors VCH.6
and VCH.7.[95]Shown in Table are the values of the 10 top
descriptors for a number of compounds
from the validation set, which were correctly classified by RF10 (chemical
structures of selective 15S-LOX1 (A) and
12S-LOX1 (B) inhibitors from Table are shown in Chart ). The ranking of
the descriptors was performed by RF2, a submodel of RF10. The compounds
are divided in four groups according to their inhibitory activity
toward LOX, that is, selective 15S-LOX1 inhibitors
(A), selective 12S-LOX1 inhibitors (B), nonselective
inhibitors with strong preference for the 15S-LOX1
pocket (C), and inactive molecules (D) (Supporting Information, Table S1). At a first glance, an interesting observation
could be made in regard to differences between the groups A and B,
concerning the values of descriptors HybRatio and WTPT.3. Low values
of HybRatio in combination with high values of WTPT.3 seemed to favor
selectivity toward 12S-LOX1. On the other hand, ALogP seemed to be significant for the discrimination between
the inhibitors in groups A, B, and C, and the inactive molecules in
the group D. Large negative ALogP values were in
most cases associated with inactivity toward the protein.
Table 3
Values of the 10 Top Descriptors (as
Evaluated by RF2) for Selected Compounds from the Validation Set Correctly
Classified by RF10b
indexa
nHBAcc
XLogP
ALogP
AMR
HybRatio
WTPT.3
BCUTc.1h
VCH.6
VCH.7
nAtomP
class
6A
2
8.531
–1.655
76.42
0.850
5.316
0.261
0.068
0.048
3
1
7A
7
2.885
–0.892
108.5
0.608
19.34
0.296
0.315
0.413
6
1
29A
3
1.030
0.948
59.93
0.400
14.35
0.229
0.277
0.514
10
1
50A
3
3.679
1.816
82.65
0.133
11.69
0.338
0.086
0.112
18
1
72A
5
4.780
2.936
120.3
0.200
24.47
0.317
0.150
0.209
12
1
75A
4
4.042
1.639
121.4
0.200
17.23
0.214
0.074
0.120
9
1
99A
1
4.290
3.701
73.67
0.600
2.404
0.080
0
0
2
1
112A
1
3.811
0.952
122.0
0.375
17.47
0.175
0.082
0.175
10
1
113A
2
5.916
1.736
74.86
0.941
5.323
0.302
0.558
1.267
3
1
232A
2
3.574
0.445
78.62
0
10.85
0.223
0.068
0.079
19
1
103B
6
4.081
0.899
119.2
0.136
24.37
0.261
0.141
0.179
14
–1
127B
5
2.989
0.405
123.9
0.095
26.26
0.248
0.117
0.135
20
–1
225B
1
3.230
–0.173
82.13
0
21.20
0.276
0.060
0.107
23
–1
69C
5
2.975
–0.262
100.3
0.235
21.09
0.249
0.265
0.361
19
1
106C
0
3.193
0.631
76.21
0
10.07
0.126
0.048
0.070
18
1
168C
3
7.323
0.216
146.0
0.285
17.45
0.240
0.157
0.209
16
1
172C
3
5.293
1.527
135.9
0.120
17.83
0.240
0.205
0.278
18
1
228C
0
2.785
0.216
82.54
0.250
10.98
0.139
0.078
0.118
9
1
234C
0
5.773
1.375
103.1
0.636
8.498
0.129
0.122
0.453
11
1
236C
0
4.817
1.991
87.99
0
15.76
0.135
0.044
0.167
15
1
257D
5
4.322
–2.362
92.33
0.555
21.42
0.210
0.181
0.147
11
–1
259D
5
5.337
1.957
125.6
0.272
23.87
0.211
0.118
0.146
9
–1
260D
5
2.712
–0.550
80.64
0.333
19.52
0.163
0.139
0.138
10
–1
261D
2
2.535
1.013
63.79
0
8.046
0.194
0.142
0.091
14
–1
262D
5
0.973
–2.504
53.37
0.600
22.90
0.318
0.101
0.053
9
–1
263D
2
1.885
–0.117
78.01
0.153
14.19
0.241
0.082
0.085
16
–1
264D
6
3.087
–0.344
105.9
0.470
30.51
0.216
0.275
0.265
13
–1
268D
6
2.238
–3.768
81.98
0.625
22.92
0.242
0.172
0.122
14
–1
277D
8
3.873
–1.060
131.4
0.592
27.47
0.264
0.228
0.250
11
–1
281D
7
2.635
–2.279
131.1
0.333
23.31
0.255
0.140
0.127
9
–1
282D
7
2.527
–3.503
120.0
0.583
31.18
0.232
0.262
0.203
13
–1
286D
5
2.224
–0.838
97.04
0.187
24.75
0.269
0.048
0.062
21
–1
287D
7
3.143
–2.273
108.6
0.666
27.23
0.223
0.147
0.092
10
–1
293D
8
1.711
–3.031
125.8
0.461
22.98
0.222
0.158
0.081
9
–1
297D
4
5.833
2.786
119.9
0.318
18.96
0.217
0.217
0.360
11
–1
303D
7
0.706
1.014
91.94
0.500
23.03
0.277
0.240
0.332
10
–1
312D
5
2.059
–2.657
80.60
0.625
19.27
0.153
0.177
0.173
6
–1
313D
5
3.122
–1.484
100.5
0.333
24.33
0.218
0.079
0.064
12
–1
314D
8
0.977
–2.283
111.8
0.333
32.95
0.243
0.131
0.128
15
–1
Index numbers as they appear in
the Supporting Information (Table S1)
Chemical structures of selective
15S-LOX1 (A) and 12S-LOX1 (B) inhibitors from this table are shown in Chart .
Chart 1
Chemical Structures of Selective 15S-LOX1 (A) and 12S-LOX1 (B) Inhibitors
from Table
Index numbers as they appear in
the Supporting Information (Table S1)Chemical structures of selective
15S-LOX1 (A) and 12S-LOX1 (B) inhibitors from this table are shown in Chart .However, as already mentioned,
it is impossible to elucidate how
RFs actually combined the descriptors and linked them to the response
variable. This is even more the case for the support vector machine
models used as base learners for the ensemble model RF10. Nevertheless,
since our aim has been to acquire some elemental insights into how
decisions were made, without necessarily knowing every detail of the
full model, a simpler explanation could be sufficient. To this end,
we used the rpart algorithm to create a single decision
tree on the validation set using the 10 top descriptors as ranked
by RF2. The decision path clarified which features were associated
with every decision (Figure ). We see that 3 of the top 10 descriptors are the same, although
with different ordering. ALogP is the top descriptor
for the decision tree but is ranked 6th by RF2. There is a consensus
regarding XLogP (second), whereas HybRatio is ranked
third by the decision tree and fourth by RF. Interestingly, RF2 identifies
nHBAcc as the top descriptor, whereas the decision tree ignores it.
The differences observed in the ranking of descriptors between the
RF and the single decision tree are not atypical as they are attributed
to the greediness of the single tree.[112]
Figure 8
A
single decision tree created on the validation set using the
10 top descriptors as ranked by the classifier RF2. The decision path
clarifies which features are associated with every decision as well
as the threshold values of the top descriptors that are responsible
for a molecule being classified as active/inactive against 15-LOX1.
A
single decision tree created on the validation set using the
10 top descriptors as ranked by the classifier RF2. The decision path
clarifies which features are associated with every decision as well
as the threshold values of the top descriptors that are responsible
for a molecule being classified as active/inactive against 15-LOX1.
Conclusions
In the present study, we have created a dataset of structurally
diverse molecules hitherto known as active against three LOX isoenzymes
(15S-LOX1, 12S-LOX1, and 15S-LOX2). Subsequently, we considered a binary classification
problem with the aim of exploring the selective activity of chemicals
toward the enzyme 15S-LOX1. We identified a set of
37 structural descriptors of high discriminatory affinity, which are
responsible for binding selectivity toward 15S-LOX1.
Based on these descriptors, we created two highly accurate stacked
models (RF9 and RF10) as ensembles of SVM, RF, and KNN algorithms
that predict the selective interactions of various chemicals with
15S-LOX1. The models successfully classify the compounds
as active/inactive against LOX1 and most importantly, they discriminate
the molecules with selective activity toward either one of the isozymes
15S-LOX1 and 12S-LOX1. These QSAR
classifiers can be used in drug discovery as computational filters
in the virtual screening of novel LOX1 inhibitors, whereas the selected
descriptors can be used as a guide for the design and synthesis of
new 15S-LOX1 inhibitors.
Methods
Binary Classification and Model Generation
The present
work is concerned with a binary classification problem
studied in supervised machine learning.[39] Therefore, the response (output) variable is categorical, and the
two categories are predefined according to the biological activity
in vitro of various chemicals toward the enzyme 15S-LOX1, that is, active (1) and inactive (−1).QSAR modeling
was carried out with R.[40] R is both a language
and an environment for statistical computing and graphics. Extended
functionalities were added to R by installing a number of packages,
which are machine learning algorithms implemented as third party libraries.The
following R packages were used for the analysis: rcdk,[41]SparseM,[42]randomForest,[43]caret,[44]e1071,[45]pROC,[46]rpart,[47]AppliedPredictiveModeling,[48] and caretEnsemble.[49]
Calculation of Molecular
Descriptors
A total of 317 structurally diverse molecules
with recorded biological
activity in vitro toward three LOX subtypes (15S-LOX1,
12S-LOX1, and 15S-LOX2) were used
for the classification analysis (Supporting Information, Table S1). The molecules have been collected from available public
sources: (a) reviews and reports[50−79] and (b) the PubChem Bioassay database, where we found several assays
concerning the biological activity of series of molecules toward LOX.[80−91] Special consideration was taken to avoid structural redundancy.
The ligands were drawn in ACD/ChemSketch[92] (Advanced Chemical Development) and a single 3D conformation was
created for each structure with the Bioclipse software.[93,94] An SDF file containing the 3D coordinates of the molecules was imported
in R, and the rcdk package was used to calculate
automatically a number of descriptor variables. These descriptors
are divided broadly into three main groups,[95] that is, atomic, bond, and molecular and belong to the specific
categories “topological”, “geometrical”,
“hybrid”, “constitutional”, and “electronic”.
The calculation resulted in 282 descriptors for each molecule.
IC50 Experimental Data
It is a common practice
in in vitro experiments to use three different
LOX subtypes[50−79] to draw conclusions regarding biological activity toward human 15S-LOX1: The human 15S-LOX1, the soybean
13LOX1, and the mammalian (rabbit reticulocyte) 12-/15-LOX1. Such
extrapolations are valid on the grounds that the enzymes have high
structural similarity, most importantly in the area of the binding
site, and the mode of action of the small molecule inhibitors is similar.[4,13,14,19] Therefore, for the collection of data for the training of the classification
models, we adopted the same rationale.Rabbit reticulocyte 12-/15-LOX1
has been characterized[19] and, as mentioned
above, is often used as a standard for biochemical assays. Soybean
13LOX1 has been for many years a model system for understanding catalysis
and structure of all lipoxygenases.[4] It
is sufficiently stable, obtainable in large quantities, and relatively
easy to purify.[4,19] In addition, the overall architecture
of the soybean structure is similar to the mammalian, although the
latter is much more compact.[19] Therefore,
soybean 13LOX1 is used in bioassays for recording bioactivity toward
15S-LOX1, although the IC50 values of
a molecule may be different for the two enzymes.In the present
analysis, we have considered the molecules with
IC50 ≥ 100 μM as being inactive. Also, as
already mentioned, we have included a number of molecules, which,
although inactive against 15S-LOX1, exhibit bioactivity
toward other LOX isozymes. These molecules were considered to be inactive.
Our aim was to train our models to distinguish subtle differences
among chemical structures and to be able to predict those that showed
selectivity for 15S-LOX1.In accordance to
their inhibitory activity in vitro toward 15S-LOX1,
the molecules were classified as follows: (a) active
(1): selective 15S-LOX1 inhibitors and nonselective
inhibitors with preference for the 15S-LOX1 pocket
and (b) inactive (−1): molecules exhibiting selectivity toward
two other LOX isozymes (12S-LOX1 and 15S-LOX2) but inactive against 15S-LOX1 and molecules
inactive toward LOX in general. For all the molecules in the dataset,
the IC50 values were recorded (Supporting Information, Table S1).
Authors: Roman Stavniichuk; Viktor R Drel; Hanna Shevalye; Igor Vareniuk; Martin J Stevens; Jerry L Nadler; Irina G Obrosova Journal: Free Radic Biol Med Date: 2010-06-22 Impact factor: 7.376
Authors: Khehyong Ngu; David S Weinstein; Wen Liu; Charles Langevine; Donald W Combs; Shaobin Zhuang; Xing Chen; Cort S Madsen; Timothy W Harper; Saleem Ahmad; Jeffrey A Robl Journal: Bioorg Med Chem Lett Date: 2011-06-06 Impact factor: 2.823
Authors: Eric K Hoobler; Ganesha Rai; Andrew G S Warrilow; Steven C Perry; Christopher J Smyrniotis; Ajit Jadhav; Anton Simeonov; Josie E Parker; Diane E Kelly; David J Maloney; S L Kelly; Theodore R Holman Journal: PLoS One Date: 2013-06-24 Impact factor: 3.240
Authors: Ganesha Rai; Netra Joshi; Joo Eun Jung; Yu Liu; Lena Schultz; Adam Yasgar; Steve Perry; Giovanni Diaz; Qiangli Zhang; Victor Kenyon; Ajit Jadhav; Anton Simeonov; Eng H Lo; Klaus van Leyen; David J Maloney; Theodore R Holman Journal: J Med Chem Date: 2014-05-13 Impact factor: 7.446
Authors: Yanli Wang; Stephen H Bryant; Tiejun Cheng; Jiyao Wang; Asta Gindulyte; Benjamin A Shoemaker; Paul A Thiessen; Siqian He; Jian Zhang Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971