Selvaraman Nagamani1, G Narahari Sastry1. 1. Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, Assam 785 006, India.
Abstract
The drug-resistant strains of Mycobacterium tuberculosis (M.tb) are evolving at an alarming rate, and this indicates the urgent need for the development of novel antitubercular drugs. However, genetic mutations, complex cell wall system of M.tb, and influx-efflux transporter systems are the major permeability barriers that significantly affect the M.tb drugs activity. Thus, most of the small molecules are ineffective to arrest the M.tb cell growth, even though they are effective at the cellular level. To address the permeability issue, different machine learning models that effectively distinguish permeable and impermeable compounds were developed. The enzyme-based (IC50) and cell-based (minimal inhibitory concentration) data were considered for the classification of M.tb permeable and impermeable compounds. It was assumed that the compounds that have high activity in both enzyme-based and cell-based assays possess the required M.tb cell wall permeability. The XGBoost model was outperformed when compared to the other models generated from different algorithms such as random forest, support vector machine, and naïve Bayes. The XGBoost model was further validated using the validation data set (21 permeable and 19 impermeable compounds). The obtained machine learning models suggested that various descriptors such as molecular weight, atom type, electrotopological state, hydrogen bond donor/acceptor counts, and extended topochemical atoms of molecules are the major determining factors for both M.tb cell permeability and inhibitory activity. Furthermore, potential antimycobacterial drugs were identified using computational drug repurposing. All the approved drugs from DrugBank were collected and screened using the developed permeability model. The screened compounds were given as input in the PASS server for the identification of possible antimycobacterial compounds. The drugs that were retained after two filters were docked to the active site of 10 different potential antimycobacterial drug targets. The results obtained from this study may improve the understanding of M.tb permeability and activity that may aid in the development of novel antimycobacterial drugs.
The drug-resistant strains of Mycobacterium tuberculosis (<span class="Species">M.tb) are evolving at an alarming rate, and this indicates the urgent need for the development of novel antitubercular drugs. However, genetic mutations, complex cell wall system of M.tb, and influx-efflux transporter systems are the major permeability barriers that significantly affect the M.tb drugs activity. Thus, most of the small molecules are ineffective to arrest the M.tb cell growth, even though they are effective at the cellular level. To address the permeability issue, different machine learning models that effectively distinguish permeable and impermeable compounds were developed. The enzyme-based (IC50) and cell-based (minimal inhibitory concentration) data were considered for the classification of M.tb permeable and impermeable compounds. It was assumed that the compounds that have high activity in both enzyme-based and cell-based assays possess the required M.tb cell wall permeability. The XGBoost model was outperformed when compared to the other models generated from different algorithms such as random forest, support vector machine, and naïve Bayes. The XGBoost model was further validated using the validation data set (21 permeable and 19 impermeable compounds). The obtained machine learning models suggested that various descriptors such as molecular weight, atom type, electrotopological state, hydrogen bond donor/acceptor counts, and extended topochemical atoms of molecules are the major determining factors for both M.tb cell permeability and inhibitory activity. Furthermore, potential antimycobacterial drugs were identified using computational drug repurposing. All the approved drugs from DrugBank were collected and screened using the developed permeability model. The screened compounds were given as input in the PASS server for the identification of possible antimycobacterial compounds. The drugs that were retained after two filters were docked to the active site of 10 different potential antimycobacterial drug targets. The results obtained from this study may improve the understanding of M.tb permeability and activity that may aid in the development of novel antimycobacterial drugs.
Drug resistance is
the major ongoing threat in developing first-line Mycobacterium
tuberculosis (<span class="Species">M.tb) drugs.[1−3] Complex M.tb cell wall systems,[4] rapid mutations in enzymes,[5] and modification of drug targets are some of the major
causative agents for M.tb drug resistance. The development
of potential, novel chemotherapeutic agents depends upon the deeper
understanding of these M.tb defense mechanisms.[1]M.tb has a thick and waxy cell
wall, and this acts as a powerful barrier to major antibiotics as
well as potential M.tb drugs. The outer layer of
the M.tb cell wall is made up of peptidoglycan–arabinogalactan–mycolic
acid. In vitro experiments have demonstrated that this outer layer
has unusual low fluidity, and thus, hydrophilic and lipophilic agents
have severe problems while passing through this thick cell membrane.[6−10] Hence, the complexity of the M.tb cell wall is
the natural defense mechanism for its survival, and therefore it seeks
attention from researchers for the development of potential antimycobacterial
drugs.[11]
The advancement of structure-based
and ligand-based drug design
approaches has made drug discovery processes more feasible by identifying
potential lead-like molecules before synthesis and biological evaluation.[12−21] However, M.tb permeability is the major concern
in developing potential antimycobacterial drugs. Even most potential <span class="Species">M.tb inhibitors that are validated against InhA[22] do not have the potential efficacy due to their
inability to penetrate through the M.tb cell membrane.
Different physiochemical, biological, and chemical properties such
as stereochemistry, lipophilicity, saturation and unsaturation, flexibility,
viscosity, fluidity, pressure, temperature, physiological, and pathological
conditions are responsible for the M.tb cell wall
permeability.[23] Moreover, 14 out of 36
anti-TB drugs do not obey Lipinski’s rule of five and other
drug-likeness properties.[24] Therefore,
various other parameters have to be considered to design potential
antitubercular compounds. However, the data on permeability properties
of the small molecules are not available, and thus, it obstructs the
development of knowledge-based methods for permeability estimation.
Two different models are available for the assessment of M.tb permeability of small molecules. Merget et al.[25] used logistic regression to assess the permeability
of small molecules. They considered 3815 chemical structures as “active”
data set and randomly drawn drug-like molecules as the negative data
set. The model was generated based on descriptors calculated from
PaDEL[26] and QikProp[27] tools. Janardhan et al.[11] developed
2D-QSAR for <span class="Species">M.tb cell wall permeability model based
on the minimal inhibitory concentration (MIC) value. The compounds
with MIC ≤ 200 nM and the compounds with MIC > 200 nM were
considered as “active” and “inactive”
compounds, respectively. It has been assumed that the compounds with
good IC50 values failed to penetrate through the M.tb cell wall due to their poor MIC values. This classification
may be used as a knowledge-based source for analyzing the features
of permeable compounds.
Machine learning and data science play
an enormous role in different
fields such as bioinformatics,[28] chemoinformatics,[29] computational drug discovery,[30] genomics,[31] and computational
chemistry.[32] In this work, M.tb permeability models were created using different machine learning
approaches, and it was subsequently validated with various metrics.
Interestingly, the developed XGBoost machine learning model has outperformed
the existing QSAR model. The same strategy was applied from the previous
QSAR model for the selection of “active” and “inactive”
data sets.[11] In the next step, the available
drug molecules collected from DrugBank[33] were screened with the permeability model. The molecules screened
with the machine learning model subsequently underwent the PASS analysis
for the selection of compounds that have antimycobacterial activity.
In the last step, virtual screening was performed against 10 different
potential <span class="Species">M.tb targets with the screened compounds
from two filters. The identified drug molecules can be the potential
lead molecules against M.tbinfection since it has
been predicted as potential candidates through a series of filtering
mechanisms (i.e., permeability model, PASS analysis, and virtual screening)
for antitubercular drug discovery.
Results and Discussion
Compositional
Analysis of Permeable and Impermeable Compounds
The compositional
analysis of both data sets revealed that the
permeable and impermeable compounds are rich in <span class="Chemical">carbon, <span class="Chemical">nitrogen,
and oxygen (Figure S1). However, there
was no significant major difference in the atom frequency of compositional
elements between permeable and impermeable compounds. Thus, the composition
and related properties may not be important determinants for the permeability
of a compound.
Principal Component Analysis
Principal
component analysis
(PCA) was performed on training data to compute the variance between
permeable and impermeable compounds using descriptors as input features.[34] Two principal components (PC1 and <span class="CellLine">PC2) are represented
by the two axes in the principal component. The maximum variance in
the data set has been captured by the first principal component since
it is a linear combination of original predictor variables, whereas
the second principal component captures the remaining linear components
of other predictors. The variance between PC1 and PC2 for the whole
data set was 42.1% and 5.4%, respectively (Figure S2a). It significantly decreases after PC2, and after PC5,
the variance was almost constant (Figure S2b). However, from the results, it is evident that the PCA-based model
is not expected to give the best performance. Hence, the performance
of various machine learning algorithms was compared for developing
the M.tb permeability model.
Selection of Appropriate
Machine Learning Model for Classification
Various machine
learning methods have been evaluated using the
“MLBench” package[35] in R
for classification performance. The accuracy values displayed for
random forest (RF), gradient boosting model (GBM), classification
and regression model (CART), Glmnet, support vector machine (SVM), k-nearest neighbors (KNN), naïve Bayes (NB), and
logistic regression were 0.946, 0.939, 0.851, 0.927, 0.925, 0.925,
0.864, and 0.490, respectively (Figure ). Thus, it is apparent that RF- and GBM-based models
were comparatively better than the other machine learning models.
Figure 1
Performance
of different machine learning models for the classification
of M.tb permeable and impermeable compounds (RF,
random forest; GBM, gradient boosting model; CART, classification
and regression model; Glmnet, Lasso and elastic-net regularized generalized
linear model; SVM, support vector machine; KNN, k-nearest neighbors; NB, naïve Bayes; and logistic, logistic
regression).
Performance
of different machine learning models for the classification
of <span class="Species">M.tb permeable and impermeable compounds (RF,
random forest; GBM, gradient boosting model; CART, classification
and regression model; Glmnet, Lasso and elastic-net regularized generalized
linear model; SVM, support vector machine; KNN, k-nearest neighbors; NB, naïve Bayes; and logistic, logistic
regression).
RF Parameter Optimization
and Development of Classification
Models
Initially, all the descriptors and the permeability
information of 1114 compounds were given as input features at the
best mtry (optimized by tuneRF function) and ntree = 1000 and mean
decrease values were calculated. The top 100 mean decrease accuracy
and mean decrease gini values are shown in Figure S3.The performance of the descriptors-based models has
been examined by different mtry values and ntrees (200, 400, 600,
800, and 1000) using different combinations of descriptors. Based
on the results depicted in Figure and Table , it is apparent that the top 40 descriptors at mtry = 4 and
ntree = 800 performed better than other models with an accuracy of
0.9706 and an MCC of 0.8918, with the out of bag (OOB) error of 5.6%.
Figure 2
Optimization
of RF models at different mtry and ntree values using
top (A) 20 descriptors, (B) 40 descriptors, (C) 60 descriptors, (D)
80 descriptors, and (E) 100 descriptors as input features. Mtry is
the number of variables randomly sampled as candidates at each split,
and ntree is the number of trees to grow. The performance has been
calculated by the percentage of OOB error.
Table 1
Performance of RF Models at Different
mtry Values Using Variable Number of Important Descriptorsa
descriptors
mtry
sensitivity
specificity
precision
accuracy
MCC
Top 20
3
0.9814
0.8289
0.9420
0.9403
0.8455
4
0.9907
0.8289
0.9425
0.9692
0.8645
5
0.9814
0.8205
0.9378
0.9412
0.8395
6
0.9721
0.8289
0.9414
0.9130
0.8273
7
0.9814
0.8289
0.9420
0.9403
0.8455
Top 40
2
0.9767
0.8158
0.9375
0.9254
0.8270
3
0.9814
0.8553
0.9505
0.9420
0.8641
4
0.9907
0.8684
0.9552
0.9706
0.8918
5
0.9814
0.8421
0.9462
0.9412
0.8548
6
0.9860
0.8421
0.9464
0.9552
0.8641
Top 60
3
0.9953
0.8421
0.9469
0.9846
0.8832
5
0.9814
0.8289
0.9420
0.9403
0.8455
7
0.9814
0.8158
0.9378
0.9394
0.8362
9
0.9907
0.8205
0.9383
0.9697
0.8583
11
0.9814
0.8289
0.9420
0.9403
0.8455
Top 80
36
0.9721
0.8421
0.9457
0.9143
0.8368
37
0.9721
0.8421
0.9457
0.9143
0.8368
38
0.9953
0.8421
0.9469
0.9846
0.8832
39
0.9587
0.8421
0.9457
0.8767
0.8115
40
0.9814
0.8421
0.9462
0.9412
0.8548
Top 100
2
0.9814
0.8289
0.9420
0.9403
0.8455
4
0.9907
0.8421
0.9467
0.9697
0.8736
6
0.9767
0.8421
0.9459
0.9275
0.8457
8
0.9721
0.8421
0.9457
0.9143
0.8368
10
0.9814
0.8421
0.9462
0.9412
0.8548
The top 40 descriptors at mtry 4
were selected as the best performing model. These descriptors were
further optimized using the boosting method in XGBoost.
Optimization
of RF models at different mtry and ntree values using
top (A) 20 descriptors, (B) 40 descriptors, (C) 60 descriptors, (D)
80 descriptors, and (E) 100 descriptors as input features. Mtry is
the number of variables randomly sampled as candidates at each split,
and ntree is the number of trees to grow. The performance has been
calculated by the percentage of OOB error.The top 40 descriptors at mtry 4
were selected as the best performing model. These descriptors were
further optimized using the boosting method in XGBoost.
XGBoost Model Generation
The top
40 descriptors along
with the dependent variables were given as input for the XGBoost model
generation. Different XGBoost parameters were manually optimized,
and the best performance was obtained by applying the values of 0.1,
0.8, 3, and 10 for eta, gamma, minimum child weight, and maximum depth,
resppan>ectively. Eta is the step size shrinkage to control the learning
rate and over-fitting through scaling each tree contribution, the
minimum loss reduction required to make a split is represented by
gamma, and the minimum <span class="Species">child weight and maximum depth are the minimum
sum of instance weight needed in a child and maximum depth of a child,
respectively. Other tree parameters, such as maximum delta step, subsample,
column sample, and the number of trees to grow per round, are left
at their default parameters.
Performance Evaluation of XGBoost Model on
the Validation Set
The performance of the XGBoost model was
evaluated with the validation
set containing 40 compounds (21 impermeable compounds and 19 inhA
inhibitors). The same data set has also been screened against the
QSAR model.[11] The results are compared
and depicted in Figure and Table S1. The GBM has high true positive
rates as compared to false positive rates. The ROC values (Figure ) show that the XGBoost
model (Figure a) is
better (0.81) than the QSAR model (Figure b) (0.60). The steps involved in the construction
of the permeability model are represented in Figure .
Figure 3
ROC performance of the (A) gradient boosting
model and (B) QSAR
model on external validation data set (40 compounds).
Figure 4
Schematic workflow for the development of M.tb permeability
machine learning model.
ROC performance of the (A) gradient boosting
model and (B) QSAR
model on external validation data set (40 compounds).Schematic workflow for the development of <span class="Species">M.tb permeability
machine learning model.
Identification of Statistically
Significant Descriptors
The Wilcoxon rank-sum test was performed
on the training data set
of the top 100 descriptors to find out the statistically discriminating
(P < 0.05) descriptors among the permeable and
impermeable compounds. Figure depicts the statistically significant descriptors among permeable
and impermeable compounds. This analysis will be helpful to the researchers
for the detailed characterization of discriminating features of permeable
compounds, which eventually helps in the development of potential
permeable compounds.
Figure 5
Significantly discriminating descriptors: (A) ETA_Epsilon_5,
(B)
nHBint5, (C) A log P, (D) MaxsOm, (E) Mpe, (F) GATS1e, (G)
GATS3v, (H) MeanI, and (I) BUCTp.1l on the basis of Wilcoxon test
(P < 0.05) among permeable and impermeable compounds.
Significantly discriminating descriptors: (A) ETA_Epsilon_5,
(B)
nHBint5, (C) A log P, (D) MaxsOm, (E) Mpe, (F) GATS1e, (G)
GATS3v, (H) MeanI, and (I) BUCTp.1l on the basis of Wilcoxon test
(P < 0.05) among permeable and impermeable compounds.
Relative Influence of Various Descriptors
toward M.tb Permeability
In this section,
different descriptors were
analyzed based on their contribution toward <span class="Species">M.tb cell
permeability and inhibition. BCUT descriptors, atom type electrotopological
descriptors, molecular path count, autocorrelation descriptors, and
lipophilicity (log P) are some of the major
contributors that have the potential to alter the <span class="Species">M.tb permeability.
Mean decrease accuracy and mean decrease gini
were calculated, and their contribution toward the permeability or
impermeability of the compounds is discussed in detail. The obtained
values suggested that various descriptors, i.e., lipophilicity (log P, size, and shape) and electrotopology (atomic ch<span class="Chemical">arges,
ionization potential, <span class="Chemical">hydrogen bond acceptors/donors, and number of
fused rings), of a molecule determine the activity of the descriptors.
In feature importance, autocorrelation descriptors[36] showed a significant difference between permeable and impermeable
compounds. The descriptors GATS1s, GATS6e, and AATSC3i showed significant
contributions toward permeable compounds when compared to impermeable
compounds. Thus, these autocorrelation descriptors could be useful
to improve the <span class="Species">M.tb permeability of the molecules.
Similarly, ATSC5i, MATS2e, GATS6m, and GATS8c descriptors have higher
values for impermeable compounds than permeable compounds. Thus, it
can be concluded that the negative contributions of these descriptors
may be unfavorable toward the permeability of the molecule.
Each skeletal atom or group has been assigned to intrinsic state
value by calculating atom-type electrotopological descriptors.[37] This value represents electrotopological intrinsic
state (E-state numbers) and E-state indices (E-state contribution).
Different parameters such as mindsN, maxHBint5, and minHBint2 have
high value for a permeable compound that represents maximum E-state
descriptors, which are directly proportional to the inhibitory potency
of the molecule.Lipophilicity (log P)[38] is the tendency of a molecule to separate
two phases, i.e.,
immiscible <span class="Chemical">octanol and polar aqueous <span class="Chemical">water phase. The permeability
of a compound is directly proportional to the log P value. Different types of log P values,
such as AlogP2, MlogP, and XlogP, are favorable for the permeable compounds and unfavorable
for the impermeable compounds. Thus, to improve the permeability of
a molecule, substitution of optimal lipophilic moieties within the
applicability domain may be beneficial.
Another important descriptor
is nHBAcc_Lipinksi (number of <span class="Chemical">hydrogen
bond acceptors), which contributes favorably toward the permeable
compounds. Thus, it can be concluded that substitution of the optimal
number of <span class="Chemical">hydrogen acceptors may improve the permeability of the M.tb inhibitors. The total count of fused ring (nFRing)
descriptors favors the impermeable compounds. Thus, it indicates that
the number of fused rings should be minimal to design M.tb permeable compounds.
It was also noted that nHBint3 descriptor
(40), <span class="Chemical">nHsOH descriptor,
minHssNH, and maxsssCH descriptor are unfavorable for the permeability
of the compounds. Hence, it is inferred that minimum secondary <span class="Chemical">amines,
maximum sp[3] hybridized carbon atoms, and
minimum hydroxyl groups are critical parameters for gaining inhibitory
potency. Overall, apart from the identification of potential M.tb inhibitors, the developed machine learning model may
help in understanding favorable and unfavorable molecular properties
that aid in governing M.tb permeability.
Drug Repurposing
Studies
Three-Tier (Permeability Screening, Antimycobacterial Activity
Prediction, and Virtual Screening) Screening
The M.tb drug targets were considered for computational drug
discovery and drug repurposing studies. For instance, 10 genes have
been chosen in M.tb for computational drug repurposing.
A total of 6086 drug molecules were collected from the DrugBank databases,
including clinical trials and experimental drugs. A three-tier screening
was applied for the identification of potential antimycobacterial
compounds. All the molecules were screened with the generated XGBoost
model for the permeability criteria. The M.tb permeability
regression model values were in between 0 and 1. Molecules that possess
values ≤0.5 were considered as impermeable compounds. This
process screened out 656 compounds that were not able to permeate
through the M.tb cell wall according to the machine
learning model prediction.
Antimycobacterial Activity Prediction
The biological
potential of the molecules has been predicted for the screened permeable
compounds. Fifty-one molecules have been predicted as potent antimycobacterial
compounds in PASS analysis and may be used for the treatment of tuberculosis
<span class="Disease">infection. Bedaquiline, a known antimycobacterial drug used to treat
multidrug-resistant tuberculosis having high Pa (probability “to be active”), was identified
by PASS among the 51 molecules. Some of the other antitubercular drugs
(i.e., isoniazid, ethambutol, and Rifampicin) have also shown significant
antitubercular activity in PASS analysis; however, these molecules
were not considered because these molecules are all currently being
practiced as antitubercular drugs. Forty-one molecules (Table S2) were considered for the virtual screening
process.
Virtual Screening of Screened Drugs
The 41 molecules
were further docked in the active site of 10 potential antitubercular
drug targets, namely, MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA,
<span class="Chemical">pknB, kasA, and dprE1. The Autodock vina was used for the docking
calculations. The top 10 molecules were considered based on their
docking score (>−5 kcal/mol) against the 10 targets. The
grid
parameters and active sites of 10 targets are shown in Table S3. The docking scores along with the PASS
analysis results for the top 10 drug molecules are shown in Table . In the case of enoyl-ACP
reductase (InhA), the major interaction was found with Tyr 158, and
this interaction is essential for fatty acyl substrate binding.[39] The ligands are buried inside the hydrophobic
residues. In dihydrofolate enzyme, a conserved interaction with Asp27
was noticed, and this amino acid is critical for catalysis.[40] Other than this, two additional common hydrophobic
interactions (Ile5 and Ile94) have also been noticed. In protein kinase
PknB, common interactions were observed with Leu17, Gly18, Glu93,
and Tyr94. These amino acids are essential for blocking the catalytic
activity of the protein.[41] The protein
KasA in M.tb is responsible for fatty acid biosynthesis
pathway. The Phe404 acts as a gatekeeper residue, which is responsible
for the opened and closed conformations.[42] Interestingly, most of the identified molecules have interaction
with this amino acid. In DprE1, Cys387 residue is an important residue
for the bonding of small molecule inhibitors.[43] Most of the identified molecules interact with this amino acid.
In M.tb beta carbonic anhydrase, the molecules have
occupied well in the active site. Asp 53 and Arg 55 are important
resides, which can adopt different conformation for the accessible
cavity.[44] Thus, interaction with these
amino acids may be helpful to block the carbonic anhydrase. In the
case of CYP51, major interaction was observed with Phe78 and Phe83.
These amino acids are essential and “hotspot” residues
of this protein.[45] In thymidine kinase
protein, the major interactions were identified with Asp163, Tyr166,
and Glu166. These interactions are essential for open and close conformations.[46] The Arg253 amino acid in dihydrosynthase[47] is one of the highly conserved key amino acids
to avoid M.tb drug resistance. Interestingly, most
of the drugs have interaction with this amino acid. In case of DNA
ligase, most of the molecules have interaction with Leu122 and Ile124.
These interactions are essential for inhibitory activity.[48] Overall, the 10 molecules have formed interaction
with the significantly important residues of 10 targets. Figure S4 displays the identified drug molecules
in the active sites of 10 proteins.
Table 2
The Top 10 Drug Molecules
with Probable
Tuberculosis Activity in PASS Prediction along with Docking Scores
against 10 M.tb Targets
As a first
step, 86 833 <span class="Species">M.tb inhibitors were collected
from the ChEMBL database,[49] which has a
wide range of <span class="Species">M.tb inhibitory potency. These compounds
were then classified according
to their biological assay protocols (MIC, activity, CC50, GI, IC50, IC90, inhibition, Ki, MIC50, MIC90, MIC99, ratio, selective index, AvgIC90, cc25, EC50, EC90, EC99, CFU, growth index, IZ, Kd, Km, log 1/MMIC,
log 10CFU, log 10CFU/mL, MBC, MIC95, ratioIC50, and selective ratio). Next, the redundant compounds (34 411
compounds) in each category were removed, and the nonredundant compounds
(52 422 compounds) were retained for further analysis (Table S4). One thousand one hundred fourteen
compounds were considered in this study, which has MIC values (846
compounds) and IC50 (268 compounds) values in the nonredundant
data set (Tables S5 and S6). Compounds
that have MIC values ≤200 nM (846 compounds) were considered
as permeable compounds and compounds that have MIC > 200 nM and
IC50 ≤ 1000 nM (268 compounds) were considered as
impermeable
compounds. Furthermore, 40 compounds (19 InhA inhibitors and 21 impermeable
compounds from the ChEMBL database) were collected and stored as an
external validation set to validate the machine learning model.
Compositional Analysis
The distribution of atom types
in the data set was calculated using the “atomcount” function in “ChemmineR” library of “R”
software. The physiochemical descriptors (i.e., elemental composition)
have been calculated for the permeable and impermeable compounds.[50]
Input Features
The 1D and 2D descriptors
of all the
molecules were calculated using the PaDEL descriptor calculation tool.
The properties of the molecules have been depicted as numerical values
in the PaDEL. Different 1D (molecular weight, molar refractivity,
solubility, and permeability) and 2D (log P, log D, and topological polar surface area)
descriptors were considered to generate a machine learning model.[26]
Development of Classification Models for
the Prediction of Permeability
A total of 1444 different
1D and 2D descriptors were calculated,
and among these 1186 descriptors were considered that have a nonzero
value for most of the compounds. The data set was classified into
75% as the training set and 25% as the test set molecules.
Principal
Component Analysis
The high dimensionality
(1186 × 1114 descriptors) of data by several input features was
compressed using PCA to identify the major components that can distinctly
differentiate the two data sets.[34]
Appropriate
Machine Learning Model Selection for Classification
The 1187
descriptors along with the response variables have been
given as input features for the generation of different machine learning
algorithms, i.e., RF, GBM, CART, generalized linear model, SVM, KNN,
NB classifier, logistic regression. A 10-fold cross-validation was
performed on training data to compare the performance of machine learning
models[51] using the “mlbench”
package in R software.[35]
Optimization
of Parameters for the Generation of Classification
Models
The “RF” package in R (version 3.5.2)
was used to optimize the input feature parameters, viz., mtry (number
of randomly selected variables), ntree (number of trees generated
by RF algorithm).[51] The importance of each
descriptor has been analyzed by calculating mean decrease accuracy
value in RF algorithm at best mtry value, which is tuned by tuneRF
function. The optimized values for each mtry in RF have been estimated
by OOB error estimation. The descriptor-based RF model has been evaluated
using the top 20, 40, 60, 80, and 100 descriptors and various mtry
values at 200, 400, 600, 800, and 1000 ntree, respectively. The best
model was selected based on the minimum OOB error value. Tenfold cross-validation
method was applied for further optimization of the classification
models.
Optimization of GBM for the Regression Model Generation
The selected features from the best RF model were given as input
for the development of the GBM. Extreme Gradient Boosting (XGBoost)
uses the Gradient Boosting Decision Tree (GBDT) for classification
and regression problems.[52−54] The inside nodes of the regression
tree represent the values for an attribute test, whereas the leaf
nodes with scores represent a decision. The result of the prediction
is described as follows, where the sum of the scores is predicted
by K trees,where x is the ith training sample, f(x) is the score for the kth
tree, and F is the space of functions containing
all regression trees, and the objective function to be optimized is
given by the following formula:The former term is a differentiable loss
function that
measures whether the model is suitable for training data set. The
latter is
an item that punishes the complexity
of the model. When the complexity of the model increases, the corresponding
score is deducted.In this study, the 2D descriptors are the
viable input to the XGBoost
classifier and the predicted class (i.e., permeable and impermeable)
on a scale of 0–1 as the output variable. The probability value
greater than or equal to 0.5 indicates impermeable compounds, whereas
the probability value less than 0.5 indicates permeable compounds.
Validation of Final Classification Models
Performances
of RF and GBMs were evaluated on the 25% test data set and also on
separate validation data set using the following performance measures:where TP is the true positive, TN is the true
negative, FP is the false positive, FN is the false negative, and
MCC is the Matthews correlation coefficient.
Drug Repurposing Studies
A series of filtering techniques
were adopted to identify the potent antimycobacterial compounds. As
a first step, all the drug molecules from DrugBank were collected
and screened with the generated permeability model. The screened compounds
were further evaluated with PASS biological activity prediction to
identify potential antimycobacterial compounds. Furthermore, virtual
screening was performed against 10 different M.tb drug targets (MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA, pknB,
kasA, and dprE1) to identify potential drug molecules. These targets
were selected based on their role in M.tb virulence,
cell wall synthesis, biochemical pathways, etc. Table S7 displays the function of all the selected 10 M.tb targets.
Biological Activity Prediction
The
PASS online web
server is widely used for the prediction of the biological activity
of compounds, including different pharmacological effects, toxic and
adverse effects, mechanisms of actions, and the influence of gene
expression.[55−58] The biological activity of a molecule is calculated based on the
structure activity relationships using a training set of >300 000
biologically active substances (i.e., drugs, drug-like molecules,
lead compounds, and toxins) in the PASS server. The activity in each
category (i.e., antiviral, antibacterial, and antimycobacterial) is
evaluated by the probability of active (Pa) value. The screened compounds from the permeability model were
further screened in PASS for their antimycobacterial activity.
Virtual
Screening Protocol
Screening of large molecule
libraries against the molecular target is known as virtual screening.[12,14] Autodock vina was used for estimating the binding affinity of the
drug molecules in the active site of M.tb drug targets.
Proteins (MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA, pknB, kasA,
and dprE1) and all the predicted permeable compounds were converted
into corresponding .pdbqt molecules using AutoDock tools.[59] The grids were generated in the active sites
of all the drug targets. The compounds that passed through the above
two filters (M.tb machine learning model and PASS
antimycobacterial activity prediction) were further docked in the M.tb drug targets. The top-scored compounds are reported
as the potential hits for the antimycobacterial drug discovery.
Conclusions
The small molecule permeability through biological
systems mainly
depend on the physiochemical factors. M.tb permeability
is one of the major challenges to develop potential antimycobacterial
compounds. A systematic state-of-the-art machine learning calculation
based on cell-based and enzyme-based inhibitory activity was performed
to delineate the causative factors affecting permeability. The generated
XGBoost <span class="Species">M.tb permeability model was validated with
known M.tb inhibitors and impermeable compounds.
The XGBoost model was further used to screen the DrugBank database.
The screened molecules were further screened in PASS server to identify
potential compounds that have significant antitubercular activity.
The screened candidates are expected to display high permeability
through the M.tb cell wall and good antitubercular
activity. The compounds that passed through two filters were docked
in the active site of 10 different potential M.tb targets. Ten drugs have been reported as potential repurposable
antitubercular candidates. Overall, the obtained results can be used
for the identification of novel potential drug molecules against drug-resistant M.tb strains with improved cell permeability.
Authors: Sarah M Batt; Talat Jabeen; Veemal Bhowruth; Lee Quill; Peter A Lund; Lothar Eggeling; Luke J Alderwick; Klaus Fütterer; Gurdyal S Besra Journal: Proc Natl Acad Sci U S A Date: 2012-06-25 Impact factor: 11.205
Authors: Suresh B Lakshminarayana; Tan Bee Huat; Paul C Ho; Ujjini H Manjunatha; Véronique Dartois; Thomas Dick; Srinivasa P S Rao Journal: J Antimicrob Chemother Date: 2014-11-11 Impact factor: 5.758
Authors: Anna Gaulton; Anne Hersey; Michał Nowotka; A Patrícia Bento; Jon Chambers; David Mendez; Prudence Mutowo; Francis Atkinson; Louisa J Bellis; Elena Cibrián-Uhalte; Mark Davies; Nathan Dedman; Anneli Karlsson; María Paula Magariños; John P Overington; George Papadatos; Ines Smit; Andrew R Leach Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971