Literature DB >> 34278133

Mycobacterium tuberculosis Cell Wall Permeability Model Generation Using Chemoinformatics and Machine Learning Approaches.

Selvaraman Nagamani¹, G Narahari Sastry¹.

Abstract

The drug-resistant strains of Mycobacterium tuberculosis (M.tb) are evolving at an alarming rate, and this indicates the urgent need for the development of novel antitubercular drugs. However, genetic mutations, complex cell wall system of M.tb, and influx-efflux transporter systems are the major permeability barriers that significantly affect the M.tb drugs activity. Thus, most of the small molecules are ineffective to arrest the M.tb cell growth, even though they are effective at the cellular level. To address the permeability issue, different machine learning models that effectively distinguish permeable and impermeable compounds were developed. The enzyme-based (IC50) and cell-based (minimal inhibitory concentration) data were considered for the classification of M.tb permeable and impermeable compounds. It was assumed that the compounds that have high activity in both enzyme-based and cell-based assays possess the required M.tb cell wall permeability. The XGBoost model was outperformed when compared to the other models generated from different algorithms such as random forest, support vector machine, and naïve Bayes. The XGBoost model was further validated using the validation data set (21 permeable and 19 impermeable compounds). The obtained machine learning models suggested that various descriptors such as molecular weight, atom type, electrotopological state, hydrogen bond donor/acceptor counts, and extended topochemical atoms of molecules are the major determining factors for both M.tb cell permeability and inhibitory activity. Furthermore, potential antimycobacterial drugs were identified using computational drug repurposing. All the approved drugs from DrugBank were collected and screened using the developed permeability model. The screened compounds were given as input in the PASS server for the identification of possible antimycobacterial compounds. The drugs that were retained after two filters were docked to the active site of 10 different potential antimycobacterial drug targets. The results obtained from this study may improve the understanding of M.tb permeability and activity that may aid in the development of novel antimycobacterial drugs.

The drug-resistant strains of Species">Mycobacterium tuberculosis (span>n class="Species">M.tb) are evolving at an alarming rate, and this indicates the urgent need for the development of novel antitubercular drugs. However, genetic mutations, complex cell wall system of M.tb, and influx-efflux transporter systems are the major permeability barriers that significantly affect the M.tb drugs activity. Thus, most of the small molecules are ineffective to arrest the M.tb cell growth, even though they are effective at the cellular level. To address the permeability issue, different machine learning models that effectively distinguish permeable and impermeable compounds were developed. The enzyme-based (IC50) and cell-based (minimal inhibitory concentration) data were considered for the classification of M.tb permeable and impermeable compounds. It was assumed that the compounds that have high activity in both enzyme-based and cell-based assays possess the required M.tb cell wall permeability. The XGBoost model was outperformed when compared to the other models generated from different algorithms such as random forest, support vector machine, and naïve Bayes. The XGBoost model was further validated using the validation data set (21 permeable and 19 impermeable compounds). The obtained machine learning models suggested that various descriptors such as molecular weight, atom type, electrotopological state, hydrogen bond donor/acceptor counts, and extended topochemical atoms of molecules are the major determining factors for both M.tb cell permeability and inhibitory activity. Furthermore, potential antimycobacterial drugs were identified using computational drug repurposing. All the approved drugs from DrugBank were collected and screened using the developed permeability model. The screened compounds were given as input in the PASS server for the identification of possible antimycobacterial compounds. The drugs that were retained after two filters were docked to the active site of 10 different potential antimycobacterial drug targets. The results obtained from this study may improve the understanding of M.tb permeability and activity that may aid in the development of novel antimycobacterial drugs.

Entities: CellLine Chemical Disease Gene Species

Year: 2021 PMID： 34278133 PMCID： PMC8280707 DOI： 10.1021/acsomega.1c01865

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Drug resistance is the major ongoing threat in developing first-line Mycobacterium tuberculosis (span>n class="Species">M.tb) drugs.[1−3] Complex M.tb cell wall systems,[4] rapid mutations in enzymes,[5] and modification of drug targets are some of the major causative agents for M.tb drug resistance. The development of potential, novel chemotherapeutic agents depends upon the deeper understanding of these M.tb defense mechanisms.[1]M.tb has a thick and waxy cell wall, and this acts as a powerful barrier to major antibiotics as well as potential M.tb drugs. The outer layer of the M.tb cell wall is made up of peptidoglycan–arabinogalactan–mycolic acid. In vitro experiments have demonstrated that this outer layer has unusual low fluidity, and thus, hydrophilic and lipophilic agents have severe problems while passing through this thick cell membrane.[6−10] Hence, the complexity of the M.tb cell wall is the natural defense mechanism for its survival, and therefore it seeks attention from researchers for the development of potential antimycobacterial drugs.[11] The advancement of structure-based and ligand-based drug design approaches has made drug discovery processes more feasible by identifying potential lead-like molecules before synthesis and biological evaluation.[12−21] However, Species">M.tb permeability is the major concern in developing potential antimycobacterial drugs. Even most potential span>n class="Species">M.tb inhibitors that are validated against InhA[22] do not have the potential efficacy due to their inability to penetrate through the M.tb cell membrane. Different physiochemical, biological, and chemical properties such as stereochemistry, lipophilicity, saturation and unsaturation, flexibility, viscosity, fluidity, pressure, temperature, physiological, and pathological conditions are responsible for the M.tb cell wall permeability.[23] Moreover, 14 out of 36 anti-TB drugs do not obey Lipinski’s rule of five and other drug-likeness properties.[24] Therefore, various other parameters have to be considered to design potential antitubercular compounds. However, the data on permeability properties of the small molecules are not available, and thus, it obstructs the development of knowledge-based methods for permeability estimation. Two different models are available for the assessment of Species">M.tb permeability of small molecules. Merget et al.[25] used logistic regression to assess the permeability of small molecules. They considered 3815 chemical structures as “active” data set and randomly drawn drug-like molecules as the negative data set. The model was generated based on descriptors calculated from span>n class="Chemical">PaDEL[26] and QikProp[27] tools. Janardhan et al.[11] developed 2D-QSAR for M.tb cell wall permeability model based on the minimal inhibitory concentration (MIC) value. The compounds with MIC ≤ 200 nM and the compounds with MIC > 200 nM were considered as “active” and “inactive” compounds, respectively. It has been assumed that the compounds with good IC50 values failed to penetrate through the M.tb cell wall due to their poor MIC values. This classification may be used as a knowledge-based source for analyzing the features of permeable compounds. Machine learning and data science play an enormous role in different fields such as bioinformatics,[28] chemoinformatics,[29] computational drug discovery,[30] genomics,[31] and computational chemistry.[32] In this work, Species">M.tb permeability models were created using different machine learning approaches, and it was subsequently validated with various metrics. Interestingly, the developed XGBoost machine learning model has outperformed the existing QSAR model. The same strategy was applied from the previous QSAR model for the selection of “active” and “inactive” data sets.[11] In the next step, the available drug molecules collected from DrugBank[33] were screened with the permeability model. The molecules screened with the machine learning model subsequently underwent the span>n class="Chemical">PASS analysis for the selection of compounds that have antimycobacterial activity. In the last step, virtual screening was performed against 10 different potential M.tb targets with the screened compounds from two filters. The identified drug molecules can be the potential lead molecules against M.tb infection since it has been predicted as potential candidates through a series of filtering mechanisms (i.e., permeability model, PASS analysis, and virtual screening) for antitubercular drug discovery.

Results and Discussion

Compositional Analysis of Permeable and Impermeable Compounds

The compositional analysis of both data sets revealed that the permeable and impermeable compounds are rich in carbon, span>n class="Chemical">nitrogen, and oxygen (Figure S1). However, there was no significant major difference in the atom frequency of compositional elements between permeable and impermeable compounds. Thus, the composition and related properties may not be important determinants for the permeability of a compound.

Principal Component Analysis

Principal component analysis (PCA) was performed on training data to compute the variance between permeable and impermeable compounds using descriptors as input features.[34] Two principal components (PC1 and span>n class="CellLine">PC2) are represented by the two axes in the principal component. The maximum variance in the data set has been captured by the first principal component since it is a linear combination of original predictor variables, whereas the second principal component captures the remaining linear components of other predictors. The variance between PC1 and PC2 for the whole data set was 42.1% and 5.4%, respectively (Figure S2a). It significantly decreases after PC2, and after PC5, the variance was almost constant (Figure S2b). However, from the results, it is evident that the PCA-based model is not expected to give the best performance. Hence, the performance of various machine learning algorithms was compared for developing the M.tb permeability model.

Selection of Appropriate Machine Learning Model for Classification

Various machine learning methods have been evaluated using the “MLBench” package[35] in R for classification performance. The accuracy values displayed for random forest (RF), gradient boosting model (GBM), classification and regression model (CART), Glmnet, support vector machine (SVM), k-nearest neighbors (KNN), naïve Bayes (NB), and logistic regression were 0.946, 0.939, 0.851, 0.927, 0.925, 0.925, 0.864, and 0.490, respectively (Figure ). Thus, it is apparent that RF- and GBM-based models were comparatively better than the other machine learning models.

Figure 1

Performance of different machine learning models for the classification of M.tb permeable and impermeable compounds (RF, random forest; GBM, gradient boosting model; CART, classification and regression model; Glmnet, Lasso and elastic-net regularized generalized linear model; SVM, support vector machine; KNN, k-nearest neighbors; NB, naïve Bayes; and logistic, logistic regression).

Performance of different machine learning models for the classification of pan class="Species">M.tb permeable and impermeable compounds (RF, random forest; GBM, gradient boosting model; CART, classification and regression model; Glmnet, Lasso and elastic-net regularized generalized linear model; SVM, support vector machine; KNN, k-nearest neighbors; NB, naïve Bayes; and logistic, logistic regression).

RF Parameter Optimization and Development of Classification Models

Initially, all the descriptors and the permeability information of 1114 compounds were given as input features at the best mtry (optimized by tuneRF function) and ntree = 1000 and mean decrease values were calculated. The top 100 mean decrease accuracy and mean decrease gini values are shown in Figure S3. The performance of the descriptors-based models has been examined by different mtry values and ntrees (200, 400, 600, 800, and 1000) using different combinations of descriptors. Based on the results depicted in Figure and Table , it is apparent that the top 40 descriptors at mtry = 4 and ntree = 800 performed better than other models with an accuracy of 0.9706 and an MCC of 0.8918, with the out of bag (OOB) error of 5.6%.

Figure 2

Optimization of RF models at different mtry and ntree values using top (A) 20 descriptors, (B) 40 descriptors, (C) 60 descriptors, (D) 80 descriptors, and (E) 100 descriptors as input features. Mtry is the number of variables randomly sampled as candidates at each split, and ntree is the number of trees to grow. The performance has been calculated by the percentage of OOB error.

Table 1

Performance of RF Models at Different mtry Values Using Variable Number of Important Descriptorsa

descriptors	mtry	sensitivity	specificity	precision	accuracy	MCC
Top 20	3	0.9814	0.8289	0.9420	0.9403	0.8455
	4	0.9907	0.8289	0.9425	0.9692	0.8645
	5	0.9814	0.8205	0.9378	0.9412	0.8395
	6	0.9721	0.8289	0.9414	0.9130	0.8273
	7	0.9814	0.8289	0.9420	0.9403	0.8455
Top 40	2	0.9767	0.8158	0.9375	0.9254	0.8270
	3	0.9814	0.8553	0.9505	0.9420	0.8641
	4	0.9907	0.8684	0.9552	0.9706	0.8918
	5	0.9814	0.8421	0.9462	0.9412	0.8548
	6	0.9860	0.8421	0.9464	0.9552	0.8641
Top 60	3	0.9953	0.8421	0.9469	0.9846	0.8832
	5	0.9814	0.8289	0.9420	0.9403	0.8455
	7	0.9814	0.8158	0.9378	0.9394	0.8362
	9	0.9907	0.8205	0.9383	0.9697	0.8583
	11	0.9814	0.8289	0.9420	0.9403	0.8455
Top 80	36	0.9721	0.8421	0.9457	0.9143	0.8368
	37	0.9721	0.8421	0.9457	0.9143	0.8368
	38	0.9953	0.8421	0.9469	0.9846	0.8832
	39	0.9587	0.8421	0.9457	0.8767	0.8115
	40	0.9814	0.8421	0.9462	0.9412	0.8548
Top 100	2	0.9814	0.8289	0.9420	0.9403	0.8455
	4	0.9907	0.8421	0.9467	0.9697	0.8736
	6	0.9767	0.8421	0.9459	0.9275	0.8457
	8	0.9721	0.8421	0.9457	0.9143	0.8368
	10	0.9814	0.8421	0.9462	0.9412	0.8548

The top 40 descriptors at mtry 4 were selected as the best performing model. These descriptors were further optimized using the boosting method in XGBoost.

XGBoost Model Generation

The top 40 descriptors along with the dependent variables were given as input for the XGBoost model generation. Different XGBoost parameters were manually optimized, and the best performance was obtained by applying the values of 0.1, 0.8, 3, and 10 for eta, gamma, minimum Species">child weight, and maximum depth, respectively. Eta is the step size shrinkage to control the learning rate and over-fitting through scaling each tree contribution, the minimum loss reduction required to make a split is represented by gamma, and the minimum span>n class="Species">child weight and maximum depth are the minimum sum of instance weight needed in a child and maximum depth of a child, respectively. Other tree parameters, such as maximum delta step, subsample, column sample, and the number of trees to grow per round, are left at their default parameters.

Performance Evaluation of XGBoost Model on the Validation Set

The performance of the XGBoost model was evaluated with the validation set containing 40 compounds (21 impermeable compounds and 19 inhA inhibitors). The same data set has also been screened against the QSAR model.[11] The results are compared and depicted in Figure and Table S1. The GBM has high true positive rates as compared to false positive rates. The ROC values (Figure ) show that the XGBoost model (Figure a) is better (0.81) than the QSAR model (Figure b) (0.60). The steps involved in the construction of the permeability model are represented in Figure .

Figure 3

ROC performance of the (A) gradient boosting model and (B) QSAR model on external validation data set (40 compounds).

Figure 4

Schematic workflow for the development of M.tb permeability machine learning model.

ROC performance of the (A) gradient boosting model and (B) QSAR model on external validation data set (40 compounds). Schematic workflow for the development of pan class="Species">M.tb permeability machine learning model.

Identification of Statistically Significant Descriptors

The Wilcoxon rank-sum test was performed on the training data set of the top 100 descriptors to find out the statistically discriminating (P < 0.05) descriptors among the permeable and impermeable compounds. Figure depicts the statistically significant descriptors among permeable and impermeable compounds. This analysis will be helpful to the researchers for the detailed characterization of discriminating features of permeable compounds, which eventually helps in the development of potential permeable compounds.

Figure 5

Significantly discriminating descriptors: (A) ETA_Epsilon_5, (B) nHBint5, (C) A log P, (D) MaxsOm, (E) Mpe, (F) GATS1e, (G) GATS3v, (H) MeanI, and (I) BUCTp.1l on the basis of Wilcoxon test (P < 0.05) among permeable and impermeable compounds.

Relative Influence of Various Descriptors toward M.tb Permeability

In this section, different descriptors were analyzed based on their contribution toward pan class="Species">M.tb cell permeability and inhibition. BCUT descriptors, atom type electrotopological descriptors, molecular path count, autocorrelation descriptors, and lipophilicity (log P) are some of the major contributors that have the potential to alter the pan class="Species">M.tb permeability. Mean decrease accuracy and mean decrease gini were calculated, and their contribution toward the permeability or impermeability of the compounds is discussed in detail. The obtained values suggested that various descriptors, i.e., lipophilicity (log P, size, and shape) and electrotopology (atomic chpan class="Chemical">arges, ionization potential, pan class="Chemical">hydrogen bond acceptors/donors, and number of fused rings), of a molecule determine the activity of the descriptors. In feature importance, autocorrelation descriptors[36] showed a significant difference between permeable and impermeable compounds. The descriptors GATS1s, GATS6e, and AATSC3i showed significant contributions toward permeable compounds when compared to impermeable compounds. Thus, these autocorrelation descriptors could be useful to improve the pan class="Species">M.tb permeability of the molecules. Similarly, ATSC5i, MATS2e, GATS6m, and GATS8c descriptors have higher values for impermeable compounds than permeable compounds. Thus, it can be concluded that the negative contributions of these descriptors may be unfavorable toward the permeability of the molecule. Each skeletal atom or group has been assigned to intrinsic state value by calculating atom-type electrotopological descriptors.[37] This value represents electrotopological intrinsic state (E-state numbers) and E-state indices (E-state contribution). Different parameters such as mindsN, maxHBint5, and minHBint2 have high value for a permeable compound that represents maximum E-state descriptors, which are directly proportional to the inhibitory potency of the molecule. Lipophilicity (log P)[38] is the tendency of a molecule to separate two phases, i.e., immiscible pan class="Chemical">octanol and polar aqueous pan class="Chemical">water phase. The permeability of a compound is directly proportional to the log P value. Different types of log P values, such as AlogP2, MlogP, and XlogP, are favorable for the permeable compounds and unfavorable for the impermeable compounds. Thus, to improve the permeability of a molecule, substitution of optimal lipophilic moieties within the applicability domain may be beneficial. Another important descriptor is nHBAcc_Lipinksi (number of hydrogen bond acceptors), which contributes favorably toward the permeable compounds. Thus, it can be concluded that substitution of the optimal number of span>n class="Chemical">hydrogen acceptors may improve the permeability of the M.tb inhibitors. The total count of fused ring (nFRing) descriptors favors the impermeable compounds. Thus, it indicates that the number of fused rings should be minimal to design M.tb permeable compounds. It was also noted that nHBint3 descriptor (40), nHsOH descriptor, minHssNH, and maxsssCH descriptor are unfavorable for the permeability of the compounds. Hence, it is inferred that minimum secondary span>n class="Chemical">amines, maximum sp[3] hybridized carbon atoms, and minimum hydroxyl groups are critical parameters for gaining inhibitory potency. Overall, apart from the identification of potential M.tb inhibitors, the developed machine learning model may help in understanding favorable and unfavorable molecular properties that aid in governing M.tb permeability.

Drug Repurposing Studies

Three-Tier (Permeability Screening, Antimycobacterial Activity Prediction, and Virtual Screening) Screening

The Species">M.tb drug tspan>n class="Chemical">argets were considered for computational drug discovery and drug repurposing studies. For instance, 10 genes have been chosen in M.tb for computational drug repurposing. A total of 6086 drug molecules were collected from the DrugBank databases, including clinical trials and experimental drugs. A three-tier screening was applied for the identification of potential antimycobacterial compounds. All the molecules were screened with the generated XGBoost model for the permeability criteria. The M.tb permeability regression model values were in between 0 and 1. Molecules that possess values ≤0.5 were considered as impermeable compounds. This process screened out 656 compounds that were not able to permeate through the M.tb cell wall according to the machine learning model prediction.

Antimycobacterial Activity Prediction

The biological potential of the molecules has been predicted for the screened permeable compounds. Fifty-one molecules have been predicted as potent antimycobacterial compounds in PASS analysis and may be used for the treatment of span>n class="Disease">tuberculosis infection. Bedaquiline, a known antimycobacterial drug used to treat multidrug-resistant tuberculosis having high Pa (probability “to be active”), was identified by PASS among the 51 molecules. Some of the other antitubercular drugs (i.e., isoniazid, ethambutol, and Rifampicin) have also shown significant antitubercular activity in PASS analysis; however, these molecules were not considered because these molecules are all currently being practiced as antitubercular drugs. Forty-one molecules (Table S2) were considered for the virtual screening process.

Virtual Screening of Screened Drugs

The 41 molecules were further docked in the active site of 10 potential antitubercular drug targets, namely, MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA, span>n class="Chemical">pknB, kasA, and dprE1. The Autodock vina was used for the docking calculations. The top 10 molecules were considered based on their docking score (>−5 kcal/mol) against the 10 targets. The grid parameters and active sites of 10 targets are shown in Table S3. The docking scores along with the PASS analysis results for the top 10 drug molecules are shown in Table . In the case of enoyl-ACP reductase (InhA), the major interaction was found with Tyr 158, and this interaction is essential for fatty acyl substrate binding.[39] The ligands are buried inside the hydrophobic residues. In dihydrofolate enzyme, a conserved interaction with Asp27 was noticed, and this amino acid is critical for catalysis.[40] Other than this, two additional common hydrophobic interactions (Ile5 and Ile94) have also been noticed. In protein kinase PknB, common interactions were observed with Leu17, Gly18, Glu93, and Tyr94. These amino acids are essential for blocking the catalytic activity of the protein.[41] The protein KasA in M.tb is responsible for fatty acid biosynthesis pathway. The Phe404 acts as a gatekeeper residue, which is responsible for the opened and closed conformations.[42] Interestingly, most of the identified molecules have interaction with this amino acid. In DprE1, Cys387 residue is an important residue for the bonding of small molecule inhibitors.[43] Most of the identified molecules interact with this amino acid. In M.tb beta carbonic anhydrase, the molecules have occupied well in the active site. Asp 53 and Arg 55 are important resides, which can adopt different conformation for the accessible cavity.[44] Thus, interaction with these amino acids may be helpful to block the carbonic anhydrase. In the case of CYP51, major interaction was observed with Phe78 and Phe83. These amino acids are essential and “hotspot” residues of this protein.[45] In thymidine kinase protein, the major interactions were identified with Asp163, Tyr166, and Glu166. These interactions are essential for open and close conformations.[46] The Arg253 amino acid in dihydrosynthase[47] is one of the highly conserved key amino acids to avoid M.tb drug resistance. Interestingly, most of the drugs have interaction with this amino acid. In case of DNA ligase, most of the molecules have interaction with Leu122 and Ile124. These interactions are essential for inhibitory activity.[48] Overall, the 10 molecules have formed interaction with the significantly important residues of 10 targets. Figure S4 displays the identified drug molecules in the active sites of 10 proteins.

Table 2

The Top 10 Drug Molecules with Probable Tuberculosis Activity in PASS Prediction along with Docking Scores against 10 M.tb Targets

					docking score (kcal/mol)
S. no.	drug name	phase	MOA	P_aa	mtcA2	folA	inhA	Cyp51	folP1	tmk	ligA	pknB	kasA	dprE1
1	Nomegestrol	approved	progestrone receptor agonist	0.9793	–7.5	–10.9	–11.9	–11.7	–9.1	–8.4	–9.4	–10.2	–8.6	–10.6
2	NGX267	investigational	muscarinic acetylcholine receptor M1	0.956	–5.0	–6.6	–7.0	–7.2	–6.0	–5.7	–6.1	–6.4	–6.7	–7.6
3	Gamolenic acid	approved and investigational	NA	0.8674	–5.7	–7.6	–7.8	–8.0	–6.1	–7.5	–6.4	–6.9	–7.6	–7.8
4	Tetrazepam	experimental	NA	0.8559	–6.3	–9.2	–10.7	–9.6	–8.9	–7.2	–9.0	–9.5	–8.3	–9.1
5	Nitrofural	approved and investigational	anti-infective agent	0.7415	–4.9	–6.4	–6.2	–6.5	–5.6	–7.3	–6.1	–5.7	–6.4	–7.3
6	Quinine	approved	hemozoin biocrystallization inhibitor	0.6352	–6.7	–9.6	–9.7	–10.4	–8.4	–8.1	–8.1	–8.2	–9.9	–10.6
7	Quinidine	approved and investigational	sodium channel blocker	0.6352	–6.8	–9.6	–9.7	–10.4	–8.4	–8.6	–8.1	–8.2	–8.8	–9.7
8	But-3-enyl-[5-(4-chloro-phenyl)-3,6-dihydro-[1,3,4]thiadiazin-2-ylidene]-amine	experimental	NA	0.6175	–5.6	–7.7	–7.5	–8.2	–6.6	–7.6	–7.0	–6.6	–6.8	–8.0
9	Lefamulin	approved and investigational	50S ribosomal protein L22	0.5986	–7.5	–11.9	–13.3	–12.0	–10.2	–9.9	–10.6	–11.3	–9.1	–12.2
10	Stavudine	approved and investigational	nucleoside reverse transcriptase inhibitor	0.5958	–6.1	–7.2	–7.3	–7.7	–6.3	–8.6	–6.6	–6.3	–7.3	–8.0

Pa is the probability of active.

Materials and Methods

Data Set Preparation

As a first step, 86 833 Species">M.tb inhibitors were collected from the ChEMBL database,[49] which has a wide range of span>n class="Species">M.tb inhibitory potency. These compounds were then classified according to their biological assay protocols (MIC, activity, CC50, GI, IC50, IC90, inhibition, Ki, MIC50, MIC90, MIC99, ratio, selective index, AvgIC90, cc25, EC50, EC90, EC99, CFU, growth index, IZ, Kd, Km, log 1/MMIC, log 10CFU, log 10CFU/mL, MBC, MIC95, ratioIC50, and selective ratio). Next, the redundant compounds (34 411 compounds) in each category were removed, and the nonredundant compounds (52 422 compounds) were retained for further analysis (Table S4). One thousand one hundred fourteen compounds were considered in this study, which has MIC values (846 compounds) and IC50 (268 compounds) values in the nonredundant data set (Tables S5 and S6). Compounds that have MIC values ≤200 nM (846 compounds) were considered as permeable compounds and compounds that have MIC > 200 nM and IC50 ≤ 1000 nM (268 compounds) were considered as impermeable compounds. Furthermore, 40 compounds (19 InhA inhibitors and 21 impermeable compounds from the ChEMBL database) were collected and stored as an external validation set to validate the machine learning model.

Compositional Analysis

The distribution of atom types in the data set was calculated using the “atomcount” function in “ChemmineR” library of “R” software. The physiochemical descriptors (i.e., elemental composition) have been calculated for the permeable and impermeable compounds.[50]

Input Features

The 1D and 2D descriptors of all the molecules were calculated using the PaDEL descriptor calculation tool. The properties of the molecules have been depicted as numerical values in the PaDEL. Different 1D (molecular weight, molar refractivity, solubility, and permeability) and 2D (log P, log D, and topological polar surface area) descriptors were considered to generate a machine learning model.[26]

Development of Classification Models for the Prediction of Permeability

A total of 1444 different 1D and 2D descriptors were calculated, and among these 1186 descriptors were considered that have a nonzero value for most of the compounds. The data set was classified into 75% as the training set and 25% as the test set molecules.

Principal Component Analysis

The high dimensionality (1186 × 1114 descriptors) of data by several input features was compressed using PCA to identify the major components that can distinctly differentiate the two data sets.[34]

Appropriate Machine Learning Model Selection for Classification

The 1187 descriptors along with the response variables have been given as input features for the generation of different machine learning algorithms, i.e., RF, GBM, CART, generalized linear model, SVM, KNN, NB classifier, logistic regression. A 10-fold cross-validation was performed on training data to compare the performance of machine learning models[51] using the “mlbench” package in R software.[35]

Optimization of Parameters for the Generation of Classification Models

The “RF” package in R (version 3.5.2) was used to optimize the input feature parameters, viz., mtry (number of randomly selected variables), ntree (number of trees generated by RF algorithm).[51] The importance of each descriptor has been analyzed by calculating mean decrease accuracy value in RF algorithm at best mtry value, which is tuned by tuneRF function. The optimized values for each mtry in RF have been estimated by OOB error estimation. The descriptor-based RF model has been evaluated using the top 20, 40, 60, 80, and 100 descriptors and various mtry values at 200, 400, 600, 800, and 1000 ntree, respectively. The best model was selected based on the minimum OOB error value. Tenfold cross-validation method was applied for further optimization of the classification models.

Optimization of GBM for the Regression Model Generation

The selected features from the best RF model were given as input for the development of the GBM. Extreme Gradient Boosting (XGBoost) uses the Gradient Boosting Decision Tree (GBDT) for classification and regression problems.[52−54] The inside nodes of the regression tree represent the values for an attribute test, whereas the leaf nodes with scores represent a decision. The result of the prediction is described as follows, where the sum of the scores is predicted by K trees,where x is the ith training sample, f(x) is the score for the kth tree, and F is the space of functions containing all regression trees, and the objective function to be optimized is given by the following formula: The former term is a differentiable loss function that measures whether the model is suitable for training data set. The latter is an item that punishes the complexity of the model. When the complexity of the model increases, the corresponding score is deducted. In this study, the 2D descriptors are the viable input to the XGBoost classifier and the predicted class (i.e., permeable and impermeable) on a scale of 0–1 as the output variable. The probability value greater than or equal to 0.5 indicates impermeable compounds, whereas the probability value less than 0.5 indicates permeable compounds.

Validation of Final Classification Models

Performances of RF and GBMs were evaluated on the 25% test data set and also on separate validation data set using the following performance measures:where TP is the true positive, TN is the true negative, FP is the false positive, FN is the false negative, and MCC is the Matthews correlation coefficient.

Drug Repurposing Studies

A series of filtering techniques were adopted to identify the potent antimycobacterial compounds. As a first step, all the drug molecules from DrugBank were collected and screened with the generated permeability model. The screened compounds were further evaluated with PASS biological activity prediction to identify potential antimycobacterial compounds. Furthermore, virtual screening was performed against 10 different span>n class="Species">M.tb drug targets (MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA, pknB, kasA, and dprE1) to identify potential drug molecules. These targets were selected based on their role in M.tb virulence, cell wall synthesis, biochemical pathways, etc. Table S7 displays the function of all the selected 10 M.tb targets.

Biological Activity Prediction

The PASS online web server is widely used for the prediction of the biological activity of compounds, including different pharmacological effects, toxic and adverse effects, mechanisms of actions, and the influence of gene expression.[55−58] The biological activity of a molecule is calculated based on the structure activity relationships using a training set of >300 000 biologically active substances (i.e., drugs, drug-like molecules, lead compounds, and toxins) in the PASS server. The activity in each category (i.e., antiviral, antibacterial, and antimycobacterial) is evaluated by the probability of active (Pa) value. The screened compounds from the permeability model were further screened in PASS for their antimycobacterial activity.

Virtual Screening Protocol

Screening of large molecule libraries against the molecular tspan>n class="Chemical">arget is known as virtual screening.[12,14] Autodock vina was used for estimating the binding affinity of the drug molecules in the active site of M.tb drug targets. Proteins (MtcA2, folA, inhA, Cyp51, folP1, tmk, ligA, pknB, kasA, and dprE1) and all the predicted permeable compounds were converted into corresponding .pdbqt molecules using AutoDock tools.[59] The grids were generated in the active sites of all the drug targets. The compounds that passed through the above two filters (M.tb machine learning model and PASS antimycobacterial activity prediction) were further docked in the M.tb drug targets. The top-scored compounds are reported as the potential hits for the antimycobacterial drug discovery.

Conclusions

The small molecule permeability through biological systems mainly depend on the physiochemical factors. Species">M.tb permeability is one of the major challenges to develop potential antimycobacterial compounds. A systematic state-of-the-art machine learning calculation based on cell-based and enzyme-based inhibitory activity was performed to delineate the causative factors affecting permeability. The generated XGBoost span>n class="Species">M.tb permeability model was validated with known M.tb inhibitors and impermeable compounds. The XGBoost model was further used to screen the DrugBank database. The screened molecules were further screened in PASS server to identify potential compounds that have significant antitubercular activity. The screened candidates are expected to display high permeability through the M.tb cell wall and good antitubercular activity. The compounds that passed through two filters were docked in the active site of 10 different potential M.tb targets. Ten drugs have been reported as potential repurposable antitubercular candidates. Overall, the obtained results can be used for the identification of novel potential drug molecules against drug-resistant M.tb strains with improved cell permeability.

44 in total

1. Relationships between statistical measures of agreement: sensitivity, specificity and kappa.

Authors: Martin Feuerman; Allen R Miller
Journal: J Eval Clin Pract Date: 2008-10 Impact factor: 2.431

2. Deep learning in bioinformatics.

Authors: Wei Wang; Xin Gao
Journal: Methods Date: 2019-06-08 Impact factor: 3.608

3. Structural basis of inhibition of Mycobacterium tuberculosis DprE1 by benzothiazinone inhibitors.

Authors: Sarah M Batt; Talat Jabeen; Veemal Bhowruth; Lee Quill; Peter A Lund; Lothar Eggeling; Luke J Alderwick; Klaus Fütterer; Gurdyal S Besra
Journal: Proc Natl Acad Sci U S A Date: 2012-06-25 Impact factor: 11.205

4. Modeling the permeability of drug-like molecules through the cell wall of Mycobacterium tuberculosis: an analogue based approach.

Authors: Sridhara Janardhan; M Ram Vivek; G Narahari Sastry
Journal: Mol Biosyst Date: 2016-10-18

5. Uncovering Structural and Molecular Dynamics of ESAT-6:β2M Interaction: Asp53 of Human β2-Microglobulin Is Critical for the ESAT-6:β2M Complexation.

Authors: Vishwanath Jha; Nagender Rao Rameshwaram; Sridhara Janardhan; Rajeev Raman; G Narahari Sastry; Vartika Sharma; Jasti Subba Rao; Dhiraj Kumar; Sangita Mukhopadhyay
Journal: J Immunol Date: 2019-09-04 Impact factor: 5.422

6. NAD+-dependent DNA Ligase (Rv3014c) from Mycobacterium tuberculosis. Crystal structure of the adenylation domain and identification of novel inhibitors.

Authors: Sandeep Kumar Srivastava; Rama Pati Tripathi; Ravishankar Ramachandran
Journal: J Biol Chem Date: 2005-05-17 Impact factor: 5.157

7. Crystal structure of Mycobacterium tuberculosis 7,8-dihydropteroate synthase in complex with pterin monophosphate: new insight into the enzymatic mechanism and sulfa-drug action.

Authors: A M Baca; R Sirawaraporn; S Turley; W Sirawaraporn; W G Hol
Journal: J Mol Biol Date: 2000-10-06 Impact factor: 5.469

8. The crystal structure of Mycobacterium tuberculosis thymidylate kinase in complex with 3'-azidodeoxythymidine monophosphate suggests a mechanism for competitive inhibition.

Authors: Emanuela Fioravanti; Virgile Adam; Hélène Munier-Lehmann; Dominique Bourgeois
Journal: Biochemistry Date: 2005-01-11 Impact factor: 3.162

9. Comprehensive physicochemical, pharmacokinetic and activity profiling of anti-TB agents.

Authors: Suresh B Lakshminarayana; Tan Bee Huat; Paul C Ho; Ujjini H Manjunatha; Véronique Dartois; Thomas Dick; Srinivasa P S Rao
Journal: J Antimicrob Chemother Date: 2014-11-11 Impact factor: 5.758

10. The ChEMBL database in 2017.

Authors: Anna Gaulton; Anne Hersey; Michał Nowotka; A Patrícia Bento; Jon Chambers; David Mendez; Prudence Mutowo; Francis Atkinson; Louisa J Bellis; Elena Cibrián-Uhalte; Mark Davies; Nathan Dedman; Anneli Karlsson; María Paula Magariños; John P Overington; George Papadatos; Ines Smit; Andrew R Leach
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

1 in total

1. Artificial intelligence: machine learning for chemical sciences.

Authors: Akshaya Karthikeyan; U Deva Priyakumar
Journal: J Chem Sci (Bangalore) Date: 2021-12-21

1 in total