Literature DB >> 27335215

Machine Learning Model Analysis and Data Visualization with Small Molecules Tested in a Mouse Model of Mycobacterium tuberculosis Infection (2014-2015).

Sean Ekins^1,2, Alexander L Perryman³, Alex M Clark⁴, Robert C Reynolds⁵, Joel S Freundlich^3,6.

Abstract

The renewed urgency to develop new treatments for Mycobacterium tuberculosis (Mtb) infection has resulted in large-scale phenotypic screening and thousands of new active compounds in vitro. The next challenge is to identify candidates to pursue in a mouse in vivo efficacy model as a step to predicting clinical efficacy. We previously analyzed over 70 years of this mouse in vivo efficacy data, which we used to generate and validate machine learning models. Curation of 60 additional small molecules with in vivo data published in 2014 and 2015 was undertaken to further test these models. This represents a much larger test set than for the previous models. Several computational approaches have now been applied to analyze these molecules and compare their molecular properties beyond those attempted previously. Our previous machine learning models have been updated, and a novel aspect has been added in the form of mouse liver microsomal half-life (MLM t1/2) and in vitro-based Mtb models incorporating cytotoxicity data that were used to predict in vivo activity for comparison. Our best Mtb in vivo models possess fivefold ROC values > 0.7, sensitivity > 80%, and concordance > 60%, while the best specificity value is >40%. Use of an MLM t1/2 Bayesian model affords comparable results for scoring the 60 compounds tested. Combining MLM stability and in vitro Mtb models in a novel consensus workflow in the best cases has a positive predicted value (hit rate) > 77%. Our results indicate that Bayesian models constructed with literature in vivo Mtb data generated by different laboratories in various mouse models can have predictive value and may be used alongside MLM t1/2 and in vitro-based Mtb models to assist in selecting antitubercular compounds with desirable in vivo efficacy. We demonstrate for the first time that consensus models of any kind can be used to predict in vivo activity for Mtb. In addition, we describe a new clustering method for data visualization and apply this to the in vivo training and test data, ultimately making the method accessible in a mobile app.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27335215 PMCID： PMC4962118 DOI： 10.1021/acs.jcim.6b00004

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Tuberculosis (TB) is a major infectious disease that unfortunately knows no geographic boundaries and accounts for approximately 9 million new cases and 1.5 million deaths each year.[1] TB and its etiological agent, Mycobacterium tuberculosis (Mtb), continue to be the focus of intense international efforts to develop new tools for the control and ultimate elimination[2] of this devastating disease that is increasingly associated with resistance to first- and second-line drugs.[3] The discovery of new TB drug candidates with novel mechanisms of action is of fundamental importance in this regard. The majority of funding for TB research still comes from the NIH NIAID and the Bill and Melinda Gates Foundation. In the past, the European Commission has also funded TB research in the FP7 Program (although nowhere near the levels of the aforementioned organizations). However, no funding for TB small-molecule drug discovery is foreseen in the EC’s Horizon 2020 Program over the next few years. These cuts in funding highlight the need to increase the efficiency of tuberculosis small-molecule drug discovery. Analysis of the recent pipeline at TB Alliance[4] and elsewhere[5] reveals that while there are ∼27 projects in preclinical stages and 13 in clinical trials in phases 1–3, only one project is in phase 4 (Figure ). This indicates a suboptimal pipeline. Of the latter clinical-stage compounds, several do not seem to have progressed since earlier analyses. It is incredibly concerning that we do not have more new molecules in clinical stages, especially with the prescribing limitations surrounding bedaquiline and delamanid because of their cardiovascular side effects from hERG inhibition.[6,7]

Figure 1

Global TB pipeline using data from TB Alliance and the Working Group on New TB Drugs Drug Pipeline.

Global TB pipeline using data from TB Alliance and the Working Group on New TB Drugs Drug Pipeline. A major hurdle to progressing molecules into the clinic is identifying compounds that have activity in the mouse models of Mtb infection.[8] Mice have been used since the 1940s to test drug efficacy.[9] Of course, the mouse cannot completely model the complex pathology observed in humans, and its drug metabolism and pharmacokinetics also may differ. A recent questionnaire polled different TB laboratories and found that most use BALB/C, C57BL/6, or Swiss mice.[9] It was concluded that the mouse model may be most useful for rank ordering of compounds to select a drug regimen. However, some laboratories also supplement the in vivo mouse data with in vivo rabbit or marmoset studies of Mtb infection. We previously published a comprehensive assessment of over 70 years of literature resulting in modeling of 773 compounds reported to modulate Mtb infection in mice.[8] Our detailed analyses of the physiochemical and structural properties of both active and inactive molecules as well as their chemical property space coverage revealed new insights. Furthermore, we used machine learning models to correctly predict in vivo efficacy in the Mtb-infected mouse model for eight of 11 compounds. We identified gaps corresponding to the discovery and approval of new compounds,[10] as highlighted by the 40 years between the approvals of rifampicin and bedaquiline, suggesting that we can learn from earlier drug discovery. Furthermore, there were clear peaks in in vivo testing in the 1940s and 1950s, and it now appears that recent testing in mouse in vivo seems to have peaked,[10] also leading to concerns about the future health of the TB drug pipeline. Since that report, we have collated an additional 60 molecules from more recent data published in 2014 and 2015 and further evaluated and validated the earlier models with this much larger test set that was not previously available. In addition, we have used recently published models based on in vitro data to create consensus models to predict in vivo activity. Further we describe a new clustering method for data visualization and apply it to all of the in vivo data gathered to date. Our goal is to continue to develop and validate various computational approaches for predicting and visualizing in vivo activity data in the Mtb mouse model, enabling the prediction of new compounds that are the most promising for advancement.

Experimental Section

Data Collection

The original data used in the initial published model were curated and quality-assessed as described previously.[8] Literature searching in the 2014–2015 time frame was performed using PubMed, and data curation was also described previously.[8] We present only new molecules that were not included in the earlier in vivo paper as assessed using similarity of training and test compounds based on the Tanimoto similarity distance metric[11−13] in Discovery Studio (a value of 0 represents the molecule being in the model). Distance is a generalization to continuous properties of the Tanimoto distance for binary fingerprints: D = 1 – ∑xx/[∑(x)2 + ∑(x)2 – ∑xx]. Possible values range from 0 to 1.3333. As also described earlier, molecules were classified as active in the mouse model if they demonstrated at least 1 log10 reduction in colony-forming units (CFU) (or in some cases a statistically significant reduction in CFU).[8]

Test Set Molecular Property Distribution

AlogP, molecular weight, number of rotatable bonds, number of rings, number of aromatic rings, number of hydrogen-bond acceptors, number of hydrogen-bond donors, and molecular fractional polar surface area were calculated from input structural data (SD) files using Discovery Studio 4.1 (San Diego, CA).[8]

Principal Component Analysis with in Vivo Test Set Compounds and TB Mobile Data

In order to assess the applicability domain of the 60 new in vivo molecules and the 784 compounds in the in vivoMtb training set, we used the union of these sets to generate a principal component analysis (PCA) plot based on the interpretable descriptors selected previously (AlogP, molecular weight, number of rotatable bonds, number of rings, number of aromatic rings, number of hydrogen-bond acceptors, number of hydrogen-bond donors, and molecular fractional polar surface area) for machine learning. We also compared the 60 new compounds tested in the in vivo mouse Mtb model to the previously described 805 compounds with known Mtb targets collated from the literature[14] and available in TB Mobile (version 2).[15] This PCA model essentially represents the published target-chemistry property space for Mtb.[15]

Building and Validating Machine Learning Models with Mouse Mtbin Vivo Data

We have previously described the generation and validation of the Laplacian-corrected naïve Bayesian classifier models developed from the mouse Mtb infection in vivo model using Discovery Studio 3.5.[16−20] We have now updated the Bayesian, tree, and support vector machine (SVM) models using Discovery Studio 4.1. In addition to the eight molecular descriptors listed in the previous section, the molecular function class fingerprints of maximum diameter 6 (FCFP_6) was added as the ninth descriptor.[21] Computational models were validated using leave-one-out cross-validation, in which each sample was left out one at a time, a model was built using the remaining samples, and that model was utilized to predict the left-out sample. Each model was internally validated, the receiver operator characteristic (ROC) plots were generated, and the areas under the cross-validated ROC curves (XV ROC AUC) were calculated. Fivefold cross-validation (leave out 20% of the data set and repeat five times) was also performed, as was leave out 50% × 100-fold cross-validation. We compared the resulting Bayesian model with SVM, recursive partitioning forest (RP Forest), and RP Single Tree models built with the same set of molecular descriptors in Discovery Studio. For SVM models[22,23] we calculated interpretable descriptors in Discovery Studio and then used Pipeline Pilot to generate the FCFP_6 descriptors, followed by integration with R.[24] RP Forest[25−28] and RP Single Tree models used the standard protocol in Discovery Studio. In the case of RP Forest models, 10 trees were created with bootstrap aggregation (“bagging”). For each tree, a bootstrap sample of the original data is taken, and this sample is used to grow the tree. A bootstrap sample is a data set of the same total size as the original one, but a subset of the data records can be included multiple times (i.e., each tree is built with a slightly different subset of the original set, and each tree’s set can contain duplicates). RP Single Tree models had a minimum of 10 samples per node and a maximum tree depth of 20. In all cases, fivefold cross-validation was used to calculate the ROC for the models generated. CDD Models (Collaborative Drug Discovery, Inc., Burlingame, CA) was also utilized to build a Bayesian model using just the open source FCFP_6 descriptors and threefold cross-validation as described previously.[29] This provides an approach for generating models that can be shared between researchers and used in mobile apps, thereby making the models more accessible.[30−33]

Mouse Mtb Infection Model Predictions for Compounds Identified after Model Building

From the data curation in this study, 60 compounds were identified from the literature (2014–2015) that were tested in Mtb infected mice (Supplemental Data 1). These were predicted with the mouse Mtb infection computational machine learning models previously reported as well as the updated models with 784 compounds. For each molecule, the closest distance to the training set for each model was also calculated using the “calculate molecular properties” protocol in Discovery Studio, in which a value of zero represents a molecule in the training set while larger values indicate that a molecule is more different than the training set.

In Vivo Activity Predictions with Previous in Vitro-Trained Bayesian Models

Previously generated Bayesian models for mouse liver microsomal half-life (MLM t1/2)[34] and dual-event models that combine in vitroMtb activity and Vero cell cytotoxicity[35,36] (e.g., the Tuberculosis Antimicrobial Acquisition and Coordinating Facility (TAACF-CB2) and Molecular Libraries Small Molecule Repository (MLSMR) data sets) were used either alone or in a novel consensus workflow to predict the in vivoMtb activity for the 60 compounds identified from the literature (2014–2015) that were tested in mice. The sort by two attributes features in Discovery Studio and Excel were used to organize the data, followed by tabulating the numbers of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), to enable calculation of the external statistics and enrichment factors.

Clustering of Mouse Mtbin Vivo Data

Honeycomb clustering (Molecular Materials Informatics, Inc., Montreal, Canada) is a greedy layout method for arranging structures on a plane in a meaningful way. A single reference compound is selected by the user as the focal point, and this is placed on a hexagonal grid pattern. For each compound, the ECFP_6 fingerprints are determined, and for all similarity comparisons, the Tanimoto coefficient is used as the metric. The six compounds most similar to the reference compound are arranged in the six available positions immediately adjacent to the focus to form the initial flower-petal starting point. The compound that is most similar to the reference compound is placed immediately above (“north”), and all possible permutations of the remaining five neighbors are considered and evaluated by summing the pairwise similarities between radially adjacent neighbors. The permutation with the highest score is used, and thus, the placement positions for the first seven compounds are fixed. The remaining compounds are ordered by decreasing similarity to the reference compound, and each is evaluated in turn. At each step the next compound is placed irreversibly: all of the unoccupied hexagons that are adjacent to at least one already-placed compound are considered and evaluated according to a score. The hexagon with the highest score is taken to be the position for this compound. The score is calculated by determining the average similarity of the compound to each of its putative neighbors. An additional “density fudge factor” of 0.01 per neighbor is added to balance out what would otherwise be a tendency to minimize the neighbor count, i.e., to prevent overfavoring long, spindly branches. For positions where there is just a single neighbor, there are three positions that may be occupied by neighbors of the neighbors, and each of these that is occupied is compared with the current compound: for the position directly opposite, the score is increased by its similarity to the current compound multiplied by 0.001, whereas for the other two positions the multiplier is 0.002. This additional term encourages the arrangement of compounds to “bend” in the direction that encourages higher similarity. This approach was used with the complete training and test set for compounds tested in the mouse Mtb infection model.

Statistical Analysis

Means for descriptor values for active and inactive compounds were compared by two-tailed t test with JMP version 8.0.1 (SAS Institute, Cary, NC). We also evaluated several additional alternative statistics for the test set, including Youden’s J statistic,[37] Matthews’ correlation coefficient (MCC),[38]the F1 score,[39] and κ,[40,41] which are given by the following expressions: where po is the relative observed agreement among raters and pe is the hypothetical probability of chance agreement, obtained using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement, then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by pe), κ ≤ 0.

Results

Molecular Property Distribution

The 60 compounds identified from the literature (2014–2015) that were tested in Mtb-infected mice were collated in this study (Table S1 and Supplemental Data 1) and analyzed with respect to eight simple and interpretable molecular descriptors used previously (Table ). The only difference between active and inactive compounds was that the number of hydrogen-bond donors was statistically significantly higher for the in vivo active compounds.

Table 1

Means and Standard Deviations of Molecular Descriptors for the New in VivoMtb Dataset (N = 60), Comparing Actives and Inactivesa

	MW	AlogP	HBD	HBA	Num Rings	Num Arom Rings	FPSA	RBN
active (N = 41)	493.88 ± 219.81	3.65 ± 3.00	2.07 ± 2.70b	7.22 ± 2.95	3.63 ± 0.83	2.15 ± 0.96	0.26 ± 0.08	7.10 ± 3.18
inactive (N = 19)	427.83 ± 72.78	3.63 ± 1.91	1.05 ± 0.97	6.21 ± 2.17	3.68 ± 1.06	2.53 ± 0.90	0.24 ± 0.10	6.47 ± 1.87

MW = molecular weight; HBD = number of hydrogen-bond donors; HBA = number of hydrogen-bond acceptors; Num Rings = number of rings; Num Arom Rings = number of aromatic rings; FPSA = fractional polar surface area (sum of areas of the polar regions of the molecular surface divided by the total molecular surface area); RBN = number of rotatable bonds.

p < 0.05.

Principal Component Analysis with in Vivo Data, in Vitro Hits, and TB Mobile Data

The PCA analysis of the complete set of 844 molecules with in vivo data indicated that the majority of the 60 new molecules are within the area of the training set (Figure A), suggesting similar chemical property coverage. PCA with the same descriptors and compounds for which we have previously collected information on targets in Mtb suggested that these 60 molecules tested in vivo in mice are in the same regions of physiochemical property space as the prior compounds (Figure B).

Figure 2

(A) Principal component analysis (PCA) of the updated training set for the in vivo model (blue) and compounds tested in vivo in 2014 and 2015 (yellow). Three principal components explain 86.9% of the variance. (B) PCA of TB Mobile 2 compounds (N = 805, blue) and compounds tested in vivo in 2014 and 2015 (yellow). Three principal components explain 87.5% of the variance.

Building and Validating Machine Learning Models with Mouse Mtb Data

The original N = 773 data set used to build a Discovery Studio Bayesian model had a leave-one-out ROC of 0.77 and fivefold cross-validation value of 0.73 (Supplemental Data 2). The updated N = 784 Discovery Studio Bayesian model (combining the initial training set and test set[8]) for in vivoMtb activity had fivefold cross-validation ROC values > 0.70 (Table ), which were comparable to those published previously.[8] The fivefold cross-validation model ROC values (Supplemental Data 3) were comparable with the leave out 50% × 100-fold ROC values, although the concordance, specificity, and sensitivity were lower in the latter case (Table S2).

Table 2

Fivefold Cross-Validation ROC AUC Values for the Updated (N = 784) in Vivo Machine Learning Models

Bayesiana	SVM	Single Tree	Forest
0.733	0.77	0.72	0.74

Bayesian leave-one-out cross-validation = 0.772.

Model Predictions for Additional Compounds Identified after Model Building

The 60 molecules collated in this study (Table S1 and Supplemental Data 1) were used as an external test set for the original N = 773 Bayesian model (Supplemental Data 1), which produced a poor ROC score (0.554). The updated N = 784 Bayesian model performed similarly (Supplemental Data 3) and displayed a poor ROC (0.558). The set of 60 molecules included 20 PA-824 analogues (mean closest distance = 0.26 with a standard deviation of 0.07; Table S3); all were predicted by these Bayesian models as actives (including seven in vivo inactives), which may reflect some bias based on similarity to active PA-824 analogues. New models including SVM (Supplemental Data 4), Single Tree and Forest models had similar ROC values (Table S4). When the sensitivity, specificity, and concordance data for the 60 molecules are compared across the different models, there are subtle differences. For example, the RP Forest model has highest sensitivity (85.4%) and concordance (66.7%). The CDD Bayesian, relying solely on the open source FCFP_6 descriptors, was similar (82.9% and 61.7% for sensitivity and concordance, respectively). The SVM model has the highest specificity (42.1%) (Table ).

Table 3

External Statistics for the in Vivo TB Machine Learning Models Tested on the New in Vivo Mouse TB Data

machine learning model	sensitivity (%)	specificity (%)	concordance (%)
TB in vivoN = 773 Bayesian	70.7	36.8	60.0
TB in vivoN = 784 Bayesian	78.0	10.5	56.7
RP Forest TB in vivoN = 784	85.4	26.3	66.7
Best RP Tree TB in vivoN = 784	78.0	21.1	60.0
TB in vivoN = 784 SVM	68.3	42.1	60.0
TB in vivoN = 784 CDD Bayesiana	82.9	15.8	61.7

External statistics were calculated from the results of the Bayesian modeling tool on Collaborative Drug Discovery using a cutoff score of >0.65, which produced an internal sensitivity of 0.7 and an internal specificity of 0.67.

In Vivo Activity Predictions with Prior in Vitro Bayesian Models

Previously generated Bayesian models for MLM t1/2, dual-event models for Mtb activity and Vero cell cytotoxicity, and a consensus approach were used to predict mouse in vivo activity (Table ). The dual-event Bayesian models displayed poor sensitivity (<35%) and concordance (≤50%), but two had sufficient specificity (78.9%). The full t1/2 MLM stability model with just FCFP_6 descriptors and the pruned t1/2 MLM model with all nine descriptors produced sensitivity (78%), specificity (26.3%), and concordance (61.7%) data for the 60 molecules that were comparable to those for the N = 784 in vivo Bayesian model (Table ). When the MLM stability Bayesian model and a dual-event in vitro Bayesian model agreed that a compound was either “good” or “bad,” the concordance (overall accuracy) values for the in vivo predictions were significantly improved relative to the original dual-event model (from 43.3% to 58.8%, from 48.3% to 62.5%, and from 50.0% to 60.6%; Table ). The confusion matrices for all of the models are also shown for clarity (Table ).

Table 4

External Statistics for the Dual-Event Bayesian (in VitroMtb Efficacy and Non-cytotoxicity in Vero Cells) and Mouse Liver Microsomal Stability Bayesian Models Tested on the New in vivo Mouse TB Data

machine learning model	sensitivity (%)	specificity (%)	concordance (%)
full t_1/2 MLM stability Bayesian (just FCFP_6)	78.0	26.3	61.7
pruned t_1/2 MLM stability Bayesian (all nine descriptors)	78.0	26.3	61.7
TAACF-CB2 dual-event Bayesian	26.8	78.9	43.3
combined TB dual-event Bayesian	34.1	78.9	48.3
MLSMR dual-event Bayesian	34.1	57.9	50.0
consensus:a TAACF-CB2 dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	7 true positives when predictions agree	3 true negatives when predictions agree	58.8
consensus:a combined TB dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	11 true positives when predictions agree	4 true negatives when predictions agree	62.5
consensus:a MLSMR dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	17 true positives when predictions agree	3 true negatives when predictions agree	60.6
modified consensus:b TAACF-CB2 dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	7 true positives when predictions agree (17.1%)	17 true negatives (89.5%)	40.0
modified consensus:b combined TB dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	11 true positives when predictions agree (26.8%)	16 true negatives (84.2%)	45.0
modified consensus:b MLSMR dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	17 true positives when predictions agree (41.5%)	13 true negatives (68.4%)	50.0

For the modified consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive. However, if either model classified a compound as bad/inactive, it was defined as a true negative or false negative (depending on its experimental value). Thus, the modified consensus approaches made predictions for all of the test compounds.

Table 5

Confusion Matrices Produced When the Machine Learning Models Were Tested on the New in Vivo Mouse TB Data

Legend
true positives	false positives
false negatives	true negatives

External statistics were calculated using the Bayesian modeling tool on Collaborative Drug Discovery with a cutoff score of >0.65, which produced an internal sensitivity of 0.7 and an internal specificity of 0.67.

For the consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive or false positive (depending on the experimental value of the compound). Similarly, both models had to classify a compound as bad/inactive for it to be considered as a true negative or false negative.

Coverage = 17/60 = 28%.

Coverage = 24/60 = 40%.

Coverage = 33/60 = 55%.

Coverage = 60/60 = 100%.

For the initial consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive or false positive. Similarly, both models had to classify a compound as bad/inactive for it to be considered as a true negative or false negative. Since the combination of the models agreed on the classification only for a subset of the test set, the overall sensitivity and overall specificity are not applicable. However, the overall concordance is still relevant and was calculated as (number of true positives + number of true negatives)/(number of compounds on which both models agreed on the good or bad classification). For the modified consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive. However, if either model classified a compound as bad/inactive, it was defined as a true negative or false negative (depending on its experimental value). Thus, the modified consensus approaches made predictions for all of the test compounds. External statistics were calculated using the Bayesian modeling tool on Collaborative Drug Discovery with a cutoff score of >0.65, which produced an internal sensitivity of 0.7 and an internal specificity of 0.67. For the consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive or false positive (depending on the experimental value of the compound). Similarly, both models had to classify a compound as bad/inactive for it to be considered as a true negative or false negative. For the modified consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive. However, if either model classified a compound as bad/inactive, it was defined as a true negative or false negative (depending on its experimental value). Coverage = 17/60 = 28%. Coverage = 24/60 = 40%. Coverage = 33/60 = 55%. Coverage = 60/60 = 100%.

Enrichment Factors

All of the computational models were used to calculate enrichment factors (EFs) for the test set (Table ). The best enrichment in the hit rate (positive predicted value, or PPV) for the top-scoring 10% of compounds was 1.22, which was obtained for the in vivo Bayesian models, the Combined TB in vitro dual-event Bayesian, and the TAACF-CB2 in vitro dual-event Bayesian. The highest overall PPV (78.6%) was seen for the consensus model that involved using both the combined TB dual-event Bayesian and the full t1/2 MLM stability Bayesian (just FCFP_6). The second-best PPV (77.8%) was also produced by a consensus approach (the TAACF-CB2 dual-event in vitro Bayesian plus the full t1/2 MLM stability Bayesian). These two consensus approaches had better overall hit rates (and thus better overall enrichment factors) than the machine learning models that were trained with in vivo mouse Mtb data alone. Our analysis of additional statistics illustrates that some models perform better on the basis of some statistics versus others (Table ). For example, the consensus models perform best on the basis of κ and MCC, whereas the combined dual-event Bayesian does best on the basis of Youden’s J statistic and the RP Forest model performs best on the basis of the F1 score (Table ).

Table 6

External Enrichment Factors in Hit Rates for the Machine Learning Models Tested on the New in Vivo Mouse TB Data

		enrichment factorsb^,c
machine learning model	overall hit rate (PPV)a^,c	for top 10%	for top 10 compounds	for top 20%
TB in vivoN = 773 Bayesian	29/41 (70.7%)	1.22	1.17	0.98
TB in vivoN = 784 Bayesian	32/49 (65.3%)	1.22	1.17	0.98
RP Forest TB in vivoN = 784	35/49 (71.4%)	0.98	1.02	0.98
TB in vivoN = 784 SVM	28/39 (71.8%)	N/A	N/A	N/A
TB in vivoN = 784 CDD Bayesiand	34/50 (68.0%)	1.22	1.02	1.10
full t_1/2 MLM stability Bayesian (just FCFP_6)	32/46 (69.6%)	0.98	0.88	0.98
pruned t_1/2 MLM stability Bayesian (all nine descriptors)	32/46 (69.6%)	0.73	0.88	0.98
TAACF-CB2 dual-event Bayesian	11/15 (73.3%)	1.22	1.17	1.10
combined TB dual-event Bayesian	14/18 (77.8%)	1.22	1.32	1.22
MLSMR dual-event Bayesian	19/27 (70.4%)	0.73	0.88	0.98
consensus: TAACF-CB2 dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	7/9 (77.8%); overall EF = 1.14	N/A	N/A	N/A
consensus: combined TB dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	11/14 (78.6%); overall EF = 1.15	N/A	N/A	N/A
consensus: MLSMR dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	17/23 (73.9%); overall EF = 1.08	N/A	N/A	N/A

The hit rate (positive predicted value = PPV) was calculated as (number of true positives)/(number of true positives + number of false positives).

The enrichment factor was calculated as (hit rate in %)/(% of in vivo active compounds in the external test set). Since 41 of the 60 compounds in this external test set (68.3%) were active, the maximum enrichment factor that a perfect model could achieve would be 100%/68.3% = 1.46.

Since each original consensus model and the corresponding “modified consensus” model have the same number of true positives and false positives, their hit rates and enrichment factors are equivalent.

Table 7

Additional External Statistics for the Machine Learning Models Tested on the New in Vivo Mouse TB Dataa

machine learning model	κ	MCC	J	F₁
TB in vivoN = 773 Bayesian	0.08	0.08	0.08	0.71
TB in vivoN = 784 Bayesian	–0.13	–0.14	–0.11	0.71
RP Forest TB in vivoN = 784	0.13	0.14	0.12	0.78
TB in vivoN = 784 SVM	0.10	0.10	0.10	0.70
TB in vivoN = 784 CDD Bayesian	–0.01	–0.02	–0.01	0.75
full t_1/2 MLM stability Bayesian (just FCFP_6)	0.05	0.05	0.04	0.74
pruned t_1/2 MLM stability Bayesian (all nine descriptors)	0.05	0.05	0.04	0.74
TAACF-CB2 dual-event Bayesian	0.04	0.06	0.06	0.39
combined TB dual-event Bayesian	0.10	0.13	0.13	0.47
MLSMR dual-event Bayesian	0.04	0.04	0.04	0.56
consensus:b TAACF-CB2 dual-event Bayesian + fullt_1/2 MLM stability Bayesian (just FCFP_6)	0.13	0.17	N/A	0.67
consensus:b combined TB dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	0.18	0.20	N/A	0.71
consensus:b MLSMR dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	0.19	0.04	N/A	0.72
modified consensus:c TAACF-CB2 dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	0.05	0.09	0.07	0.28
modified consensus:c combined TB dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	0.08	0.12	0.11	0.40
modified consensus:c MLSMR dual-event Bayesian + full t_1/2 MLM stability Bayesian (just FCFP_6)	0.08	0.09	0.10	0.53

The top two scores for each particular type of external statistic are shown in bold.

The hit rate (positive predicted value = PPV) was calculated as (number of true positives)/(number of true positives + number of false positives). The enrichment factor was calculated as (hit rate in %)/(% of in vivo active compounds in the external test set). Since 41 of the 60 compounds in this external test set (68.3%) were active, the maximum enrichment factor that a perfect model could achieve would be 100%/68.3% = 1.46. Since each original consensus model and the corresponding “modified consensus” model have the same number of true positives and false positives, their hit rates and enrichment factors are equivalent. External statistics were calculated from the results of the Bayesian modeling tool on Collaborative Drug Discovery using a cutoff score of >0.65, which produced an internal sensitivity of 0.7 and an internal specificity of 0.67. The top two scores for each particular type of external statistic are shown in bold. For the initial consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive or false positive. Similarly, both models had to classify a compound as bad/inactive for it to be considered as a true negative or false negative. Consequently, these workflows made active/inactive classifications on only a subset of the test set. For the modified consensus approaches, both types of Bayesian models had to classify a compound as good/active for it to be considered as a true positive. However, if either model classified a compound as bad/inactive, it was defined as a true negative or false negative (depending on its experimental value). Thus, the modified consensus approaches made predictions for all of the test compounds.

Clustering of Mouse Mtb in Vivo Data

The new honeycomb clustering approach (Figure A; an enlarged version is shown in Figure S1) provides a map of the in vivo active and inactive compounds according to structural similarity. The 60 additional test set molecules cluster near similar molecules, such as the macrolides (Figure B). This approach can also be used to infer activity on the basis of similarity:[11−13] if a compound is surrounded by other active molecules, this might suggest that it too is active. For example, the macrolides cyclogriselimycin and ecumicin in the test set are surrounded by other active compounds (Figure B).

Figure 3

Honeycomb clustering of TB in vivo data from 2014 and 2015. Yellow hexagons highlight the compounds from 2014 and 2015, and green outlines signify in vivo active compounds. (A) Complete map of compounds in the training and testing sets. (B) Enlarged view of the section marked with the black circle in (A), highlighting cyclogriselimycin and ecumicin.

Discussion

An abundance of data from large in vitro phenotypic screens against Mtb exists in the public domain[42−46] that can readily be used to assist future drug discovery. Since machine learning methods can learn from past data, we have extensively applied Bayesian and other machine learning algorithms to model Mtb inhibition and Vero cell cytotoxicity. We have pioneered the use of dual-event data sets, which use both dose–response data for whole-cell antitubercular activity[47,48] and Vero cell cytotoxicity.[35,36,42,49−51] We have also used the same approach to model mouse Mtbin vivo data[8] and most recently MLM stability.[34] In addition, we have recently developed a freely available mobile app called TB Mobile[15] that displays over 800 Mtb-active molecule structures and their targets, with links to associated data. This tool was recently enhanced by adding target-specific Bayesian models to rank probable targets.[15] Such Bayesian modeling approaches with FCFP_6 fingerprints have also been integrated into the CDD Vault software,[29] and the algorithms were made open source and applied to large numbers of data sets.[31] Our combined efforts in this area indicate that such machine learning models are a valuable resource and can be used prospectively to suggest molecules to test not only against Mtb but also for other diseases.[52,53] Since our earlier study compiling compounds tested in the mouse Mtb model,[8] we have continued to collect and curate data with the aim of testing the machine learning models developed and improving upon them. We have now added to our database 60 unique molecules published in 2014–2015 (Table S1 and Supplemental Data 1), of which 41 were classified as actives. The large percentage of active compounds (∼60%) in this new set may represent a publication bias, as noted in our prior analysis for data up to 2014.[8] Simple molecular property analysis suggested that only the number of hydrogen-bond donors was significantly higher for the 41 in vivo active compounds (Table ). These 60 molecules broadly cover a similar physiochemical property space as the training set of 784 molecules (Figure A) and the over 800 molecules collated in TB Mobile[15] (Figure B). These results suggest that we are likely in the applicability domain of the data set. The closest similarity values calculated with the N = 784 Bayesian model (mean closest distance = 0.40 with a standard deviation of 0.15, where a value of zero connotes identity and larger values indicate greater difference; Figure S2) are quite low, suggesting that most are within the applicability domain of the model. In this study, we have updated the machine learning models and also evaluated whether models for other properties, such as the combined in vitro bioactivity and Vero cell cytotoxicity, MLM t1/2, or a consensus of these models, could also predict compounds likely to have in vivo activity. When comparing different machine learning approaches such as Bayesian, SVM, and recursive partitioning, we observed little difference based on internal fivefold ROC (Table ), although predictions for the external test set produced some slight differences in sensitivity, specificity, and concordance (Table ). These show that the sensitivity values are generally much higher than the specificity values, which is a reversal of what we observed for the fivefold ROC for the Bayesian model training sets (Supplemental Data 1 and 2). This likely suggests that we would have some difficulty classifying inactives with these models while being able to select actives. The updated training set is fairly well balanced, as it was before.[8] Obviously the use of a computational approach to select compounds presents advantages in likely reducing follow-up costs and lowering numbers of mice used. The t1/2 MLM stability models follow the same trend with the sensitivity being much higher than the specificity (Table ), while the in vitroMtb bioactivity TAACF-CB2 dual-event, combined TB dual-event, and MLSMR dual-event models have higher specificity than sensitivity for the test set molecules (Table ). Surprisingly, we found that Bayesian models built with in vitro MLM t1/2 data sets could produce external statistics similar to those for the in vivoMtb models (Table ), as dual-event models that included in vitroMtb bioactivity and cytotoxicity had positive predicted values of >70% and the best overall hit rates were obtained with a consensus of an Mtb model and the MLM t1/2 model (Table ). In vivo models and in vitro models in the best cases individually had enrichment factors of 1.22 for the top-scoring 10% of compounds, and the combined TB dual-event in vitro Bayesian achieved the best enrichment factor of 1.32 for the top-scoring 10 compounds. Because of the large percentage of active compounds in this new in vivo test set (60%), the maximum enrichment factor that a “perfect” model could produce was 1.46. On the basis of these results (Table ), it seems that the better PPV hit rates and enrichment factors for two of the consensus approaches are due to (a) superior specificity (filtering out compounds likely to be MLM-unstable, Mtb-inactive, and/or cytotoxic gave ∼2 times the best specificity of an in vivo-trained model) while (b) not having high false positive rates for the consensus approaches. The use of additional external statistics suggested that no single model performed best across all (Table ). Therefore, these may not be as useful as considering the enrichment factors for test set evaluation (Table ). Perhaps part of the accuracy of the MLM Bayesian for predicting in vivo activity is based on the fact that at least 190 of the 894 compounds in the full t1/2 MLM Bayesian training set were from Mtb and malaria projects (i.e., 20 stable compounds out of 42 were from Mtb projects and 49 stable compounds out of 148 were from malaria projects). Antimalarial compounds sometimes display activity against Mtb,[36,54] and researchers are unlikely to devote the time and money to MLM stability assays unless a compound displays promising therapeutic activity. Thus, perhaps some of the chemical features and properties that are deemed favorable by the MLM stability Bayesian implicitly incorporate both stability and efficacy. Greatly simplified, it is reasonable to correlate in vivo efficacy in the mouse with compounds that at minimum are metabolically stable with regard to phase I metabolism in the mouse and inhibit the growth of Mtb under in vitro conditions mimicking aspects of their actual pathologic environs. Support for this notion is provided by the fact that when evaluating an external test set of known antitubercular molecules, the MLM stability Bayesian recognized most TB drugs as metabolically stable.[34] It is interesting that although the TAACF-CB2 and combined TB dual-event Bayesian models had poor overall external sensitivity and concordance values, they displayed enrichment factors for the top-scoring 10% of compounds that were equivalent or superior to the in vivo models. When these two in vitroMtb dual-event models were combined with the t1/2 MLM stability model to make a novel consensus workflow, the best overall hit rates (and thus the best overall enrichment factors) were achieved (Table ). This highlights the importance of specificity and of studying more than just the overall external statistics, especially since most laboratories will only be able to perform in vivoMtb studies in mice for a very small number of candidate compounds. In summary, this suggests that together the experimental in vitroMtb and t1/2 MLM stability models are a good predictor of in vivoMtb efficacy in the mouse model and that a consensus machine learning model may be a useful alternative, at least on the basis of this particular test set as described. While humans can visualize quite complex data, there have been many approaches to data visualization to produce maps that can explain the terrain of biological data, including Kohonen networks, Sammon maps, and many other approaches.[55] In this study we have described a new data visualization approach called honeycomb visualization that was used to cluster the in vivo training set and the new external test set (Figures and S1). This shows, for example, how macrocyclic compounds cluster together (Figure B) and how the 60 new compounds are dispersed around the map with local clusters of similar analogues. Such an approach, while also relying on the same descriptors, illustrates an alternative way to assess predictions or structure–activity relationships and has been implemented recently in a new iOS free mobile app called PolyPharma[56] to demonstrate how machine learning and this visualization can be used outside of the desktop. In addition, the use of tools such as CDD Models can enable the sharing of Bayesian models such that they can be run in freely available mobile apps, making the models more accessible.[30] In summary, this work adds further weight to the use of machine learning approaches to predict in vivoMtb activity in the mouse efficacy model. We have not seen this kind of approach taken with other disease models and data sets, so this still ranks as a difficult but interesting problem to address. Our previous reported Mtbin vivo model[8] was tested with only a small test set of 11 molecules, while we now report one that has 60 molecules. For the first time we have shown that MLM stability predictions using a Bayesian model could be a useful adjunct to applying a specific Mtbin vivo mouse machine learning model to predict efficacy. This is potentially of importance because it is likely that MLM stability models are based on more structurally diverse molecules than just antitubercular efficacy models. Having more modeling options for predicting in vivo activity is of value because of the relatively small data set of in vivoMtb values publicly available at this time. The in vitro and cytotoxicity models’ training sets are larger, and these with the MLM t1/2 models can be additionally used for predicting in vivo activity based on our test set of 60 recently published molecules. There is no shortage of in vitro screening hits (there are likely many thousands across the various public–private partnership, NIH-funded, and commercial screens), and we would propose that computational models such as those described in this study be made available, shared,[29] and utilized alongside other selection criteria (medicinal chemistry heuristics or gut feeling) prior to selecting compounds for testing in the animal model in order to expedite TB research and save both time and money. Along these lines, the 177 GSK open source Mtb leads were scored with the top two consensus dual-event and MLM Bayesian workflows, and the intersection of their top predictions was analyzed in order to suggest compounds that should be prioritized for in vivo assays (Tables S5–S7). In due course we will use our models to make prospective predictions prior to in vivo testing in the mouse Mtb model in our own laboratories, and these results will be reported in the future. In conclusion, we have built on our previous Mtbin vivo and in vitro modeling studies[8,35,36,47,48,50,51,57,58] to suggest a combination of published data and machine learning models that can be used to harness limited research resources and increasingly contracting funding for TB research. With continual pressure to identify novel in vivo active antituberculars to supplement the currently depleted clinical pipeline, these machine learning models could be considered.

42 in total

1. Analysis of a large structure/biological activity data set using recursive partitioning.

Authors: A Rusinko; M W Farmen; C G Lambert; P L Brown; S S Young
Journal: J Chem Inf Comput Sci Date: 1999 Nov-Dec

2. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

Review 3. Cheminformatics analysis and learning in a data pipelining environment.

Authors: Moises Hassan; Robert D Brown; Shikha Varma-O'brien; David Rogers
Journal: Mol Divers Date: 2006-09-22 Impact factor: 2.943

4. Index for rating diagnostic tests.

Authors: W J YOUDEN
Journal: Cancer Date: 1950-01 Impact factor: 6.860

5. Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL.

Authors: Alex M Clark; Sean Ekins
Journal: J Chem Inf Model Date: 2015-06-03 Impact factor: 4.956

6. Novel inhibitors of InhA efficiently kill Mycobacterium tuberculosis under aerobic and anaerobic conditions.

Authors: Catherine Vilchèze; Anthony D Baughn; JoAnn Tufariello; Lawrence W Leung; Mack Kuo; Christopher F Basler; David Alland; James C Sacchettini; Joel S Freundlich; William R Jacobs
Journal: Antimicrob Agents Chemother Date: 2011-05-31 Impact factor: 5.191

7. Predicting Mouse Liver Microsomal Stability with "Pruned" Machine Learning Models and Public Data.

Authors: Alexander L Perryman; Thomas P Stratton; Sean Ekins; Joel S Freundlich
Journal: Pharm Res Date: 2015-09-28 Impact factor: 4.200

8. Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses.

Authors: Alex M Clark; Krishna Dole; Sean Ekins
Journal: J Chem Inf Model Date: 2016-01-19 Impact factor: 4.956

9. Enhancing hit identification in Mycobacterium tuberculosis drug discovery using validated dual-event Bayesian models.

Authors: Sean Ekins; Robert C Reynolds; Scott G Franzblau; Baojie Wan; Joel S Freundlich; Barry A Bunin
Journal: PLoS One Date: 2013-05-07 Impact factor: 3.240

10. Fueling open-source drug discovery: 177 small-molecule leads against tuberculosis.

Authors: Lluís Ballell; Robert H Bates; Rob J Young; Daniel Alvarez-Gomez; Emilio Alvarez-Ruiz; Vanessa Barroso; Delia Blanco; Benigno Crespo; Jaime Escribano; Rubén González; Sonia Lozano; Sophie Huss; Angel Santos-Villarejo; José Julio Martín-Plaza; Alfonso Mendoza; María José Rebollo-Lopez; Modesto Remuiñan-Blanco; José Luis Lavandera; Esther Pérez-Herran; Francisco Javier Gamo-Benito; José Francisco García-Bustos; David Barros; Julia P Castro; Nicholas Cammack
Journal: ChemMedChem Date: 2013-01-10 Impact factor: 3.466

13 in total

1. Naïve Bayesian Models for Vero Cell Cytotoxicity.

Authors: Alexander L Perryman; Jimmy S Patel; Riccardo Russo; Eric Singleton; Nancy Connell; Sean Ekins; Joel S Freundlich
Journal: Pharm Res Date: 2018-06-29 Impact factor: 4.200

Review 2. Molecule Property Analyses of Active Compounds for Mycobacterium tuberculosis.

Authors: Vadim Makarov; Elena Salina; Robert C Reynolds; Phyo Phyo Kyaw Zin; Sean Ekins
Journal: J Med Chem Date: 2020-04-20 Impact factor: 7.446

3. Addressing the Metabolic Stability of Antituberculars through Machine Learning.

Authors: Thomas P Stratton; Alexander L Perryman; Catherine Vilchèze; Riccardo Russo; Shao-Gang Li; Jimmy S Patel; Eric Singleton; Sean Ekins; Nancy Connell; William R Jacobs; Joel S Freundlich
Journal: ACS Med Chem Lett Date: 2017-09-14 Impact factor: 4.345

Review 4. Collaborative drug discovery for More Medicines for Tuberculosis (MM4TB).

Authors: Sean Ekins; Anna Coulon Spektor; Alex M Clark; Krishna Dole; Barry A Bunin
Journal: Drug Discov Today Date: 2016-11-22 Impact factor: 7.851

5. A Machine Learning Strategy for Drug Discovery Identifies Anti-Schistosomal Small Molecules.

Authors: Kimberley M Zorn; Shengxi Sun; Cecelia L McConnon; Kelley Ma; Eric K Chen; Daniel H Foil; Thomas R Lane; Lawrence J Liu; Nelly El-Sakkary; Danielle E Skinner; Sean Ekins; Conor R Caffrey
Journal: ACS Infect Dis Date: 2021-01-12 Impact factor: 5.084

6. A rapid method for estimation of the efficacy of potential antimicrobials in humans and animals by agar diffusion assay.

Authors: Elena G Salina; Sean Ekins; Vadim A Makarov
Journal: Chem Biol Drug Des Date: 2018-11-23 Impact factor: 2.817

Review 10. Artificial intelligence as a fundamental tool in management of infectious diseases and its current implementation in COVID-19 pandemic.

Authors: Ishnoor Kaur; Tapan Behl; Lotfi Aleya; Habibur Rahman; Arun Kumar; Sandeep Arora; Israt Jahan Bulbul
Journal: Environ Sci Pollut Res Int Date: 2021-05-25 Impact factor: 4.223