| Literature DB >> 33780444 |
Josh L Espinoza1,2, Chris L Dupont1, Aubrie O'Rourke1, Sinem Beyhan1, Pavel Morales1, Amy Spoering3, Kirsten J Meyer4, Agnes P Chan5, Yongwook Choi5, William C Nierman5, Kim Lewis4, Karen E Nelson1,2,5.
Abstract
To better combat the expansion of antibiotic resistance in pathogens, new compounds, particularly those with novel mechanisms-of-action [MOA], represent a major research priority in biomedical science. However, rediscovery of known antibiotics demonstrates a need for approaches that accurately identify potential novelty with higher throughput and reduced labor. Here we describe an explainable artificial intelligence classification methodology that emphasizes prediction performance and human interpretability by using a Hierarchical Ensemble of Classifiers model optimized with a novel feature selection algorithm called Clairvoyance; collectively referred to as a CoHEC model. We evaluated our methods using whole transcriptome responses from Escherichia coli challenged with 41 known antibiotics and 9 crude extracts while depositing 122 transcriptomes unique to this study. Our CoHEC model can properly predict the primary MOA of previously unobserved compounds in both purified forms and crude extracts at an accuracy above 99%, while also correctly identifying darobactin, a newly discovered antibiotic, as having a novel MOA. In addition, we deploy our methods on a recent E. coli transcriptomics dataset from a different strain and a Mycobacterium smegmatis metabolomics timeseries dataset showcasing exceptionally high performance; improving upon the performance metrics of the original publications. We not only provide insight into the biological interpretation of our model but also that the concept of MOA is a non-discrete heuristic with diverse effects for different compounds within the same MOA, suggesting substantial antibiotic diversity awaiting discovery within existing MOA.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33780444 PMCID: PMC8031737 DOI: 10.1371/journal.pcbi.1008857
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Training data for pure compounds and producer-strain extracts relative to MOA.
Number of compounds, samples, and pairwise DGE profiles for pure compounds and producer-strain extracts relative to individual MOA.
| Pure Compounds | Producer-strain Extracts | |||||
|---|---|---|---|---|---|---|
| Compounds | Samples | Pairwise DGE Profiles | Compounds | Samples | Pairwise DGE Profiles | |
| 2 | 11 | 33 | 0 | 0 | 0 | |
| 12 | 61 | 178 | 4 | 18 | 54 | |
| 10 | 52 | 171 | 2 | 7 | 21 | |
| 3 | 12 | 36 | 0 | 0 | 0 | |
| 9 | 42 | 126 | 2 | 6 | 18 | |
| 4 | 20 | 58 | 2 | 6 | 18 | |
Model performance using several supervised machine-learning algorithms.
Various machine-learning algorithms were evaluated using the entire feature set (n = 3065 genes) and the Clairvoyance-optimized feature set (GeneSet, n = 399 genes) with the same LCOCV pairs. Performance metrics for each LCOCV set include accuracy, precision, recall, and F1 score. LCOCV refers to Leave Compound Out Cross Validation where we remove all instances of a compound from the data used to fit the model (training data) and evaluate performance on the held-out compound profiles (testing data) (see ).
| Clairvoyance feature selection [N = 399 Genes] | No feature selection [N = 3065 Genes] | |||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | F1 Score | Precision | Recall | Accuracy | F1 Score | Precision | Recall | |
| 0.999 | 0.983 | 0.983 | 0.982 | 0.749 | 0.693 | 0.715 | 0.682 | |
| 0.880 | 0.829 | 0.856 | 0.817 | 0.793 | 0.732 | 0.763 | 0.723 | |
| 0.792 | 0.719 | 0.768 | 0.708 | 0.742 | 0.659 | 0.703 | 0.645 | |
| 0.714 | 0.568 | 0.617 | 0.546 | 0.636 | 0.506 | 0.561 | 0.481 | |
| 0.798 | 0.722 | 0.778 | 0.704 | 0.694 | 0.616 | 0.668 | 0.600 | |
| 0.698 | 0.582 | 0.623 | 0.561 | 0.429 | 0.302 | 0.389 | 0.274 | |
| 0.333 | 0.308 | 0.333 | 0.301 | 0.339 | 0.277 | 0.333 | 0.261 | |
| 0.872 | 0.785 | 0.815 | 0.773 | 0.741 | 0.635 | 0.683 | 0.619 | |
Evaluating external datasets using CoHEC models.
MOA prediction accuracy and performance when applying our methods to the data from Zoffmann et al. 2019 and Zampieri et al. 2018 and the methods from Hutter et al. 2004 on our dataset. In all cases, LCOCV was used for evaluating model performance for each individual observation (e.g. pairwise DGE profile), each cross-validation set (e.g. held out teixobactin), and using various majority voting schemes (see ). CPD is an abbreviation for compound. *Indicates protein-synthesis sub-MOA classification (30S/50S).
| Dataset | Model | Organism | Feature Set Label | Feature Type | Features | MOA | CPD | Individual Pairwise Profiles Accuracy | LCOCV Test Set Accuracy | Majority Voting (Hard) Accuracy | Majority Voting (Soft) Accuracy | Data Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| This study (All MOA) | CoHEC | Escherichia coli (W01573) | GeneSet_y1-y5 | Gene | 399 | 6 | 41 | 0.9972 | 0.9986 | 1 | 1 | |
| This study (All MOA) | Clairvoyance-optimized multiclass logistic regression | Escherichia coli (W01573) | GeneSet_y1-y5 | Gene | 399 | 6 | 41 | 0.85714286 | 0.88017911 | 0.86440678 | 0.89830508 | |
| This study (30S/50S) | Clairvoyance-optimized binary logistic regression | Escherichia coli (W01573) | GeneSet_30S/50S | Gene | 7 | 2* | 9 | 0.9691358 | 0.96153846 | 1 | 1 | |
| This study (All MOA) | Clairvoyance-optimized multiclass logistic regression | Escherichia coli (W01573) | GeneSet_Multiclass | Gene | 98 | 6 | 41 | 0.95936 | 0.967735 | 0.983051 | 1 | |
| This study (All MOA) | Support vector machine (Hutter et al. 2004 Methods) | Escherichia coli (W01573) | - | Gene | - | 6 | 41 | 0.758 | - | - | - | - |
| Zoffmann et al. 2019 | CoHEC | Escherichia coli (BW25113) | GeneSet_Zoffmann | Gene | 35 | 4 | 16 | 1 | 1 | 1 | 1 | |
| Zampieri et al. 2018 ( | CoHEC | MetaboliteSet_Zampieri-t0 | Metabolite | 492 | 18 | 62 | 0.949 | 0.949 | 0.977 | 0.991 | ||
| Zampieri et al. 2018 ( | CoHEC | MetaboliteSet_Zampieri-solvent | Metabolite | 494 | 18 | 62 | 0.882 | 0.882 | 0.954 | 0.963 |
Fig 1MOA classification performance and model benchmarking.
A) The empirically determined structure of the CoHEC model calibrated to predict the MOA of an unobserved antibacterial compound based on the transcriptional change profiles of E. coli. The colored bar chart below the dendrogram shows the explained variance of the eigenprofile for each MOA. B) The influence of the Clairvoyance optimization algorithm for feature selection on model performance at each of the 5 sub-model decision points. Optimization step (t = 0) corresponds to using all available gene features, while each optimization step removes low information features during each consecutive iteration. The column chart shows the original baseline accuracy (lower) with all 3065 gene features and the effects of Clairvoyance optimized feature selection (upper). C) Network visualization of genes feature sets, determined by Clairvoyance, used by each sub-model decision point of the CoHEC model. The edge width represents the coefficient magnitude in each fitted Logistic Regression sub-model with the sign reflected by the color (positive = teal, negative = rose). D) Benchmarking of CoHEC model performance (N = 500 permutations without repetition) showing (upper) the number of compounds included during (lower) LCOCV evaluation relative to performance. Error bars represent standard error of mean. E) Kernel density of LCOCV accuracy for CoHEC null model (N = 500 permutations without repetition) and dashed horizontal lines representing actual CoHEC model performance.
Fig 2CoHEC model decision graphs for pure compounds, producer extracts, and darobactin representing MOA predictions.
Prediction paths where each terminal colored node depicts a MOA, each internal gray node represents a sub-model decision point, and the edge-width corresponds to the probability according to the model for the respective path. Opaque halos around the edges represent SE with a large width corresponding to higher variance and vice versa. Rose and teal colored edges illustrate predictions traversing incorrect and correct paths, respectively, with black edges representing paths within a novel MOA paradigm. (A,B) Show teixobactin as a pure compound and the respective producer-strain while (C) depicts kirromycin and (D) represents darobactin. All of the prediction paths shown have no instance of the compound being previously observed by the model.
Fig 3Unsupervised clustering performance and error profiles of transcriptomes and CoHEC model probability vectors.
Unsupervised hierarchical clustering using (A) pairwise DGE profiles prior to feature selection (N = 3065 genes), (B) CoHEC model LCOCV test set prediction probabilities concatenated for all sub-models, and (D) CoHEC model prediction probabilities averaged by LCOCV test set. All hierarchical clustering uses Euclidean distance and ward linkage. C) Distributions of silhouette scores for (A) and (B) clustering results with Wilcoxon signed-rank test for statistical significance. D) Unsupervised hierarchical clustering and E) standard error profiles for each of the sub-model and the predicted path with red showing darobactin as a novel MOA and teal showing producer-extracts. Producer-strain extracts where the pure compound: (*) has been observed; and (**) has not been observed by the HEC model in the training data. The box plots extend from the Q1 to Q3 quartile values of the data, with a line at the median (Q2), and whiskers at 1.5 * IQR.