| Literature DB >> 35363569 |
Jing Yang1, Mohammed Eslami2, Yi-Pei Chen2, Mayukh Das1, Dongmei Zhang1, Shaorong Chen1, Alexandria-Jade Roberts1, Mark Weston2, Angelina Volkova2, Kasra Faghihi2, Robbie K Moore1, Robert C Alaniz1, Alice R Wattam3, Allan Dickerman3, Clark Cucinell3, Jarred Kendziorski1, Sean Coburn1, Holly Paterson1, Osahon Obanor1, Jason Maples1, Stephanie Servetas4, Jennifer Dootz4, Qing-Ming Qin1, James E Samuel1, Arum Han5,6, Erin J van Schaik1, Paul de Figueiredo1,7.
Abstract
Bacterial pathogen identification, which is critical for human health, has historically relied on culturing organisms from clinical specimens. More recently, the application of machine learning (ML) to whole-genome sequences (WGSs) has facilitated pathogen identification. However, relying solely on genetic information to identify emerging or new pathogens is fundamentally constrained, especially if novel virulence factors exist. In addition, even WGSs with ML pipelines are unable to discern phenotypes associated with cryptic genetic loci linked to virulence. Here, we set out to determine if ML using phenotypic hallmarks of pathogenesis could assess potential pathogenic threat without using any sequence-based analysis. This approach successfully classified potential pathogenetic threat associated with previously machine-observed and unobserved bacteria with 99% and 85% accuracy, respectively. This work establishes a phenotype-based pipeline for potential pathogenic threat assessment, which we term PathEngine, and offers strategies for the identification of bacterial pathogens.Entities:
Keywords: adherence; bacterial pathogen; machine learning; threat assessment; toxicity
Mesh:
Substances:
Year: 2022 PMID: 35363569 PMCID: PMC9168455 DOI: 10.1073/pnas.2112886119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779
Fig. 1.Framework for generation of an ML platform that enables bacterial threat assessment. (A) Bacterial strains used in this work are phylogenetically divergent. (B) An overall framework in a time frame for threat assessment. (C) Overview of architecture of ML workflow includes data requirements and processing, model selection, and threat assessment. Unknown and known bacterial pathogens used in the threat assessment by the indicated ML models. (D) Overview of computational architecture of the four different ML models used in this work. MOI, multiplicity of infection. Fluor, fluorescence.
Fig. 2.Bacterial adherence performance in evaluating bacterial threat assessment using the ML model. (A) Representative images of adherence assays for P. aeruginosa and E. coli as positive and negative controls. The adherent bacteria and their corresponding target host cells were counted and marked with outlines. Host cells (blue) were stained by DAPI, and bacteria (green) were GFP-tagged. (Scale bar: 50 μm.) (B) Average adherent bacterial counts per A549 cell under various MOIs. Data represent the means ± SDs from three independent experiments. At each MOI, n ≥ 15. Significant difference in adherent bacteria at MOIs of 50 and 100 was observed (****: P value < 0.0001). (C and D) Performance of the four ML models in Test 1 (C) or Test 2 (D) for adherence assay. All models were characterized to determine the percentage of data required to plateau in performance. Each machine learning algorithm was run 20 times, with the error bars showing the 95% confidence interval from the accuracy scores in each run. The accuracy referred to the percentage of strains assigned correctly by the models.
Fig. 3.Bacteria-induced host cell toxicity performance in threat assessment by ML models. (A) Representative images of toxicity assay for THP1 cells induced by bacteria or Shiga toxin at 18 h post infection/incubation (h.p.i.). Total cells (blue) were counted by Hoechst staining and dead cells (red) by PI staining. Cells were automatically counted and marked with outlines. (Scale bar: 50 μm.) (B) Time course of THP1 cell death coincubated with B. subtilis or Shiga toxin producing E. coli (EcpJES 101) at an MOI of 1 for 18 h.p.i. Data represent the mean ± SD from three independent experiments, each experimental data point n ≥ 9. Significant difference in adherent bacteria at MOIs of 50 and 100 was observed (** P value < 0.001, **** P value < 0.0001). (C and D) Performance of the four indicated ML models in Test 1 (C) or Test 2 (D) for toxicity assay. All models were characterized to determine the percentage of data required to plateau in performance. Each machine learning algorithm was run 20 times, with the error bars showing the 95% confidence interval from the accuracy scores in each run. The accuracy referred to the percentage of strains assigned correctly by the models.
Kirby-Bauer Disk Diffusion susceptibility testing of a collection of pathogenic and nonpathogenic bacterial strains
| Zone diameter (mm) | Kanamycin | Ampicillin | Tetracycline | Chloramphenicol | Polymyxin B | Ceftazidime |
|---|---|---|---|---|---|---|
|
| 28.3 ± 0.5 | 6.0 ± 0 | 6.0 ± 0 | 6.0 ± 0 | 18.0 ± 0 | 27.7 ± 0.5 |
| S | R | R | R | S | S | |
|
| 56.7 ± 1.7 | 31.0 ± 0.8 | 53.0 ± 0.8 | 39.7 ± 0.8 | 14.0 ± 0.8 | 22.3 ± 0.9 |
| S | S | S | S | S | S | |
|
| 34.7 ± 0.5 | 6.0 ± 0 | 23.0 ± 0 | 15.0 ± 0 | 6.0 ± 0 | 26.0 ± 0.8 |
| S | R | S | I | R | S | |
|
| 34.0 ± 0.5 | 22.3 ± 0.5 | 25.0 ± 0 | 27.0 ± 0 | 18.3 ± 0.5 | 31.3 ± 0.9 |
| S | S | S | S | S | S | |
|
| 34.0 ± 0.5 | 23.0 ± 0 | 27.0 ± 0 | 29.3 ± 0 | 18.0 ± 0.8 | 31.7 ± 0.5 |
| S | S | S | S | S | S | |
|
| 31.7 ± 0.5 | 24.7 ± 0.5 | 28.3 ± 0.5 | 23.7 ± 0.5 | 12.0 ± 0 | 15.0 ± 0 |
| S | R | S | S | S | S | |
|
| 32.7 ± 0.5 | 26.0 ± 0.8 | 28.3 ± 0.5 | 21.3 ± 0.5 | 11.0 ± 0 | 14.3 ± 0.5 |
| S | R | S | S | I | S | |
|
| 30.3 ± 0.5 | 6.0 ± 0 | 10.7 ± 0.5 | 8.7 ± 0.9 | 18.7 ± 0.5 | 32.3 ± 0.5 |
| S | R | R | R | S | S | |
|
| 42.3 ± 0.5 | 31.0 ± 0.8 | 20.3 ± 0.5 | 26.7 ± 0.9 | 15.7 ± 0.5 | 20.3 ± 1.2 |
| S | S | S | S | S | S | |
|
| 34.0 ± 0.5 | 26.0 ± 0 | 25.7 ± 0.5 | 28.7 ± 0.5 | 17.7 ± 0.5 | 29.0 ± 0.8 |
| S | S | S | S | S | S | |
|
| 41.7 ± 0 | 8.0 ± 0 | 29.0 ± 0 | 30.0 ± 0.8 | 19.0 ± 0 | 35.0 ± 0 |
| S | R | S | S | S | S | |
|
| 42.7 ± 0.8 | 29.0 ± 0.8 | 6.0 ± 0 | 6.0 ± 0 | 20.3 ± 0 | 39.7 ± 0.5 |
| S | S | R | R | S | S | |
|
| 27.0 ± 0 | 6.0 ± 0 | 10.7 ± 0.5 | 6.0 ± 0 | 18.0 ± 0 | 30.7 ± 0.5 |
| S | R | R | R | S | S |
Notes: S, susceptible; R, resistant; I, intermediate refer to the Zone Diameter Interpretive Chart; BD BBL, Sensi-Disk Antimicrobial Susceptibility Test Disk. Data represent the mean ± SD from three independent experiments, each experimental data point n ≥ 3.
Fig. 4.ARs detection and immune activation in the ML models for bacterial threat assessment. (A and B) Performance of the four ML models in Test 1 (A) or Test 2 (B) for bacterial AR assays. (C) Representative flow cytometry plots of GFP reporter activation induced by S. enterica and E. coli and the quantification of activated NF-κB/Jurkat/GFP T lymphocyte reporter cells at various hours post infection (h.p.i) at an MOI of 1. GFP signal was measured using BD Fortessa X-20 (FITC: 488-nm laser with bandwidth filter 525/50) at various h.p.i. (D and E) Performance of the four indicated ML models in Test 1 (B) or Test 2 (C) for immune activation assay. All models were characterized to determine the percentage of data required to plateau in performance. Each machine learning algorithm was run 20 times, with the error bars showing the 95% confidence interval from the accuracy scores in each run. The accuracy referred to the percentage of strains assigned correctly by the models.
Fig. 5.The ensemble ML model of PathEngine improves the accuracy of threat assessment. (A and B) Aggregated performance for all four phenotypic assays, bacterial adherence, host immune activation, AR, and bacterial toxicity in PathEngine prediction for Test 1 (A) and in Test 2 (B). Each machine learning algorithm was run 20 times, with the error bars showing the 95% confidence interval from the accuracy scores in each run. (C and D) The observations from each strain and each phenotypic assay were aggregated to make one prediction per strain in the PathEngine ensemble model. The accuracy was estimated by comparing the actual threat status and the predicted threat status for each strain. Bacterial pathogenic potential was quantified by individual assays and the ensemble assay in Test 1 (C) and Test 2 (D). Pathogenic scores obtained from ML predictions for each assay between 0 (blue) and 1 (red). 0 represents a strong nonpathogen, and 1 represents a strong pathogen. The ensemble probabilities show when the pathogenic scores from all four assays are ensembled together. Ensemble predictions convert the ensemble probabilities to 0 or 1 with a cutoff at 0.5 for comparing to the pathogenicity label in the last column.
Aggregated accuracy, precision, recall, and F1 scores for each strain by aggregating across assays for Test 1 and Test 2
| Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|
|
| ||||
| All four assays | 99.0 | 99.0 | 99.0 | 99.0 |
| Adherence + Toxicity + AR | 98.0 | 98.0 | 98.0 | 98.0 |
| Adherence + AR | 97.0 | 96.0 | 100.0 | 98.0 |
| Adherence + Toxicity | 91.0 | 91.0 | 96.0 | 94.0 |
| Toxicity + AR | 98.0 | 98.0 | 100.0 | 99.0 |
|
| ||||
| All four assays | 85.0 | 91.0 | 87.0 | 89.0 |
| Adherence + Toxicity + AR | 85.0 | 88.0 | 91.0 | 89.0 |
| Adherence + AR | 80.0 | 84.0 | 89.0 | 86.0 |
| Adherence + Toxicity | 76.0 | 79.0 | 90.0 | 84.0 |
| Toxicity + AR | 83.0 | 90.0 | 86.0 | 88.0 |
Notes: The four assays include bacterial adherence, bacterial toxicity, AR, and host cell immune activation. accuracy, the ratio of correct predictions to all predictions; precision, the ratio of correctly predicted pathogens to the total predicted pathogens; recall, the ratio of correctly predicted pathogens to the total pathogens present; F1, the weighted average of precision and recall.