| Literature DB >> 34427505 |
Sam Benkwitz-Bedford1, Martin Palm2,3, Talip Yasir Demirtas1, Ville Mustonen4,5, Anne Farewell2,3, Jonas Warringer2,3, Leopold Parts6,7, Danesh Moradigaravand1.
Abstract
Escherichia coli is an important cause of bacterial infections worldwide, with multidrug-resistant strains incurring substantial costs on human lives. Besides therapeutic concentrations of antimicrobials in health care settings, the presence of subinhibitory antimicrobial residues in the environment and in clinics selects for antimicrobial resistance (AMR), but the underlying genetic repertoire is less well understood. Here, we used machine learning to predict the population doubling time and cell growth yield of 1,407 genetically diverse E. coli strains expanding under exposure to three subinhibitory concentrations of six classes of antimicrobials from single-nucleotide genetic variants, accessory gene variation, and the presence of known AMR genes. We predicted cell growth yields in the held-out test data with an average correlation (Spearman's ρ) of 0.63 (0.36 to 0.81 across concentrations) and cell doubling times with an average correlation of 0.59 (0.32 to 0.92 across concentrations), with moderate increases in sample size unlikely to improve predictions further. This finding points to the remaining missing heritability of growth under antimicrobial exposure being explained by effects that are too rare or weak to be captured unless sample size is dramatically increased, or by effects other than those conferred by the presence of individual single-nucleotide polymorphisms (SNPs) and genes. Predictions based on whole-genome information were generally superior to those based only on known AMR genes and were accurate for AMR resistance at therapeutic concentrations. We pinpointed genes and SNPs determining the predicted growth and thereby recapitulated many known AMR determinants. Finally, we estimated the effect sizes of resistance genes across the entire collection of strains, disclosing the growth effects for known resistance genes in each individual strain. Our results underscore the potential of predictive modeling of growth patterns from genomic data under subinhibitory concentrations of antimicrobials, although the remaining missing heritability poses a challenge for achieving the accuracy and precision required for clinical use. IMPORTANCE Predicting bacterial growth from genome sequences is important for a rapid characterization of strains in clinical diagnostics and to disclose candidate novel targets for anti-infective drugs. Previous studies have dissected the relationship between bacterial growth and genotype in mutant libraries for laboratory strains, yet no study so far has examined the predictive power of genome sequence in natural strains. In this study, we used a high-throughput phenotypic assay to measure the growth of a systematic collection of natural Escherichia coli strains and then employed machine learning models to predict bacterial growth from genomic data under nontherapeutic subinhibitory concentrations of antimicrobials that are common in nonclinical settings. We found a moderate to strong correlation between predicted and actual values for the different collected data sets. Moreover, we observed that the known resistance genes are still effective at sublethal concentrations, pointing to clinical implications of these concentrations.Entities:
Keywords: antimicrobial resistance; deep learning; high-throughput assay; machine learning; whole-genome sequencing
Year: 2021 PMID: 34427505 PMCID: PMC8407197 DOI: 10.1128/mSystems.00346-21
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1The performance (Spearman’s ρ, y axis) of the best-performing predictive models of each model type (colors) for 6 antimicrobials (panels, x axis) at 3 concentrations (x axis) and under the control condition of no antimicrobial treatment (no AB) for doubling time (top row) and growth yield (bottom row). The performance was assessed as the magnitude of correlation (Spearman’s ρ) between the predicted and real data in the test data set. Numbers 1, 2, and 3 represent low, medium, and high subinhibitory concentrations of antimicrobials, respectively. Values are corrected for the measurement errors. Error bars for the gradient boosted and lasso regressors correspond to 95% confidence interval computed from Spearman’s ρ values for four cross-validation data sets. Error bars for the neural network shows 95% confidence interval computed from Spearman’s ρ values for 10 independent runs of the best-performing models on the test data sets.
FIG 2The performance (Spearman’s ρ, y axis) of the best-performing gradient-boosted regressor model of each predictor feature sets (colors) for 6 antimicrobials (panels, x axis) at 3 concentrations (x axis) and under the control condition of no antimicrobial treatment (no AB) for doubling time (top row) and growth yield (bottom row). The performance was assessed as the magnitude of correlation (Spearman’s ρ) between the predicted and real data in the test data set. The values are corrected for the measurement errors. Numbers 1, 2, and 3 represent low, medium, and high subinhibitory concentrations of antimicrobials, respectively. Error bars correspond to 95% confidence interval computed from Spearman’s ρ values for four cross-validation data sets.
FIG 3Feature importance in gradient-boosted regressor models for (A) growth yield and (B) doubling time in the absence of antimicrobials. Features are predictive gene family features, which are sorted according to their average ranks across models trained on four cross-validation subsets. Box plots show the Shapley additive explanations (SHAP) value, i.e., the effect of the presence or absence of the genes on the response features of doubling time and growth yield. Bar plots show the frequency of the gene families in the pangenome. Asterisks indicate the significance of P values for independence from population structure, computed by Scoary, at 0.05 (*) and 0.01 (**) levels. The matrix shows pairwise association between the presence of hits, where the color density shows the strength of association and colors show the direction of the Pearson correlation. The sequences of the gene families are provided in GitHub directory of the project.
Predictive biomarker gene families for doubling time and growth yield
| Gene | Product function | Linked ARGs | Frequency | AMR | Antimicrobial | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | CAM | CIP | CTX | KAN | TET | TRIM | |||||||||||||||||
| 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | ||||||
|
| Nuclease-like protein | 75 | G | G | |||||||||||||||||||
|
| NAD-dependent epimerase/dehydratase, polysaccharide |
| 65 | G | |||||||||||||||||||
|
| 382 | CTX | G | G | G | ||||||||||||||||||
|
| Tetracycline repressor protein |
| 297 | G | G | GD | G | G | |||||||||||||||
|
| Tetracycline efflux protein | 304 | TET | G | GD | GD | G | G | G | ||||||||||||||
|
| Dihydropteroate synthase type-1 | 377 | GD | ||||||||||||||||||||
|
| RNA polymerase sigma factor | 24 | D | D | |||||||||||||||||||
|
| Aminoglycoside 3′-phosphotransferase | 268 | KAN | G | G | GD | G | G | |||||||||||||||
|
| Regulator of macrolide 2′-phosphotransferase I | 293 | G | ||||||||||||||||||||
|
| Multidrug efflux transporter | 295 | D | ||||||||||||||||||||
|
| Acetaldehyde dehydrogenase | 998 | D | ||||||||||||||||||||
|
| Polysaccharide biosynthesis/export protein | 55 | G | ||||||||||||||||||||
|
| Dihydropteroate synthase | 197 | G | G | G | G | |||||||||||||||||
|
| TonB-dependent receptor | 34 | G | ||||||||||||||||||||
|
| Ethidium bromide resistance protein | 375 | G | G | |||||||||||||||||||
|
| Dihydrofolate reductase | 51 | TRIM | G | G | G | |||||||||||||||||
|
| Dihydrofolate reductase | 263 | TRIM | G | G | G | |||||||||||||||||
|
| Phosphomannomutase (PMM) | 28 | G | ||||||||||||||||||||
|
| Colicin activity protein | 15 | G | G | |||||||||||||||||||
|
| ArsR family transcriptional regulator | 164 | G | ||||||||||||||||||||
|
| Chloramphenicol acetyltransferase | 125 | CAM | G | G | G | |||||||||||||||||
| Aminoglycoside adenylyltransferase | 152 | KAN | G | GD | GD | ||||||||||||||||||
|
| Alcohol dehydrogenase |
| 31 | G | |||||||||||||||||||
|
| DeoR family transcriptional regulator |
| 22 | G | G | G | |||||||||||||||||
|
| TDP-fucosamine acetyltransferase | 77 | G | ||||||||||||||||||||
|
| Cobalamin biosynthesis protein | 157 | G | ||||||||||||||||||||
|
| Streptomycin phosphotransferase | 151 | KAN | D | |||||||||||||||||||
|
| UPF0162 family protein | 51 | G | ||||||||||||||||||||
| UGDH gene | UDP-glucose 6-dehydrogenase |
| 65 | G | |||||||||||||||||||
Found to be significantly linked with the phenotype after accounting for population structure for different treatment conditions, i.e., drug type and concentration (P value cutoff, <0.01). The full sequences of the genes are available in the GitHub directory of the project (see Materials and Methods). CAM, chloramphenicol; CIP, ciprofloxacin; CTX, ceftriaxone; KAN, kanamycin; TET, tetracycline; TRIM, trimethoprim.
D, doubling time measurement; G, growth yield measurement.
Antimicrobials for the known resistance genes. AMR, antimicrobial resistance.
Known resistance genes in the Comprehensive Antibiotic Resistance Database (CARD) data set identified in the 200 bp downstream and upstream of the genes. The term ARG stands for AMR gene.
FIG 4The contribution of the presence of known resistance genes and tnpA gene to the prediction of growth yield and doubling time, as measured by SHAP values. The tnpA gene was found to be linked with the extended-spectrum beta lactamase (ESBL) blaCTX-M gene. Box plots in red and blue correspond to distribution of the effect of the presence and absence of features on growth-related features in each sample, respectively. The numbers above the box plot pairs show the difference between the medians of boxplots for the presence and absence under each condition divided by the range of values for the condition, which were turned into percentages. The numbers were used as proxies for the fitness effects of the resistance genes and the associated resistance gene for tnpA.