| Literature DB >> 35164570 |
Sabiha Shaik1, Anuradha Singh1, Arya Suresh1, Niyaz Ahmed1.
Abstract
Escherichia coli, a ubiquitous commensal/pathogenic member from the Enterobacteriaceae family, accounts for high infection burden, morbidity, and mortality throughout the world. With emerging multidrug resistance (MDR) on a massive scale, E. coli has been listed as one of the Global Antimicrobial Resistance and Use Surveillance System (GLASS) priority pathogens. Understanding the resistance mechanisms and underlying genomic features appears to be of utmost importance to tackle further spread of these multidrug-resistant superbugs. While a few of the globally prevalent sequence types (STs) of E. coli, such as ST131, ST69, ST405, and ST648, have been previously reported to be highly virulent and harboring MDR, there is no clarity if certain ST lineages have a greater propensity to acquire MDR. In this study, large-scale comparative genomics of a total of 5,653 E. coli genomes from 19 ST lineages revealed ST-wide prevalence patterns of genomic features, such as antimicrobial resistance (AMR)-encoding genes/mutations, virulence genes, integrons, and transposons. Interpretation of the importance of these features using a Random Forest Classifier trained with 11,988 genomic features from whole-genome sequence data identified ST-specific or phylogroup-specific signature proteins mostly belonging to different protein superfamilies, including the toxin-antitoxin systems. Our study provides a comprehensive understanding of a myriad of genomic features, ST-specific proteins, and resistance mechanisms entailing different lineages of E. coli at the level of genomes; this could be of significant downstream importance in understanding the mechanisms of AMR, in clinical discovery, in epidemiology, and in devising control strategies. IMPORTANCE With the leap in whole-genome data being generated, the application of relevant methods to mine biologically significant information from microbial genomes is of utmost importance to public health genomics. Machine-learning methods have been used not only to mine, curate, or classify the data but also to identify the relevant features that could be linked to a particular class/target. This is perhaps one of the pioneering studies that has attempted to classify a large repertoire of E. coli genome data sets (5,653 genomes) belonging to 19 different STs (including well-studied as well as understudied STs) using machine learning approaches. Important features identified by these approaches have revealed ST-specific signature proteins, which could be further studied to predict possible associations with the phenotypic profiles, thereby providing a better understanding of virulence and the resistance mechanisms among different clonal lineages of E. coli.Entities:
Keywords: AMR surveillance; Escherichia coli; bacterial evolution; bioinformatics; genomics; machine learning; molecular epidemiology; sequence types; virulence
Year: 2022 PMID: 35164570 PMCID: PMC8844930 DOI: 10.1128/mbio.03796-21
Source DB: PubMed Journal: mBio Impact factor: 7.867
Serotype information of 19 STs considered in the study
| ST | No. of genomes | Total no. of serotypes | Major two serotypes (%) |
|---|---|---|---|
| ST131 | 1,101 | 13 | O25:H4 (84.5), O16:H5 (8.9) |
| ST10 | 1,036 | 222 | O16:H48 (26.0), O89:H9 (4.1) |
| ST11 | 543 | 3 | O157:H7 (99.1), −:H7 (0.6) |
| ST95 | 418 | 13 | O1:H7 (27.8), O18:H7 (21.8) |
| ST73 | 318 | 10 | O6:H1 (59.7), O2/O50:H1 (11.0) |
| ST69 | 276 | 25 | O77/O17/O44/O106/O73:H18 (32.2), O77/O17/O44/O106:H18 (21.7) |
| ST58 | 222 | 79 | O8:H25 (10.8), −:H21 (6.8) |
| ST410 | 214 | 24 | O8:H9 (42.5), −:H9 (24.3) |
| ST101 | 202 | 68 | −:H31 (8.4), O82:H8 (7.4) |
| ST38 | 164 | 26 | O86:H18 (30.5), O2/O50:H30 (10.4) |
| ST155 | 162 | 61 | −:H21 (11.7), O86:H51 (6.2) |
| ST167 | 150 | 12 | O89:H9 (68.7), O89:H5 (6.7) |
| ST117 | 142 | 34 | O24:H4 (23.2), O33:H4 (9.9) |
| ST405 | 134 | 5 | O102:H6 (92.5), O2/O50:H4 (3.7) |
| ST48 | 126 | 57 | −:H11 (8.7), O10:H5 (7.1) |
| ST648 | 121 | 22 | O1:H6 (50.4), O153:H6 (8.3) |
| ST127 | 112 | 4 | O6:H31 (92.9), −:H31 (4.5) |
| ST156 | 107 | 50 | −:H28 (13.1), −:H25 (9.3) |
| ST12 | 105 | 7 | O4:H5 (63.8), O4:H1 (17.1) |
FIG 1Heat map depicting the resistome profiles of the 5,653 genomes from 19 different STs. Gene names are represented on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100 and plotted using the matplotlib module. The color key represents % presence.
FIG 2Heat map depicting the virulome profile of the 5,653 genomes from 19 different STs. Gene names are given on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. The color key represents % presence.
FIG 3Heat map depicting the prevalence of genes linked with secretion systems in the 5,653 genomes from 19 different STs. Gene names are mentioned on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. The color key represents % presence.
FIG 4(A) Bar plot depicting the number of genomes harboring class 1 integrons among the 19 STs considered under this study. (B) Sunburst plot depicting the prevalence of transposons among the 19 STs.
FIG 5Principal coordinate analysis (PCoA) plot (with the top 2 PCoA axes) showing the clustering of genomes from different STs. A nested scree plot within the image depicts the proportion of variance explained by the top 10 principal components (PCs).
Performance of five supervised algorithms that support multiclass classification
| Algorithm | Training accuracy | Testing accuracy | F1 score | Recall | Precision |
|---|---|---|---|---|---|
| RandomForestClassifier | 1.000000 | 0.999410 | 0.999408 | 0.999410 | 0.999412 |
| ExtraTreesClassifier | 1.000000 | 0.999410 | 0.999408 | 0.999410 | 0.999412 |
| KNeighborsClassifier | 0.998989 | 0.998231 | 0.998227 | 0.998231 | 0.998250 |
| CategoricalNB | 0.995198 | 0.998231 | 0.998242 | 0.998231 | 0.998288 |
| XGBoost | 0.183220 | 0.183373 | 0.016311 | 0.052632 | 0.009651 |
FIG 6Cluster map depicting the prevalence of validated important features from the RF model among the 19 STs considered under this study. The % presence of each of the proteins at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. Feature names are represented on the y axis and the ST lineages on the x axis. The color key represents % presence.