| Literature DB >> 34128695 |
D Aytan-Aktug1, M Nguyen2,3, P T L C Clausen1, R L Stevens2,4,5, F M Aarestrup1, O Lund1, J J Davis2,3,6.
Abstract
Antimicrobial resistance (AMR) is an important global health threat that impacts millions of people worldwide each year. Developing methods that can detect and predict AMR phenotypes can help to mitigate the spread of AMR by informing clinical decision making and appropriate mitigation strategies. Many bioinformatic methods have been developed for predicting AMR phenotypes from whole-genome sequences and AMR genes, but recent studies have indicated that predictions can be made from incomplete genome sequence data. In order to more systematically understand this, we built random forest-based machine learning classifiers for predicting susceptible and resistant phenotypes for Klebsiella pneumoniae (1,640 strains), Mycobacterium tuberculosis (2,497 strains), and Salmonella enterica (1,981 strains). We started by building models from alignments that were based on a reference chromosome for each species. We then subsampled each chromosomal alignment and built models for the resulting subalignments, finding that very small regions, representing approximately 0.1 to 0.2% of the chromosome, are predictive. In K. pneumoniae, M. tuberculosis, and S. enterica, the subalignments are able to predict multiple AMR phenotypes with at least 70% accuracy, even though most do not encode an AMR-related function. We used these models to identify regions of the chromosome with high and low predictive signals. Finally, subalignments that retain high accuracy across larger phylogenetic distances were examined in greater detail, revealing genes and intergenic regions with potential links to AMR, virulence, transport, and survival under stress conditions. IMPORTANCE Antimicrobial resistance causes thousands of deaths annually worldwide. Understanding the regions of the genome that are involved in antimicrobial resistance is important for developing mitigation strategies and preventing transmission. Machine learning models are capable of predicting antimicrobial resistance phenotypes from bacterial genome sequence data by identifying resistance genes, mutations, and other correlated features. They are also capable of implicating regions of the genome that have not been previously characterized as being involved in resistance. In this study, we generated global chromosomal alignments for Klebsiella pneumoniae, Mycobacterium tuberculosis, and Salmonella enterica and systematically searched them for small conserved regions of the genome that enable the prediction of antimicrobial resistance phenotypes. In addition to known antimicrobial resistance genes, this analysis identified genes involved in virulence and transport functions, as well as many genes with no previous implication in antimicrobial resistance.Entities:
Keywords: AMR; ML; antimicrobial resistance; machine learning; random forest
Year: 2021 PMID: 34128695 PMCID: PMC8269213 DOI: 10.1128/mSystems.00185-21
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
Data set sizes and model performances reported as AUC for models built from the whole chromosomal alignment for each species and from randomly selected subalignments
| Species or antibiotic | No. of susceptible genomes | No. of resistant genomes | Chromosomal alignment AUC | Subalignment AUC |
|---|---|---|---|---|
| Amikacin | 1,296 | 100 | 0.897 ± 0.063 | 0.868 ± 0.080 |
| Aztreonam | 208 | 1,388 | 0.797 ± 0.026 | 0.675 ± 0.065 |
| Cefepime | 407 | 950 | 0.733 ± 0.044 | 0.653 ± 0.048 |
| Cefoxitin | 650 | 819 | 0.880 ± 0.043 | 0.749 ± 0.059 |
| Ceftazidime | 128 | 1,470 | 0.930 ± 0.027 | 0.761 ± 0.095 |
| Ciprofloxacin | 189 | 1,413 | 0.963 ± 0.012 | 0.865 ± 0.089 |
| Gentamicin | 909 | 676 | 0.897 ± 0.015 | 0.762 ± 0.065 |
| Imipenem | 1,138 | 473 | 0.915 ± 0.016 | 0.757 ± 0.066 |
| Levofloxacin | 332 | 1,277 | 0.974 ± 0.006 | 0.906 ± 0.091 |
| Meropenem | 1,112 | 478 | 0.915 ± 0.009 | 0.763 ± 0.072 |
| Piperacillin-tazobactam | 417 | 1,040 | 0.850 ± 0.014 | 0.746 ± 0.064 |
| Tetracycline | 725 | 766 | 0.805 ± 0.015 | 0.708 ± 0.049 |
| Tobramycin | 578 | 715 | 0.891 ± 0.015 | 0.821 ± 0.068 |
| Trimethoprim-sulfamethoxazole | 405 | 1,235 | 0.844 ± 0.023 | 0.641 ± 0.058 |
| Avg | 0.878 ± 0.023 | 0.763 ± 0.069 | ||
| Ethambutol | 2,182 | 246 | 0.807 ± 0.030 | 0.712 ± 0.058 |
| Isoniazid | 1,868 | 570 | 0.760 ± 0.038 | 0.656 ± 0.045 |
| Pyrazinamide | 1,803 | 224 | 0.815 ± 0.033 | 0.723 ± 0.063 |
| Rifampin | 2,061 | 416 | 0.833 ± 0.013 | 0.698 ± 0.053 |
| Streptomycin | 364 | 132 | 0.676 ± 0.026 | 0.659 ± 0.077 |
| Avg | 0.778 ± 0.028 | 0.690 ± 0.059 | ||
| Amoxicillin-clavulanate | 1,473 | 398 | 0.792 ± 0.053 | 0.762 ± 0.046 |
| Ampicillin | 1,277 | 702 | 0.779 ± 0.018 | 0.735 ± 0.045 |
| Cefoxitin | 1,584 | 346 | 0.765 ± 0.036 | 0.751 ± 0.050 |
| Ceftiofur | 1,586 | 391 | 0.815 ± 0.015 | 0.762 ± 0.052 |
| Ceftriaxone | 1,585 | 395 | 0.813 ± 0.037 | 0.762 ± 0.055 |
| Chloramphenicol | 1,848 | 88 | 0.820 ± 0.039 | 0.737 ± 0.083 |
| Gentamicin | 1,622 | 328 | 0.740 ± 0.027 | 0.695 ± 0.050 |
| Streptomycin | 378 | 767 | 0.833 ± 0.014 | 0.759 ± 0.064 |
| Sulfisoxazole | 1,093 | 770 | 0.841 ± 0.033 | 0.793 ± 0.052 |
| Tetracycline | 859 | 1,114 | 0.839 ± 0.021 | 0.753 ± 0.051 |
| Avg | 0.804 ± 0.029 | 0.751 ± 0.055 | ||
Data are the results of one model per antibiotic, with the standard deviation of a 5-fold cross-validation.
Data are for 1,066 and 971 5-kb subalignment models for each antibiotic for K. pneumoniae and S. enterica, respectively, and for 441 10-kb subalignment models for M. tuberculosis. Data are the averages of all 5-fold cross-validations with standard deviations.
FIG 1The effect of subalignment length on model performance. The y axis depicts model performance for subalignment-based models as area under the receiver operating characteristic curve (AUC) values, and the x axis depicts subalignment nucleotide length (in kilobases). Error bars represent the standard deviation of multiple random samples for each length. The number of random samples for each subalignment is shown in Table S1P in the supplemental material. A separate set of 5-fold cross-validated models was computed for each subalignment and antibiotic.
FIG 2Subalignment model accuracies by chromosomal location with alignment conservation. (Top) AUCs of every subalignment-based model are plotted based on their position in the whole chromosomal alignment. Peaks corresponding with antimicrobial resistance (AMR) genes and valleys corresponding with low alignment conservation are denoted with asterisks and are described in greater detail in Fig. S5 and Tables S1H and S1I in the supplemental material. (Bottom) The alignment conservation for each whole chromosomal alignment. The y axis depicts the percentage of sequences with a nucleotide in each column, and the x axis depicts chromosomal alignment position.
FIG 3Results of clustering similar genomes on subalignment model performance. All samples were clustered based on their k-mer similarity in the whole chromosomal alignment at various k-mer identity thresholds. Samples belonging to the same cluster were restricted to either the test or the training sets of each 5-fold cross-validation. Data are the average accuracies with standard deviations of 5-fold cross-validations for 971 Klebsiella pneumoniae (5 kb), 1,066 Salmonella enterica (5 kb), and 441 Mycobacterium tuberculosis (10 kb) subalignments, respectively.
FIG 4The number of subalignments and corresponding to a given model AUC for each antibiotic. Genomes were clustered based on similarity, and samples belonging to the same cluster were restricted to either the testing or the training sets in each fold of the 5-fold cross-validation. A clustering threshold of 75% k-mer similarity is shown for K. pneumoniae, and a clustering threshold of 95% is shown M. tuberculosis and S. enterica. Results for additional thresholds are shown in Fig. S7 in the supplemental material.
FIG 5Protein-encoding gene functions for subalignments with high AMR prediction accuracies. Each panel includes predictive subalignments with AUCs of >0.8 after clustering the strains at a given k-mer similarity threshold. Points represent each subalignment, and they are plotted based on the corresponding position on the reference chromosome. Subalignments are colored according to hits for AMR (green), virulence (blue), and transporter genes (orange), respectively. If a subalignment contains multiple gene categories, the color will appear as a mixture. Subalignments that do not produce hits in any of the known AMR-related genes are colored in pink. Results for high-scoring subalignments at other clustering thresholds are shown in Fig. S8 in the supplemental material.