| Literature DB >> 32457240 |
Nenad Macesic1,2, Oliver J Bear Don't Walk3, Itsik Pe'er4, Nicholas P Tatonetti3, Anton Y Peleg2,5, Anne-Catrin Uhlemann1,6.
Abstract
Polymyxins are used as treatments of last resort for Gram-negative bacterial infections. Their increased use has led to concerns about emerging polymyxin resistance (PR). Phenotypic polymyxin susceptibility testing is resource intensive and difficult to perform accurately. The complex polygenic nature of PR and our incomplete understanding of its genetic basis make it difficult to predict PR using detection of resistance determinants. We therefore applied machine learning (ML) to whole-genome sequencing data from >600 Klebsiella pneumoniae clonal group 258 (CG258) genomes to predict phenotypic PR. Using a reference-based representation of genomic data with ML outperformed a rule-based approach that detected variants in known PR genes (area under receiver-operator curve [AUROC], 0.894 versus 0.791, P = 0.006). We noted modest increases in performance by using a bacterial genome-wide association study to filter relevant genomic features and by integrating clinical data in the form of prior polymyxin exposure. Conversely, reference-free representation of genomic data as k-mers was associated with decreased performance (AUROC, 0.692 versus 0.894, P = 0.015). When ML models were interpreted to extract genomic features, six of seven known PR genes were correctly identified by models without prior programming and several genes involved in stress responses and maintenance of the cell membrane were identified as potential novel determinants of PR. These findings are a proof of concept that whole-genome sequencing data can accurately predict PR in K. pneumoniae CG258 and may be applicable to other forms of complex antimicrobial resistance.IMPORTANCE Polymyxins are last-resort antibiotics used to treat highly resistant Gram-negative bacteria. There are increasing reports of polymyxin resistance emerging, raising concerns of a postantibiotic era. Polymyxin resistance is therefore a significant public health threat, but current phenotypic methods for detection are difficult and time-consuming to perform. There have been increasing efforts to use whole-genome sequencing for detection of antibiotic resistance, but this has been difficult to apply to polymyxin resistance because of its complex polygenic nature. The significance of our research is that we successfully applied machine learning methods to predict polymyxin resistance in Klebsiella pneumoniae clonal group 258, a common health care-associated and multidrug-resistant pathogen. Our findings highlight that machine learning can be successfully applied even in complex forms of antibiotic resistance and represent a significant contribution to the literature that could be used to predict resistance in other bacteria and to other antibiotics.Entities:
Keywords: antimicrobial resistance; genotype; machine learning; phenotype; prediction
Year: 2020 PMID: 32457240 PMCID: PMC7253370 DOI: 10.1128/mSystems.00656-19
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1Summary of input data sets according to polymyxin resistance. Histograms show the relative distribution of polymyxin resistance across all genomes, Columbia University Irving Medical Center (CUIMC) genomes only, non-CUIMC genomes only, and then individual publicly available data sets that formed non-CUIMC genomes. For further information regarding individual genomes, see Data Set S1, sheet 1, in the supplemental material.
FIG 2Schematic of polymyxin susceptibility genotype-phenotype prediction using machine learning. Publicly available genomes and genomes from Columbia University Irving Medical Center were processed using either a reference-based or a reference-free approach in order to generate a binary matrix. The binary matrix represented either a gene variant matrix of coding regions in the genome (reference-based approach) or individual k-mers (reference-free approach). For machine learning analyses, two approaches were used. In the non-genome-wide association study (non-GWAS) approach, a scikit-learn pipeline was implemented, which included a further feature selection step using a support vector classifier and then comparison between four algorithms with 10-fold cross-validation (CV) for hyperparameter tuning. For the GWAS filtering approach, further feature selection was performed by integrating the results of a bacterial GWAS to prioritize genes. Model training was performed with hyperparameter tuning using the 75% training set, with the same four algorithms evaluated. The best-performing model was chosen, and the data were bootstrapped 10 times to assess model performance, with GWAS performed on the 75% training split of each data set. Specific tools used during the workflow are noted in the figure.
Comparison of rule-based and reference-based approaches for prediction of PR
| Genomes used | Metric | Rule-based | Ref-based value | Ref-based value | P value (GWAS vs | |
|---|---|---|---|---|---|---|
| Random forest | Random forest | |||||
| CUIMC | AUROC | 0.832 | 0.885 (0.849, 0.92) | 0.893 (0.864, 0.922) | 0.014* | 0.004* |
| bACC | 0.832 | 0.789 (0.751, 0.827) | 0.841 (0.82, 0.862) | 0.049+ | 0.262 | |
| Accuracy | 0.821 | 0.796 (0.762, 0.83) | 0.854 (0.83, 0.878) | 0.185 | 0.052 | |
| F1 | 0.819 | 0.755 (0.701, 0.809) | 0.816 (0.793, 0.84) | 0.027+ | 0.919 | |
| Precision | 0.738 | 0.799 (0.739, 0.859) | 0.881 (0.848, 0.914) | 0.037* | 0.006* | |
| Recall | 0.92 | 0.733 (0.635, 0.831) | 0.763 (0.725, 0.801) | 0.008+ | 0.006+ | |
| GBTC | SVC | |||||
| Non-CUIMC | AUROC | 0.717 | 0.933 (0.884, 0.982) | 0.933 (0.888, 0.979) | 0.006* | 0.002* |
| bACC | 0.717 | 0.753 (0.654, 0.853) | 0.82 (0.76, 0.881) | 0.415 | 0.006* | |
| Accuracy | 0.699 | 0.873 (0.822, 0.925) | 0.917 (0.894, 0.94) | 0.006* | 0.006* | |
| F1 | 0.471 | 0.59 (0.395, 0.785) | 0.729 (0.648, 0.81) | 0.185 | 0.002* | |
| Precision | 0.345 | 0.711 (0.473, 0.949) | 0.832 (0.759, 0.905) | 0.018* | 0.006* | |
| Recall | 0.745 | 0.57 (0.362, 0.778) | 0.669 (0.547, 0.791) | 0.184 | 0.262 | |
| GBTC | SVC | |||||
| All | AUROC | 0.791 | 0.894 (0.838, 0.95) | 0.931 (0.915, 0.947) | 0.006* | 0.002* |
| bACC | 0.791 | 0.784 (0.73, 0.838) | 0.801 (0.776, 0.827) | 0.61 | 0.375 | |
| Accuracy | 0.761 | 0.827 (0.78, 0.874) | 0.864 (0.84, 0.888) | 0.019* | 0.006* | |
| F1 | 0.694 | 0.702 (0.623, 0.781) | 0.741 (0.702, 0.779) | 0.76 | 0.02* | |
| Precision | 0.577 | 0.8 (0.675, 0.926) | 0.889 (0.846, 0.932) | 0.011* | 0.006* | |
| Recall | 0.87 | 0.668 (0.549, 0.788) | 0.638 (0.591, 0.686) | 0.006+ | 0.002+ |
*, statistical significance with a P value of <0.05 in favor of machine-learning approaches; +, statistical significance with a P value of <0.05 in favor of rule-based approaches. Abbreviations: PR, polymyxin resistance; AUROC, area under receiver-operator curve; bACC, balanced accuracy; CI, confidence interval; CUIMC, Columbia University Irving Medical Center; GBTC, gradient boosted trees classifier; GWAS, genome-wide association study; ref, reference; SVC, support vector machine classifier.
FIG 3Impact of feature engineering approach and machine learning algorithm on performance of machine learning models for polymyxin resistance prediction. Mean performance with 95% confidence intervals is shown across different performance metrics. The algorithms used were those that achieved the highest area under receiver-operator curve and can be found in Table 3. Histograms show how performance is impacted by the feature engineering approach (A) and the choice of machine learning algorithm (B). Abbreviations: AUROC, area under receiver-operator curve; bACC, balanced accuracy; CUIMC, Columbia University Irving Medical Center; GBTC, gradient boosted trees classifier; GWAS, genome-wide association study; SVC, support vector machine classifier.
Genes ranked by relative feature importance in ML model for PR prediction that incorporated all genomes and used GBTC
| Annotation | Quantitative feature | Full name | Function and comments | Reference(s) |
|---|---|---|---|---|
| 0.317 | Known determinant | |||
| 0.0454 | Known determinant | |||
| 0.0343 | Known determinant | |||
| 0.0325 | Known determinant | |||
| 0.0284 | Dihydrolipoyl dehydrogenase | Respiratory chain enzyme; implicated | ||
| 0.0250 | Alkyl hydroperoxide | Outer membrane protein conferring | ||
| 0.0195 | Grx4 family monothiol | Iron | ||
| 0.0185 | Sensing of osmotic signals, | |||
| 0.0183 | Phosphate import ATP- | Capture and transport of periplasmic phosphate | ||
| 0.0160 | Known determinant | |||
| 0.0147 | Unknown | |||
| 0.0143 | Known determinant | |||
| 0.0136 | UDP 4-deoxy-4- | Part of | ||
| 0.0136 | Aminopeptidase N | Cell wall protein, possible target of neutrophil | ||
| 0.0133 | Paraquat-inducible | Involved in transport pathways that contribute | ||
| KP0228_00228 | 0.0113 | H239_3063 | Encodes putative RND-type efflux pump | |
| 0.0104 | DNA topoisomerase (ATP- | Implicated in fluoroquinolone resistance | ||
| KP0228_00219 | 0.00954 | Unknown | ||
| 0.00902 | Unknown | |||
| 0.00897 | Phosphatidylglycero- | Involved in generating phospholipid for |
ML, machine learning; PR, polymyxin resistance; GBTC, gradient-boosted trees classifier.
Comparison of different feature engineering approaches on performance of ML prediction of PR
| Genomes | Metric | Ref-based value | Ref-based value | Ref-free value | Ref-based value | |||
|---|---|---|---|---|---|---|---|---|
| Random forest | Random forest | SVC | GBTC | |||||
| CUIMC | AUROC | 0.885 (0.849, 0.92) | 0.893 (0.864, 0.922) | 0.696 (0.564, 0.828) | 0.923 (0.88, 0.965) | 0.571 | 0.241 | 0.104 |
| bACC | 0.789 (0.751, 0.827) | 0.841 (0.82, 0.862) | 0.64 (0.536, 0.743) | 0.796 (0.714, 0.879) | 0.026* | 0.226 | 0.544 | |
| Accuracy | 0.796 (0.762, 0.83) | 0.854 (0.83, 0.878) | 0.649 (0.541, 0.758) | 0.804 (0.716, 0.892) | 0.009* | 0.91 | 0.342 | |
| F1 | 0.755 (0.701, 0.809) | 0.816 (0.793, 0.84) | 0.579 (0.454, 0.704) | 0.768 (0.685, 0.85) | 0.045* | 0.734 | 0.733 | |
| Precision | 0.799 (0.739, 0.859) | 0.881 (0.848, 0.914) | 0.67 (0.506, 0.833) | 0.866 (0.737, 0.996) | 0.006* | 0.023* | 0.085 | |
| Recall | 0.733 (0.635, 0.831) | 0.763 (0.725, 0.801) | 0.56 (0.407, 0.714) | 0.732 (0.607, 0.857) | 1 | 0.011* | 0.879 | |
| GBTC | SVC | SVC | ||||||
| Non-CUIMC | AUROC | 0.933 (0.884, 0.982) | 0.933 (0.888, 0.979) | 0.803 (0.692, 0.913) | 0.85 | 0.677 | ||
| bACC | 0.753 (0.654, 0.853) | 0.82 (0.76, 0.881) | 0.5 (0.5, 0.5) | 0.427 | 0.089 | |||
| Accuracy | 0.873 (0.822, 0.925) | 0.917 (0.894, 0.94) | 0.82 (0.811, 0.83) | 0.185 | 0.005* | |||
| F1 | 0.59 (0.395, 0.785) | 0.729 (0.648, 0.81) | 0 (0, 0) | 0.345 | 0.623 | |||
| Precision | 0.711 (0.473, 0.949) | 0.832 (0.759, 0.905) | 0 (0, 0) | 0.703 | 0.569 | |||
| Recall | 0.57 (0.362, 0.778) | 0.669 (0.547, 0.791) | 0 (0, 0) | 0.88 | 0* | |||
| GBTC | SVC | GBTC | ||||||
| All | AUROC | 0.894 (0.838, 0.95) | 0.931 (0.915, 0.947) | 0.692 (0.546, 0.838) | 0.19 | 0.015* | ||
| bACC | 0.784 (0.73, 0.838) | 0.801 (0.776, 0.827) | 0.5 (0.5, 0.5) | 0.473 | 0.006* | |||
| Accuracy | 0.827 (0.78, 0.874) | 0.864 (0.84, 0.888) | 0.688 (0.685, 0.691) | 0.162 | 0.045* | |||
| F1 | 0.702 (0.623, 0.781) | 0.741 (0.702, 0.779) | 0 (0, 0) | 0.384 | 0.003* | |||
| Precision | 0.8 (0.675, 0.926) | 0.889 (0.846, 0.932) | 0 (0, 0) | 0.363 | 0.472 | |||
| Recall | 0.668 (0.549, 0.788) | 0.638 (0.591, 0.686) | 0 (0, 0) | 0.344 | 0.064 |
*, statistical significance with a P value of <0.05. Abbreviations: ML, machine learning; PR, polymyxin resistance; AUROC, area under receiver-operator curve; bACC, balanced accuracy; CI, confidence interval; CUIMC, Columbia University Irving Medical Center; GWAS, genome-wide association study; ref, reference.
Using the best-performing algorithm indicated.