| Literature DB >> 34613360 |
Yunxiao Ren1, Trinad Chakraborty2,3, Swapnil Doijad2,3, Linda Falgenhauer3,4,5, Jane Falgenhauer2,3, Alexander Goesmann3,6, Anne-Christin Hauschild1, Oliver Schwengers3,6, Dominik Heider1.
Abstract
MOTIVATION: Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput, and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done.Entities:
Year: 2021 PMID: 34613360 PMCID: PMC8722762 DOI: 10.1093/bioinformatics/btab681
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Workflow of the study. WGS data from Giessen and the public data from Moradigaravand were processed, and single nucleotide polymorphisms (SNPs) were called. The SNP data were encoded by label encoding, one-hot encoding and FCGR encoding for subsequent machine learning. The Giessen dataset was used to train and validate the four machine learning algorithms using cross-validation. The public data were used for the final evaluation of the models. Finally, we analyzed the association of SNPs and SNPs-adjacent genes with AMR using EFS. Created with BioRender.com
Overview of the datasets
| Drug | CIP | CTX | CTZ | GEN | ||||
|---|---|---|---|---|---|---|---|---|
| Source | Giessen | Public | Giessen | Public | Giessen | Public | Giessen | Public |
| Resistant | 418 | 267 | 455 | 115 | 291 | 73 | 216 | 101 |
| Susceptible | 482 | 1229 | 475 | 1313 | 550 | 1398 | 710 | 1398 |
| Total | 900 | 1496 | 930 | 1428 | 841 | 1471 | 926 | 1489 |
Fig. 2.ROC curves for the models with label encoding, one-hot encoding and FCGR encoding on the Giessen data. First row: ROC curves for CIP with label encoding (A), one-hot encoding (B) and FCGR encoding (C), respectively. Second row: ROC curves for CTX with label encoding (D), one-hot encoding (E) and FCGR encoding (F), respectively. Third row: ROC curves for CTZ with label encoding (G), one-hot encoding (H) and FCGR encoding (I), respectively. Fourth row: ROC curves for GEN with label encoding (J), one-hot encoding (K) and FCGR encoding (L), respectively
Results of the four machine learning models with label encoding on the Giessen data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.88 ± 0.04 | 0.75 ± 0.04 | 0.81 ± 0.02 | 0.76 ± 0.03 | 0.87 ± 0.01 | 0.65 ± 0.10 | 0.89 ± 0.03 | 0.91 ± 0.02 |
| LR | 0.88 ± 0.05 | 0.71 ± 0.04 | 0.81 ± 0.03 | 0.77 ± 0.02 | 0.90 ± 0.03 | 0.69 ± 0.08 | 0.92 ± 0.05 | 0.96 ± 0.03 |
| RF | 0.92 ± 0.04 | 0.75 ± 0.03 | 0.84 ± 0.03 | 0.79 ± 0.02 | 0.89 ± 0.03 | 0.73 ± 0.07 | 0.90 ± 0.06 | 0.97 ± 0.03 |
| SVM | 0.85 ± 0.03 | 0.69 ± 0.02 | 0.78 ± 0.03 | 0.75 ± 0.02 | 0.89 ± 0.04 | 0.73 ± 0.03 | 0.89 ± 0.03 | 0.96 ± 0.03 |
Results of the four machine learning models with one-hot encoding on the Giessen data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.87 ± 0.05 | 0.75 ± 0.00 | 0.84 ± 0.01 | 0.80 ± 0.00 | 0.90 ± 0.01 | 0.71 ± 0.03 | 0.84 ± 0.03 | 0.87 ± 0.05 |
| LR | 0.89 ± 0.05 | 0.71 ± 0.04 | 0.80 ± 0.03 | 0.78 ± 0.02 | 0.89 ± 0.03 | 0.73 ± 0.08 | 0.89 ± 0.05 | 0.95 ± 0.02 |
| RF | 0.92 ± 0.05 | 0.75 ± 0.01 | 0.82 ± 0.02 | 0.80 ± 0.03 | 0.90 ± 0.02 | 0.73 ± 0.07 | 0.90 ± 0.07 | 0.97 ± 0.03 |
| SVM | 0.86 ± 0.05 | 0.68 ± 0.03 | 0.77 ± 0.03 | 0.76 ± 0.03 | 0.89 ± 0.03 | 0.69 ± 0.06 | 0.89 ± 0.06 | 0.95 ± 0.04 |
Results of the four machine learning models with FCGR encoding on the Giessen data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.87 ± 0.04 | 0.74 ± 0.04 | 0.81 ± 0.03 | 0.75 ± 0.02 | 0.91 ± 0.03 | 0.84 ± 0.04 | 0.87 ± 0.06 | 0.96 ± 0.01 |
| LR | 0.79 ± 0.08 | 0.70 ± 0.04 | 0.73 ± 0.05 | 0.69 ± 0.04 | 0.85 ± 0.04 | 0.79 ± 0.05 | 0.85 ± 0.04 | 0.86 ± 0.02 |
| RF | 0.91 ± 0.03 | 0.74 ± 0.01 | 0.82 ± 0.02 | 0.80 ± 0.02 | 0.87 ± 0.03 | 0.72 ± 0.07 | 0.90 ± 0.07 | 0.98 ± 0.01 |
| SVM | 0.81 ± 0.03 | 0.72 ± 0.03 | 0.73 ± 0.01 | 0.69 ± 0.02 | 0.88 ± 0.03 | 0.81 ± 0.05 | 0.87 ± 0.03 | 0.92 ± 0.03 |
Fig. 3.ROC curves for the models with label, one-hot and FCGR encoding on the public data. First row: ROC curves for CIP with label encoding (A), one-hot encoding (B) and FCGR encoding (C), respectively. Second row: ROC curves for CTX with label encoding (D), one-hot encoding (E) and FCGR encoding (F), respectively. Third row: ROC curves for CTZ with label encoding (G), one-hot encoding (H) and FCGR encoding (I), respectively. Fourth row: ROC curves for GEN with label encoding (J), one-hot encoding (K) and FCGR encoding (L), respectively
Evaluation of the machine learning models with label encoding on the public data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.94 | 0.71 | 0.79 | 0.84 | 0.88 | 0.88 | 0.81 | 0.70 |
| LR | 0.93 | 0.76 | 0.80 | 0.82 | 0.90 | 0.84 | 0.75 | 0.62 |
| RF | 0.95 | 0.75 | 0.81 | 0.83 | 0.90 | 0.85 | 0.77 | 0.61 |
| SVM | 0.94 | 0.71 | 0.75 | 0.77 | 0.87 | 0.84 | 0.74 | 0.60 |
Note: Precision and recall are calculated based on balanced data using down-sampling.
Evaluation of the machine learning models with FCGR encoding on the public data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.84 | 0.71 | 0.72 | 0.74 | 0.93 | 0.89 | 0.86 | 0.71 |
| LR | 0.85 | 0.77 | 0.79 | 0.80 | 0.89 | 0.87 | 0.86 | 0.74 |
| RF | 0.92 | 0.77 | 0.83 | 0.83 | 0.88 | 0.89 | 0.78 | 0.59 |
| SVM | 0.88 | 0.78 | 0.77 | 0.75 | 0.90 | 0.86 | 0.86 | 0.74 |
Note: Precision and recall are calculated based on balanced data using down-sampling.
Fig. 4.EFS analysis for each antibiotic for both datasets. The left four figures are the identified ten most important SNPs for CIP (A), CTX (C), CTZ (E) and GEN (G) from the Giessen dataset. The right figures are the corresponding SNPs from the public dataset
SNPs and corresponding genes associated with AMR
| SNP Position | Gene location | SNP annotation | Gene | Gene biotype | Drug |
|---|---|---|---|---|---|
| 18169 | 17489 → 18655 | Synonymous |
| CDS | CTX, CTZ, GEN |
| 898919 | 898518 → 899645 | Synonymous |
| CDS | CIP, CTX |
| 2008324 | 2008277 → 2009482 | Synonymous |
| CDS | CTX, CTZ, GEN |
| 2017588 | 2016554 → 2017927 | synonymous |
| CDS | CIP, CTX, CTZ, GEN |
| 2148909 | 2147674 → 2149026 | Synonymous |
| CDS | GEN |
| 2655873 | 2655075 → 2656358 | Synonymous |
| CDS | CIP, CTX, GEN |
| 3099618 | 3098558 → 3099565 | Upstream gene |
| CDS | CIP, CTX, CTZ, GEN |
| 3644715 | 3643140 → 3645182 | Synonymous |
| CDS | CTX, CTZ, GEN |
| 4101302 | 4100810 → 4101430 | Missense |
| CDS | CIP, CTZ |
| 4127700 | 4127286 → 4127894 | Synonymous |
| CDS | CTX, CTZ, GEN |
| 4172893 | 4172057 → 4173085 | Missense |
| CDS | CIP, CTX, GEN |
| 4230581 | 4230354 → 4231226 | Synonymous |
| CDS | CTZ, GEN |
| 4441487 | 4439872 → 4441215 | Upstream gene |
| CDS | CIP, CTX, CTZ, GEN |
| 4453756 | 4453583 → 4454578 | Synonymous |
| CDS | CIP, CTX, CTZ, GEN |
| 4466572 | 4466299 → 4467246 | Synonymous |
| CDS | CIP, CTX, CTZ, GEN |
| 4477553 | 4477307 → 4478311 | Missense |
| CDS | CIP, CTX, CTZ |
| 4483166 | 4480982 → 4483837 | Synonymous |
| CDS | GEN |
| 4605418 | 4604875 → 4605663 | Synonymous |
| CDS | CIP, CTX, CTZ, GEN |
| 4627668 | 4627315 → 4628547 | Synonymous |
| CDS | CIP |
Note: The first column shows the positions of the identified SNPs for the four antibiotics. The second column and third column show the gene location and SNP annotation. The fourth column and fifth column show the genes annotated from SNPs and gene biotype. The final column is the antibiotics that are associated with the SNPs.
Evaluation of the machine learning models with one-hot encoding on the public data
| Classifiers/drug | Precision | Precision | Precision | Precision | Recall | Recall | Recall | Recall |
|---|---|---|---|---|---|---|---|---|
| CIP | CTX | CTZ | GEN | CIP | CTX | CTZ | GEN | |
| CNN | 0.95 | 0.83 | 0.84 | 0.80 | 0.90 | 0.83 | 0.78 | 0.62 |
| LR | 0.90 | 0.80 | 0.76 | 0.81 | 0.90 | 0.85 | 0.78 | 0.63 |
| RF | 0.90 | 0.78 | 0.73 | 0.81 | 0.90 | 0.86 | 0.78 | 0.63 |
| SVM | 0.89 | 0.78 | 0.75 | 0.73 | 0.88 | 0.83 | 0.77 | 0.55 |
Note: Precision and recall are calculated based on balanced data using down-sampling.