| Literature DB >> 31597945 |
Abu Sayed Chowdhury1, Douglas R Call2,3, Shira L Broschat2,3,4.
Abstract
The increasing prevalence of antimicrobial-resistant bacteria drives the need for advanced methods to identify antimicrobial-resistance (AMR) genes in bacterial pathogens. With the availability of whole genome sequences, best-hit methods can be used to identify AMR genes by differentiating unknown sequences with known AMR sequences in existing online repositories. Nevertheless, these methods may not perform well when identifying resistance genes with sequences having low sequence identity with known sequences. We present a machine learning approach that uses protein sequences, with sequence identity ranging between 10% and 90%, as an alternative to conventional DNA sequence alignment-based approaches to identify putative AMR genes in Gram-negative bacteria. By using game theory to choose which protein characteristics to use in our machine learning model, we can predict AMR protein sequences for Gram-negative bacteria with an accuracy ranging from 93% to 99%. In order to obtain similar classification results, identity thresholds as low as 53% were required when using BLASTp.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31597945 PMCID: PMC6785542 DOI: 10.1038/s41598-019-50686-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Classification performance for different δ values (corresponding classification accuracy in parentheses).
| AMR | Oversampling | Undersampling | ||||
|---|---|---|---|---|---|---|
| acetyltransferase ( | 6 (0.97) | 6 (0.97) | 6 (0.97) | 5 (0.97) | 5 (0.97) | 5 (0.97) |
| 15 (1) | 19 (1) | 18 (1) | 9 (0.97) | 9 (0.97) | 11 (0.97) | |
| dihydrofolate reductase ( | 5 (1) | 5 (1) | 5 (1) | 18 (0.96) | 28 (1) | 25 (1) |
Figure 1Comparison between GTDWFE and RReliefF accuracies for oversampling and undersampling. Accuracies are given as a function of the number of features used.
Figure 2Confusion matrices for oversampling and undersampling.
Figure 3Identification of AMR sequences in Pseudomonas, Vibrio, or Enterobacter using BLASTp as a function of percent identity using AMR sequences from Acinetobacter, Klebsiella, Campylobacter, Salmonella, and Escherichia.
Confusion matrix for classification performance.
| Actual / Predicted | Negative | Positive |
|---|---|---|
| Negative | TN | FN |
| Positive | FP | TP |