| Literature DB >> 34884852 |
Peter Májek1, Lukas Lüftinger1,2, Stephan Beisken1, Thomas Rattei3, Arne Materna1.
Abstract
The prediction of antimicrobial resistance (AMR) based on genomic information can improve patient outcomes. Genetic mechanisms have been shown to explain AMR with accuracies in line with standard microbiology laboratory testing. To translate genetic mechanisms into phenotypic AMR, machine learning has been successfully applied. AMR machine learning models typically use nucleotide k-mer counts to represent genomic sequences. While k-mer representation efficiently captures sequence variation, it also results in high-dimensional and sparse data. With limited training data available, achieving acceptable model performance or model interpretability is challenging. In this study, we explore the utility of feature engineering with several biologically relevant signals. We propose to predict the functional impact of observed mutations with PROVEAN to use the predicted impact as a new feature for each protein in an organism's proteome. The addition of the new features was tested on a total of 19,521 isolates across nine clinically relevant pathogens and 30 different antibiotics. The new features significantly improved the predictive performance of trained AMR models for Pseudomonas aeruginosa, Citrobacter freundii, and Escherichia coli. The balanced accuracy of the respective models of those three pathogens improved by 6.0% on average.Entities:
Keywords: WGS; antibiotics; antimicrobial resistance; genome-wide mutation scoring; genomics; machine learning
Mesh:
Substances:
Year: 2021 PMID: 34884852 PMCID: PMC8657983 DOI: 10.3390/ijms222313049
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Balanced accuracy, very major error (1-sensitivity), and major error (1-specificity) of trained models as averaged across the compounds of the given pathogens. Error bars are the standard error of means.
Figure 2Balanced accuracy of 110 models evaluated on the 20% test set splits. The training set size is shown on the x-axis. For each dataset, the balanced accuracy of the best performing model of the three models considered, trained on different feature spaces (DNA k-mers, extended dataset, extended + PROVEAN features), is shown. Vertical lines indicate the increase in balanced accuracy on individual datasets from the original feature space. Section 4.2 lists all compound names and their abbreviations as shown here. A small scatter is added on the x-axis to avoid overlapping vertical lines.
Figure 3Relative importance of different feature types for individual pathogens as measured by SHAP contributions of features to model output. Values were summed up across all tested compounds.
Figure 4Mutational profiles across the three most important genes of the models for the following combinations: (A) P. aeruginosa, ciprofloxacin; (B) C. freundii, cefotaxime; and (C) E. cloacae, ceftazidime. Each plot shows the positional distribution of mutations of the most important gene for the model prediction. The bars show the number of assemblies having a given mutation at a given position, but only considering the most damaging mutation per assembly. Damaging mutations (PROVEAN scores lower than a gene specific threshold used by XGB models) are shown on the negative y-axis section; neutral mutations (score larger than or equal to the threshold) are shown in the positive y-axis section. The y-axis is shown in a square root scale. Red and blue colors correspond to assemblies resistant and susceptible to the given compound. The number of isolates identified as damaging or neutral along with the percentage of resistant isolates are shown for each group. Isolates with wild-type sequences are positioned at position -10 on the x-axis and labelled WT.