Literature DB >> 32929164

Understanding and predicting ciprofloxacin minimum inhibitory concentration in Escherichia coli with machine learning.

Bálint Ármin Pataki1,2, Sébastien Matamoros3, Boas C L van der Putten3,4, Daniel Remondini5, Enrico Giampieri6, Derya Aytan-Aktug7, Rene S Hendriksen7, Ole Lund8, István Csabai9,10, Constance Schultsz3,4.   

Abstract

It is important that antibiotics prescriptions are based on antimicrobial susceptibility data to ensure effective treatment outcomes. The increasing availability of next-generation sequencing, bacterial whole genome sequencing (WGS) can facilitate a more reliable and faster alternative to traditional phenotyping for the detection and surveillance of AMR. This work proposes a machine learning approach that can predict the minimum inhibitory concentration (MIC) for a given antibiotic, here ciprofloxacin, on the basis of both genome-wide mutation profiles and profiles of acquired antimicrobial resistance genes. We analysed 704 Escherichia coli genomes combined with their respective MIC measurements for ciprofloxacin originating from different countries. The four most important predictors found by the model, mutations in gyrA residues Ser83 and Asp87, a mutation in parC residue Ser80 and presence of the qnrS1 gene, have been experimentally validated before. Using only these four predictors in a linear regression model, 65% and 93% of the test samples' MIC were correctly predicted within a two- and a four-fold dilution range, respectively. The presented work does not treat machine learning as a black box model concept, but also identifies the genomic features that determine susceptibility. The recent progress in WGS technology in combination with machine learning analysis approaches indicates that in the near future WGS of bacteria might become cheaper and faster than a MIC measurement.

Entities:  

Year:  2020        PMID: 32929164      PMCID: PMC7490380          DOI: 10.1038/s41598-020-71693-5

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Antibiotics are an essential resource in the control of infectious diseases; they have been a major contributor to the decline of infection-associated mortality and morbidity in the twentieth century. However, the recent rise of antimicrobial resistance (AMR) threatens this situation[1]. Bacterial AMR is associated with a higher likelihood of therapeutic failure in case of infections. Accurate and fast prediction of AMR in bacteria is needed to select the optimal therapy. With the increasing availability of next-generation sequencing, bacterial whole genome sequencing (WGS) is becoming a feasible alternative to traditional phenotyping for the detection and surveillance of AMR[2-4]. However, data analysis remains the weak point in this approach; fast and scalable methods are required to transform the ever-growing amount of genomic data into actionable clinical or epidemiological information[5]. Machine learning is a promising approach for this kind of data analysis. AMR can be predicted in numerous ways. In addition to classic and highly standardized phenotypic testing of resistance, several methods of resistance prediction have been developed. Most novel methods use a genetic or genomic approach, although transcriptomic approaches have been investigated as well[6-8]. An important factor in the choice of the resistance prediction method is the microorganism under study. For example, the CRyPTIC consortium managed to predict resistance to four first-line drugs in Mycobacterium tuberculosis, using only known mutations extracted from WGS[9]. However, M. tuberculosis displays little-to-no horizontal gene transfer and low genomic evolution rate[10], which makes it feasible to predict resistance only on the basis of known mutations[11]. For other bacteria, more advanced analysis methods such as machine learning need to be applied to allow for accurate prediction. Machine learning has been applied to predict resistance from WGS data in several settings. To date, these methods have been restricted mostly to assign bacteria to binary categories, i.e. susceptible or non-susceptible[8,12-18]. Clinical breakpoints used to define susceptible and non-susceptible categories can change over time based on various protocols. Such binary categories do not allow following trends in subtle changes in susceptibility. Minimum inhibitory concentration (MIC) measures offer an adequate resolution to follow if susceptibility is changing in a population, which is useful for epidemiological purposes. Therefore, a resistance prediction method would preferably output a continuous estimate of resistance similar to MIC, instead of binary classification (S/R), as a number of studies already proposed[19-22]. Additional issues should be considered when developing a reliable and useful prediction model. Genotypes are often geographically clustered[23]. This implies that if a prediction model is trained on data from one country, this model might not be generalizable to data from another country. Data from multiple countries are thus needed. A combinations of chromosomal mutations and acquired resistance genes may influence antimicrobial resistance together, it might be not enough to focus only on the point mutations or the acquired genes. Therefore, different data types need to be combined to obtain a biologically relevant set of input data. Lastly, while machine learning is able to analyse highly complex patterns of features, the model would preferably output generally understandable data. Among others k-mer profiles have been used to predict resistance[20,21]. In this study, we focus on predicting a quantitative measure of ciprofloxacin resistance (MIC) for a geographically diverse population of E. coli using machine learning. We chose to study ciprofloxacin resistance in E. coli because of three reasons: In our selection of machine learning models, an important criterion was that high-scoring features could be extracted from the model. This would allow us to explore the reasoning behind each prediction and thus to interpret and understand the model. Also, if the trained model relies only on a few genomic features, when genetating predictions for new samples, it is enough to determine those few genomic attributes, WGS is not necessary. Ciprofloxacin resistance in E. coli has been studied intensively Ciprofloxacin resistance in E. coli can be caused by a range of different chromosomal and plasmid-mediated mechanisms[24] Ciprofloxacin is commonly used in the treatment of E. coli infections across the globe.

Results

Midpoint-rooted phylogenetic tree of the 704 E. coli samples that had ciprofloxacin MIC measurement. It is clearly visible that the test data is clustered separately from the training data suggesting the generalization power of our model. Nodes with lower than 80% bootstrap support are collapsed. Our dataset consists of a phylogenetically diverse collection of E. coli strains, see Fig. 1. Strains in the test and train group are present throughout the whole phylogeny, although the groups are present predominantly in different parts of the phylogeny.
Figure 1

Midpoint-rooted phylogenetic tree of the 704 E. coli samples that had ciprofloxacin MIC measurement. It is clearly visible that the test data is clustered separately from the training data suggesting the generalization power of our model. Nodes with lower than 80% bootstrap support are collapsed.

We trained a random forest model using genome-wide mutation profiles alongside the ResFinder-based profiles of acquired resistance genes. We ranked the predictors proposed by the model itself, see Supplementary Table S1. The model performed with high accuracy on the training set leave-one-country-out cross-validation using four predictors, see Supplementary Fig. S1. The addition of more features did not improve significantly the cross-validation results, therefore we kept only the first four, allowing for a simple and understandable model. We also trained a linear regression model on this restricted dataset. Using these four predictors with the linear regression model, 264 out of the 266 test data samples were correctly classified at susceptible/non-susceptible level, 65% and 93% of the corresponding MIC values were correctly predicted within a two- and a four-fold dilution, see Table 1. All the four genetic features are experimentally proven to play an important role in ciprofloxacin resistance[24].
Table 1

Number of features, R score, Pearson correlation (R), Major Error (ME), Very Major Error (VME), area under the receiver operating curve (AUC), Accuracy within a two/four-fold dilution (ACC-2, ACC-4) and Mean Absolute Fold Error (MAFE) on the unseen test data. For the AUC, ME, VME the data was binarized using 1 mg/L threshold.

ModelN_featR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {b}}$$\end{document}bR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {b}}$$\end{document}bME\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {a,c}}$$\end{document}a,cVME\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {a,c}}$$\end{document}a,cAUCACC-2ACC-4MAFE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {c}}$$\end{document}c
Random forest40.9320.966101.0000.6580.9440.883
Random forest150.9020.951500.9960.6800.9140.915
Linear regression40.9180.959021.0000.6500.9290.954

The number of features were selected according to the performance using leave-one-country-out validation on the training data, see Supplementary Fig. S1.

Number of samples.

Calculated on the log2 values.

The lower the better.

Number of features, R score, Pearson correlation (R), Major Error (ME), Very Major Error (VME), area under the receiver operating curve (AUC), Accuracy within a two/four-fold dilution (ACC-2, ACC-4) and Mean Absolute Fold Error (MAFE) on the unseen test data. For the AUC, ME, VME the data was binarized using 1 mg/L threshold. The number of features were selected according to the performance using leave-one-country-out validation on the training data, see Supplementary Fig. S1. Number of samples. Calculated on the log2 values. The lower the better. These 4 predictors are the following:All of the predictors above are binary (presence/absence) therefore there are different possible prediction for any sample based on these features, see Supplementary Table S2. A linear regression model fitted on the log2 values of the MIC measurements could achieve similar performance as a more complex random forest model, see Fig. 2. Linear regression is preferred due to its simplistic nature. Having a random forest regressor with hundreds of decision trees and thousands of genomic features as predictors it is difficult to understand why the model made that particular prediction, leaving doubts of its clinical usefulness.
Figure 2

Prediction on the unseen test set was generated via random forest and linear regression model using the best four predictors. It can be clearly seen that the two models do not differ much in terms of predicted values.

gyrA mutation at amino acid #87 gyrA mutation at amino acid #83 parC mutation at amino acid #80 qnrS1 gene Prediction on the unseen test set was generated via random forest and linear regression model using the best four predictors. It can be clearly seen that the two models do not differ much in terms of predicted values. There are several previous works on predicting ciprofloxacin resistance for E. coli at susceptible/non-susceptible level. This makes a limited comparision possible between the presented method and the methods published in the literature. Limitation comes from the fact that the susceptible/non-susceptible outcome from the measurements are based on breakpoints which are not always reported. Also, some papers use disk diffusion test. Keeping these limitations in mind it is still worthy to compare our method to others. Pesesky et al.[13] reported AUC of 0.9652–0.9786, while Hyun et al.[18] reported 0.98 AUC. Moradigaravand et al.[12] reported 0.97 precision and 0.81 recall while our linear model achieved 1.0 AUC, see Table 1 and 1.0 precision with 0.987 recall, see Supplementary Fig. S2.

Discussion

Here, we present an accurate method for predicting ciprofloxacin resistance for E. coli. With no built-in prior knowledge on which chromosomal mutations and which acquired resistance genes might be important, using a data-driven approach, we managed to create a machine learning model that was not only accurately predicting the susceptible/non-susceptible labels but also accurately predicting at MIC level. Additionally, the highlighted features of our approach could be narrowed down to four biologically understandable features, making the method simpler and therefore applicable to clinical microbiology practice. It was previously shown that accurate ciprofloxacin susceptible/non-susceptible binary prediction is possible for E. coli[3,12,17]. For some other bacteria-antibiotic combinations even MIC level predictions were performed[19-22]. This study goes beyond by not only predicting MIC level ciprofloxacin resistance for E. coli, but also highlighting the underlying reasoning behind the predictions. Furthermore, this study is one of very few that includes the presence or absence of genes located on mobile genetic elements (MGEs), in combination with chromosomal point mutations, in the machine learning algorithm. This is a crucial step since particularly in Gram-negative microorganisms such as E. coli, AMR is often encoded by genomic determinants located on MGE, or a combination of chromosomal and MGE encoded determinants, as is demonstrated in our study for ciprofloxacin. In addition, this study used data from different countries and regions thus ensuring potential variation in determinants that may contribute to ciprofloxacin resistance are represented in the data set. Notably, a linear regression model based on only the four most important features of the random forest model performed nearly as well as the full model. These features comprise two gyrA mutations, one parC mutation and the presence of the qnrS1 gene. All features have been associated with ciprofloxacin resistance before[24]. In addition, the presence of a single determinant versus combinations of multiple of the four determinants predicted MIC ranges that were comparable to those observed experimentally and in clinical isolates[24]. For example, the single presence of the qnrS1 gene predicted a relatively low MIC but the combination of the qnrS1 gene with a single mutation in gyrA increased the predicted MIC substantially (Supplementary Table S2). Our results indicate that for prediction of ciprofloxacin susceptibility on the basis of whole-genome sequencing in E. coli, the analysis could be limited to only these four determinants. It is worthy to note that the model was trained on all possible mutations. Also all acquired resistance genes were considered that appreared in the ResFinder gene database. Neither SNPs, nor resistance genes were pre-selected for ciprofloxacin during data preparation. Therefore the model could potentially discover novel, currently unknown mutation-based resistance encoding mechanisms which may be located in genes that are or are not yet known to contribute to resistance. For E. coli the ciprofloxacin resistance determinants that were predicted in our machine learning approach have been experimentally verified, but for other antibiotics, our approach could detect novel genomic variants associated with resistance. Our study also has some limitations, which mostly pertain to the dataset. For strains with measured MICs in the range of 8–64 mg/L, our model performs worse than for strains with lower MICs. This is most likely due to the fact that the majority of resistant strains in our training data have an MIC of 32 mg/L, with only very few other resistant MICs. This hampers accurate prediction of MIC for more resistant E. coli. Additionally, our dataset is not yet diverse and complete enough to be applied on a wide scale. This is a common problem for many studies aiming to predict AMR from WGS data. Solving this would require continuous updating of databases and an adequate database structure, the latter we have addressed previously[25]. In conclusion, we report a machine learning approach for a quantitative prediction of antibiotic resistance, which we applied for prediction of ciprofloxacin resistance in E. coli. In combination with continuous data base improvements, our approach could allow machine learning methods to enter routine clinical diagnostic and epidemiological practices to continuously improve predictions.

Methods

Data summary

In this study, 704 E. coli genomes combined with MIC measurement for ciprofloxacin were analysed[25]. Paired-end sequencing was performed on all isolates and the results were stored in FASTQ format. The isolates originated from five countries, Denmark, Italy, USA, UK, and Vietnam. The MIC distribution for these isolates is depicted in Table 2. Out of 704, 266 E. coli genomes had no country metadata available and were used as an independent test set. All data were deposited in the AMR Data Hub[25] which consists of raw sequencing data, ciprofloxacin minimum inhibitory concentrations, and additional metadata such as the origin of the samples. All data is publicly available from the SRA EBI database with the following accession codes: PRJEB21131, PRJNA266657, PRJNA292901, PRJNA292904, PRJNA292902, PRJDB7087, PRJEB21880, PRJEB21997, PRJEB14086 and PRJEB16326.
Table 2

The collected and used data in the analysis grouped by country and MIC values.

MIC (mg/L)DenmarkItalyNA\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${^{\text {a}}}$$\end{document}aUSAUKVietnamTotal
0.01000900211
0.0120000011
0.015119134249920315
0.0160000022
0.0230000011
0.030120634025
0.0601071009
0.120001120013
0.1250000066
0.190000001010
0.25060221131658
0.3800000055
0.500006201119
0.7500000011
1.00000520512
2.0000030014
4.0000026019
8.000003001233
12.000000011
16.00002300023
24.000000011
32.0000720045117
64.00002800028
Total1381326676100111704

Country metadata is not available.

The collected and used data in the analysis grouped by country and MIC values. Country metadata is not available. Download and analysis scripts are available at https://github.com/patbaa/AMR_ciprofloxacin. iTOL phylogenetic tree is available at https://itol.embl.de/tree/14511722611491391569485969. All used files are listed at https://github.com/patbaa/AMR_ciprofloxacin/blob/master/meta.tsv with URLs provided. Isolates with accession codes and MIC measurements are also available at https://github.com/patbaa/AMR_ciprofloxacin/blob/master/supplementary_meta_table.csv.

Data preprocessing

Raw reads were mapped on the ATCC 25922 reference genome (https://www.ncbi.nlm.nih.gov/assembly/GCF_000743255.1) using BWA-MEM v0.7.17[26] with default settings. Pileup files were generated with bcftools v1.9[27] with “–min-MQ 50” settings. Single-nucleotide polymorphisms (SNPs) and insertions-deletions (INDELs) were called using bcftools v1.9 with “–ploidity 1-m” flags. Further filtering was applied via bcftools v1.9 “ & ” flags. Bcftools output data was expressed as either a SNP (value: 1), an INDEL (value: 5) or no mutation (value: 0) per position in the reference genome. Exact numbers are irrelevant, as tree-based methods are not sensitive to the scale. The intention was to differentiate between reference alleles, SNPs and INDELs at a given position. Acquired resistance genes were identified using ResFinder v3.2[28] with a coverage threshold of 90% and an identity threshold of 90% using a database downloaded on 13th Apr. 2020. ResFinder was used with KMA v1.1.4[29]. The ResFinder output data was expressed as presence (value: 1) or absence (value: 0) of resistance genes. The SNP/INDEL data and ResFinder data were subsequently merged which provided a matrix with more than 830,000 columns representing reference genome positions with at least one mutation and 175 columns representing detected resistance genes.

Phylogenetic tree generation

The merged variant call files were converted to a FASTA alignment using vcf2phylip v2.0, retaining positions that were called in at least 50% of isolates[30]. The invariant positions were removed from the alignment using snp-sites v2.4.0[31]. The phylogeny was inferred using RAxML v8.2.9 in rapid bootstrap mode (-f a) with 100 bootstraps using a General Time Reversible model with Gamma rate heterogeneity including Lewis ascertainment bias correction (-m ASC_GTRGAMMA)[32]. The resulting phylogeny was visualized in iTOL[33].

Metrics

We used the following metrics for the evaluation of the model: —we used the clinical breakpoint for ciprofloxacin, 1 mg/L, based on the Clinical & Laboratory Standards Institure guideline[34] to encode whether samples are resistant or not. where is the measured value for sample i, is the predicted value for sample i, is the mean of the measured values. where cov is the covariance and is the standard deviation. —when the sample is non-resistant by measurement, but it is predicted to be resistant. Non-resistant and resistant labels are derived from MIC via thresholding. —when the sample is resistant by measurement, but it is predicted to be non-resistant. Non-resistant and resistant labels are derived from MIC via thresholding. —the fraction of the samples with MIC properly predicted within a two-fold dilution. If the measured MIC is x, then the prediction is counted as properly predicted within a two-fold dilution if it falls to the [x/2;2x] interval. Dilution range is the natural scale for comparison of MIC predictions and measurements due to the logarithmic scale of the latter. As the MIC gives additional clinical information beyond the binary resistant/non-resistant outcome here we report both ACC-2 and ACC-4. —the fraction of the samples with MIC properly predicted within a four-fold dilution. If the measured MIC is x, then the prediction is counted as properly predicted within a four-fold dilution if it falls to the [x/4;4x] interval. —The mean absolute difference between the log2 values of the prediction and the measurements.

Importance of the validation scheme

Proper validation is a key element in machine learning as most of the models have a large number of parameters which makes it easy for them to memorize the training dataset. In image recognition, popular convolutional neural networks can have more than 100 M parameters[35]. This number of parameters is orders of magnitudes larger than the number of pixels of a single image or even the number of the images in the whole usual benchmark data sets, such as ImageNet[36]. Having that many parameters it is possible to memorize the training data without generalizing any knowledge to the test data or for future use. However, with having a proper validation scheme it is possible to fairly estimate the generalization power of a model. In many cases simply randomly splitting the samples into two groups to a test and a validation set is enough. If the data set is small, cross-validation can help, usually, K-fold cross-validation, where the data set is split into K set, each having the same size. Then, the model is trained on using data from set and the predictions are made for the one set that was not used in the training process. Repeating the process, K times predictions can be generated for the whole data set in a way that the model did not see in training time any of the samples for which it is generating predictions. The weights of the model are reset between any two training. K-fold cross-validation can produce too optimistic results if the samples are clustered. For example, when the data collection is biased, bacterial isolates from one country are predominantly resistant whilst isolates from other countries are predominantly susceptible to an antibiotic. In addition, genetic signatures are often clustered by country[23]. Due to such clustering, the model may predict the country of origin of the bacterial isolate, which may be correlated with the MIC, on both the training and the validation data sets, but it is not guaranteed that the same will happen in real-life usage later.

Leave-one-country-out validation

Here we propose a more strict and reliable validation method. Instead of randomly splitting the data into K different folds, we split the folds by country. Using this approach, the model is not rewarded if it only learns country-specific attributes. Leave-one-country-out validation was performed during the selection of the most important features in the data set, see Supplementary Table S1. The random forest model was fitted times leaving out one country each time from the training data set. Then the feature importances were summed over each fold resulting in the final feature importance rankings. It worth to look at Supplementary Table S1, which contains the feature importances calculated the way described above. For gyrA#87 we have fairly large values for all splits except, when the Vietnam data is left out from the training, suggesting that gyrA#87 mutant stains are mainly coming from Vietnam. The high feature importance of gyrA#83 for the split when Vietnam data is the test means that for the non-vietnamese data gyrA#83 mutation is the most descriptive.

Random forest model

For tabular data most often tree-like models perform the best. The random forest model is an ensemble of numerous (usually hundreds of) decision trees. In the training process, each tree is trained separately and each of them uses only a random fraction of the data, which ensures that the decision trees will not be identical. For a new sample, the prediction is the average of the prediction of the trees, or for classification the category that was predicted the most often by the individual trees. This ensemble technique ends up an accurate, robust, scalable model. The prediction error is usually large for each individual tree, but as long as the errors of the trees are uncorrelated, averaging their prediction lowers the final error. Random forest regressor was trained with mean squared error criterion, min_samples_leaf = 1, min_samples_split = 2, and n_estimators = 200 for the feature selection. For the final evaluation mean squared error criterion, min_samples_leaf = 1, min_samples_split = 5, and n_estimators = 100 parameters were used. The random seed was fixed. Other parameters remained default. Scikit-learn v0.21.2[37] was used for fitting the model in Python 3.6.5.

Random forest feature importance

For decision trees the input variables, the features can be sorted by their importance. The importance can be defined in various ways; the used scikit-learn v0.21.2[37] implementation calculates the mean decrease impurity averaged over all the trees in the forest[38,39]. In this approach, the identification of the most important predictors becomes feasible even for cases when there are hundreds of thousands of features.

Model fitting

All models were fitted on the log2 values of the MIC, which is the natural scale for the MIC measurement. Later the predicted values were converted back to the MIC units.

Study pipeline

The pipeline of this study is shown in Fig. 3. First, the raw reads were converted to a numerical table indicating mutations and plasmid related resistant genes. In the second step, a random forest model is fitted on the train data via leave-one-country cross-validation. Features importances were averaged over each fold. Then the highest-ranking features were kept which significantly reduced the dimensionality of the data. Using this low dimensional training data a random forest model and a linear regression was fitted. For fitting the models always the log2 MIC values were used as a natural scale for the MIC measurements.
Figure 3

Workflow of the study. First, a random forest model was fitted to the training data with leave-one-country-out validation. Feature importances of the fitted models are averaged over all the folds and the four best features are kept. Then the random forest model and a linear regression model were fitted on all the training samples using only the four best features. And model performances are tested using the independent test dataset.

Workflow of the study. First, a random forest model was fitted to the training data with leave-one-country-out validation. Feature importances of the fitted models are averaged over all the folds and the four best features are kept. Then the random forest model and a linear regression model were fitted on all the training samples using only the four best features. And model performances are tested using the independent test dataset. At the last step, the performance of the models was evaluated on the unseen test data using the same restricted feature set. Supplementary Information.
  31 in total

1.  Infectious history.

Authors:  J Lederberg
Journal:  Science       Date:  2000-04-14       Impact factor: 47.728

2.  RNA signatures allow rapid identification of pathogens and antibiotic susceptibilities.

Authors:  Amy K Barczak; James E Gomez; Benjamin B Kaufmann; Ella R Hinson; Lisa Cosimi; Mark L Borowsky; Andrew B Onderdonk; Sarah A Stanley; Devinder Kaur; Kevin F Bryant; David M Knipe; Alexander Sloutsky; Deborah T Hung
Journal:  Proc Natl Acad Sci U S A       Date:  2012-04-02       Impact factor: 11.205

3.  Quantifying the contribution of four resistance mechanisms to ciprofloxacin MIC in Escherichia coli: a systematic review.

Authors:  Boas C L van der Putten; Daniel Remondini; Giovanni Pasquini; Victoria A Janes; Sébastien Matamoros; Constance Schultsz
Journal:  J Antimicrob Chemother       Date:  2019-02-01       Impact factor: 5.790

4.  Genes mirror geography within Europe.

Authors:  John Novembre; Toby Johnson; Katarzyna Bryc; Zoltán Kutalik; Adam R Boyko; Adam Auton; Amit Indap; Karen S King; Sven Bergmann; Matthew R Nelson; Matthew Stephens; Carlos D Bustamante
Journal:  Nature       Date:  2008-08-31       Impact factor: 49.962

5.  Identification of acquired antimicrobial resistance genes.

Authors:  Ea Zankari; Henrik Hasman; Salvatore Cosentino; Martin Vestergaard; Simon Rasmussen; Ole Lund; Frank M Aarestrup; Mette Voldby Larsen
Journal:  J Antimicrob Chemother       Date:  2012-07-10       Impact factor: 5.790

6.  Accelerating surveillance and research of antimicrobial resistance - an online repository for sharing of antimicrobial susceptibility data associated with whole-genome sequences.

Authors:  Sébastien Matamoros; Rene S Hendriksen; Bálint Ármin Pataki; Nima Pakseresht; Marc Rossello; Nicole Silvester; Clara Amid; Frank M Aarestrup; Marion Koopmans; Guy Cochrane; Istvan Csabai; Ole Lund; Constance Schultsz
Journal:  Microb Genom       Date:  2020-03-30

7.  Antimicrobial Resistance Prediction in PATRIC and RAST.

Authors:  James J Davis; Sébastien Boisvert; Thomas Brettin; Ronald W Kenyon; Chunhong Mao; Robert Olson; Ross Overbeek; John Santerre; Maulik Shukla; Alice R Wattam; Rebecca Will; Fangfang Xia; Rick Stevens
Journal:  Sci Rep       Date:  2016-06-14       Impact factor: 4.379

8.  Evaluation of Machine Learning and Rules-Based Approaches for Predicting Antimicrobial Resistance Profiles in Gram-negative Bacilli from Whole Genome Sequence Data.

Authors:  Mitchell W Pesesky; Tahir Hussain; Meghan Wallace; Sanket Patel; Saadia Andleeb; Carey-Ann D Burnham; Gautam Dantas
Journal:  Front Microbiol       Date:  2016-11-28       Impact factor: 5.640

9.  Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics.

Authors:  Ariane Khaledi; Aaron Weimann; Monika Schniederjans; Ehsaneddin Asgari; Tzu-Hao Kuo; Antonio Oliver; Gabriel Cabot; Axel Kola; Petra Gastmeier; Michael Hogardt; Daniel Jonas; Mohammad Rk Mofrad; Andreas Bremges; Alice C McHardy; Susanne Häussler
Journal:  EMBO Mol Med       Date:  2020-02-12       Impact factor: 12.137

10.  Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data.

Authors:  Yang Yang; Katherine E Niehaus; Timothy M Walker; Zamin Iqbal; A Sarah Walker; Daniel J Wilson; Tim E A Peto; Derrick W Crook; E Grace Smith; Tingting Zhu; David A Clifton
Journal:  Bioinformatics       Date:  2018-05-15       Impact factor: 6.937

View more
  8 in total

1.  A genomic data resource for predicting antimicrobial resistance from laboratory-derived antimicrobial susceptibility phenotypes.

Authors:  Margo VanOeffelen; Marcus Nguyen; Derya Aytan-Aktug; Thomas Brettin; Emily M Dietrich; Ronald W Kenyon; Dustin Machi; Chunhong Mao; Robert Olson; Gordon D Pusch; Maulik Shukla; Rick Stevens; Veronika Vonstein; Andrew S Warren; Alice R Wattam; Hyunseung Yoo; James J Davis
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 13.994

Review 2.  Machine Learning for Antimicrobial Resistance Prediction: Current Practice, Limitations, and Clinical Perspective.

Authors:  Jee In Kim; Finlay Maguire; Kara K Tsang; Theodore Gouliouris; Sharon J Peacock; Tim A McAllister; Andrew G McArthur; Robert G Beiko
Journal:  Clin Microbiol Rev       Date:  2022-05-25       Impact factor: 50.129

3.  Prediction of antimicrobial minimal inhibitory concentrations for Neisseria gonorrhoeae using machine learning models.

Authors:  Muhammad Yasir; Asad Mustafa Karim; Sumera Kausar Malik; Amal A Bajaffer; Esam I Azhar
Journal:  Saudi J Biol Sci       Date:  2022-03-04       Impact factor: 4.052

4.  Feasibility of predicting allele specific expression from DNA sequencing using machine learning.

Authors:  Zhenhua Zhang; Freerk van Dijk; Niek de Klein; Mariëlle E van Gijn; Lude H Franke; Richard J Sinke; Morris A Swertz; K Joeri van der Velde
Journal:  Sci Rep       Date:  2021-05-19       Impact factor: 4.379

5.  Genomic insights into the circulation of pandemic fluoroquinolone-resistant extra-intestinal pathogenic Escherichia coli ST1193 in Vietnam.

Authors:  Quynh Nguyen; To Thi Nguyen Nguyen; Phuong Pham; Vinh Chau; Lan Phu Huong Nguyen; Toan Duc Nguyen; Tuyen Thanh Ha; Nhi Thi Quynh Le; Duong Thuy Vu; Stephen Baker; Guy E Thwaites; Maia A Rabaa; Duy Thanh Pham
Journal:  Microb Genom       Date:  2021-12

6.  ResFinder - an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes.

Authors:  Alfred Ferrer Florensa; Rolf Sommer Kaas; Philip Thomas Lanken Conradsen Clausen; Derya Aytan-Aktug; Frank M Aarestrup
Journal:  Microb Genom       Date:  2022-01

7.  PlasmidHostFinder: Prediction of Plasmid Hosts Using Random Forest.

Authors:  Derya Aytan-Aktug; Philip T L C Clausen; Judit Szarvas; Patrick Munk; Saria Otani; Marcus Nguyen; James J Davis; Ole Lund; Frank M Aarestrup
Journal:  mSystems       Date:  2022-04-06       Impact factor: 7.324

8.  Predicting Antimicrobial Resistance Using Partial Genome Alignments.

Authors:  D Aytan-Aktug; M Nguyen; P T L C Clausen; R L Stevens; F M Aarestrup; O Lund; J J Davis
Journal:  mSystems       Date:  2021-06-15       Impact factor: 6.496

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.