| Literature DB >> 35356528 |
Chunxuan Wang1,2, Zhuo Wang1,3, Hsin-Yao Wang4,5, Chia-Ru Chung6, Jorng-Tzong Horng6,7, Jang-Jih Lu4,8,9, Tzong-Yi Lee1,2.
Abstract
Klebsiella pneumoniae is one of the most common causes of hospital- and community-acquired pneumoniae. Resistance to the extensively used quinolone antibiotic, such as ciprofloxacin, has increased in Klebsiella pneumoniae, which leads to the increase in the risk of initial antibiotic selection for Klebsiella pneumoniae treatment. Rapid and precise identification of ciprofloxacin-resistant Klebsiella pneumoniae (CIRKP) is essential for clinical therapy. Nowadays, matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) is another approach to discover antibiotic-resistant bacteria due to its shorter inspection time and lower cost than other current methods. Machine learning methods are introduced to assist in discovering significant biomarkers from MALDI-TOF MS data and construct prediction models for rapid antibiotic resistance identification. This study examined 16,997 samples taken from June 2013 to February 2018 as part of a longitudinal investigation done by Change Gung Memorial Hospitals (CGMH) at the Linkou branch. We applied traditional statistical approaches to identify significant biomarkers, and then a comparison was made between high-importance features in machine learning models and statistically selected features. Large-scale data guaranteed the statistical power of selected biomarkers. Besides, clustering analysis analyzed suspicious sub-strains to provide potential information about their influences on antibiotic resistance identification performance. For modeling, to simulate the real antibiotic resistance predicting challenges, we included basic information about patients and the types of specimen carriers into the model construction process and separated the training and testing sets by time. Final performance reached an area under the receiver operating characteristic curve (AUC) of 0.89 for support vector machine (SVM) and extreme gradient boosting (XGB) models. Also, logistic regression and random forest models both achieved AUC around 0.85. In conclusion, models provide sensitive forecasts of CIRKP, which may aid in early antibiotic selection against Klebsiella pneumoniae. The suspicious sub-strains could affect the model performance. Further works could keep on searching for methods to improve both the model accuracy and stability.Entities:
Keywords: Klebsiella pneumonia; MALDI-TOF MS; antibiotic susceptibility test; ciprofloxacin resistance; machine learning
Year: 2022 PMID: 35356528 PMCID: PMC8959214 DOI: 10.3389/fmicb.2022.827451
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Flow chart of the whole study, including sample collection, data cleaning and processing, feature selection, unbalance problem treatment, and model construction and comparison.
FIGURE 2Demographic of statistical information of the data. (A1,A4) Proportion of CIRKP samples in each year; (A2,A3) age information of samples; (A5,A6) number of samples of each SPC and gender in CIRKP and CISKP; (B1) overall average spectrum plot of CIRKP and CISKP; (B2) distribution of peak numbers. (C1–C6) average spectrum plots of CIRKP and CISKP by years.
FIGURE 3Data visualization of clustering results. (A1) Distribution of 8 clusters; (A2,A3) distribution of CIRKP and CISKP samples in each cluster; (B) proportion of cluster in each year; (C) average spectrum plot of each cluster.
Significance of non-spectrometry covariates.
| Sample population | |||||
|
| |||||
| CISKP | CIRKP | Total number | |||
| Total numbers | 9,201 | 3,424 | 12,625 | ||
| Gender (ratio %) | < 2.2×10−16 | ||||
| Male | 4,581 (49.7) | 2,030 (59.3) | 6,611 | ||
| Female | 4,620 (50.3) | 1,394 (40.7) | 6,014 | ||
| Age (ratio %) | < 2.2×10−16 | ||||
| Infant | 856 (9.3) | 41 (1.2) | 897 | ||
| Children | 50 (0.5) | 12 (0.4) | 62 | ||
| Teenager | 62 (0.7) | 14 (0.4) | 76 | ||
| Youth | 1,201 (13.1) | 272 (7.9) | 1,473 | ||
| Middle-aged | 3,715 (40.4) | 1,167 (34.1) | 4,882 | ||
| Senium | 3,317 (36.0) | 1,918 (56.0) | 5,235 | ||
| Specimen type (ratio %) | < 2.2×10−16 | ||||
| B | 1,330 (14.5) | 400 (11.7) | 1,730 | ||
| F | 351 (3.8) | 133 (3.9) | 484 | ||
| W | 1,490 (16.2) | 373 (10.9) | 1,863 | ||
| R | 2,081 (22.6) | 1,148 (33.5) | 3,229 | ||
| U | 3,610 (39.2) | 1,276 (37.3) | 4,886 | ||
| O | 339 (3.7) | 94 (2.7) | 433 | ||
Top 15 significant pseudo-ions selected by statistical methods.
| Rank | Pseudo-ion | m/z | Mean difference of peak intensity ratio (10−4) | log2(fc) | Wilcoxon rank sum test | KS test | Observation times | |
| 1 |
| (4,700, 4,720) | 10.22 | 0.88 | 1.24×10−38 | ≤0.01 | ≤0.01 | 3,200 |
| 2 | Pseudo-ion 5 | (2,080, 2,100) | −50.75 | −0.43 | 1.02×10−51 | ≤0.01 | ≤0.01 | 12,528 |
| 3 | Pseudo-ion 27 | (2,520, 2,540) | −13.63 | −0.41 | 1.73×10−64 | ≤0.01 | ≤0.01 | 10,704 |
| 4 | Pseudo-ion 95 | (3,880, 3,900) | −3.51 | −0.61 | 2.40×10−10 | ≤0.01 | ≤0.01 | 3,358 |
| 5 | Pseudo-ion 353 | (9,040, 9,060) | −1.03 | −0.77 | 5.60×10−05 | ≤0.01 | ≤0.01 |
|
| 6 | Pseudo-ion 1 | (2,000, 2,020) | −13.52 | −0.40 | 1.72×10−41 | ≤0.01 | ≤0.01 | 10,727 |
| 7 | Pseudo-ion 32 | (2,620, 2,640) | −13.09 | −0.40 | 1.01×10−34 | ≤0.01 | ≤0.01 | 10,453 |
| 8 | Pseudo-ion 293 | (7,840, 7,860) | −0.26 | −0.67 | 3.20×10−07 | ≤0.01 | ≤0.01 |
|
| 9 | Pseudo-ion 284 | (7,660, 7,680) | −4.76 | −0.72 | 1.87×10−03 | ≤0.01 | ≤0.01 | 2,873 |
| 10 | Pseudo-ion 11 | (2,200, 2,220) | −13.45 | −0.37 | 6.23×10−42 | ≤0.01 | ≤0.01 | 10,405 |
| 11 | Pseudo-ion 2 | (2,020, 2,040) | −20.51 | −0.38 | 9.96×10−34 | ≤0.01 | ≤0.01 | 11,433 |
| 12 | Pseudo-ion 4 | (2,060, 2,080) | −97.53 | −0.39 | 2.46×10−23 | ≤0.01 | ≤0.01 | 12,318 |
| 13 |
| (9,320, 9,340) | 0.01 | 1.11 | 3.86×10−04 | ≤0.01 | 0.01 |
|
| 14 | Pseudo-ion 50 | (2,980, 3,000) | −12.41 | −0.35 | 6.23×10−33 | ≤0.01 | 0.01 | 10,574 |
| 15 | Pseudo-ion 163 | (3,240, 3,260) | −3.71 | −0.84 | 2.46×10−09 | ≤0.01 | 0.01 |
|
Mean difference is calculated by CIRKP-CISKP;fc represents the fold change value; fold change is calculated by CIRKP/CSIKP; total number of training samples: 13,414. Bold type values means the statistical quality of these pseudo-ions are relatively lower than other pseudo-ions since less samples are observed.
FIGURE 4Performance of models. (A1,A2) Training and testing ROC plots of four models; (B–E) optimal probability cutoff of training and testing set for four models.
Top 15 significant features selected by models.
| Features (observed sample number) | |||||||||||||||
|
| |||||||||||||||
| Rank 1 | Rank 2 | Rank 3 | Rank 4 | Rank 5 | Rank 6 | Rank 7 | Rank 8 | Rank 9 | Rank 10 | Rank 11 | Rank 12 | Rank 13 | Rank 14 | Rank 15 | |
| LR | AGE (13,414) | GEN (13,414) | U (5,218) | PI138 (9,756) | R (3,408) | PI172 (6,666) | PI307 (4,831) | PI228 (1,694) | PI158 (13,137) | PI495 (1,358) | PI103 (3,940) | PI203 (551) | PI290 (2,367) | PI287 (4,403) | PI92 (8,767) |
| SVM | PI154 (3,324) | AGE (13,414) | PI36 (3,200) | PI226 (5,434) | PI494 (2,858) | PI171 (9,543) | PI127 (1,739) | PI306 (5,431) | PI273 (606) | PI103 (3,940) | PI1495 (1,358) | PI230 (7,252) | PI367 (285) | PI198 (4,580) | PI52 (10,181) |
| RF | AGE (13,414) | PI171 (9,543) | R (3,408) | PI154 (3,324) | GEN (13,414) | PI136 (3,200) | PI208 (5,405) | PI91 (8,789) | PI288 (12,122) | PI165 (9,641) | PI316 (9,601) | PI226 (5,434) | PI108 (9,883) | PI286 (12,699) | PI306 (5,431) |
| XGB | R (3,945) | AGE (13,414) | GEN (13,414) | PI171 (9,543) | PI136 (3,200) | PI54 (3,324) | PI208 (5,405) | PI91 (8,789) | PI226 (5,434) | PI306 (5,431) | PI31 (9,619) | PI165 (9,641) | PI266 (5,375) | PI316 (9,601) | PI288 (12,112) |
PI, pseudo-ion; GEN, gender; R, SPC-R; U, SPC-U.
In cluster performance of the general model.
| Training (AUC %) | Testing (AUC %) | |||||||
|
|
| |||||||
| Clusters | LR | SVM | RF | XGB | LR | SVM | RF | XGB |
|
|
|
|
|
|
|
|
|
|
| 0 | 90.61 | 98.00 | 100.00 | 98.84 | 82.89 | 85.70 | 84.27 | 85.95 |
| 1 | 90.56 | 97.58 | 100.00 | 98.82 | 84.82 | 87.70 | 86.34 | 89.80 |
| 2 | 89.25 | 97.55 | 100.00 | 98.24 | 92.16 | 93.72 | 90.74 | 91.03 |
| 3 | 89.14 | 97.47 | 100.00 | 98.41 |
| 84.12 |
| 83.15 |
| 4 | 89.01 | 97.45 | 100.00 | 98.28 | 80.84 | 87.31 | 81.43 | 86.70 |
| 5 | 88.26 | 96.73 | 100.00 | 98.55 | 88.95 | 91.44 | 87.94 | 91.31 |
| 6 | 87.10 | 96.70 | 100.00 | 97.22 |
| 92.59 |
|
|
| 7 | 91.15 | 97.63 | 100.00 | 98.73 | 85.30 | 88.67 | 87.98 | 93.81 |
Bold type AUC value of cluster 0–7 shows poor performances of machine learning models on those clusters.
Confusion matrix of each model with optimal probability cutoff of training set.
| Real predicted | Cluster 0 | Cluster 1 | Cluster 2 | Cluster 3 | |||||
|
|
|
|
| ||||||
| CIRKP | CISKP | CIRKP | CISKP | CIRKP | CISKP | CIRKP | CISKP | ||
| LR | CIRKP | 127 | 62 | 166 | 97 | 15 | 6 | 38 | 35 |
| CISKP | 41 | 206 | 29 | 167 | 5 | 91 | 24 | 140 | |
| SVM | CIRKP | 117 | 36 | 156 | 49 | 16 | 2 | 36 | 18 |
| CISKP | 51 | 232 | 39 | 215 | 4 | 95 | 26 | 157 | |
| RF | CIRKP |
|
|
|
|
|
|
|
|
| CISKP |
|
|
|
|
|
|
|
| |
| XGB | CIRKP | 105 | 37 | 161 | 57 | 14 | 5 | 35 | 22 |
| CISKP | 63 | 231 | 34 | 207 | 6 | 92 | 27 | 153 | |
|
| |||||||||
|
|
|
|
| ||||||
|
|
|
|
| ||||||
|
|
|
|
|
|
|
|
| ||
|
| |||||||||
| LR | CIRKP | 91 | 46 | 159 | 57 |
|
| 28 | 20 |
| CISKP | 22 | 103 | 61 | 413 |
|
| 8 | 55 | |
| SVM | CIRKP | 85 | 27 | 156 | 37 |
|
| 30 | 21 |
| CISKP | 28 | 122 | 64 | 433 |
|
| 6 | 54 | |
| RF | CIRKP |
|
|
|
|
|
|
|
|
| CISKP |
|
|
|
|
|
|
|
| |
| XGB | CIRKP | 93 | 36 | 169 | 50 |
|
| 34 | 20 |
| CISKP | 20 | 113 | 51 | 420 |
|
| 2 | 55 | |
|
| |||||||||
|
|
|
|
| ||||||
|
|
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| |
|
| |||||||||
| CIRKP | 624 | 326 | 596 | 192 |
|
| 611 | 229 | |
| CISKP | 191 | 1,226 | 219 | 1,360 |
|
| 204 | 1,323 | |
RF model is severely overfitted to CIRKP group. The in-cluster performance of cluster 6 is acceptable but low AUC value is caused by insufficient positive test samples.
FIGURE 5Time influence on the performance of four models.