| Literature DB >> 31572327 |
Chia-Ru Chung1, Hsin-Yao Wang2,3, Frank Lien2, Yi-Ju Tseng2,4,5, Chun-Hsien Chen2,4, Tzong-Yi Lee6,7, Tsui-Ping Liu2, Jorng-Tzong Horng1,8, Jang-Jih Lu2,9,10.
Abstract
Staphylococcus haemolyticus is one of the most significant coagulase-negative staphylococci, and it often causes severe infections. Rapid strain typing of pathogenic S. haemolyticus is indispensable in modern public health infectious disease control, facilitating the identification of the origin of infections to prevent further infectious outbreak. Rapid identification enables the effective control of pathogenic infections, which is tremendously beneficial to critically ill patients. However, the existing strain typing methods, such as multi-locus sequencing, are of relatively high cost and comparatively time-consuming. A practical method for the rapid strain typing of pathogens, suitable for routine use in clinics and hospitals, is still not available. Matrix-assisted laser desorption ionization-time of flight mass spectrometry combined with machine learning approaches is a promising method to carry out rapid strain typing. In this study, we developed a statistical test-based method to determine the reference spectrum when dealing with alignment of mass spectra datasets, and constructed machine learning-based classifiers for categorizing different strains of S. haemolyticus. The area under the receiver operating characteristic curve and accuracy of multi-class predictions were 0.848 and 0.866, respectively. Additionally, we employed a variety of statistical tests and feature-selection strategies to identify the discriminative peaks that can substantially contribute to strain typing. This study not only incorporates statistical test-based methods to manage the alignment of mass spectra datasets but also provides a practical means to accomplish rapid strain typing of S. haemolyticus.Entities:
Keywords: Fisher's exact test; MALDI-TOF MS; Staphylococcus haemolyticus; machine learning; strain typing
Year: 2019 PMID: 31572327 PMCID: PMC6753874 DOI: 10.3389/fmicb.2019.02120
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Flowchart of preprocessing of spectral data given that the tolerance value is 5. The incidence ratio was determined by the number of the isolates among the CPS. D was defined as the total difference between the incidence ratios.
Figure 2Distribution of the dataset. (A) Pie chart showing the distribution of dataset. (B) Number of identified peaks in each group.
Figure 3Proportion of significance for different tolerance values. Fisher's exact test was employed to examine the difference between two different ST types. The p-values were derived by the average of three p-values.
Figure 4Mass spectra before and after peak alignment. The left panel is the number of spectra appearing the specific peaks under the original signal of the mass spectra and the right panel is after the alignment strategy with tolerance value 5.
Figure 5Boxplot of the accuracy and AUC for the repeated 5-fold cross validation when the tolerance value is 5.
Figure 6Performance of different classifiers. Mean and standard deviation AUC of the 5-fold cross validations for the different tolerance values using different machine learning methods.
Performance of 5-fold cross validation.
| MLR | 0.819 ± 0.028 | – | 0.808 ± 0.074 | – |
| SVM | 0.858 ± 0.029 | 0.0937 | 0.839 ± 0.060 | 0.5476 |
| DT | 0.840 ± 0.046 | 0.4005 | 0.804 ± 0.012 | 0.6905 |
| RF | 0.866 ± 0.014 | 0.0196 | 0.848 ± 0.037 | 0.3095 |
Mean ± standard deviation accuracy and AUC of the 5-fold cross validations for the multiclass classifications using different machine learning methods when the tolerance value is 5. The p-values were derived by comparing with MLR. MLR, multiclass logistic regression; SVM, support vector machine; DT, decision tree; RF, random forest.
Performance of feature selection.
| Stepwise | AUC | 21 | 0.918 ± 0.024 | 0.921 ± 0.025 | 0.0079 |
| Kendall's tau | 26 | 0.906 ± 0.008 | 0.897 ± 0.049 | 0.2222 | |
| KW | 20 | 0.902 ± 0.030 | 0.917 ± 0.042 | 0.0952 | |
| IMP | 27 | 0.910 ± 0.008 | 0.926 ± 0.020 | 0.0079 | |
| Forward | AUC | 18 | 0.897 ± 0.032 | 0.896 ± 0.026 | 0.0952 |
| Kendall's tau | 35 | 0.901 ± 0.024 | 0.893 ± 0.047 | 0.2222 | |
| FE | 25 | 0.902 ± 0.023 | 0.898 ± 0.035 | 0.0556 | |
| KW | 26 | 0.906 ± 0.031 | 0.902 ± 0.061 | 0.2222 | |
| IMP-ACC | 22 | 0.882 ± 0.032 | 0.864 ± 0.059 | 0.7533 | |
| IMP-GINI | 28 | 0.874 ± 0.034 | 0.836 ± 0.037 | 0.5476 | |
| No | 583 | 0.866 ± 0.014 | 0.848 ± 0.037 | - | |
Mean ± standard deviation accuracy and AUC of the 5-fold cross validation of RF using different numbers of peaks selected by the forward and stepwise feature selection strategies using different orders of peaks and the corresponding performance using RF. AUC, area under the curve; FE, Fisher's exact test; KW, Kruskal-Wallis test; IMP-ACC, importance measure calculated by mean decreased accuracy using RF; IMP-GINI, importance measure calculated by mean decreased impurity using RF.
Number of occurrence peaks (proportions) and average p-values using the Fisher's exact test for the discriminative peaks.
| 4673 | 40 (0.645) | 5 (0.034) | 5 (0.106) | 0.022 |
| 5129 | 49 (0.790) | 25 (0.172) | 18 (0.383) | 0.001 |
| 4999 | 62 (1.000) | 138 (0.952) | 35 (0.745) | 0.035 |
| 5635 | 0 (0.000) | 0 (0.000) | 12 (0.255) | 0.333 |
| 6466 | 31 (0.500) | 3 (0.021) | 23 (0.489) | 0.333 |
| 2499 | 52 (0.839) | 59 (0.407) | 33 (0.702) | 0.035 |
| 3390 | 15 (0.242) | 107 (0.738) | 27 (0.574) | 0.015 |
| 3411 | 20 (0.323) | 70 (0.483) | 1 (0.021) | 0.015 |
| 5036 | 43 (0.694) | 43 (0.297) | 17 (0.362) | 0.157 |
| 6496 | 30 (0.484) | 136 (0.938) | 15 (0.319) | 0.039 |
| 6781 | 21 (0.339) | 129 (0.890) | 26 (0.553) | 0.011 |
Indicated that the p < 0.01.
Figure 7Overview of processed MS data. Occurrence proportions among the three groups over the range from 2,000 to 17,000 Da and zoomed in for the range 4,900 to 7,100 Da. The red areas include peaks 4548, 4673, 4999, 5036, 5129, 5635, 6466, 6496, and 6781, which are the important peaks when constructing the RF-based classifiers.
Means (standard deviation) and p-values using the Kruskal–Wallis test for the discriminative peaks.
| 4673 | 0.052 (0.049) | 0.003 (0.026) | 0.011 (0.032) | <0.001 |
| 5129 | 0.209 (0.200) | 0.031 (0.094) | 0.094 (0.150) | <0.001 |
| 4999 | 0.769 (0.307) | 0.350 (0.236) | 0.455 (0.427) | <0.001 |
| 5635 | 0.000 (0.000) | 0.000 (0.000) | 0.126 (0.266) | <0.001 |
| 6466 | 0.145 (0.185) | 0.003 (0.025) | 0.118 (0.155) | <0.001 |
| 2499 | 0.151 (0.120) | 0.057 (0.083) | 0.222 (0.207) | <0.001 |
| 3390 | 0.030 (0.062) | 0.132 (0.110) | 0.103 (0.111) | <0.001 |
| 3411 | 0.024 (0.040) | 0.054 (0.065) | 0.002 (0.011) | <0.001 |
| 5036 | 0.115 (0.102) | 0.032 (0.082) | 0.076 (0.109) | <0.001 |
| 6496 | 0.108 (0.151) | 0.338 (0.210) | 0.089 (0.174) | <0.001 |
| 6781 | 0.065 (0.119) | 0.247 (0.162) | 0.108 (0.126) | <0.001 |
Indicated that the p < 0.01.
Figure 8Boxplots for the normalized intensity for the discriminative peaks.
Performance of binary classifier.
| LR | 0.890 ± 0.062 | 0.968 ± 0.044 | 0.913 ± 0.045 | 0.919 ± 0.051 |
| SVM | 0.931 ± 0.055 | 0.983 ± 0.037 | 0.947 ± 0.032 | 0.969 ± 0.021 |
| DT | 0.938 ± 0.037 | 0.904 ± 0.032 | 0.928 ± 0.033 | 0.919 ± 0.024 |
| RF | 0.951 ± 0.031 | 1.000 ± 0.000 | 0.966 ± 0.022 | 0.972 ± 0.020 |
Mean ± standard deviation sensitivity, specificity, accuracy, and AUC of 5-fold cross validation for binary class classification using different machine learning methods when tolerance value is 5. LR, logistic regression; SVM, support vector machine; DT, decision tree; RF, random forest.