| Literature DB >> 36014076 |
Flavia Dematheis1, Mathias C Walter1, Daniel Lang1, Markus Antwerpen1, Holger C Scholz2, Marie-Theres Pfalzgraf1, Enrico Mantel1, Christin Hinz1, Roman Wölfel1, Sabine Zange1.
Abstract
(1) Background: MALDI-TOF mass spectrometry (MS) is the gold standard for microbial fingerprinting, however, for phylogenetically closely related species, the resolution power drops down to the genus level. In this study, we analyzed MALDI-TOF spectra from 44 strains of B. melitensis, B. suis and B. abortus to identify the optimal classification method within popular supervised and unsupervised machine learning (ML) algorithms. (2)Entities:
Keywords: B. abortus; B. suis; Brucella melitensis; MALDI-TOF MS; R; feature selection; machine learning; nested k-fold cross validation
Year: 2022 PMID: 36014076 PMCID: PMC9416640 DOI: 10.3390/microorganisms10081658
Source DB: PubMed Journal: Microorganisms ISSN: 2076-2607
Scheme 1The MALDI-TOF MS dataset was randomly split in two independent training and external test datasets, representing the 70% and 30% of the cases, respectively. The external test dataset, encompassing 12 samples unseen during the modeling procedure and treated as unknown, was used to access the true performance of the developed model in a real-world setting. The training dataset was used to create fine-tuned ML models and to assess their expected performance by means of nested cross validation (nCV) with a 10-fold outer loop and 5-fold inner loop. In the outer loop, the training set was split into 10 folds or groups of approximately equal size, nine of which were assigned to the training set (in blue) and one to the test set (in green). In the inner loop, the training set was additionally split into five folds, four of which were assign to the nested training set and one to the validation set (in orange). The nCV process was as follows: The models were configured on the nested training data using a combination of all hyperparameters in the grid, and the accuracy on the validation dataset was assessed. Validation folds were rotated until all groups had contributed to the validation data. The best hyperparameters were identified as those with the highest validation performance over the five folds. The best hyperparameters were selected to train a classifier from the training dataset of the outer loop, and evaluate the expected performance of the model on the corresponding test dataset (green fold). The test fold was rotated among all folds of the outer CV loop to evaluate the ability of each model to generalize against new data.
Figure 1Venn diagram created using the R Limma package, displaying the level of overlap among different feature selection strategies based on variable importance assessed by means of Neural Network (NN), Multinomial Logistic Regression (MNR), Random Forest (RF) and the ratio of between-groups to within-groups sum of squares (BSS/WSS).
Figure 2Illustration of the k-means model with the optimal number of clusters (k = 3). The function fviz_cluster from the R-package “factoextra” was used to plot the data points according to the first two principal components that explained the majority of the variance. The figure shows species-specific clustering. The optimal number of clusters is three, with an SC value of 0.39. The silhouette coefficient was computed using the silhouette function from the “cluster” R-package.
Figure 3Heatmap displaying the intensities of the 16 potential biomarkers identified by means of a consensus feature selection method. Feature intensity was reported with the color code red for “high intensity peak” and blue for “low intensity peak”. Samples and features were ordered by means of agglomerative hierarchical clustering with average linkage and Manhattan distance.
Expected and true performance of the best final classifiers, optimized in terms of feature and model hyperparameters. The expected ML model accuracy was evaluated across all 10 CV folds of the training dataset, while the true ML accuracy was assessed on the external test dataset.
| ML Model | Feature Number | Features | Best Model Hyperparam | Expected Accuracy | True Accuracy |
|---|---|---|---|---|---|
| SVM | 4 | Peak.3633, .8700, .6715, .5271 | sigma = 0.57 | 1 | 1 |
| 3 | Peak.3633, .8700, .6715 | sigma = 0.085 | 1 | 0.92 | |
| NN | 4 | Peak.6715, .8324, .9863, .8700 | size = 2 | 1 | 1 |
| 3 | Peak.6715, .8324, .9863 | size = 3 | 1 | 0.83 | |
| RF | 4 | Peak.5271, .6715, .8700, .3633 | ntree= 500 | 1 | 1 |
| 3 | Peak.5271, .6715, .8700 | ntree= 1000 | 1 | 1 | |
| MNR | 4 | Peak.9863, .6715, .8700, .9978 | decay = 1 | 1 | 1 |
| 3 | Peak.9863, .6715, .8700 | decay = 1 | 1 | 1 |
Performance of optimized fine-tuned ML models trained on the top four features identified using a feature selection approach specific for Mulitnomial Logistic Regression (MNR), Random Forest (RF), Neural Network (NN), and Support Vector Machine (SVM), separately. The performance was evaluated in terms of expected and true accuracy for training CV folds and external test datasets, respectively. The top four features used to train the corresponding ML model were also included.
| MNR | RF | NN | SVM | |
|---|---|---|---|---|
| Features | Peak.3427, .4911, .9863, .4930 | Peak.3427, .4351, .3677, .9978 | Peak.4911, .6715, .4930, .8324 | Peak.2880, .9863, .7377, .4249 |
| Expected accuracy | 0.98 | 0.95 | 0.96 | 0.96 |
| True accuracy | 0.833 | 0.833 | 0.92 | 0.833 |
Species assignment for each strain from the external dataset, performed by the optimized fine-tuned ML models trained on the top four features identified using a standard feature selection approach specific for Mulitnomial Logistic Regression (MNR), Random Forest (RF), Neural Network (NN) and Support Vector Machine (SVM), separately. Incorrect classification is shown in bold.
| SVM | NN | RF | MNR | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Strain | Species | Score | Identity | Score | Identity | Score | Identity | Score | Identity |
| L3-0515 |
|
|
|
| correct |
|
|
| correct |
| L3-0529 |
|
| correct |
| correct |
| correct |
| correct |
| L3-2529 |
|
| correct |
|
|
| correct |
| correct |
| L3-0638 |
|
|
|
| correct |
| correct |
| correct |
| L3-2519 |
|
| correct |
| correct |
| correct |
| correct |
| L3-2525 |
|
| correct |
| correct |
| correct |
| correct |
| L3-4556 |
|
| correct |
| correct |
| correct |
| correct |
| L3-4561 |
|
| correct |
| correct |
| correct |
|
|
| L3-4564 |
|
| correct |
| correct |
| correct |
|
|
| L3-4565 |
|
| correct |
| correct |
| correct |
| correct |
| L3-4590 |
|
| correct |
| correct |
| correct |
| correct |
| L3-4602 |
|
| correct |
|
|
| correct |
| correct |