| Literature DB >> 35323675 |
Mehak Arora1,2, Stephen C Zambrzycki3, Joshua M Levy4, Annette Esper5,6, Jennifer K Frediani7, Cassandra L Quave8,9, Facundo M Fernández3, Rishikesan Kamaleswaran2,6,10.
Abstract
Point-of-care screening tools are essential to expedite patient care and decrease reliance on slow diagnostic tools (e.g., microbial cultures) to identify pathogens and their associated antibiotic resistance. Analysis of volatile organic compounds (VOC) emitted from biological media has seen increased attention in recent years as a potential non-invasive diagnostic procedure. This work explores the use of solid phase micro-extraction (SPME) and ambient plasma ionization mass spectrometry (MS) to rapidly acquire VOC signatures of bacteria and fungi. The MS spectrum of each pathogen goes through a preprocessing and feature extraction pipeline. Various supervised and unsupervised machine learning (ML) classification algorithms are trained and evaluated on the extracted feature set. These are able to classify the type of pathogen as bacteria or fungi with high accuracy, while marked progress is also made in identifying specific strains of bacteria. This study presents a new approach for the identification of pathogens from VOC signatures collected using SPME and ambient ionization MS by training classifiers on just a few samples of data. This ambient plasma ionization and ML approach is robust, rapid, precise, and can potentially be used as a non-invasive clinical diagnostic tool for point-of-care applications.Entities:
Keywords: DART-MS; K-means clustering; VOC; ambient plasma ionization; imbalanced learning; machine learning classification algorithms; pathogen identification; point-of-care devices; solid phase micro-extraction
Year: 2022 PMID: 35323675 PMCID: PMC8953436 DOI: 10.3390/metabo12030232
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1(a) Feature vectors corresponding to each pathogen as data points in the transformed feature space after k-means clustering for two clusters. The colored shaded regions cover all the points that were clustered together and the red and green differentiate between the two clusters. We can see that there is some inherent separability between bacteria (red data points) and fungi (green data points) in this transformed feature space. (b) This plot shows the variability of salient peak locations in each cluster. The clustering algorithm is able to automatically identify peak locations that are commonly seen in one pathogen type, and not seen in the other.
Figure 2Study overview for identification of pathogens from skin VOCs. (a) Summary diagram of the SPME-DART-MS procedure. SPME blades are incubated in the headspace of the agar slant with the microbes. The blades are then removed and placed in between the plasma stream of the DART plasma ionization source and the ambient pressure interface of the mass spectrometer. VOC adhered to the SPME blade are desorbed and ionized by DART. Then, the ionized VOC enter the ambient pressure interface of the mass spectrometer for measurement to produce a signature. (b) Process flow diagram for MS data preprocessing, peak detection, and ML. The mass spectra of the pathogen and blank were first min-max normalized. Linear interpolation was applied to the pathogen data to match the m/z sampling frequency of the blank to facilitate the blank subtraction step. The pathogen mass spectra were then smoothed by a mean filter, after which adaptive thresholding was applied to windows of 50 m/z intervals to obtain peak locations. These are encoded into a binary feature matrix that indicates whether a peak is present in a unit m/z interval or not. A PCA transform on this matrix was used as the input to our ML classification algorithms. (c) Interpretation of the decision tree trained on the binary feature matrix in terms of feature importance. This figure shows the distribution of bacteria and fungi samples with respect to the peak locations of most discriminatory importance (100 m/z, 377 m/z, 147 m/z).
Results of binary classification (bacteria vs. fungi) for the five supervised ML algorithms (averaged over 3-folds) on the binary feature matrix and the PCA-transformed data. The numbers in bold mark the highest values for each metric. It can be observed that decision tree classifiers trained on the binary feature matrix perform well, with highest overall accuracy, f-score, and AUC.
| Classifier | Dataset | Accuracy | F-Score | Area under the ROC Curve (AUC) | Class Bacteria | Class Fungi | ||
|---|---|---|---|---|---|---|---|---|
| Precision | Sensitivity | Precision | Sensitivity | |||||
| Logistic Regression | Binary Features | 0.846 | 0.748 | 0.865 | 0.903 | 0.899 | 0.639 | 0.667 |
| PCA Features | 0.846 | 0.843 | 0.775 | 0.853 | 0.970 | 0.833 | 0.444 | |
| Logistic Regression with Lasso | Binary Features | 0.795 | 0.753 | 0.921 |
| 0.870 | 0.633 |
|
| PCA Features | 0.795 | 0.742 | 0.827 | 0.895 | 0.870 | 0.633 | 0.639 | |
| K-Nearest Neighbors | Binary Features | 0.821 | 0.782 | 0.743 | 0.886 | 0.903 | 0.700 | 0.583 |
| PCA Features | 0.820 | 0.812 | 0.752 | 0.886 | 0.936 | 0.750 | 0.583 | |
| Support Vector Machines | Binary Features | 0.795 | 0.657 | 0.805 | 0.870 | 0.862 | 0.528 | 0.583 |
| PCA Features | 0.821 | 0.670 | 0.734 | 0.842 | 0.899 | 0.556 | 0.444 | |
| Random Forest Classifier | Binary Features |
|
|
| 0.881 |
|
| 0.555 |
| PCA Features | 0.744 | 0.570 | 0.779 | 0.794 | 0.903 | 0.444 | 0.194 | |
Figure 3Receiver operating characteristic curves (ROC) depict the tradeoff between true positives vs. false positives for each algorithm (traces rising closer to the top left are better). The precision–recall curves (PRC) show the tradeoff between true positive vs precision in positive predictions for each algorithm (traces closer to the top right are better). This figure displays the ROC and PRC curves for classifiers trained on the binary feature matrix as well as the feature matrix after PCA, respectively.
Figure 4Detected VOCs at m/z 478.3883, m/z 592.2332, and m/z 666.2437 for all 3 samples of Proteus mirabilis (CDC-0029).
This table lists the specific bacteria and fungus species evaluated using SPME-DART-MS in this study.
| Pathogen | Strain |
|---|---|
| Bacteria | |
| Fungi | |
Figure 5Principal component analysis on (a) original VOC data, and on (b) the dataset after SMOTE upsampling. SMOTE is a dataset upsampling technique that was used in this study to add artificial datapoints generated via random convex combinations of existing data points in the feature space. This is useful while training ML algorithms on class-imbalanced datasets and low-data regimes. In our study, the bacteria class (Class 0) had 30 samples and the fungi class (Class 1) had 9 samples. SMOTE was used to combat the limited dataset and class imbalance issues to train our ML algorithms with lesser bias.