| Literature DB >> 35905092 |
Hooman H Rashidi1, John Pepper2, Taylor Howard3, Karina Klein4, Larissa May4, Samer Albahra1, Brett Phinney5, Michelle R Salemi5, Nam K Tran3.
Abstract
The 2019 novel coronavirus infectious disease (COVID-19) pandemic has resulted in an unsustainable need for diagnostic tests. Currently, molecular tests are the accepted standard for the detection of SARS-CoV-2. Mass spectrometry (MS) enhanced by machine learning (ML) has recently been postulated to serve as a rapid, high-throughput, and low-cost alternative to molecular methods. Automated ML is a novel approach that could move mass spectrometry techniques beyond the confines of traditional laboratory settings. However, it remains unknown how different automated ML platforms perform for COVID-19 MS analysis. To this end, the goal of our study is to compare algorithms produced by two commercial automated ML platforms (Platforms A and B). Our study consisted of MS data derived from 361 subjects with molecular confirmation of COVID-19 status including SARS-CoV-2 variants. The top optimized ML model with respect to positive percent agreement (PPA) within Platforms A and B exhibited an accuracy of 94.9%, PPA of 100%, negative percent agreement (NPA) of 93%, and an accuracy of 91.8%, PPA of 100%, and NPA of 89%, respectively. These results illustrate the MS method's robustness against SARS-CoV-2 variants and highlight similarities and differences in automated ML platforms in producing optimal predictive algorithms for a given dataset.Entities:
Mesh:
Year: 2022 PMID: 35905092 PMCID: PMC9337631 DOI: 10.1371/journal.pone.0263954
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Comparison of molecular, antigen, and ML-enhanced MALDI-TOF-MS workflows.
The figure illustrates the tradeoffs of (A) high throughput molecular platforms, (B) point-of-care molecular platforms, and (C) our proposed MALDI-TOF-MS method.
Fig 2Study datasets.
Datasets A and B were obtained from two different time points (before and after COVID-19 vaccine emergency use authorization. Combined, Datasets A and B totaled 361 asymptomatic and symptomatic patients. These data were randomly divided into Datasets C and D, with Dataset C serving as the training/ initial validation dataset. Optimized models produced from Dataset C were then secondarily tested with Dataset D for generalization to assess their true performance. Notably, Dataset C and D both contained random samples of the various negative subgroups (negative vaccinated individuals, emergency department [ED] patients, etc.).
Comparison of AutoML specifications.
|
|
| |
|---|---|---|
| Manufacturer (Location) | MILO-ML, LLC (Sacramento, CA) | Microsoft Corporation (Redmond, WA) |
| Interface | GUI | ML.NET |
| Machine Learning Methods | k-NN, GBM, LR, MLP-NN, NB, RF, SVM | AP-NN, BDT, LBFGS, LGBM, linear SVM, LR, SDCA, SGD |
| Automated Data Feature Selector | ANOVA F select percentile feature selector and RF Feature Importances Selector | None |
Abbreviations: ANOVA, analysis of variance; AP, averaged perceptron; BDT, boosted decision tree; GBM, gradient boosting machine; GUI, graphical user interface; k-NN, k-nearest neighbors; LBFGS, limited memory Broyden-Fletcher-Goldfarb-Sanno; LGBM, lightGBM; LR, logistic regression; MLP, multilayer-perceptron; NB, naïve Bayes; NN, neural network; RF, random forest; SDCA, stochastic dual coordinate ascent; SGD, stochastic gradient descent; SVM, support vector machine.
Study population.
|
|
|
| |
|---|---|---|---|
| Mean (SD) Age (Years) | 42.7 (15.3) | 40.9 (16.1) | NS |
| Percent Symptomatic | 88.9% (111/125) | N/A | <0.001 |
| Percent Vaccinated | 0% (0/125) | 31.4% (74/236) | <0.001 |
| Mean (SD) Viral RNA Load (copies/mL) | |||
| | 37,549.7 (19,236.2) | N/A | N/A |
| | 779.0 (237.3) | N/A | N/A |
Abbreviations: COVID-19, novel coronavirus infectious disease 2019; N/A, not applicable; NS, not significant; RNA, ribonucleic acid; SD, standard deviation
Machine learning algorithm generalization performance for top models produced by Platforms A and B.
|
| ||||||
|
|
|
|
|
| ||
| LBFGS-Logistic Regression | 92.8 (88.2–96.0) | 98.9 (81.9–100) | 100 (92.9–100) | 90.3 (84.2–94.6) | 91.3 | All |
| k-Nearest Neighbor | 92.3 (87.6–95.6) | 96.9 (60.1–100) | 100 (92.9–100) | 89.6 (83.4–94.1) | 90.7 | 25% |
| Naïve Bayes | 91.7 (86.9–95.2) | 99.2 (84.8–100) | 100 (92.9–100) | 88.9 (82.6–93.5) | 90.2 | All |
| Random Forest | 95.4 (91.4–97.9) | 98.1 (83.3–100) | 92.0 (80.8–97.7) | 96.5 (92.1–98.9) | 93.9 | All |
| Support Vector Machine | 93.3 (88.8–96.4) | 98.6 (86.8–100) | 100 (92.9–100) | 91.0 (85.1–95.1) | 91.9 | 75% |
| Neural Network-Multi Layer Perceptron | 94.9 (90.7–97.5) | 99.6 (84.9–100) | 100 (92.9–100) | 93.1 (87.6–96.6) | 92.5 | All |
| Gradient Boosting Machine (XGBoost) | 93.8 (89.4–96.8) | 98.3 (82.0–100) | 94.0 (83.5–98.7) | 93.8 (88.5–97.1) | 92.2 | All |
|
| ||||||
|
|
|
|
| |||
| Fast Tree | 87.1 (81.6–91.5) | 98.0 | 98.0 (89.4–99.9) | 83.3 (76.2–89.0) | 79.7 | All |
| Fast Forest | 86.6 (80.9–91.1) | 96.9 | 92.0 (80.8–97.8) | 84.7 (77.8–90.2) | 78.0 | All |
| Gradient Boosting Machine (light) | 86.1 (80.4–90.6) | 98.3 | 98.0 (89.4–99.9) | 81.9 (74.7–87.9) | 78.4 | All |
| Support Vector Machine | 95.4 (91.4–97.9) | 99.5 | 98.0 (89.4–99.9) | 94.4 (89.4–97.6) | 91.6 | All |
| SDCA-Logistic Regression | 91.8 (86.9–95.2) | 99.4 | 100 (92.9–100) | 88.9 (82.6–93.5) | 86.2 | All |
| LBFGS-Logistic Regression | 90.7 (85.7–94.4) | 99.3 | 100 (92.9–100) | 87.5 (80.9–92.4) | 84.8 | All |
| SGD-Calibrated | 91.2 (86.3–94.8) | 99.1 | 98.0 (89.4–99.9) | 88.9 (82.6–93.5) | 85.2 | All |
| Symbolic SGD-Logistic Regression | 85.6 (79.8–90.2) | 97.1 | 92.0 (80.8–97.8) | 83.3 (76.2–89.0) | 76.7 | All |
| Averaged Perceptron | 89.2 (83.9–93.2) | 98.7 | 98.0 (89.4–99.9) | 86.1 (79.4–91.3) | 82.4 | All |
**95% CI is not reported by the Microsoft AutoML platform for the calculated AUROC
*All features were used on a PCA transformed data
# 25% of the features selected were from an ANOVA-based Select percentile unsupervised feature selection approach
## 75% of the features selected were from an Random Forest-based Importances unsupervised feature selection approach
SGD (Stochastic Gradient Descent)
SDCA (Stochastic Dual Coordinate Ascent)
LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)