| Literature DB >> 32960917 |
David Goodman-Meza1, Akos Rudas2,3, Jeffrey N Chiang2, Paul C Adamson1, Joseph Ebinger4, Nancy Sun4, Patrick Botting4, Jennifer A Fulcher1, Faysal G Saab5, Rachel Brook5, Eleazar Eskin2,6,7, Ulzee An6, Misagh Kordi2, Brandon Jew2, Brunilda Balliu2, Zeyuan Chen6, Brian L Hill6, Elior Rahmani6, Eran Halperin2,6,7,8, Vladimir Manuel9,10.
Abstract
Worldwide, testing capacity for SARS-CoV-2 is limited and bottlenecks in the scale up of polymerase chain reaction (PCR-based testing exist. Our aim was to develop and evaluate a machine learning algorithm to diagnose COVID-19 in the inpatient setting. The algorithm was based on basic demographic and laboratory features to serve as a screening tool at hospitals where testing is scarce or unavailable. We used retrospectively collected data from the UCLA Health System in Los Angeles, California. We included all emergency room or inpatient cases receiving SARS-CoV-2 PCR testing who also had a set of ancillary laboratory features (n = 1,455) between 1 March 2020 and 24 May 2020. We tested seven machine learning models and used a combination of those models for the final diagnostic classification. In the test set (n = 392), our combined model had an area under the receiver operator curve of 0.91 (95% confidence interval 0.87-0.96). The model achieved a sensitivity of 0.93 (95% CI 0.85-0.98), specificity of 0.64 (95% CI 0.58-0.69). We found that our machine learning algorithm had excellent diagnostic metrics compared to SARS-CoV-2 PCR. This ensemble machine learning algorithm to diagnose COVID-19 has the potential to be used as a screening tool in hospital settings where PCR testing is scarce or unavailable.Entities:
Mesh:
Year: 2020 PMID: 32960917 PMCID: PMC7508387 DOI: 10.1371/journal.pone.0239474
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Diagram of eligible, included and excluded cases, and diagnostic cross tabulation.
Characteristics of cases by SARS-CoV-2 status.
| SARS-CoV-2 status | ||||
|---|---|---|---|---|
| Negative | Positive | Total | p- value | |
| Total | 1273 (87.5) | 182 (12.5) | 1455 | |
| Age, years, mean (SD) | 57.2 (22.6) | 64.2 (19.1) | 58.1 (22.3) | <0.001 |
| Gender | 0.030 | |||
| Female | 610 (47.9) | 71 (39.0) | 681 (46.8) | |
| Male | 663 (52.1) | 111 (61.0) | 774 (53.2) | |
| Race/ethnicity | 0.006 | |||
| Asian | 91 (7.1) | 16 (8.8) | 107 (7.4) | |
| Black | 156 (12.3) | 18 (9.9) | 174 (12.0) | |
| Latino | 281 (22.1) | 61 (33.5) | 342 (23.5) | |
| Other | 110 (8.6) | 17 (9.3) | 127 (8.7) | |
| White | 635 (49.9) | 70 (38.5) | 705 (48.5) | |
| Immunosuppressed + | 385 (30.2) | 35 (19.2) | 420 (28.9) | 0.003 |
| HIV | 17 (1.3) | 1 (0.5) | 18 (1.2) | 0.590 |
| Transplant | 180 (14.1) | 19 (10.4) | 199 (13.7) | 0.214 |
| Immunosuppressive medications | 312 (24.5) | 29 (15.9) | 341 (23.4) | 0.014 |
| Not immunosuppressed | 888 (69.8) | 147 (80.8) | 1035 (71.1) | |
| Hemoglobin, g/dl, mean (SD) a | 11.80 | 12.60 | 11.90 | <0.001 |
| Absolute neutrophil count x 10^3/uL, median (IQR) | 6.02 | 5.19 | 5.92 | 0.001 |
| Absolute lymphocyte count x 10^3/uL, median (IQR) e | 1.22 | 0.96 | 1.18 | <0.001 |
| Neutrophil:lymphocyte ratio, median (IQR) | 4.81 | 5.21 | 4.88 | 0.112 |
| Absolute basophil count x 10^3/uL, median (IQR) | 0.03 | 0.01 | 0.03 | <0.001 |
| Absolute eosinophil count x 10^3/uL, median (IQR) | 0.08 | 0.01 | 0.07 | <0.001 |
| Absolute monocyte count x 10^3/uL, median (IQR) | 0.65 | 0.48 | 0.64 | <0.001 |
| Platelet count x 10^3/uL, mean (SD) b | 231 | 188 | 227 | <0.001 |
| C-reactive protein, mg/dl, mean (SD) c | 1.90 | 6.60 | 2.80 | <0.001 |
| Ferritin, ng/ml, mean (SD) d | 216 | 439 | 261 | <0.001 |
| Lactate dehydrogenase, U/L, mean (SD) e | 245 | 306 | 261 | <0.001 |
Abbreviations: IQR, interquartile range; SD, standard deviation.
Missing values (n, % of total): a hemoglobin 3 (0.2%); b platelets 6 (0.4%); c C-reactive protein 517 (35.5%); d ferritin 737 (50.6%); e lactate dehydrogenase 693 (47.6%).
+ We defined immunosuppressed status as a case with an HIV diagnosis, record of receipt of an organ transplant, or had taken an oral immunosuppressive medication prior to their SARS-CoV-2 test (e.g., prednisone, tacrolimus, mycophenolate, azathioprine, methotrexate).
Fig 2Performance of the model on the held-out test set (N = 392).
A) Receiver operator curve. B) Precision-recall curve. At a sensitivity-optimized operating threshold, sensitivity and specificity were 0.93 (95% CI 0.85–0.98) and 0.64 (95% CI 0.59–0.69), respectively. Red solid lines were the mean receiver operator curve and mean precision-recall curve, respectively; the purple shaded lines were the curves obtained from the bootstrapping procedure to calculate the 95% confidence intervals.
Fig 3Combined model feature importance.
Decrease in model performance (f1-score) after randomly shuffling the respective feature values. Higher values represent important features for classification. Abbreviations: LDH, lactate dehydrogenase; NLR, neutrophil to lymphocyte ratio; RBC, red blood cells.
Fig 4Performance of models while removing one of the features.
All analyses were performed on the held-out test set (N = 392). A) Receiver operating curve. B) Precision-recall curve. Base model includes only demographic features and complete blood cell count. Abbreviations: CRP, C-reactive protein; LDH, lactate dehydrogenase.