| Literature DB >> 32976111 |
Patrick Schwab1, August DuMont Schütte2, Benedikt Dietz2, Stefan Bauer3.
Abstract
BACKGROUND: COVID-19 is a rapidly emerging respiratory disease caused by SARS-CoV-2. Due to the rapid human-to-human transmission of SARS-CoV-2, many health care systems are at risk of exceeding their health care capacities, in particular in terms of SARS-CoV-2 tests, hospital and intensive care unit (ICU) beds, and mechanical ventilators. Predictive algorithms could potentially ease the strain on health care systems by identifying those who are most likely to receive a positive SARS-CoV-2 test, be hospitalized, or admitted to the ICU.Entities:
Keywords: COVID-19; SARS-CoV-2; clinical data; clinical prediction; hospitalization; infectious disease; intensive care; machine learning; prediction; testing
Mesh:
Year: 2020 PMID: 32976111 PMCID: PMC7541040 DOI: 10.2196/21439
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1We study the use of predictive models (light purple) to estimate whether patients are likely (i) to be SARS-CoV-2 positive and whether SARS-CoV-2 positive patients are likely (ii) to be admitted to the hospital and (iii) to require critical care based on clinical, demographic, and blood analysis data. Accurate clinical predictive models stratify patients according to individual risk and, in this manner, help prioritize health care resources such as testing, hospital, and critical care capacity.
Figure 2The presented multistage machine learning pipeline consists of preprocessing (light purple) the input data x, developing multiple candidate models using the given data set (orange), selecting the best candidate model for evaluation (blue), and evaluating the selected best model's outputs ŷ.
Training, validation, and test fold statistics for all patients and patients who are SARS-CoV-2 positive.
| Property | Training | Validation | Test | |
|
| ||||
|
| Patients (N=5644), n (%) | 2822 (50) | 1129 (20) | 1693 (30) |
|
| SARS-CoV-2 (%) | 9.85 | 9.92 | 9.92 |
|
| Admission (%) | 1.42 | 1.33 | 1.42 |
|
| ICUa (%) | 1.59 | 1.68 | 1.59 |
|
| Age (20-quantiles)b | 9.0 (1.0, 17.0) | 9.0 (1.0, 18.0) | 9.0 (2.0, 17.0) |
|
| ||||
|
| Patients (n=558), n (%) | 279 (50) | 112 (20) | 167 (30) |
|
| SARS-CoV-2 (%) | 100 | 100 | 100 |
|
| Admission (%) | 6.45 | 6.25 | 6.59 |
|
| ICU (%) | 2.87 | 2.68 | 2.99 |
|
| Age (20-quantiles)b | 10.0 (4.0, 17.0) | 11.5 (4.5, 18.5) | 10.0 (4.0, 17.5) |
aICU: intensive care unit.
bPatient ages are specified in 20-quantiles to maintain patient privacy (10% and 90% percentiles in parentheses).
Hyperparameter ranges used for hyperparameter optimization of logistic regression, neural network, random forest, support vector machine, and gradient boosting models for all tasks.
| Model and hyperparameter | Range/choicesa | |
|
| ||
|
| Regularization strength | 0.01, 0.1, 1.0, 10.0 |
|
| ||
|
| Number of hidden units | 16, 32, 64, 128 |
|
| Number of hidden layers | 1, 2, 3 |
|
| Activation | ReLUb [ |
|
| Batch size | 16, 32, 64, 128 |
|
| L2 regularization | 0.0, 0.00001, 0.0001 |
|
| Learning rate | 0.003, 0.03 |
|
| Dropout percentage | (0%-25%) |
|
| ||
|
| Tree depth D | 3, 4, 5 |
|
| Number of Trees T | 32, 64, 128, 256 |
|
| ||
|
| Regularization strength C | 0.01, 0.1, 1.0, 10.0 |
|
| Kernel k | polynomial, radial basis function, sigmoid |
|
| Polynomial degree d | 3, 5, 7 |
|
| ||
|
| Subsample ratio r | 0.25, 0.5, 0.75, 1.0 |
|
| Maxe tree depth T | 2, 3, 4, 5, 6, 7, 8 |
|
| Minf partition loss | 0.0, 0.1, 1.0, 10.0 |
|
| Learning rate | 0.003, 0.03, 0.3, 0.5 |
|
| L1 regularization | 1.0, 0.1, 0.001, 0.0 |
|
| L2 regularization | 1.0, 0.1, 0.001, 0.0 |
|
| Numg boosting rounds B | 5, 10, 15, 20 |
aParentheses indicate continuous ranges within the indicated limits sampled uniformly. Comma-delimited lists indicate discrete choices with equal selection probability.
bReLU: rectified linear unit.
cSELU: scaled exponential linear unit.
dELU: exponential linear unit.
eMax: maximum.
fMin: minimum.
gNum: number.
Comparison of LR, NN, RF, SVM, and XGB models in terms of AUC, AUPR, sensitivity, specificity, and Spec@95%Sens for predicting SARS-CoV-2 test results, hospital admission for patients who are SARS-CoV-2 positive, and intensive care unit admission for patients who are SARS-CoV-2 positive on the test set cohort.
| Model | AUCa (95% CI)b | AUPRc (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Spec@95%Sensd (95% CI) | ||||||
|
| |||||||||||
|
| XGBe | 0.21 (0.15-0.28) | 0.49 (0.46-0.51) | ||||||||
|
| RFg | 0.65 (0.62-0.69)h | 0.19 (0.14-0.24)h | 0.69 (0.61-0.74) | 0.54 (0.46-0.57)h | 0.19 (0.10-0.25)h | |||||
|
| NNi | 0.62 (0.57-0.65)h | 0.60 (0.52-0.67)h | 0.55 (0.46-0.58)h | 0.17 (0.14-0.28)h | ||||||
|
| LRj | 0.61 (0.57-0.65)h | 0.17 (0.13-0.24)h | 0.58 (0.51-0.65)h | 0.55 (0.46-0.57)h | 0.19 (0.16-0.25)h | |||||
|
| SVMk | 0.61 (0.57-0.65)h | 0.21 (0.15-0.27) | 0.57 (0.51-0.64)h | 0.14 (0.06-0.16)h | ||||||
|
| |||||||||||
|
| RF | 0.43 (0.19-0.81) | 0.55 (0.19-0.85) | ||||||||
|
| XGB | 0.91 (0.80-0.98) | 0.64 (0.43-0.95)h | 0.94 (0.90-0.97)h | 0.00 (0.00-0.94)h | ||||||
|
| LR | 0.88 (0.70-0.98)h | 0.44 (0.18-0.83) | 0.85 (0.79-0.90)h | 0.13 (0.08-0.93)h | ||||||
|
| NN | 0.85 (0.68-0.97)h | 0.31 (0.13-0.66)h | 0.64 (0.33-1.00)h | 0.95 (0.91-0.97)h | 0.11 (0.06-0.93)h | |||||
|
| SVM | 0.85 (0.70-0.98)h | 0.35 (0.17-0.77)h | 0.64 (0.30-1.00)h | 0.95 (0.91-0.97)h | 0.21 (0.15-0.96)h | |||||
|
| |||||||||||
|
| SVM | 0.53 (0.14-1.00) | 0.96 (0.92-0.98) | ||||||||
|
| LR | 0.93 (0.89-0.96) | 0.91 (0.87-1.00)h | ||||||||
|
| NN | 0.97 (0.94-0.99)h | 0.35 (0.10-0.88)h | 0.95 (0.91-0.99)h | 0.94 (0.90-0.99) | ||||||
|
| RF | 0.97 (0.92-1.00) | 0.56 (0.13-1.00)h | 0.60 (0.15-1.00)h | 0.90 (0.86-1.00)h | ||||||
|
| XGB | 0.67 (0.53-0.98)h | 0.29 (0.01-0.68)h | 0.40 (0.00-1.00)h | 0.94 (0.91-0.97)h | 0.00 (0.00-0.96)h | |||||
aAUC: area under the receiver operator characteristic curve.
b95% CIs obtained via bootstrap resampling with 100 samples.
cAUPR: area under the precision recall curve.
dSpec@95%Sens: specificity at greater than 95% sensitivity.
eXGB: gradient boosting.
fItalics represent the best results.
gRF: random forest.
hSignificant at P<.05 (t test) to the model with the highest predictive performance in terms of AUC.
iNN: neural network.
jLR: logistic regression.
kSVM: support vector machine.
Figure 3A comparison of the top 10 features ranked by relative feature importance scores for the best-encountered model for predicting SARS-CoV-2 test results (gradient boosting, top), hospital admissions (random forest, middle), and critical care admission for patients who are SARS-CoV-2 positive (support vector machine, bottom), respectively. The bar length corresponds to the relative marginal importance (in %) of the displayed features toward the predictive performance of the respective model. Feature names that include “MISSING” indicate that the given marginal contribution refers to the importance of the presence of that feature's absence, not the feature itself.
Figure 4Receiver operator characteristic curves for the best-encountered model for predicting SARS-CoV-2 test results (gradient boosting, left), hospital admissions for patients who are SARS-CoV-2 positive (random forest, top right), and critical care admissions for patients who are SARS-CoV-2 positive (support vector machine, bottom right). Numbers in the bottom right of each subgraph show the respective model's AUC. Solid dots on the curves indicate operating thresholds selected on the validation fold. AUC: Area under the receiver operator characteristic curve.