| Literature DB >> 35663116 |
Maxim Edelson1, Tsung-Ting Kuo2.
Abstract
Objective: Predicting Coronavirus disease 2019 (COVID-19) mortality for patients is critical for early-stage care and intervention. Existing studies mainly built models on datasets with limited geographical range or size. In this study, we developed COVID-19 mortality prediction models on worldwide, large-scale "sparse" data and on a "dense" subset of the data. Materials andEntities:
Keywords: COVID-19; coronavirus; data mining; machine learning; predictive modeling
Year: 2022 PMID: 35663116 PMCID: PMC9129227 DOI: 10.1093/jamiaopen/ooac036
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.COVID-19 patient data included in this study. (A) The original dataset contained n = 2 676 403 patients. We kept n = 2 567 823 patients after discarding all observations without a valid COVID-19 confirmation date. (B) The data breakdown of the “sparse” dataset with n = 104 047 patients. (C) The data breakdown for the “dense” dataset with n = 6893. (D) The inclusion requirements for the dense dataset. The “Death or Discharge Date” field (*) has no death or discharge indication and is just a date.
Figure 2.Overview of our study’s workflow. (A) We started by preprocessing the original dataset from 33 fields down to the 12 most important and relevant fields. From these 12 remaining fields, we extracted 55 features. (B) We then split the dataset to obtain 90% training data. (C) Next, we performed 10-fold cross validation with the training data by feeding our data to our 6 classifiers. (D) We calibrated our models using the first 5% of the holdout data. (E) Finally, we evaluated our calibrated models using the second 5% of the holdout data.
The 12 relevant fields and statistics of our data for both the sparse and dense datasets
| Nos. | Field | Description | Data type | No. of possible values (NOM), or range of values (NUM/DAT) | Missing value percentage (%) | |
|---|---|---|---|---|---|---|
| Sparse | Dense | |||||
| 1 | Outcome | Patient outcome from COVID-19 (deceased = 1 or discharged = 0) | NOM | 2 | 0.0 | 0.0 |
| 2 | Age | Age of the patient in years | NUM | 0–101 | 94.5 | 18.9 |
| 3 | Sex | Sex of the patient (male, female, unreported) | NOM | 3 | 93.4 | 0.2 |
| 4 | Chronic disease flag | Binary flag for whether the patient has chronic diseases (true, false) | NOM | 2 | 0.0 | 0.0 |
| 5 | Chronic diseases | List of reported chronic diseases (asthma, chronic kidney disease, diabetes, and hypertension) | NOM | 4 | 99.9 | 98.5 |
| 6 | Symptoms | List of symptoms of the patient experienced | NOM | 10 | 99.8 | 97.4 |
| 7 | Country | Name of country in which the case was reported | NOM | 20 | 0.0 | 0.0 |
| 8 | Date confirmation | Date when patient was confirmed to have COVID-19 | DAT | January 2, 2020–June 3, 2020 | 0.0 | 0.0 |
| 9 | Date of onset symptoms | Date when patient began reporting symptoms | DAT | January 2, 2020–May 27, 2020 | 96.6 | 49.4 |
| 10 | Date of admission hospital | Date when patient was recorded to be hospitalized | DAT | January 2, 2020–April 5, 2020 | 99.8 | 96.5 |
| 11 | Date of death or discharge | Date when death or discharge of the patient was reported (only contains a date without revealing outcome information) | DAT | January 2, 2020–June 4, 2020 | 98.9 | 83.6 |
| 12 | Travel history dates | Recorded travel dates to a location | DAT | January 3, 2020–April 3, 2020 | 99.8 | 97.0 |
Notes: The field names and descriptions are adapted from the original dataset., We only enumerate possible values of the nominal field with a total number of values <10.
NUM: Numerical, NOM: Nominal, DAT: Date. Dates are given in YYYY/MM/DD format.
Figure 3.The performance of our 6 classifiers with AUC as the evaluation metric. AB outperformed the other 5 classifiers for the sparse dataset and LR was the best performer when trained on the dense dataset. The precision and recall for each result are provided near each respective AUC result, with “P” being the precision and “R” being the recall. We used the default decision threshold of 0.5 when computing the precision and recall values. The classifier abbreviations are as follows: LR: logistic regression; SVM: support vector machine; RF: random forest; MLP: multi-layer perceptron; AB: AdaBoost; NB: Naive Bayes.
The best hyper-parameter combinations for each of the 6 classifiers on both the sparse and dense datasets
| Classifier | Hyper-parameters | Best sparse data combination | Best dense data combination |
|---|---|---|---|
| LR |
Ridge |
103 |
102 |
| SVM |
Cost |
2−5 |
2−5 |
| RF |
Number of attributes Sample size Number of trees |
50% 100 |
50% 175 |
| MLP |
Momentum Number of epochs |
0.1 750 |
0.3 500 |
| AB |
Weight threshold Number of iterations Resampling for boosting Base classifier |
100 70 True J48 |
100 20 True J48 |
| NB |
Kernel estimator Supervised discretization |
True False |
False True |
Note: Notation: m is the number of attributes.
The top 10 most important features using both (a) sparse and (b) dense datasets
| Dataset | Nos. | Feature name | Description | Weight |
|---|---|---|---|---|
| (a) Sparse | 1 | Date of death or discharge (absolute) | The number of days that passed between the first recorded date of death or discharge and this patient’s date of death or discharge | −4.439 |
| 2 | Malaysia | Whether the case was reported in Malaysia | 3.567 | |
| 3 | Algeria | Whether the case was reported in Algeria | −3.162 | |
| 4 | Singapore | Whether the case was reported in Singapore | 3.006 | |
| 5 | South Korea | Whether the case was reported in South Korea | 2.712 | |
| 6 | Australia | Whether the case was reported in Australia | 2.633 | |
| 7 | Vietnam | Whether the case was reported in Vietnam | 2.424 | |
| 8 | Date of death or discharge (missing) | Whether the date of the patient’s death or discharge was reported (binary) | 1.814 | |
| 9 | United States | Whether the case was reported in the United States | −1.760 | |
| 10 | Chills (symptom) | Whether the patient reported suffering from chills because of COVID-19 | 1.708 | |
| (b) Dense | 1 | Date of death or discharge (absolute) | The number of days that passed between the first recorded date of death or discharge and this patient’s date of death or discharge | 4.026 |
| 2 | Algeria | Whether the case was reported in Algeria | 3.535 | |
| 3 | United States | Whether the case was reported in the United States | 2.376 | |
| 4 | India | Whether the COVID-19 case was reported in India | 2.015 | |
| 5 | Age (lower) | The lower age in a patient’s age range | 2.003 | |
| 6 | Age (upper) | The upper age in a patient’s age range | 1.973 | |
| 7 | Date of death or discharge (missing) | Whether the date of the patient’s death or discharge was missing (binary) | 1.918 | |
| 8 | Singapore | Whether the case was reported in Singapore | 1.898 | |
| 9 | Malaysia | Whether the case was reported in Malaysia | 1.873 | |
| 10 | Headache (symptom) | Whether the patient reported suffering from headaches because of COVID-19 | 1.601 |
Notes: These features were results of the LR classifier with a ridge-parameter of 103 for the sparse dataset and 102 for the dense dataset. The date of death or discharge only contains a date without outcome information. The features are ordered by descending absolute weight. Negative weights are indicative of discharge and positive weights are indicative of death
Temporal and calibration test results for the 6 classifiers
| Dataset | Setting | Metric | LR | SVM | RF | MLP | AB | NB | |
|---|---|---|---|---|---|---|---|---|---|
| (a) Sparse | Training/Validation | AUC Average | 0.665 | 0.604 | 0.699 | 0.676 | 0.697 | 0.665 | |
| AUC 95% CI Low | 0.656 | 0.597 | 0.690 | 0.668 | 0.675 | 0.656 | |||
| AUC 95% CI High | 0.674 | 0.610 | 0.708 | 0.685 | 0.720 | 0.675 | |||
| Evaluation | All | AUC | 0.667 | 0.596 | 0.686 | 0.684 | 0.695 | 0.651 | |
| H-L Test | 0.975 | 0.856 | 0.375 | 0.207 | 0.296 | 0.381 | |||
| Before May 2, 2020 | AUC | 0.832 | 0.812 | 0.912 | 0.885 | 0.895 | 0.710 | ||
| H-L Test | 0.267 | 1.000 | 1.000 | 0.999 | 0.997 | 0.357 | |||
| On-and-after May 2, 2020 | AUC | 0.615 | 0.530 | 0.630 | 0.640 | 0.631 | 0.625 | ||
| H-L Test | 0.981 | 0.962 | 0.999 | 1.000 | 1.000 | 0.122 | |||
| (b) Dense | Training/Validation | AUC Average | 0.961 | 0.910 | 0.968 | 0.960 | 0.959 | 0.925 | |
| AUC 95% CI Low | 0.952 | 0.894 | 0.960 | 0.948 | 0.939 | 0.913 | |||
| AUC 95% CI High | 0.971 | 0.926 | 0.976 | 0.972 | 0.979 | 0.938 | |||
| Evaluation | All | AUC | 0.963 | 0.913 | 0.962 | 0.959 | 0.956 | 0.931 | |
| H-L Test | 0.998 | 1.000 | 0.763 | 1.000 | 0.999 | 0.883 | |||
| Before May 2, 2020 | AUC | 0.982 | 0.960 | 0.997 | 0.993 | 0.992 | 0.974 | ||
| H-L Test | 0.644 | 0.462 | 0.874 | 0.968 | 0.890 | 1.000 | |||
| On-and-after May 2, 2020 | AUC | 0.919 | 0.900 | 0.924 | 0.959 | 0.945 | 0.838 | ||
| H-L Test | 0.830 | 0.717 | 0.917 | 0.839 | 0.996 | 0.848 | |||
Notes: The (a) sparse and (b) dense evaluation data were split into 2 parts, the first containing all data before May 2, 2020 and the second part contains the rest of the instances (inclusive) because the CDC’s COVID-19 timeline28 depicts May 2, 2020 as the date when the WHO declared that COVID-19 was a global health crisis. The H-L test P values show that all models are well-calibrated (P > .1) by using isotonic regression calibration.
Figure 4.The time taken for the training on the full 90% training/validation data for each model. The vertical axis on the left correlates with the sparse dataset and the vertical axis on the right is for the dense dataset.