Literature DB >> 35663116

Generalizable prediction of COVID-19 mortality on worldwide patient data.

Abstract

Objective: Predicting Coronavirus disease 2019 (COVID-19) mortality for patients is critical for early-stage care and intervention. Existing studies mainly built models on datasets with limited geographical range or size. In this study, we developed COVID-19 mortality prediction models on worldwide, large-scale "sparse" data and on a "dense" subset of the data. Materials and
Methods: We evaluated 6 classifiers, including logistic regression (LR), support vector machine (SVM), random forest (RF), multilayer perceptron (MLP), AdaBoost (AB), and Naive Bayes (NB). We also conducted temporal analysis and calibrated our models using Isotonic Regression.
Results: The results showed that AB outperformed the other classifiers for the sparse dataset, while LR provided the highest-performing results for the dense dataset (with area under the receiver operating characteristic curve, or AUC ≈ 0.7 for the sparse dataset and AUC = 0.963 for the dense one). We also identified impactful features such as symptoms, countries, age, and the date of death/discharge. All our models are well-calibrated (P > .1). Discussion: Our results highlight the tradeoff of using sparse training data to increase generalizability versus training on denser data, which produces higher discrimination results. We found that covariates such as patient information on symptoms, countries (where the case was reported), age, and the date of discharge from the hospital or death were the most important for mortality prediction.
Conclusion: This study is a stepping-stone towards improving healthcare quality during the COVID-19 era and potentially other pandemics. Our code is publicly available at: https://doi.org/10.5281/zenodo.6336231.

Entities: Chemical

Keywords: COVID-19; coronavirus; data mining; machine learning; predictive modeling

Year: 2022 PMID： 35663116 PMCID： PMC9129227 DOI： 10.1093/jamiaopen/ooac036

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

INTRODUCTION

Coronavirus disease 2019 (COVID-19) has resulted in more than 5.2 million confirmed deaths and spans across almost every country in the world. The World Health Organization (WHO) has declared that the infection fatality ratio (aka the mortality rate) among all infected individuals of COVID-19 converges at 0.5–1.0%. Thousands of people worldwide continue to be deceased due to COVID-19 and this trend is likely to continue for the foreseeable future as cases continue to spike sporadically, vaccine mandates are fiercely resisted, and new mutations emerge. It is therefore imperative to identify patients with higher risk of fatality, so that healthcare institutions can provide adequate early-stage care and interventions to reduce the risk of COVID-19 mortality. The Centers for Disease Control and Prevention (CDC) has recognized older age, kidney disease, lung disease, and certain neurological and developmental conditions as factors that can increase a patient’s risk for COVID-19 mortality. Based on these factors, several existing studies have proposed pipelines aiming to leverage artificial intelligence/machine learning (AI/ML) into predicting mortality using patients’ data. Most of these studies were performed on smaller datasets collected from 1 city, or on a moderate cohort size (<5000 patients). These datasets contain detailed/curated clinical information on each patient, contain a low missing value ratio, and are specific to 1 geographic location. However, in a real, clinical COVID-19 setting, overwhelmed hospitals, or intensive care units may not have the resources or time to contact the patients’ primary care providers to complete the missing medical history information, and thus the missing data ratio tends to be high., Also, a model built on data from a particular hospital or city may be less pertinent to COVID-19 patients outside of that region. In addition, there are some studies that use a relatively large dataset with the assumption that the dataset is balanced. For example, a recent study used a dataset containing >∼110 000 patients and adopted a preprocessing step to balance the dataset between deceased and discharged patients; this effectively creates a mortality rate of 50%, which may limit the application for real clinical use. Although these studies showed the effectiveness of adopting AI/ML methods to predict patient fatality, the data assumptions of (1) a low missing value ratio, (2) a single region, and (3) a balanced mortality rate may hinder the generalizability for real-world clinical applications.

OBJECTIVE

Our goal is to create a model that is generalizable to the world in retrospectively predicting COVID-19 patient mortality using real-world data (1) with a medium to high missing value ratio, (2) with multiple regions encompassed, and (3) without manual balancing of the discharged and deceased patients’ relative ratios.

MATERIALS AND METHODS

Data

To address the overall goal of model generalizability, we utilized an open-source COVID-19 dataset, collected from government sources, scientific papers, and news websites, which contained 2 676 403 COVID-19-confirmed patients from around the world as of March 31, 2021. The Institutional Review Board at University of California San Diego (UCSD) approved this study (no. 190385). Although the earlier versions of this dataset were also used by previous predictive modeling studies, we used a more recent version, which is therefore more complete. We kept 2 567 823 patients with a known COVID-19 confirmation date (Figure 1A) and discarded those without it. The average age was 45 (SD = 20) and the gender was 47.6% female. The countries that are represented among ≥1% of the total number of patients are the following: India = 11.3% (positive = 5.1%), United States = 4.5% (positive = 87.5%), France = 4.1% (positive = 42.9%), and China = 1.6% (positive = 20.0%). Demographic statistics were obtained from nonunknown data.

Figure 1.

COVID-19 patient data included in this study. (A) The original dataset contained n = 2 676 403 patients. We kept n = 2 567 823 patients after discarding all observations without a valid COVID-19 confirmation date. (B) The data breakdown of the “sparse” dataset with n = 104 047 patients. (C) The data breakdown for the “dense” dataset with n = 6893. (D) The inclusion requirements for the dense dataset. The “Death or Discharge Date” field (*) has no death or discharge indication and is just a date. We used 2 subsets of the same dataset: a sparse dataset and a dense dataset. For the sparse dataset, our inclusion criteria for the dataset included (1) patients with a known a COVID-19 confirmation date (ie, COVID-19+ patients), and (2) patients with known outcomes (ie, either “deceased” or “discharged”). We manually reviewed the outcome values to combine semantically equivalent ones (eg, “death” was considered the same as “deceased”); this process was executed by extracting all unique outcomes, and then manually separating them into “deceased,” “discharged,” or “ambiguous.” All observations with “ambiguous” outcomes (eg, “undertreatment”) were then discarded. We did not exclude patient data using any other criteria. Based on these inclusion criteria, the sparse dataset contains 104 047 patients, with a deceased (or positive) rate of 5.73% (5958 positive patients) and a discharged (or negative) rate of 94.27% (98 089 negative patients), as shown in Figure 1B. To examine the effects of data sparsity (ie, various levels of missing data) and to cross-examine the results with the sparse dataset, we created the dense dataset (Figure 1C), which is a subset of the sparse one. The major difference in the dense dataset is that the fraction of deceased patients is 21.1% or 1452 patients out of 5441; just like with the sparse dataset, we left the positive ratio as-is without balancing them. Each observation in the dense dataset was extracted from the sparse one and the basis of inclusion was whether they reported demographic data for age, sex, symptoms, chronic diseases, or optional dates (optional dates are all date features excluding the confirmation date), as shown in Figure 1D. That is, an observation only needed to include one of those fields to merit being placed in the dense dataset. The sparse dataset is a superset of the dense set, meaning that every patient in the dense dataset is also in the sparse dataset.

Method overview

A high-level overview of our methodology is illustrated in Figure 2. The following section will introduce our data preprocessing steps in “Data preprocessing” section. The 6 ML classifiers that we used will be explored in “Classifiers” section. Last, our validation, calibration, and evaluation framework will be described in “Validation, calibration, and evaluation” section.

Figure 2.

Overview of our study’s workflow. (A) We started by preprocessing the original dataset from 33 fields down to the 12 most important and relevant fields. From these 12 remaining fields, we extracted 55 features. (B) We then split the dataset to obtain 90% training data. (C) Next, we performed 10-fold cross validation with the training data by feeding our data to our 6 classifiers. (D) We calibrated our models using the first 5% of the holdout data. (E) Finally, we evaluated our calibrated models using the second 5% of the holdout data.

Data preprocessing

Both the sparse and the dense datasets contained 33 fields. After manual review, we kept 12 relevant fields (Table 1). The original dataset contained 33 fields, and after manual review, we kept 12 relevant fields. The manual review process included removing the following fields that are potentially irrelevant, redundant, or too specific: ID, City, Province, Latitude, Longitude, Geographic Resolution, Lives in Wuhan, Travel History Location, Reported Market Exposure, Additional Information, Source, Sequence Available, Notes for Discussion, Location, Admin 3, Admin 2, Admin 1, New Country, Admin ID, Data Moderator Initials, and Travel History Binary. The nondiscarded fields and their statistics are summarized in Table 1. The missing value ratio for most of the included fields is high to reflect the fact that a real-world COVID-19 dataset must be flexible in its assumptions to be generalizable for countries around the entire globe. We preprocessed the 12 data fields, to extract 1 binary outcome label (ie, whether the patient was deceased/positive or discharged/negative, no. 1 in Table 1) and 55 features. Specifically, we extracted the features from the following fields:

Table 1.

The 12 relevant fields and statistics of our data for both the sparse and dense datasets

Nos.	Field	Description	Data type	No. of possible values (NOM), or range of values (NUM/DAT)	Missing value percentage (%)
Nos.	Field	Description	Data type	No. of possible values (NOM), or range of values (NUM/DAT)	Sparse	Dense
1	Outcome	Patient outcome from COVID-19 (deceased = 1 or discharged = 0)	NOM	2	0.0	0.0
2	Age	Age of the patient in years	NUM	0–101	94.5	18.9
3	Sex	Sex of the patient (male, female, unreported)	NOM	3	93.4	0.2
4	Chronic disease flag	Binary flag for whether the patient has chronic diseases (true, false)	NOM	2	0.0	0.0
5	Chronic diseases	List of reported chronic diseases (asthma, chronic kidney disease, diabetes, and hypertension)	NOM	4	99.9	98.5
6	Symptoms	List of symptoms of the patient experienced	NOM	10	99.8	97.4
7	Country	Name of country in which the case was reported	NOM	20	0.0	0.0
8	Date confirmation	Date when patient was confirmed to have COVID-19	DAT	January 2, 2020–June 3, 2020	0.0	0.0
9	Date of onset symptoms	Date when patient began reporting symptoms	DAT	January 2, 2020–May 27, 2020	96.6	49.4
10	Date of admission hospital	Date when patient was recorded to be hospitalized	DAT	January 2, 2020–April 5, 2020	99.8	96.5
11	Date of death or discharge	Date when death or discharge of the patient was reported (only contains a date without revealing outcome information)	DAT	January 2, 2020–June 4, 2020	98.9	83.6
12	Travel history dates	Recorded travel dates to a location	DAT	January 3, 2020–April 3, 2020	99.8	97.0

Notes: The field names and descriptions are adapted from the original dataset., We only enumerate possible values of the nominal field with a total number of values <10.

NUM: Numerical, NOM: Nominal, DAT: Date. Dates are given in YYYY/MM/DD format.

Age (no. 2 in Table 1). We split this field into Age Lower and Age Upper because certain ages were given as ranges. For ages given as a single value, we assign both Age Lower and Age Upper to the same age value. Sex and chronic disease flag (nos. 3 and 4 in Table 1). We converted the sex field into 2 features, the first to indicate male and the second to indicate female (if both are zero, then the sex was considered unreported). The chronic disease flag field was made into a single binary feature (one if a patient has chronic diseases, otherwise zero). The chronic disease binary flag does not necessarily align with the chronic disease field; that is, even if the chronic disease flag is one (meaning a patient suffers from chronic diseases), the chronic disease field may still contain no data. Chronic diseases, symptoms, and country (nos. 5–7 in Table 1, respectively). We manually reviewed the values of these fields to combine equivalent values. Date confirmation (no. 8 in Table 1). To enable comparison between dates, we converted this date into an “absolute” day with a reference to the earliest confirmation date available in our entire dataset (ie, January 6, 2020) inclusive of the last day. For example, if the current patient’s COVID-19 confirmation date is June 3, 2020, the absolute days for date confirmation for this patient would be 150. In addition, we also use this field as the “base date” for other types of dates to compute “relative” days (details in the next bullet). Date of onset symptoms, date of admission hospital, and date of death or discharge (nos. 9–11 in Table 1, respectively). For these types of dates, we convert each of them into both “absolute” and “relative” days. The process of absolute date conversion is the same as the one for date confirmation (ie, computing the difference between a specific date and the earliest value for that type of date, inclusive of that specific day). On the other hand, each relative date value is the difference between the patient’s date confirmation and the date in question. For example, if the patient’s date confirmation was March 21, 2020 and their date of death or discharge was May 4, 2020, their relative days for date of death or discharge would be 45. Note that the date of death or discharge only contains a date without revealing outcome information. Travel history dates (no. 12 in Table 1). Many of this specific type of date were given as ranges. Therefore, we first split this field into Travel History Dates Begin and Travel History Dates End, (similarly to age, we give these 2 dates the same value if the original data field contains only 1 value). Then, for each of the “begin” and “end” fields, we further extracted both absolute day and relative day features, resulting in 4 features in total. The 12 relevant fields and statistics of our data for both the sparse and dense datasets Notes: The field names and descriptions are adapted from the original dataset., We only enumerate possible values of the nominal field with a total number of values <10. NUM: Numerical, NOM: Nominal, DAT: Date. Dates are given in YYYY/MM/DD format. We created dummy variable features for categorical fields. Then, we normalized numerical features to [0, 1] using the equation of (current value—minimum value)/(maximum value—minimum value). For fields with missing values, we added a missing indicator feature. For the dense dataset, we further removed all features that were not represented among ≥5 unique observations, to ensure that one feature would not be unrealistically predictive due to only one observation having that feature. As many of the fields have high missing value ratios, this allowed us to use a much denser subset of the sparse dataset to examine the effects of sparsity.

Classifiers

We adopted 6 classifiers for our COVID-19 mortality prediction (binary classification) task: logistic regression (LR), support vector machine (SVM), random forest (RF), multilayer perceptron (MLP), AdaBoost (AB), and Naive Bayes (NB). All hyper-parameter combinations are shown in Supplementary Appendix Table SA1. For SVM, we used a linear version. For MLP, we set the learning rate to 0.1, number of hidden layers = 1, the number of hidden neurons = 110, the learning rate decay = false, and the threshold for consecutive errors = 20. Appropriate hyper-parameter options were discovered through previous studies,, using similar implementations for the classifiers. While many of the previous studies differed in their datasets and application, we adopted a grid search hyper-parameter tuning approach. We selected the initial values of the grid search based on the previous studies’ explored hyper-parameter combinations to optimize the performance of our models. We expanded our grid search hyper-parameter values as necessary; we determined necessity based on whether the highest-performing hyper-parameter combination was an edge case in the grid search. We implemented the classifiers using the WEKA library., SVM was implemented using the LibLINEAR API,, (also WEKA).

Validation, calibration, and evaluation

Our validation, calibration and evaluation processes are shown above in Figure 2. The data were split into 3 parts: 90% for training/validation (Figure 2B), the first 5% for calibration (Figure 2D), and the second 5% for evaluation (Figure 2E). We used the full area under the receiver operating characteristic curve (AUC) as our evaluation metric for the classifiers. We built and tested our models on an Amazon Web Services virtual machine with 2 vCPUs, 8 GB RAM, and 100 GB SSD. For training/validation, we performed 10-fold cross-validation for each classifier on the 90% training data to tune the hyper-parameters, averaged the validation results in AUC over 10 folds, and calculated the 95% confidence intervals (CIs) of AUC using the best-performing hyper-parameter combinations. For calibration, the best hyper-parameter combination for each classifier was trained on the validation data, first 5%, and then tested on the testing data, second 5%, which then provided the input for Isotonic Regression., For evaluation, we tested the AUC and computed the Hosmer–Lemeshow (H-Statistic) Test on the evaluation data. Given the change in the COVID-19 viral variants, it is imperative to further show our model’s ability to predict mortality in different epochs of time. Thus, following the CDC’s COVID-19 timeline, we split the evaluation data into 2 parts separated by May 2, 2020 (ie, when the WHO declared that COVID-19 was a global health crisis). The first part contains all the data before May 2, 2020 and the second part contains all other data (inclusive of May 2, 2020). We split the evaluation data for both the sparse and dense datasets in this manner.

RESULTS

The discrimination results of each classifier on the full evaluation data for both datasets are demonstrated in Figure 3. For the sparse dataset, AB resulted in the highest AUC values (AUC ≈ 0.7), followed by RF and MLP (AUC ≈ 0.685). For the dense dataset, LR performed the best (AUC = 0.963), whereas RF, MLP, and AB followed behind closely (AUC ≈ 0.96). SVM and NB provided less competitive results for both datasets. The precision and recall results (computed using a decision threshold of 0.5) for each classifier can also be seen in Figure 3. In general, all models provided good precision (>0.89) and recall (>0.73) for the dense dataset; for the sparse dataset, the precision is still high (>0.87 except for NB), whereas the recall is relatively low (∼0.2). The best-performing hyper-parameter combinations for the sparse and the dense datasets are shown in Table 2. We also analyzed the top 10 most important features derived from LR for both the sparse and the dense datasets (Table 3) and ordered them by their decreasing absolute value of their trained weights. Symptoms, countries, ages, and dates of death or discharge were found to be among the most predictive factors. The full LR models for the spares and the dense datasets are shown in the Supplementary Appendix Table SA2 and SA3, respectively.

Figure 3.

Table 2.

The best hyper-parameter combinations for each of the 6 classifiers on both the sparse and dense datasets

Classifier	Hyper-parameters	Best sparse data combination	Best dense data combination
LR	Ridge	10³	10²
SVM	Cost	2⁻⁵	2⁻⁵
RF	Number of attributes Sample size Number of trees	m ^1/3 50% 100	m ^1/2 50% 175
MLP	Momentum Number of epochs	0.1 750	0.3 500
AB	Weight threshold Number of iterations Resampling for boosting Base classifier	100 70 True J48	100 20 True J48
NB	Kernel estimator Supervised discretization	True False	False True

Note: Notation: m is the number of attributes.

Table 3.

The top 10 most important features using both (a) sparse and (b) dense datasets

Dataset	Nos.	Feature name	Description	Weight
(a) Sparse	1	Date of death or discharge (absolute)	The number of days that passed between the first recorded date of death or discharge and this patient’s date of death or discharge	−4.439
	2	Malaysia	Whether the case was reported in Malaysia	3.567
	3	Algeria	Whether the case was reported in Algeria	−3.162
	4	Singapore	Whether the case was reported in Singapore	3.006
	5	South Korea	Whether the case was reported in South Korea	2.712
	6	Australia	Whether the case was reported in Australia	2.633
	7	Vietnam	Whether the case was reported in Vietnam	2.424
	8	Date of death or discharge (missing)	Whether the date of the patient’s death or discharge was reported (binary)	1.814
	9	United States	Whether the case was reported in the United States	−1.760
	10	Chills (symptom)	Whether the patient reported suffering from chills because of COVID-19	1.708
(b) Dense	1	Date of death or discharge (absolute)	The number of days that passed between the first recorded date of death or discharge and this patient’s date of death or discharge	4.026
	2	Algeria	Whether the case was reported in Algeria	3.535
	3	United States	Whether the case was reported in the United States	2.376
	4	India	Whether the COVID-19 case was reported in India	2.015
	5	Age (lower)	The lower age in a patient’s age range	2.003
	6	Age (upper)	The upper age in a patient’s age range	1.973
	7	Date of death or discharge (missing)	Whether the date of the patient’s death or discharge was missing (binary)	1.918
	8	Singapore	Whether the case was reported in Singapore	1.898
	9	Malaysia	Whether the case was reported in Malaysia	1.873
	10	Headache (symptom)	Whether the patient reported suffering from headaches because of COVID-19	1.601

Notes: These features were results of the LR classifier with a ridge-parameter of 103 for the sparse dataset and 102 for the dense dataset. The date of death or discharge only contains a date without outcome information. The features are ordered by descending absolute weight. Negative weights are indicative of discharge and positive weights are indicative of death

The performance of our 6 classifiers with AUC as the evaluation metric. AB outperformed the other 5 classifiers for the sparse dataset and LR was the best performer when trained on the dense dataset. The precision and recall for each result are provided near each respective AUC result, with “P” being the precision and “R” being the recall. We used the default decision threshold of 0.5 when computing the precision and recall values. The classifier abbreviations are as follows: LR: logistic regression; SVM: support vector machine; RF: random forest; MLP: multi-layer perceptron; AB: AdaBoost; NB: Naive Bayes. The best hyper-parameter combinations for each of the 6 classifiers on both the sparse and dense datasets Ridge 103 102 Cost 2−5 2−5 Number of attributes Sample size Number of trees m 1/3 50% 100 m 1/2 50% 175 Momentum Number of epochs 0.1 750 0.3 500 Weight threshold Number of iterations Resampling for boosting Base classifier 100 70 True J48 100 20 True J48 Kernel estimator Supervised discretization True False False True Note: Notation: m is the number of attributes. The top 10 most important features using both (a) sparse and (b) dense datasets Notes: These features were results of the LR classifier with a ridge-parameter of 103 for the sparse dataset and 102 for the dense dataset. The date of death or discharge only contains a date without outcome information. The features are ordered by descending absolute weight. Negative weights are indicative of discharge and positive weights are indicative of death The temporal and calibration results are shown in Table 4. RF outperformed the other classifiers for the “before May 2, 2020” time period, whereas MLP performed best for the “on-and-after May 2, 2020” epoch for both datasets. All models (including the full evaluation data, evaluation data from before May 2, 2020, and evaluation data from on-and-after May 2, 2020 for both the sparse and the dense datasets) are well-calibrated (P > .1). The training time measurements on the full 90% training/validation data for each classifier are shown in Figure 4, and MLP took by far the longest time to train with both datasets (sparse and dense).

Table 4.

Temporal and calibration test results for the 6 classifiers

Dataset	Setting		Metric	LR	SVM	RF	MLP	AB	NB
(a) Sparse	Training/Validation		AUC Average	0.665	0.604	0.699	0.676	0.697	0.665
			AUC 95% CI Low	0.656	0.597	0.690	0.668	0.675	0.656
			AUC 95% CI High	0.674	0.610	0.708	0.685	0.720	0.675
	Evaluation	All	AUC	0.667	0.596	0.686	0.684	0.695	0.651
		All	H-L Test P-value	0.975	0.856	0.375	0.207	0.296	0.381
		Before May 2, 2020	AUC	0.832	0.812	0.912	0.885	0.895	0.710
		Before May 2, 2020	H-L Test P-value	0.267	1.000	1.000	0.999	0.997	0.357
		On-and-after May 2, 2020	AUC	0.615	0.530	0.630	0.640	0.631	0.625
		On-and-after May 2, 2020	H-L Test P-value	0.981	0.962	0.999	1.000	1.000	0.122
(b) Dense	Training/Validation		AUC Average	0.961	0.910	0.968	0.960	0.959	0.925
			AUC 95% CI Low	0.952	0.894	0.960	0.948	0.939	0.913
			AUC 95% CI High	0.971	0.926	0.976	0.972	0.979	0.938
	Evaluation	All	AUC	0.963	0.913	0.962	0.959	0.956	0.931
		All	H-L Test P-value	0.998	1.000	0.763	1.000	0.999	0.883
		Before May 2, 2020	AUC	0.982	0.960	0.997	0.993	0.992	0.974
		Before May 2, 2020	H-L Test P-value	0.644	0.462	0.874	0.968	0.890	1.000
		On-and-after May 2, 2020	AUC	0.919	0.900	0.924	0.959	0.945	0.838
		On-and-after May 2, 2020	H-L Test P-value	0.830	0.717	0.917	0.839	0.996	0.848

Notes: The (a) sparse and (b) dense evaluation data were split into 2 parts, the first containing all data before May 2, 2020 and the second part contains the rest of the instances (inclusive) because the CDC’s COVID-19 timeline28 depicts May 2, 2020 as the date when the WHO declared that COVID-19 was a global health crisis. The H-L test P values show that all models are well-calibrated (P > .1) by using isotonic regression calibration.

Figure 4.

The time taken for the training on the full 90% training/validation data for each model. The vertical axis on the left correlates with the sparse dataset and the vertical axis on the right is for the dense dataset. Temporal and calibration test results for the 6 classifiers Notes: The (a) sparse and (b) dense evaluation data were split into 2 parts, the first containing all data before May 2, 2020 and the second part contains the rest of the instances (inclusive) because the CDC’s COVID-19 timeline28 depicts May 2, 2020 as the date when the WHO declared that COVID-19 was a global health crisis. The H-L test P values show that all models are well-calibrated (P > .1) by using isotonic regression calibration.

DISCUSSION

Findings

For the full evaluation data, AB provided higher best-case AUC results for the sparse dataset and LR performed the best on the dense one. Based on the results of all hyper-parameter combinations, we observed that altering the base classifier for AB resulted in the greatest direct change in discrimination results. For LR, SVM, RF, MLP, and NB, none of the hyper-parameter combinations changed the results significantly. Our results also highlight the tradeoff we’ve embraced when compared with the previous studies: models trained using sparse data to increase generalizability (and a poorly performing AUC as a result) versus models trained using dense data with higher discrimination results yet decreased generalizability. The relatively high precision and recall values for our models run on the dense dataset indicate that a decision threshold of 0.5 may be appropriate. On the other hand, the low recall on the sparse dataset was expected due to the imbalance of data (only ∼6% positive, while the dense dataset has ∼21% of positive examples). Finally, the low recall and precision for NB suggest that a threshold of 0.5 might not be appropriate for the sparse data. For temporal analysis, all our models performed even better for the patients whose confirmation dates were before May 2, 2020 (with RF’s AUC being 0.912 as the best result for the sparse dataset, and 0.997 for the dense one). This is significantly higher than the evaluation on the full evaluation dataset (of which the best models only reached AUC ≈ 0.7 for the sparse dataset and 0.963 for the dense one). On the other hand, for the patients whose confirmation dates were on-and-after May 2, 2020, all the AUC values are less than that of the results for the full evaluation data on both sparse and dense datasets. This may be a result of the first part of the dataset (data from before May 2, 2020) containing more nonmissing data, while later instances have a higher rate of missing values. Another possibility is that before May 2, 2020, COVID-19 mortality may be easier to predict due to the case counts increasing, but not yet surging, while after May 2, 2020, the number of cases began to surge making it more difficult to predict. Furthermore, our models can provide reliable prediction scores after calibration.

Limitations

There are several limitations for this study: Using a longitudinal dataset would allow us to showcase the relative dangers that each COVID-19 variants (eg, alpha, delta, and omicron) and their mixes (eg, percent delta/alpha or delta/omicron) poses. Although exploring such a dataset may have the potential to reveal certain symptoms or risk factors that are correlated with specific variants, we are yet to extend our study to model such type of data. Calculating the “optimal” decision threshold is often desirable because the threshold is usually problem-specific, and the real threshold value may be more biased towards one outcome over the other. Moreover, this “optimal” decision threshold can potentially affect our precision and recall results, especially for the dense dataset. Additionally, performing calibration near the estimated “optimal” decision threshold can be important because even a minute change in prediction scores near the decision threshold can flip the predicted class. We are yet to consult with clinical experts to estimate such “optimal” decision threshold, as well as performing subsequent calibration around the estimated threshold and recompute the precision and recall results. Our data contains a skewed geographical distribution (ie, most of the observations were from just a handful of countries and the rest of the countries are only represented in this dataset by a small number of cases). This leads to the less represented countries being more influential to the overall feature prediction. For example, if one of the country features, like Gabon, reported only 5 observations, and all 5 patients died due to COVID-19, then this feature may be receiving a higher absolute weight (importance) in determining patient outcome than it should be. Additionally, the dataset is lacking data from certain countries, which may result in a model that does not directly represent a global sample. Therefore, further investigation of the potential geographical biases in the dataset may be required. Our hyper-parameter exploration process involved iterating through a grid search algorithm, which is a computationally intensive process. Therefore, alternative hyper-parameter tuning techniques (eg, random search) may allow us to search hyper-parameter combinations more efficiently, and thus warrant further studies. The information on what treatments patients received was not present in our dataset, and therefore the effects of certain treatments on mortality were not compared. We are yet to include such additional information to examine if certain treatments might directly affect the probability of survival. Moreover, we are yet to consult with clinical experts to perform “blind assessment” of the features and mortality used in our data, as well as to create risk groups for stratifying which specific groups of patients are more susceptible to high risk of death. We adopted more traditional classification methods, prioritizing the simplicity and explainability of the models. On the other hand, advanced techniques such as Deep Learning could also be considered. For example, our tabular datasets can potentially be converted to sequential representations for recurrent neural network, or to 2D representations for convolutional neural network. These advanced Deep Learning methodologies for predicting patient mortality warrant further exploration. There have been several COVID-19 clinical prediction instruments developed since the start of the pandemic including the AIFELL and the 4C scores. The AIFELL score was designed to differentiate between severe and less severe COVID-19 cases in emergency room environments. The 4C score was developed to directly inform clinicians in their decision-making process, as well as to separate COVID-19 hospital admittees into different risk management groups. We have yet to compare our prediction results with those of these existing tools, or to combine various models to introduce a mortality prediction tool with potentially better predictive capability.

CONCLUSIONS

In this study, we demonstrated the feasibility to build generalizable COVID-19 mortality predictive models. To do this, we used a worldwide dataset that contained high missing value ratios for the most of our included fields. We evaluated 6 classifiers on a COVID-19 dataset featuring patients from around the world and reached an AUC ≈ 0.7 for the sparse dataset and AUC = 0.963 for the dense dataset. This study is a stepping-stone to creating highly generalizable models that can predict mortality for COVID-19 patients with the goal of improving healthcare quality during the COVID-19 era and other future pandemics.

FUNDING

The author T-TK was funded by the US National Institutes of Health (NIH) R00HG009680, R01HL136835, R01GM118609, R01HG011066, and U24LM013755. The content is solely the responsibility of the author and does not necessarily represent the official views of the NIH.

AUTHOR CONTRIBUTIONS

ME contributed to the conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing (original draft), and visualization. T-TK contributed to conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing (review and editing), visualization, supervision, project administration, and funding acquisition.

CONFLICT OF INTEREST STATEMENT

None declared.

DATA AVAILABILITY

The data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.6336231. The datasets were derived from sources in the public domain: https://github.com/beoutbreakprepared/nCoV2019. Click here for additional data file.

13 in total

1. Risk Factors Associated with Mortality Among Patients with Novel Coronavirus Disease (COVID-19) in Africa.

Authors: Mustapha Mohammed; Surajuddeen Muhammad; Fatima Zaji Mohammed; Sagir Mustapha; Abubakar Sha'aban; Najib Yahaya Sani; Mubarak Hussaini Ahmad; Auwal Adam Bala; Marzuq Abubakar Ungogo; Nawaf M Alotaibi; Hadzliana Zainal
Journal: J Racial Ethn Health Disparities Date: 2020-10-13

2. Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk.

Authors: Colin G Walsh; Kavya Sharman; George Hripcsak
Journal: J Biomed Inform Date: 2017-10-24 Impact factor: 6.317

Review 3. A tutorial on calibration measurements and calibration models for clinical prediction models.

Authors: Yingxiang Huang; Wentao Li; Fima Macheret; Rodney A Gabriel; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2020-04-01 Impact factor: 4.497

4. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making.

Authors: Mohammad Pourhomayoun; Mahdi Shakibi
Journal: Smart Health (Amst) Date: 2021-01-16

5. Converting tabular data into images for deep learning with convolutional neural networks.

Authors: Yitan Zhu; Thomas Brettin; Fangfang Xia; Alexander Partin; Maulik Shukla; Hyunseung Yoo; Yvonne A Evrard; James H Doroshow; Rick L Stevens
Journal: Sci Rep Date: 2021-05-31 Impact factor: 4.996

6. Patient outcomes following transfer between intensive care units during the COVID-19 pandemic.

Authors: F Huq; E Manners; D O'Callaghan; L Thakuria; C Weaver; U Waheed; R Stümpfle; S J Brett; P Patel; S Soni
Journal: Anaesthesia Date: 2022-02-28 Impact factor: 12.893

7. Risk factors for mortality among COVID-19 patients.

Authors: Orwa Albitar; Rama Ballouze; Jer Ping Ooi; Siti Maisharah Sheikh Ghadzi
Journal: Diabetes Res Clin Pract Date: 2020-07-03 Impact factor: 5.602

8. Common cardiovascular risk factors and in-hospital mortality in 3,894 patients with COVID-19: survival analysis and machine learning-based findings from the multicentre Italian CORIST Study.

Authors: Augusto Di Castelnuovo; Marialaura Bonaccio; Simona Costanzo; Alessandro Gialluisi; Andrea Antinori; Nausicaa Berselli; Lorenzo Blandi; Raffaele Bruno; Roberto Cauda; Giovanni Guaraldi; Ilaria My; Lorenzo Menicanti; Giustino Parruti; Giuseppe Patti; Stefano Perlini; Francesca Santilli; Carlo Signorelli; Giulio G Stefanini; Alessandra Vergori; Amina Abdeddaim; Walter Ageno; Antonella Agodi; Piergiuseppe Agostoni; Luca Aiello; Samir Al Moghazi; Filippo Aucella; Greta Barbieri; Alessandro Bartoloni; Carolina Bologna; Paolo Bonfanti; Serena Brancati; Francesco Cacciatore; Lucia Caiano; Francesco Cannata; Laura Carrozzi; Antonio Cascio; Antonella Cingolani; Francesco Cipollone; Claudia Colomba; Annalisa Crisetti; Francesca Crosta; Gian B Danzi; Damiano D'Ardes; Katleen de Gaetano Donati; Francesco Di Gennaro; Gisella Di Palma; Giuseppe Di Tano; Massimo Fantoni; Tommaso Filippini; Paola Fioretto; Francesco M Fusco; Ivan Gentile; Leonardo Grisafi; Gabriella Guarnieri; Francesco Landi; Giovanni Larizza; Armando Leone; Gloria Maccagni; Sandro Maccarella; Massimo Mapelli; Riccardo Maragna; Rossella Marcucci; Giulio Maresca; Claudia Marotta; Lorenzo Marra; Franco Mastroianni; Alessandro Mengozzi; Francesco Menichetti; Jovana Milic; Rita Murri; Arturo Montineri; Roberta Mussinelli; Cristina Mussini; Maria Musso; Anna Odone; Marco Olivieri; Emanuela Pasi; Francesco Petri; Biagio Pinchera; Carlo A Pivato; Roberto Pizzi; Venerino Poletti; Francesca Raffaelli; Claudia Ravaglia; Giulia Righetti; Andrea Rognoni; Marco Rossato; Marianna Rossi; Anna Sabena; Francesco Salinaro; Vincenzo Sangiovanni; Carlo Sanrocco; Antonio Scarafino; Laura Scorzolini; Raffaella Sgariglia; Paola G Simeone; Enrico Spinoni; Carlo Torti; Enrico M Trecarichi; Francesca Vezzani; Giovanni Veronesi; Roberto Vettor; Andrea Vianello; Marco Vinceti; Raffaele De Caterina; Licia Iacoviello
Journal: Nutr Metab Cardiovasc Dis Date: 2020-07-31 Impact factor: 4.222

9. Symptom Prediction and Mortality Risk Calculation for COVID-19 Using Machine Learning.

Authors: Elham Jamshidi; Amirhossein Asgary; Nader Tavakoli; Alireza Zali; Farzaneh Dastan; Amir Daaee; Mohammadtaghi Badakhshan; Hadi Esmaily; Seyed Hamid Jamaldini; Saeid Safari; Ehsan Bastanhagh; Ali Maher; Amirhesam Babajani; Maryam Mehrazi; Mohammad Ali Sendani Kashi; Masoud Jamshidi; Mohammad Hassan Sendani; Sahand Jamal Rahi; Nahal Mansouri
Journal: Front Artif Intell Date: 2021-06-22

10. The AIFELL Score as a Predictor of Coronavirus Disease 2019 (COVID-19) Severity and Progression in Hospitalized Patients.

Authors: Ian Levenfus; Enrico Ullmann; Katja Petrowski; Jutta Rose; Lars C Huber; Melina Stüssi-Helbling; Macé M Schuurmans
Journal: Diagnostics (Basel) Date: 2022-02-27

1 in total

1. Predicting COVID-19 county-level case number trend by combining demographic characteristics and social distancing policies.

Authors: Megan Mun Li; Anh Pham; Tsung-Ting Kuo
Journal: JAMIA Open Date: 2022-06-25

1 in total