Literature DB >> 35756093

Machine Learning Models to Predict In-Hospital Mortality among Inpatients with COVID-19: Underestimation and Overestimation Bias Analysis in Subgroup Populations.

Javad Zarei1, Amir Jamshidnezhad1, Maryam Haddadzadeh Shoushtari2, Ali Mohammad Hadianfard1, Maria Cheraghi3, Abbas Sheikhtaheri4.   

Abstract

Prediction of the death among COVID-19 patients can help healthcare providers manage the patients better. We aimed to develop machine learning models to predict in-hospital death among these patients. We developed different models using different feature sets and datasets developed using the data balancing method. We used demographic and clinical data from a multicenter COVID-19 registry. We extracted 10,657 records for confirmed patients with PCR or CT scans, who were hospitalized at least for 24 hours at the end of March 2021. The death rate was 16.06%. Generally, models with 60 and 40 features performed better. Among the 240 models, the C5 models with 60 and 40 features performed well. The C5 model with 60 features outperformed the rest based on all evaluation metrics; however, in external validation, C5 with 32 features performed better. This model had high accuracy (91.18%), F-score (0.916), Area under the Curve (0.96), sensitivity (94.2%), and specificity (88%). The model suggested in this study uses simple and available data and can be applied to predict death among COVID-19 patients. Furthermore, we concluded that machine learning models may perform differently in different subpopulations in terms of gender and age groups.
Copyright © 2022 Javad Zarei et al.

Entities:  

Mesh:

Year:  2022        PMID: 35756093      PMCID: PMC9226971          DOI: 10.1155/2022/1644910

Source DB:  PubMed          Journal:  J Healthc Eng        ISSN: 2040-2295            Impact factor:   3.822


1. Introduction

In spite of more than 2 years since the COVID-19 pandemic and performing vaccination in many countries, the disease's prevalence and mortality have not slowed down, and many countries are still experiencing high peaks [1]. In addition, multiple mutations in the virus have become a new challenge to control the disease, leading to the spread of the disease and increased mortality [2-4]. Until April 16, 2022, more than 500 million cases of the disease and more than 6 million deaths due to COVID-19 have been reported globally, with more than 7 million cases and 140,000 deaths in Iran [1]. Since the beginning of the COVID-19 pandemic, one of the most critical challenges for the healthcare systems has been to increase the number of patients with severe symptoms and the growing demand for hospitalization. In developing countries, which do not have sufficient healthcare infrastructure, the increase in inpatients has put a lot of burden on the healthcare system. Moreover, numerous studies have reported various risk factors such as old age, male gender, and underlying medical conditions (such as hypertension, cardiovascular disease, diabetes, COPD, cancer, and obesity) for the deterioration of COVID-19 patients [5-9]. The use of modern and noninvasive methods to triage patients into specific and known categories at the early stages of the disease is beneficial [10]. One of these approaches is the use of predictive models based on machine learning [11, 12]. For example, developing predictive models based on mortality risk factors can positively prevent mortality through controlling acute conditions and planning in intensive care units [13]. Furthermore, machine learning can classify patients based on the deteriorating risk and predict the likelihood of death to manage resources optimally [14, 15]. To date, several studies have been published on the application of machine learning to develop diagnostic models or predict the death of patients due to COVID-19 [14-23]. For example, several deep learning models have been reported to diagnose COVID-19 based on images [24]. In a study, researchers developed an enhanced fuzzy-based deep learning model to differentiate between COVID-19 and infectious pneumonia (no-COVID-19) based on portable CXRs and achieved up to 81% accuracy. Their fuzzy model had only three misclassifications on the validation dataset [24]. As for death prediction, several studies have also been published [16, 25–28]. The results obtained from the studies on machine learning-based predictive methods indicated that those methods had reliable predictability and could identify the correlation between intervening variables in complex and ambiguous conditions caused by COVID-19. Therefore, they can be used to predict such situations in the future. Although those techniques have been tested on some regional datasets of the risk factors, the performance of the models can be improved when they apply to different datasets related to other countries such as Iran, where the prevalence of the COVID-19 and related deaths is high. Iran is one of the first countries to face a widespread outbreak of the disease and has experienced more than four major epidemic waves with the highest mortality rates [29, 30]. As a result, due to the high prevalence and mortality rate of COVID-19 in Iran and the limitation of healthcare resources [31, 32], it is vital to have a prediction model based on Iranian conditions and local data. Therefore, this study aimed to fit a model for predicting the death caused by COVID-19 based on machine learning algorithms. Many previous models are based on laboratory, imaging, or treatment data [16, 25–28]; however, we suggested models based on available demographic data, symptoms, and comorbidities that can be easily collected. We also conducted a bias analysis of machine learning models based on subgroups of patient populations to show the bias of these models.

2. Materials and Methods

2.1. Population and Data

We extracted data from the Khuzestan COVID-19 registry system belonging to Ahvaz Jundishapure University of Medical Sciences (AJUMS). From the beginning of the pandemic, this registry collects data from suspected (based on clinical signs) and confirmed (based on the results of PCR or CT scan) outpatients and inpatients in Khuzestan province, Iran. This registry collects demographic data, signs and symptoms, patient outcomes, PCR and CT results, and comorbidities from 38 hospitals. The details of data collection and data quality control were published elsewhere [30]. We included only patients with a confirmed diagnosis of COVID-19 based on PCR test or CT scan results for this modeling study. Furthermore, we included only patients who were hospitalized for more than 24 hours. Because outpatients and hospitalized patients with a short stay (less than 24 hours) had a lot of missing data, we excluded these cases from the final analysis. We also included patients from all age groups. Finally, we extracted data for 10,657 patients. The frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Figure 1 shows the steps of this study.
Figure 1

Overview of the study steps.

2.2. Data Preprocessing

2.2.1. Imputing Missing Variables

Because of the data quality controls in the registry, the database had a low rate of missing data. The 28 variables had a missing rate below 4% (Supplement 1, Table S1). In machine learning, data imputation is a standard approach to improve the models' performance. Different methods such as imputation with mean, median, or mode are common. We imputed the missing values with the mean for age and the highest frequency of values for nonnumerical variables as well [11, 33].

2.2.2. Features and Feature Selection

The outcome measure of the study is in-hospital mortality until discharge which is collected as binary (yes/no). The dataset contains 60 input variables. Age and the number of comorbidities are numerical; oxygen saturation level (PO2) includes two values including below and above 93%. We created three dummy variables for the diagnosis method (only positive PCR, only abnormal CT, positive PCR, and abnormal CT). Other variables have two values: yes or no. For feature selection, we applied univariate analysis using Chi-square or Fisher exact tests for nonnumerical variables and Mann-Whitney U test for age and number of comorbidities (due to abnormal distribution). We created different feature sets to build the prediction models. The first set included all the 60 variables. The second set consisted of variables that were significant in univariate analysis (P value <0.05). The third feature set included the marginal variables based on univariate analysis (P value <0.2). To create the fourth feature set, we used the feature selection node in the IBM SPSS modeler. This node identifies important features based on univariate analysis as well as the frequency of missing values and the percentage of records with the same value. Table 1 shows the variables in each of these feature sets.
Table 1

Different feature sets.

Feature setMethodNumber of featuresFeatures
1Feature selection node (default setting)17Age, contact with COVID-19 patients, cough, diabetes, diagnosis only by abnormal CT, diagnosis only by positive PCR, diagnosis by positive PCR and abnormal CT, gender, heart diseases, HTN, and ICU. Admission, intubation, muscle ache, number of comorbidity, oxygen therapy blood oxygen saturation level, and respiratory distress.

2Univariate analysis (P value <0.05)32Age, cancer, chronic kidney disease, chronic liver disease, contact (with a probable or confirmed case in the 14 days before the onset of symptoms), convulsion, cough, diabetes, diagnosis only by abnormal CT, diagnosis only by positive PCR, diagnosis by positive PCR and abnormal CT, dialysis, diarrhea, dizziness, drug abuse, gender, headache, heart diseases, HIV/AIDS, HTN, and ICU. Admission, immune diseases, intubation, nervous system diseases, number of comorbidities, other chronic lung diseases, oxygen therapy, paralysis, blood oxygen saturation level, pregnancy, respiratory distress, and unconsciousness.

3Univariate analysis (P value <0.2)40The feature set 2 + asthma, chronic hematology diseases, mental disorders, muscle ache, other diseases (comorbidities), drowsiness, gustatory dysfunction, and weakness.

4All features60The feature set 3 + abdominal pain, autoimmune disease, chest pain, chills, constipation, ocular manifestations, fever, GI bleeding, hemoptysis, nausea, anorexia, other GI signs, paresis, runny nose, skin manifestations, sore throat, olfactory dysfunction, smoking, sweating, and vomiting.

2.2.3. Data Balancing

We first developed our models with a variety of machine learning algorithms on the original dataset (dataset 1). We found the inappropriate performance of these models, in terms of the sensitivity, because of the small number of samples in the death class (83.94% surviving vs. 16.06% nonsurviving, ratio = 5.23), so the models did not perform well to predict death. There are various methods such as oversampling the minor class or undersampling the major class to solve this problem [11, 12]. We oversampled the death cases to create more balanced datasets. Datasets 2 and 3 included 5,133 (36.5%, ratio = 1.74) and 8,938 (49.98%, ratio = 1) nonsurviving patients, respectively. We developed our models with all four feature sets on these three datasets.

2.3. Model Development and Evaluation

We randomly divided the data into two sets, training (70%) and testing (30%) sets, and developed our models using common machine learning algorithms that are usually reported to perform well in medicine including Multiple Layer Perceptron (MLP) neural networks [11, 12, 34], Chi-Squared Detection of Automatic Interaction (CHAID), C5, and Random Forest (RF) decision trees [11, 12, 33, 34], Support Vector Machine (SVM) with Radial Basic Function (RBF) kernel [12, 35, 36], and Bayesian network [12, 37–39]. We first developed models based on the default settings of parameters. We developed CHAID decision trees with a maximum depth of five and a minimum record of two in the nodes. Moreover, we implemented the C5 tree with a minimum of two records in nodes. RF was also implemented with a maximum depth of 10, and a minimum of five records in nodes using 100 models. The SVM model was implemented with a regularization parameter of 10 and a gamma of 0.1. We additionally developed MLPs using the different number of neurons (5, 10, 15, and 20) in one and two hidden layers and also with the number of neurons suggested by the software. We also implemented the best CHAID, C5, and MLP with boosting ensemble method and 10-fold cross-validation. Furthermore, we implemented stack models (combining individual models) [40]. Our analysis showed that models developed on dataset 3 had generally better performance. Therefore, we developed stack models, based on the best individual models, on this dataset with different feature sets.

2.4. External Validation

For external validation, we extracted 1734 records from the Khuzestan COVID-19 registry system. These data are from four different hospitals in different timeframes. Therefore, these data were not used in training or testing the models. This dataset contained 1425 surviving and 309 nonsurviving patients. Inclusion and exclusion criteria were similar to the training/testing dataset, described in Section 2.1. The best performing models selected from the previous step and also ensemble models were validated using this dataset.

2.5. Subpopulation Bias Analysis

Previous studies show that predictive models may have different performances against different subpopulations, for example, in different sex or age groups [41, 42]. To assess this effect, we adopted the method suggested by Seyyed-Kalantari et al. They suggested the use of false-positive rate (FPR) and false-negative rate (FNR) in subpopulations to assess the underdiagnosis and overdiagnosis of machine learning models [41]. We similarly calculated FNR and FPR to assess the underprediction or overprediction of death in our models. To this end, we used the best performing models in external evaluation and the external dataset.

2.6. Analysis

We applied IBM SPSS statistical software version 23 for statistical analysis and IBM SPSS modeler version 18 to develop and evaluate machine learning models. We evaluated and compared the models using confusion matrix, accuracy, precision, sensitivity, specificity, F-score, and Area under the Curve (AUC). To select the best performing models, we compared the models obtained from each dataset-feature with each other based on AUC and F-score.

2.7. Ethical Considerations

This study received ethical approvals from the Ethics Research Committee of Ahvaz Jundishapur University of Medical Sciences (IR.AJUMS.REC.1400.325).

3. Results

3.1. Descriptive Data

We extracted data for 10,657 patients from the Khuzestan COVID-19 registry [30]. The frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Table 2 shows that the death due to COVID-19 was significantly higher among men, older patients, and those who have been in contact with infected individuals. In addition, respiratory distress, convulsion, altered consciousness, and paralysis were more common among the nonsurviving patients. Conversely, cough, headache, diarrhea, and dizziness were less prevalent among them. Furthermore, oxygen saturation status was better among the recovered patients versus the dead. Moreover, the comorbidities and risk factors (excluding pregnancy) as well as the intubation, oxygen therapy at the beginning of hospitalization, and ICU admission were significantly higher among the dead.
Table 2

Comparison of surviving and nonsurviving patients.

VariablesAlive (n = 8946)Dead (n = 1711)Total patients (n = 10657) P value
Age
Mean (±SD), years54 ± 18.365.7 ± 16.255.88 ± 18.46<0.0001
Median (Q1, Q3)56 (42, 67)67 (57, 77)58 (43, 69)
Sex, male4611 (51.5)1010 (59)5621 (52.7)<0.0001
Contact with infected people (yes)3169 (35.4)706 (41.3)3875 (36.4)<0.0001
Sign and symptoms
Cough (yes)5296 (59.2)899 (52.5)6195 (58.1)<0.0001
Respiratory distress (yes)5021 (56.1)1288 (75.3)6309 (59.2)<0.0001
Fever (yes)4225 (47.2)802 (46.9)5027 (47.2)0.788
Muscle aches (yes)2417 (27)426 (24.9)2843 (26.7)0.069
Chills (yes)70 (0.8)9 (0.5)79 (0.7)0.257
Vomiting (yes)452 (5.1)79 (4.9)531 (5)0.448
Headache (yes)480 (5.4)51 (3)531 (5)<0.0001
Chest pain (yes)304 (3.4)61 (3.6)365 (3.4)0.728
Diarrhea (yes)315 (3.5)40 (2.3)355 (3.3)0.012
Sore throat (yes)48 (0.2)4 (0.2)52 (0.5)0.100
Gustatory dysfunction (yes)98 (1.1)10 (0.6)108 (1)0.053
Olfactory dysfunction (yes)123 (1.4)19 (1.1)142 (1.3)0.382
Abdominal pain (yes)203 (2.3)31 (1.8)234 (2.2)0.237
Runny nose (yes)8 (0.1)0 (0.0)8 (0.1)0.216
Convulsion (yes)42 (0.5)19 (1.1)61 (0.6)0.001
Altered consciousness (yes)213 (2.4)419 (24.5)633 (5.9)<0.0001
GI bleeding (yes)5 (0.1)0 (0.0)5 (0.0)0.417
Skin lesion/rush (yes)11 (0.1)3 (0.2)14 (0.1)0.584
Dizziness (yes)249 (2.8)30 (1.8)279 (2.6)0.014
Paresis (yes)54 (0.6)11 (0.6)65 (0.6)0.848
Paralysis (yes)22 (0.2)13 (0.8)35 (0.3)0.001
Weakness (yes)350 (3.9)80 (4.7)430 (4)0.142
Sweating (yes)11 (0.1)2 (0.1)13 (0.1)0.947
Ocular manifestations (yes)3 (0.0)0 (0.0)3 (0.0)0.449
Hemoptysis (yes)6 (0.1)2 (0.1)8 (0.1)0.491
Drowsiness (yes)3 (0.0)2 (0.1)5 (0.0)0.185
Constipation (yes)7 (0.1)1 (0.1)8 (0.1)0.784
Nausea (yes)478 (5.3)89 (5.2)567 (5.3)0.811
Anorexia (yes)724 (8.1)138 (8.1)862 (8.1)0.969
Other GI symptoms (yes)7 (0.1)0 (0.0)7 (0.1)0.247
Blood oxygen saturation level
(i) Less than 932046 (22.9)934 (54.6)2980 (28)<0.0001
(ii) More than 936900 (77.1)777 (45.4)7677 (72)
Comorbidity
Any comorbidity (yes)3314 (37)826 (48.3)4140 (38.8)<0.0001
Number of comorbidities<0.0001
05632 (63)885 (51.7)6517 (61.2)
11868 (2.9)391 (22.9)2259 (21.2)
2946 (10.6)275 (16.1)1221 (11.5)
3396 (4.4)112 (6.5)508 (4.8)
>3104 (1.1)48 (2.8)152 (1.5)
Number of comorbidities (mean ± SD)0.6 ± 0.90.87 ± 1.10.65 ± 0.97<0.0001
Hypertension (yes)1291 (14.4)356 (20.8)1647 (5.5)<0.0001
Heart diseases (yes)1102 (12.3)294 (17.2)1396 (13.11)<0.0001
Diabetes (yes)1577 (17.6)376 (22)1953 (18.3)<0.0001
Immunodeficiency diseases (yes)32 (0.4)13 (0.8)45 (0.4)0.019
Asthma (yes)198 (2.2)28 (1.6)226 (2.1)0.129
Neurological diseases (yes)140 (1.6)49 (2.9)189 (1.8)<0.0001
Chronic kidney diseases (yes)289 (3.2)114 (6.7)403 (3.8)<0.0001
Dialysis (yes)78 (0.9)33 (1.9)111 (1)<0.0001
Other chronic lung diseases (yes)136 (1.5)44 (2.6)180 (1.7)0.002
Chronic hematologic diseases (yes)740 (0.8)20 (1.2)94 (0.9)0.166
Cancer (yes)172 (1.9)80 (4.7)252 (2.4)<0.0001
Autoimmune diseases (yes)2 (0.0)0 (0.0)2 (0.0)0.536
Chronic liver diseases (yes)46 (0.5)16 (0.9)62 (0.6)0.036
HIV/AIDS (yes)7 (0.1)5 (0.3)12 (0.1)0.016
Mental disorders (yes)26 (0.3)2 (0.1)28 (0.3)0.198
Smoking (yes)143 (1.6)33 (1.9)176 (1.7)0.326
Drug abuse (yes)54 (0.6)21 (1.2)75 (0.7)0.005
Other comorbidities (yes)286 (3.2)69 (4)355 (0.0)0.078
Pregnancy63 (0.7)2 (0.1)65 (0.6)0.004
Care and treatment
Intubation (yes)308 (3.44)962 (56.2)1270 (11.9)<0.0001
ICU care (yes)1323 (14.8)1088 (63.6)2411 (22.6)<0.0001
Oxygen therapy (yes)2921 (32.7)682 (39.9)3603 (33.8)<0.0001
Diagnosis method
(i) Only abnormal CT3197 (35.7)583 (31.4)3735 (35)<0.0001
(ii) Only positive PCR1161 (13)160 (9.4)1321 (12.4)<0.0001
(iii) Positive PCR and abnormal CT4588 (51.3)1013 (59.2)5601 (52.6)<0.0001

Significant difference.

3.2. The Machine Learning Algorithms and Their Evaluation

The results of performing various models with different settings on three datasets and four feature groups are reported as follows.

3.2.1. The Machine Learning Algorithms on Original Dataset 1

The details on the performance of the models are given in Supplement 1 (Tables S2–S5). The result showed that the lowest and highest accuracy of the models based on the original dataset 1 were 84.52% (RF with 32 features) and 91.12% (Bayesian network with 32 features), respectively. In addition, the minimum and maximum AUC were 0.757 (C5 with 32 features) and 0.914 (Bayesian network with 32 features), respectively. According to the findings, the sensitivity for predicting death based on original dataset 1 was low and between 0.484 (MLP network with 60 features) and 0.775 (RF with 32 features) which indicates that the sensitivity of the models on imbalanced data is not appropriate. Table 3 shows the results of the performance of the top 10 models based on the test data of dataset 1. According to the table, the best two models were the Bayesian network and the CHAID tree on 32 features, respectively. The ROC curve for the best models is presented in Supplementary Figure S1.
Table 3

Top 10 models developed on original dataset 1.

SettingFeature setAccuracySensitivitySpecificityPrecisionF-scoreAUC
Bayesian networkDefault291.1264.796.276.40.7010.914
CHIADDefault290.765497.882.60.6530.909
MLP2.5.5 boosting190.6353.697.781.50.6470.904
MLPBoosting 1.10390.795497.882.30.6520.903
C5Boosting290.756.497.379.90.6620.901
MLP2.10.10290.5553.497.781.50.6460.901
MLP2.5.5190.3155.49777.60.6460.901
RFDefault284.5277.585.951.30.6170.9
MLP2.20.20390.5153.697.580.50.6430.899
Bayesian networkDefault190.4655.597.178.50.650.899

For MLPs, the numbers for MLP indicate the number of layers, the number of neurons in hidden layer 1, and the number of neurons in hidden layer 2.

3.2.2. The Machine Learning Algorithms on Dataset 2

The details on the performance of the models based on dataset 2 are given in Supplement 1, Tables S6–S9. The findings showed that the lowest and highest accuracy were 82.64% (MLP with 60 features) and 87.86% (RF with 60 features), respectively. Moreover, the minimum and maximum values of the AUC were 0.888 (MLP with 60 features) and 0.942 (SVM with 60 features), respectively. According to the findings, the sensitivity for predicting death was between 0.658 (MLP network) and 0.861 (CHAID tree with 32 features). The best results obtained for each algorithm based on dataset 2 were shown in Supplementary Figure S2. According to Table 4, SVM and C5 models had the best performance on 60 and 40 features, respectively.
Table 4

Top 10 models developed on dataset 2.

SettingsFeature setAccuracySensitivitySpecificityPrecisionF-scoreAUC
SVMRBF default487.8383.490.382.90.8320.942
C5Boosting387.4481.890.682.70.8220.94
SVMRBF default387.5982.790.382.40.8260.938
C5Boosting487.8879.992.485.50.8260.938
RFDefault487.8685.789.181.50.8360.931
C5Boosting286.6878.591.584.30.8130.927
C5Boosting185.9977.290.882.20.7970.926
SVMRBF default286.617991.183.70.8130.926
MLP1.10385.38779080.90.7890.923
RFDefault185.2685.285.376.20.8040.923

For MLPs, the numbers for MLP indicate the number of layers, the number of neurons in hidden layer 1, and the number of neurons in hidden layer 2.

3.2.3. The Machine Learning Algorithms on Dataset 3

The details on the performance of the models based on dataset 3 are given in Supplement 1, Tables S10–S13. The results showed that the lowest and highest accuracy were 81.27% (CHIAD tree with 32 features) and 92.77% (C5 with 60 features), respectively. Moreover, the minimum and maximum AUC were 0.899 (CHIAD with 32 features) and 0.972 (C5 with 60 features), respectively. The sensitivity for predicting death was also between 0.752 (MLP with 60 features) and 0.951 (C5 tree with 60 features). The best results obtained for each algorithm based on dataset 3 are shown in Supplementary Figure S3. According to Table 5, the C5 model had the best performance with different features, and SVM with 60 features was also one of the optimal models.
Table 5

Top 10 models developed on dataset 3.

SettingsFeature setAccuracySensitivitySpecificityPrecisionF-scoreAUC
C5Boosting492.7795.190.590.80.9290.972
C5Boosting391.7493.689.890.50.920.965
C5Boosting291.1894.28889.10.9160.96
SVMRBF default490.1692.787.788.10.9030.956
C5Boosting189.2891.387.387.70.8950.952
SVMRBF default388.8190.587.187.90.8920.944
MLP2.15.15 boosting388.5990.286.987.70.8890.94
MLP2.12.12 boosting487.6188.586.886.80.8760.938
C5Default387.489.88586.10.8790.934
SVMRBF default286.3486.686.186.60.8660.932

For MLPs, the numbers for MLP indicate the number of layers, the number of neurons in hidden layer 1, and the number of neurons in hidden layer 2.

3.3. Ensemble Models

Table 6 indicates that the best ensemble model had 89.13% accuracy and 0.961 AUC. However, the comparison of these models with the corresponding individual models (Table 5) shows that C5 models have better performance than these ensemble models, even though these ensemble models are better than other individual models.
Table 6

Ensemble models developed on dataset 3.

IDIncluded modelsFeature setAccuracySensitivitySpecificityPrecisionF-scoreAUC
1 Table S10 186.100.7990.9240.9140.8530.954
2 Table S11 287.390.8590.8890.8880.8730.954
3 Table S12 387.260.8310.9150.9080.8670.954
4 Table S13 489.130.8640.9190.9160.8900.961

3.4. External Validation

We evaluated all ensemble models (Table 6) and the top 10 models developed on dataset 3 (Table 5) using an external dataset. As shown in Table 7, C5 boosting models with feature sets 1 and 2 have better scores.
Table 7

External validation on dataset 3.

ModelsSettingsFeature setAccuracySensitivitySpecificityPrecisionF-scoreAUC
C5Boosting192.560.9550.9190.7200.8210.974
C5Boosting291.810.9640.9080.6950.8080.98
SVMRBF default391.000.8480.9240.7060.7710.955
Ensemble 2287.770.8610.8810.6110.7150.954
SVMRBF default288.240.8900.8810.6180.7290.953
Ensemble 1188.750.8190.9020.6450.7220.949
C5Boosting386.510.9350.8500.5750.7120.948
Ensemble 3388.180.7830.9030.6370.7020.931
MLP2.15.15 boosting387.950.7670.9040.6340.6940.914
MLP2.12.12 boosting487.310.7540.8990.6180.6790.914
Ensemble 4486.620.7700.8870.5960.6720.91
C5Boosting485.640.7480.8800.5750.6500.889
C5Default385.240.7800.8680.5620.6530.887
SVMRBF default483.790.7250.8620.5330.6150.868

For MLPs, the numbers for MLP indicate the number of layers, the number of neurons in hidden layer 1, and the number of neurons in hidden layer 2.

3.5. Subpopulation Bias Analysis

We selected the four best models based on external validation for subpopulation bias analysis (Supplement 1, Table S14). Figures 2 and 3 show the FPR and FNR of these models. As these figures indicate, most of these models better perform on female patients than male patients. Furthermore, the performance of these models decreases in older patients. As for FPR, Figure 2 indicates that SVM and C5 (feature set 2) have a less biased prediction in terms of gender and age groups. Additionally, Figure 3 shows that C5 (feature set 2) has a less biased prediction.
Figure 2

Subgroup false-positive rate (FPR) for different models. (a) C5 model on feature set 1. (b) C5 model on feature set 2. (c) SVM model on feature set 3. (d) Ensemble model on feature set 2.

Figure 3

Subgroup false-negative rate (FNR) for different models. (a) C5 model on feature set 1. (b) C5 model on feature set 2. (c) SVM model on feature set 3. (d) Ensemble model on feature set 2.

3.6. Comparison of the Models

A comparison of the models showed that, with the balancing of the data, the sensitivity and AUC increased. However, the accuracy based on dataset 2 decreased, but it also increased based on dataset 3. Furthermore, models with 60 and 40 features performed better. In general, the C5 model with 60 features outperformed the rest based on all evaluation indicators; however, based on the external validation, C5 boosting models with feature sets 1 (17 features) and 2 (32 features) have better external validity. Subpopulation analysis suggests that the C5 boosting model with 32 features has less bias.

3.7. Variable Importance

Figure 4 shows the importance of each variable in the selected model (C5). As indicated, intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CT are considered the most important death predictors by this model.
Figure 4

Variable importance of the selected model.

4. Discussion

In the first stage of the study, the risk factors for death due to COVID-19 were discovered using univariate analysis. Then, based on the important features, different machine learning models were developed to predict death. The results showed significant differences between recovered and nonrecovered patients in terms of age, sex, contact with infected people, respiratory distress, convulsion, altered consciousness, paralysis, blood oxygen saturation level, the number of comorbidities, intubation, oxygen therapy, and the need for ICU services. We found that intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CT are the most important death predictors. Other studies showed that age [17, 18, 23, 27, 28, 43], male gender [43], respiratory disease [16, 17], the number of comorbidities [43], and low oxygen saturation [17, 18, 23, 43] increased cases of death due to COVID-19. Some researchers indicate that high blood pressure, heart disease, cancer, kidney disease [16, 17], diabetes [18], cerebrovascular diseases [28], smoking [18, 23], and asthma [16] increased mortality from COVID-19. However, our model did not consider these factors significant. It is worth mentioning that these risk factors increased the number of comorbidities in a patient and this factor was also considered significant in the C5 model. We developed various models with different features to predict death from COVID-19. Based on the results, the best performance was related to the C5 decision tree with 32 features. In the same way, several studies tried to develop machine learning models for predicting death from COVID-19 [16–23, 25–28, 43–45]. Since a variety of variables (demographic, laboratory, radiographic, therapeutic, signs and symptoms, and comorbidities) and datasets are used, it is not easy to compare the studies. For example, some researchers used laboratory data to develop models in addition to other variables [17, 23, 28, 43], and a study applied only laboratory variables [45]. In another study, vital signs and imaging results were used to develop models [23]. However, the variables used in our study were similar to most of the studies. Despite this, a comparison of our study with previous studies showed that the performance of our selected model was better than those models (Table 8). The model developed by Gao et al. [43] has better performance (AUC = 0.976 vs. AUC = 0.972); however, this model was developed with small sample size. In addition, the F-score (F = 0.97) of the model developed by Yan et al. [19] was higher than our selected model. However, Barish et al. [46] showed that Yan's model did not have a good result in the external validation. Khan's model [26] also has a higher F-score than our model. Khan et al. and Gao et al. used unbalanced data; Barish et al. [46] have shown that models developed based on unbalanced data to predict death from COVID-19 may not have accurate results in the real environment.
Table 8

Some machine learning models suggested in the literature to predict death from COVID-19.

AuthorNumber of patients, death rate, number of featuresModelsAccuracyAUC
Muhammad et al. [44]1505, NA, 4Decision tree (DT)99.85NA
LR97.49NA
SVM98.85NA
Naive Bayes97.52NA
RF99.60NA
KNN98.06NA
Pourhomayoun and Shakibi [22]307382, NA, 57RF87.930.94
ANN89.980.93
SVM89.020.88
KNN89.830.90
LR87.910.92
DT86.870.93
Li et al. [20]2924, 8.8%, different features (83, 152, 5)Gradient boosting decision tree, 83 features88.90.939
LR, 152 features86.80.928
LR, 5 features88.70.915
Goncalves and Rouco [21]827601, 8.7%, 3Adaboost, gradient boosting, and RFNA0.919
LRNA0.917
An et al. [16]8000, 2.2%, 10SVM linear91.90.962
LASSO91.10.963
LASSO (14 days)86.80.944
SVM linear (14 days)87.70.941
LASSO (30 days)89.50.953
SVM linear (30 days)87.70.948
Yadaw et al. [18]3841, 8.1%, 17 and 3XGBoost (17 and 3 features)NA0.91
Yan et al. [19]375, 35%, 3XGBoost90F1: 0.97
Gao et al. [43]2160, 11%, 14SVM95.80.976
ANN95.60.976
Ensemble95.50.976
LR95.40.974
GBDT94.80.953
Chen et al. [28](192, 26%) only critically ill patients, 47 (17 nonlaboratory, 30 laboratory)SVM linear93 (47 features) 87.8 (17 features) 85.6 (30 features)NA
Booth et al. [45]398, 10.8%, 5SVM-RBF93
Parchure et al. [17]567, 17.8%, 55RF65.585.5
Zhao et al. [23]641, 12.8%, 47LRNA0.82
Das et al. [27]3524, 2.1%, 4LR96.50.83
SVM970.825
KNN92.40.759
RF92.40.787
Gradient boosting97.10.787
Chen et al. [25]1002 severe and critical cases, 16.1%, 7LRNA0.903
Khan et al. [26]103888, 5.7%, 15Deep neural network0.970F1: 0.985
RF, XGBoost0.9460.972
LR, DT0.9450.972
KNN0.9440.971

These studies did not report the AUC.

We found that machine learning models perform differently in subpopulations in terms of gender and age groups. Other studies similarly show that predictive models have different performances in different ethnic groups, genders, and age groups of patients and patients with different insurance [41, 42]. Therefore, researchers and clinicians should apply these models to different population groups cautiously. Moreover, developing models for different patient groups may be necessary. The strengths of our model are the use of demographic data, symptoms, and comorbidities that can be easily collected. Despite some previous studies, we did not use laboratory, treatment, and imaging data. It can be considered a limitation. However, we supposed that all patients received almost similar treatments. Moreover, applying models which are developed based on treatment data may be difficult because of changes in patients' treatment. Furthermore, models that depend on laboratory and imaging data require a lot of time and cost to gather these data to use the model in a real clinical environment. A comparison of our study with those that used laboratory and imaging data (Table 8) indicates that our selected model outperforms many of these models. A study also indicated that imaging data did not affect the performance of machine learning models to predict death from COVID-19 [23]. In addition, the data used in our study have been collected from 38 hospitals, which is the strength of the study. A similar study indicated that up to 20% of missing data in COVID-19 studies is acceptable for developing machine learning models [18]; however, the missing rate in our study was under 4%. Despite the strengths, some limitations should be considered. Firstly, we only analyzed the subpopulation bias based on gender and age groups. Future studies should consider other variables in this analysis. Furthermore, there are several well-established models such as APACHE and SOFA [41, 42]. Researchers are recommended to compare the performance of machine learning models with these models to predict deaths from COVID-19.

5. Conclusions

Different machine learning models were developed to predict the likelihood of death caused by COVID-19. The best prediction model was the C5 decision tree (accuracy = 91.18%, AUC = 0.96, and F = 0.916). Therefore, this model can be used to detect high-risk patients and improve the use of facilities, equipment, and medical practitioners for patients with COVID-19.
  33 in total

1.  A Clinical Decision Support System for Predicting the Early Complications of One-Anastomosis Gastric Bypass Surgery.

Authors:  Abbas Sheikhtaheri; Azam Orooji; Abdolreza Pazouki; Maryam Beitollahi
Journal:  Obes Surg       Date:  2019-07       Impact factor: 4.129

2.  Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool.

Authors:  Ashis Kumar Das; Shiba Mishra; Saji Saraswathy Gopalan
Journal:  PeerJ       Date:  2020-09-28       Impact factor: 2.984

3.  Development of a prognostic model for mortality in COVID-19 infection using machine learning.

Authors:  Adam L Booth; Elizabeth Abels; Peter McCaffrey
Journal:  Mod Pathol       Date:  2020-10-16       Impact factor: 7.842

4.  Prognostic Assessment of COVID-19 in the Intensive Care Unit by Machine Learning Methods: Model Development and Validation.

Authors:  Pan Pan; Yichao Li; Yongjiu Xiao; Bingchao Han; Longxiang Su; Mingliang Su; Yansheng Li; Siqi Zhang; Dapeng Jiang; Xia Chen; Fuquan Zhou; Ling Ma; Pengtao Bao; Lixin Xie
Journal:  J Med Internet Res       Date:  2020-11-11       Impact factor: 5.428

5.  Development and external evaluation of predictions models for mortality of COVID-19 patients using machine learning method.

Authors:  Simin Li; Yulan Lin; Tong Zhu; Mengjie Fan; Shicheng Xu; Weihao Qiu; Can Chen; Linfeng Li; Yao Wang; Jun Yan; Justin Wong; Lin Naing; Shabei Xu
Journal:  Neural Comput Appl       Date:  2021-01-05       Impact factor: 5.606

6.  Racial and Ethnic Disparities in COVID-19-Related Infections, Hospitalizations, and Deaths : A Systematic Review.

Authors:  Katherine Mackey; Chelsea K Ayers; Karli K Kondo; Somnath Saha; Shailesh M Advani; Sarah Young; Hunter Spencer; Max Rusek; Johanna Anderson; Stephanie Veazie; Mia Smith; Devan Kansagara
Journal:  Ann Intern Med       Date:  2020-12-01       Impact factor: 25.391

7.  Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis.

Authors:  Wenjie Tian; Wanlin Jiang; Jie Yao; Christopher J Nicholson; Rebecca H Li; Haakon H Sigurslid; Luke Wooster; Jerome I Rotter; Xiuqing Guo; Rajeev Malhotra
Journal:  J Med Virol       Date:  2020-07-11       Impact factor: 20.693

8.  Higher body mass index is an important risk factor in COVID-19 patients: a systematic review and meta-analysis.

Authors:  Vivek Singh Malik; Khaiwal Ravindra; Savita Verma Attri; Sanjay Kumar Bhadada; Meenu Singh
Journal:  Environ Sci Pollut Res Int       Date:  2020-07-24       Impact factor: 5.190

9.  Machine learning based early warning system enables accurate mortality risk prediction for COVID-19.

Authors:  Yue Gao; Guang-Yao Cai; Wei Fang; Hua-Yi Li; Si-Yuan Wang; Lingxi Chen; Yang Yu; Dan Liu; Sen Xu; Peng-Fei Cui; Shao-Qing Zeng; Xin-Xia Feng; Rui-Di Yu; Ya Wang; Yuan Yuan; Xiao-Fei Jiao; Jian-Hua Chi; Jia-Hao Liu; Ru-Yuan Li; Xu Zheng; Chun-Yan Song; Ning Jin; Wen-Jian Gong; Xing-Yu Liu; Lei Huang; Xun Tian; Lin Li; Hui Xing; Ding Ma; Chun-Rui Li; Fei Ye; Qing-Lei Gao
Journal:  Nat Commun       Date:  2020-10-06       Impact factor: 14.919

View more
  1 in total

1.  Using machine learning in prediction of ICU admission, mortality, and length of stay in the early stage of admission of COVID-19 patients.

Authors:  Sara Saadatmand; Khodakaram Salimifard; Reza Mohammadi; Alex Kuiper; Maryam Marzban; Akram Farhadi
Journal:  Ann Oper Res       Date:  2022-09-29       Impact factor: 4.820

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.