Literature DB >> 35280933

Predicting hospital readmission risk in patients with COVID-19: A machine learning approach.

Mohammad Reza Afrash¹, Hadi Kazemi-Arpanahi^2,3, Mostafa Shanbehzadeh⁴, Raoof Nopour⁵, Esmat Mirbagheri⁶.

Abstract

Introduction: The Coronavirus 2019 (COVID-19) epidemic stunned the health systems with severe scarcities in hospital resources. In this critical situation, decreasing COVID-19 readmissions could potentially sustain hospital capacity. This study aimed to select the most affecting features of COVID-19 readmission and compare the capability of Machine Learning (ML) algorithms to predict COVID-19 readmission based on the selected features. Material and methods: The data of 5791 hospitalized patients with COVID-19 were retrospectively recruited from a hospital registry system. The LASSO feature selection algorithm was used to select the most important features related to COVID-19 readmission. HistGradientBoosting classifier (HGB), Bagging classifier, Multi-Layered Perceptron (MLP), Support Vector Machine ((SVM) kernel = linear), SVM (kernel = RBF), and Extreme Gradient Boosting (XGBoost) classifiers were used for prediction. We evaluated the performance of ML algorithms with a 10-fold cross-validation method using six performance evaluation metrics.
Results: Out of the 42 features, 14 were identified as the most relevant predictors. The XGBoost classifier outperformed the other six ML models with an average accuracy of 91.7%, specificity of 91.3%, the sensitivity of 91.6%, F-measure of 91.8%, and AUC of 0.91%.
Conclusion: The experimental results prove that ML models can satisfactorily predict COVID-19 readmission. Besides considering the risk factors prioritized in this work, categorizing cases with a high risk of reinfection can make the patient triaging procedure and hospital resource utilization more effective.

Entities: Chemical

Keywords: AUC, Area under the curve; Artificial intelligent; CDSS, Clinical Decision Support Systems; COVID-19; COVID-19, Coronavirus disease 2019; CRISP, Cross-Industry Standard Process; Coronavirus; HGB, Hist Gradient Boosting; LASSO, Least Absolute Shrinkage and Selection Operator; ML, Machine learning; MLP, Multi-Layered Perceptron; Machine learning; Readmission; SVM, Support Vector Machine; XGBoost, Extreme Gradient Boosting

Year: 2022 PMID： 35280933 PMCID： PMC8901230 DOI： 10.1016/j.imu.2022.100908

Source DB: PubMed Journal: Inform Med Unlocked ISSN： 2352-9148

Introduction

Hospital readmission is a well-accepted metric of hospital care quality [1]. It is defined as the new hospitalization in the same hospital within a specified time between 30 and 60 days after initial hospital discharge [[2], [3], [4]]. The high readmission rates are most probably related to the quality of care delivered by hospitals and other health centers during or after the former admission [5,6]. Because of the high costs that readmission imposes on hospitals and patients, it has gained substantial attention as one of the most important criteria for evaluating the quality of care and discharge procedures. Estimates show that 60% of patient readmission can be prevented [7,8]. As the prevalence of the COVID-19, the health care systems of many countries were collapsed and could not meet the growing needs of patients to diagnose, treatment, and care services [9,10]. Many patients in such conditions were discharged after admission with partial recovery [11]. Meanwhile, due to the unknown and aggressive nature of the disease, the readmission rate of patients increased [12]. Readmission imposes additional costs on care organizations and patients. In addition, it will reduce the quality indicators of service delivery; increase the rate of serious complications and deaths during the pandemic [13]. According to the formal reports, about 5% of COVID-19 confirmed patients necessitate hospitalization care services, and the tolls of readmission from this disease report vary from 2 to 10% [14,15]. In this situation, enhancing the capability of the healthcare system against the pandemic requires attention to technological and intelligent-based solutions such as Clinical Decision Support Systems (CDSSs) [16,17]. CDSSs attracted increasing interest because of the growing availability of a large amount of patient-level data [18,19]. CDSSs using available patient data at the time of admission may provide caregivers with valuable information regarding the likelihood risk of COVID-19 readmission [20,21]. Machine learning (ML) algorithms are complex and flexible classification modeling that leverage big datasets to reveal new and practical patterns [18,22]. ML algorithms will reduce uncertainties and ambiguities related to new diseases such as COVID-19 by providing diagnostic and predictive models based on valid and scientific evidence to assess risks, screening, forecasting, and health planning [23,24]. Recently, published works have shown that several ML methods are more accurate than conventional statistics models for predicting clinical outcomes in COVID-19 hospitalized patients. They are such as predicting the Length of Stay (LOS), hospital bed occupancy and turnover, Intensive Care Unit (ICU) admission, and respiratory intubation [[25], [26], [27]]. Due to the high prevalence of the disease in our country and the existence of some limitations and lack of healthcare resources [28], therefore, the purpose of this study is to develop an effective and efficient diagnostic model based on comparing the performance of ML algorithms for COVID-19 readmission prediction. Therefore, the present study seeks to answer two questions. What are the most important predictor variables affecting readmission and worsening of patients after receiving first hospitalization services? And which ML model is more effective for predicting readmission?

Material and methods

Study roadmap and experiment environment

The present study was conducted in the form of a retrospective and single-center study in 2022 to predict readmission in patients with confirmed COVID-19 based on one of the most popular ML methods called the Cross-Industry Standard Process (CRISP). It was carried out through five main steps including, 1- Data understanding, 2- Data preprocessing, 3- Feature selection, 4- Classifier, and 5-Evaluation. Fig. 1 shows the proposed models of study steps and sub-steps based on CRISP. This study used Python programming language to run all experiments on the data mining algorithms to predict readmission in patients with confirmed COVID-19 (see Fig. 2 ).

Fig. 1

The roadmap of the proposed system for prediction of readmission based on the CRISP method.

Fig. 2

Flow chart describing patient selection.

The roadmap of the proposed system for prediction of readmission based on the CRISP method. Flow chart describing patient selection.

Data set description

The included cases are defined based on 42 variables in three main classes, including patient's demographics (three variables), hospitalization (eight variables), and clinical (31 variables) (see Table 1 ). After reviewing the demographical, clinical and hospitalization information of the patients with confirmed COVID-19, statically analysis was performed to describe the differences in the patients with confirmed COVID-19 data, were readmitted or not. For this purpose, the differences in demographical and hospitalization information of patient were described based on whether the patients were readmitted or not, and the relationship of each feature with readmission was checked by the Chi-square test.

Table 1

Patient characteristics variable data.

Patient Characteristics	Variables	Total		Readmission	Non-Readmission	P-value
Patient Characteristics	Variables	Total		N	N	P-value
Demographical	Sex	Female	2720	412	2308	<0.002**
	Sex	Male	3071	332	2739	<0.002**
	Marital status	single,	1219	631	588	<0.004**
	Marital status	married	4572	239	4333	<0.004**
	Age	0–30	1363	152	1211	<0.001**
		30–60	1836	146	1690
		60–90	2952	572	2380
Hospitalization	Number of admissions	1	4921	0	4921
		2–4	780	780	0	<0.002**
		>4	90	90	0	<0.002**
	Type of admission	Inpatient care	2075	524	1551	<0.001**
	Type of admission	Outpatient care	3716	346	3370	<0.001**
	ICU admission	Yes	528	462	66	<0.002**
	ICU admission	No	5263	408	4855	<0.002**
	Oxygen therapy	Yes	720	543	177	<0.161
	Oxygen therapy	No	5071	327	4744	<0.161
	CRP on admission	Yes	380	329	51	<0.039**
	CRP on admission	No	5411	541	4870	<0.039**
	Duration of hospitalization	<24 h	3917	43	3874	<0.497**
		1–7 days	1465	519	946
		>7days	409	308	101
	Patient status on discharge	Partial recovery-	1430	774	656	<0.041**
		Complete recovery	3970	62	3908
		dead	391	34	357
	Time to readmission	<30 days	1300	257	1043	<0.052
	Time to readmission	>30days	4491	613	3878	<0.052
	COVID status	Critical	520	14	506	<0.001**
		Severe	1034	142	892
		Moderate	2300	540	1760
		Mild	1540	98	1442
		Recovered	397	14	383
	Severe kidney disease	Yes	240	49	191	<0.630
	Severe kidney disease	No	5551	821	4730	<0.630
	Solid organ transplantation	Yes	182	94	88	<0.951
	Solid organ transplantation	No	5609	776	4833	<0.951
	Lymphocytes on discharge	Yes	746	297	449	<0.832
	Lymphocytes on discharge	No	5045	573	4472	<0.832
	Coronary artery disease	Yes	570	381	189	<0.267
	Coronary artery disease	No	5221	489	4732	<0.267
	Cancer	Yes	168	119	49	<0.574
	Cancer	No	5623	751	4872	<0.574
	History of CT result	Normal	3321	540	2781	<0.059
	History of CT result	Unmoral	2470	330	2140	<0.059
	Pregnancy	Yes	94	23	71	<0.720
	Pregnancy	No	5697	847	4850	<0.720
	Congestive heart failure	Yes	350	180	170	<0.968
	Congestive heart failure	No	5441	690	4751	<0.968
	Cerebrovascular disease	Yes	49	8	41	<0.602
	Cerebrovascular disease	No	5742	862	4880	<0.602
	C reactive protein on admission	Yes	5308	710	4598	<0.057
	C reactive protein on admission	No	753	160	593	<0.057
	Congestive heart failure	Yes	135	94	41	<0.619
	Congestive heart failure	No	5656	776	4880	<0.619
	Asthma	Yes	74	41	33	<0.570
	Asthma	No	5717	829	4888	<0.570
	Metastatic solid tumor	Yes	14	3	11	<0.924
	Metastatic solid tumor	No	5776	867	4909	<0.924
	Diabetes mellitus	Yes	364	79	285	<0.738
	Diabetes mellitus	No	5427	791	4636	<0.738
	D-dimer	Yes	4680	361	4319	<0.042**
	D-dimer	No	1111	509	602	<0.042**
	Dyspnea	Yes	1640	490	1150	<0.069
	Dyspnea	No	4151	380	3771	<0.069
	Underlying diseases	Yes	839	538	301	<0.073
	Underlying diseases	No	4952	468	4484	<0.073
	Headache	Yes	4981	681	4300	<0.075
	Headache	No	810	189	621	<0.075
	Weakness and lethargy	Yes	5134	526	4608	<0.052
	Weakness and lethargy	No	657	344	313	<0.052
	Body pain	Yes	4391	617	3774	<0.061
	Body pain	No	1400	253	1147	<0.061
Pain or pressure in the chest	Yes	2670	594	2076	<0.068
Pain or pressure in the chest	No	3121	276	2845	<0.068
High fever	Yes	4621	713	3908	<0.072
High fever	No	1170	157	1013	<0.072
Nausea & Vomiting	Yes	3910	672	3238	<0.067
Nausea & Vomiting	No	1881	198	1683	<0.067
Cough	Yes	4627	593	4034	<0.0512
Cough	No	1164	277	887	<0.0512
Gastrointestinal symptoms	Yes	234	56	178	<0.102
Gastrointestinal symptoms	No	5557	814	4743	<0.102
Chronic pulmonary	Yes	261	73	188	<0.284
Chronic pulmonary	No	5530	797	4733	<0.284
Hypertension	Yes	840	142	698	<0.043**
Hypertension	No	4951	728	4223	<0.043**
Consolidation	Yes	461	59	402	<0.0497**
Consolidation	No	5330	811	4519	<0.0497**
Pleural fluid	Yes	571	137	434	<0.0581
Pleural fluid	No	5220	733	4487	<0.0581
Hypersensitive troponin	Yes	892	261	568	<0.042*
Hypersensitive troponin	No	4899	609	4290	<0.042*

Patient characteristics variable data. Of 5791 COVID-19 hospitalized patients, 3071 (53.04%) were male, 2720 (46.96%) were women, and the median age of participants was 57.25 (interquartile 00–100). 528 (13.87%) were hospitalized in ICU, and 2075 (86.13%) were hospitalized in general wards. Out of 5791 included patients, 870 (15.02%) patients were readmitted within 30 days after initial discharge.

Ethical consideration

The ethical committee board approved the study of Ilam University of Medical Sciences (Ethics code: IR.MEDILAM.REC.1399.294). To protect the privacy and confidentiality of patients, we concealed the unique identification information of all patients in the process of data collection and presentation.

Preprocessing step

Preprocessing on the dataset was applied before the training of the proposed model. Several preprocessing steps were examined on the dataset, including removing missing values (rows with missing values greater than 70% were removed.), Standard scalar, Min-Max Scalar, Data validation under sampling for correct use of data in the machine learning algorithms. The noisy and abnormal values, duplicates, and meaningless data impacted ML models' results and were examined and removed by two authors: (M: A and M: SH).

Patient selection criteria

After applying the exclusion criteria, out of 9180 confirmed COVID-19 patients, 6411 hospitalized cases were included in the study. In the preprocessing steps, 818 patient record values were removed, and after deleting these values, the number of patient records was reduced to 5791 cases. Among them, 870 (15.02%) cases were readmitted after a 30-day of the first hospitalization.

Feature selection

Feature selection or variable selection is needed before feeding data into the ML algorithms since outside dimensions affect the classification performance and precision and decrease run time [29]. To select the most important feature to predict readmission, we used Least Absolute Shrinkage and Selection Operator Features Selection Algorithm (LASSO) in this study. The LASSO selects the most important and relevant features for predicting readmission in COVID-19 patients according to updating the absolute value of the variables' coefficient. If the coefficients value of variables is equal to zero, these zero Values for features eliminated that from features subset, and if any variables obtained high values for coefficients. Hence, the feature included in selected variables subsets.

Machine learning methods

In this study, to predict the readmission in the patient with confirmed COVID-19, we used seven ML classification algorithms, including Hist Gradient Boosting (HGB) classifier, Bagging classifier, Multi-Layered Perceptron (MLP) classifier, Support Vector Machine ((SVM) kernel = linear), SVM (kernel = RBF), and Extreme Gradient Boosting (XGBoost) classifier.

Performance metrics

To evaluate the performance of applied algorithms and verify the quality of the algorithms in this study, we used the k-fold cross-validation method. Cross-validation is a resampling method used to assess ML models in an unseen data sample. This method has one parameter named k that refers to the number of parts that the dataset should be split. In this study, we use 10 -fold cross validation method. In 10-fold cross-validation methods, the algorithms are trained and tested 10-time times, and then the mean evaluation metrics. Accuracy, specificity, sensitivity, KAPA statistic, Area under the curve (AUC) are measured at the end of the process curve (Equations (1), (2), (3), (4), (5))).

Results

Patient characteristics

The mean age of patients who were readmitted to the hospital was 59 ± 9 years old. The mean age of patients who were not readmitted to the hospital was 51 ± 6 years old (p < 0.002). Table 1 indicated that there was a significant association between some features of patients who readmitted or not: features with p-value < 0.005 that showed in Table 1 with (** symbol) have a significant difference in patients who readmitted d or not class. For example, the results showed that there was a significant relationship between ICU admission and COVID status with readmission (p-value < 0.002) and (p-value-<0.001), respectively. The LASSO feature selection method selects the most important and relevant features for predicting readmission according to updating the absolute value of the variables' coefficient. The LASSO feature selection ranks the relevant variables. After feature selection, out of 42, 28 variables have not been selected to predict readmission and have been deleted from the dataset. The top 14 selected important variables by the LASSO feature selection method and their scores are represented in Table 2 .

Table 2

Important variables selected by the LASSO algorithm.

Order	Feature name	Score	P-Value
1	COVID status	3.78	0/015
2	ICU admission	3.50	0/035
3	Oxygen therapy	3.31	0/012
4	CRP on admission	3.19	0/047
5	Duration of hospitalization	3.08	0/032
6	Solid organ transplantation	2.94	<0/001
7	Lymphocytes on discharge	2.71	0/001
8	Coronary artery disease	2.64	0/023
9	Cerebrovascular disease	2.47	0/027
10	C reactive protein on admission	2.39	0/012
11	Congestive heart failure	2.15	0/017
12	Asthma	2.09	0/021
13	Metastatic solid tumor	2.03	0/006
14	Age	1.74	0/045

Important variables selected by the LASSO algorithm. Based on Table 2, COVID-19 status, ICU admission, and oxygen therapy obtain the highest score for the prediction of readmission in a patient with COVID-19. Moreover, age and solid metastatic tumor have a low score in relevant variables scores, so it means that age and solid metastatic tumor have a low impact on the prediction of readmission in confirmed COVID-19 patients.

Results of hyper-parameters tuning

The performance of ML algorithms is highly dependent on the selection of their hyper-parameters. Hyper-parameters are applied to ML algorithms to produce the best model on a given dataset. After the preprocessing step, several ML modeling was performed by adjusting and optimizing hyper-parameters. The best hyper-parameters needed to build models with the highest F-criteria score were identified during this step. In the present study, to select the most precise and powerful models, the Randomized Search CV method was used for parameter adjustment and optimization algorithms, including HGB classifier, Bagging classifier, MLP classifier, SVM (kernel = linear), SVM (kernel = RBF), and XGBoost classifier. Table 3 represents the best Hyper-parameters for ML algorithm modeling for predicting readmission.

Table 3

Best hyper-parameters for ML algorithm modeling in prediction of readmission.

Num	Algorithms	Hyper-parameters	f-score
1	HistGradientBoostingClassifier	‘verbose’ = 2, ‘random_state’ = 999, ‘max_leaf_nodes’ = 62, ‘max_iter’ = 150, ‘max_depht’ = 7, ‘learning rate’ = 0.1	93.7
2	BaggingClassifier	‘verbose’ = 2, ‘random_state’ = 999, ‘n_estimation’ = 12, ‘max-samples’ = 0.5, ‘bootstrap’ = ‘true’	91.28
3	MLP Classifier	‘Learning rate’ = ‘constant’, hidden_layer_size’ = (100,100,100), ‘alpha’ = 0.05, ‘activation’ = ‘rulo’	91.07
4	SVM (kernel = linear)	C = 100,G = 0.0001	90.09
5	SVM (kernel = RBF)	C = 10, G = 0.001	89.24
6	XG Boost Classifier	‘min_chid_weigh’ = 1′max_depht’ = 12,‘learning_rate’ = 0.1, ‘gamma’ = 0.4, ‘colsample_bytree’ = 0.3	89.01
7	K Nearest Neighbor Classifier	K = 3, ‘n_jobs’ = −1, ‘algorithm’ = ‘auto’	87.00

Best hyper-parameters for ML algorithm modeling in prediction of readmission.

K-fold cross-validation

Selected features by the LASSO feature selection method were tested on seven ML algorithms with a 10-fold cross-validation method. 10-fold cross-validation splits our selected data set into ten subsets and performs the holdout method ten times. 90% of data was used for training ML algorithms for each run, and 10% was fed into the algorithms to test models. To measure the performance of ML algorithms with a 95% confidence interval, we measured the mean of evaluation metrics. Table 4 shows the results of seven prediction models on the selected feature by the LASSO method with a 10-fold cross-validation method to predict the readmission in COVID-19 patients.

Table 4

10-fold CV Classification performance of different classifiers on selected features.

Classifier		Mean Accuracy	Mean Specificity (%)	Mean Sensitivity	Mean F- measure	Kappa Statistic (KS)	AUC
HGB Classifier	Mean	0.8176	0.814	0.8296	0.8201	82.4%	0.8233
	95% CI	(0.81, 0.83)	(0.8, 0.82)	(0.81, 0.85)	(0.81, 0.83)	(0.82, 0.86)	(0.81, 0.83)
	STD	0.0154	0.0127	0.0296	0.0148	0.0257	0.0157
Bagging Classifier	Mean	0.847	0.841	0.847	0.845	84.36%	0.843
	95% CI	(0.84, 0.85)	(0.84, 0.85)	(0.84, 0.85)	(0.85, 0.85)	(0.84, 0.85)	(0.84, 0.85)
	STD	0.0172	0.0116	0.00128	0.0194	0.0127	0.0182
MLP Classifier	Mean	0.886	0.889	0.884	0.881	88.6%	0.882
	95% CI	(0.88, 0.89)	(0.88, 0.89)	(0.88, 0.89)	(0.88, 0.89)	(0.88, 0.89)	(0.88, 0.89)
	STD	0.0027	0.0112	0.0134	0.00140	0.010	0.0129
XGBoost Classifier	Mean	0.917	0.913	0.916	0.918	91.37%	0.9145
	95% CI	(0.91, 0.92)	(0.91, 0.92)	(0.91, 0.92)	(0.91, 0.92)	(0.91, 0.92)	(0.91, 0.92)
	STD	0.0146	0.0138	0.0147	0.0175	0.01924	0.0126
SVM (kernel = linear)	Mean	0.8896	0.8733	0.912	0.892	88.7%	0.892
	95% CI	(0.87, 0.90)	(0.66, 0.88)	(0.90, 0.93)	(0.88, 0.90)	(0.88, 0.89)	(0.88, 0.90)
	STD	0.0174	0.0167	0.0129	0.0182	0.0140	0.01864
SVM (kernel = RBF)	Mean	0.857	0.850	0.861	0.859	86.7%	0.863
	95% CI	(0.85, 0.86)	(0.84, 0.86)	(0.85, 0.87)	(0.85, 0.87)	(0.86, 0.87)	(0.86, 0.87)
	STD	0.0127	0.01734	0.0129	0.0134	0.0118	0.01727
K Nearest Neighbor Classifier	Mean	0.8835	0.8785	0.892	0.8937	88.3%	0.886
	95% CI	(0.88, 0.89)	(0.87, 0.89)	(0.89, 0.90)	(0.89, 0.90)	(0.88, 0.89)	(0.88, 0.89)
	STD	0.0014	0.0174	0.018	0.0162	0.0183	0.0163

10-fold CV Classification performance of different classifiers on selected features. Table 4 shows the results of the ML models on the adopted features by the LASSO feature selection method in ten independent runs. The results show that the HGB classifier gave a mean accuracy of 88.6%, a mean sensitivity of 88.4%, a mean specificity of 88.9.55%, mean F-measure of 88.1%, a mean for Kappa statistic of 88.6%, and AUC of 88.2% when selected risk factors were used. Bagging classifier obtained a mean accuracy of 84.7%, a mean sensitivity of 84.7%, a mean specificity of 84.1%, a mean F-measure of 84.5%, a mean for Kappa statistic of 84.36.6%, and AUC of 84.3% when the LASSO feature selection method was included in the classifier. Based on Table 3, the MLP classifier shows good performance that has a mean accuracy of 88.6%, 88.9% for a mean of specificity, 88.4% for a mean sensitivity of 88.1%, a Mean F-measure, 88.6% a mean of Kappa Statistic, and 88.2% for a mean of AUC metrics. The performance of the XGBoost classifier was excellent, as shown in Table 3. The XGBoost classifier achieved 91.7% for a mean accuracy, 91.3% specificity, 91.6% mean of sensitivity, 91.8% mean F-measure, 91.37% a mean of Kappa Statistic 91.4% for a mean of AUC per ten independent runs. The SVM (kernel = linear) was the second-best classifier that has a mean of accuracy 88.9%, 87.3% for a mean of specificity, 91.2% for a mean of sensitivity, 89.2% mean F- measure, 88.7% a mean of Kappa Statistic and 89.2% obtained as a mean of AUC. The SVM (kernel = RBF) has a mean accuracy of 85.7%, a mean sensitivity of 86.1%, a mean specificity of 85.0%, Mean F-measure of 85.9%, a mean for Kappa rate of 86.7%, and AUC of 86.3% when LASSO feature selection method was included in the classifier. The KNN classifier with mean classification accuracy 88.3%, specificity 87.8%, sensitivity 89.2%, F- measure 89.37%, Kappa statistic 88.3%, and AUC 88.6% achieved nearly acceptable performance. As shown in Fig. 3 , the performance of the XGBoost classifier outperformed the other six ML models with 91.7% mean accuracy, 91.3% mean specificity, 91.6% mean sensitivity, 91.8% mean F-measure, and 0.9145 AUC. The second important model was SVM with the linear kernel (ACU = 0.892), and the worst performance was observed for the HGB classifier out of six other ML algorithms (AUC = 0.8233). The classification report and ROC curve of the XGBoost classifier as the best classification algorithm in the present study in terms of the highest evaluation metrics are displayed in Fig. 4 .

Fig. 3

Comparison of classification models performance on selected features.

Fig. 4

Classification report and AUC curve of the XGBoost classifier.

Comparison of classification models performance on selected features. Classification report and AUC curve of the XGBoost classifier.

Discussion

Given the unknown nature of COVID-19 with a wide range of symptoms and complications, it is important to implement intelligent-based models for estimating the possibility of its reinfection and recurrence [30,31]. Readmission and disease recurrence prediction is complex and challenging, especially in new and ambiguous diseases such as COVID-19 [32,33]. Based on our knowledge, this work is one of the few studies that applied ML algorithms for predicting the readmission risk of patients with COVID-19. So far, most previous ML-based studies have focused on predicting readmission of chronic conditions such as cardiovascular [1,[34], [35], [36], [37], [38], [39]], stroke [[40], [41], [42], [43], [44]], and COPD [5,6,[45], [46], [47]]. Till now, few studies have been conducted about COVID-19 readmission. In Rodriguez's study (2021), a predictive model for readmission in COVID-19 patients was presented based on an ML classifier. They concluded that ML and data mining-based approaches have seemed fruitful for readmission prediction [20]. Koteswari (2020) proposed an intelligent model to predict the readmission probability of various COVID-19 cases using ML techniques. The experimental results demonstrate ML-based predictive models can reduce COVID-19 readmission [30]. Raftarai (2021) compared the performance of four ML algorithms for predicting readmission in patients with COVID-19. The AdaBoost ensemble classifier yielded the best performance (accuracy 91.61%) [33]. Similarly, Jia (2021) assessed the performance of some ML algorithms to predict future deterioration among discharged patients with COVID-19. Finally, the best performance was yielded by XGBoost with a mean accuracy of 91.7%, mean specificity of 91.3%, mean sensitivity of 91.6%, mean F-measure of 91.8%, and AUC of 91.45%. Ryu (2021) [48] showed Gradient Boosting Machine (GBM) and Lo (2021) [49] concluded Categorical boosting (Catboost) had the highest AUC performance (= %75.1 and %75.15 respectively) in prediction readmission. Besides in recent studies (performed in 2021) by Zhao [50], Darabi [51], Chen [52], Shah [53], the results showed Boosting algorithms gained better performance in predicting patient readmission. Boosting like Adaptive Boosting (Ada Boost), XGBoost, HGB, Catboost and GBM is a set of powerful and most widely used ML algorithms. Boosting classifiers improve the classification accuracy by combining of the outputs from a sequence of weak learner and developing a robust predictive model [54,55]. The results of previous studies showed that the performance of these algorithms was optimum in predicting hospital readmission risk in patients with COVID-19. In the present study, due to the optimization of prediction variables through performing feature selection and data preprocessing before using them as inputs for modeling, the performance of the implemented models has been improved. Similarly in the current work the XGBoost model outperformed the other six techniques (0.91% AUC, 0.91–0.92 CI and 0.0146 STD). Since the COVID-19 pandemic began, several studies selected clinically important predictors for post-discharge COVID-19. For example, Rodriguez's study (2021) indicated underline chronic disease, hypoxia (oxygen saturation ≤94%), increased LDH, CRP, and ESR as the most effective factors on hospital readmission [20]. In another study performed by Mendito (2021), several clinical features such as age, neutrophilia count, sequential organ failure assessment (SOFA), LDH, CRP, and D-dimer are recognized as highly contributing factors to the readmission of COVID-19 patients [31]. But, Duarte's research (2021) detected polypharmacy, living in residential care or nursing homes, general illness, chest pain, psychological symptoms, syncope, and superinfection as the most relevant factors on COVID-19 hospital readmission [56]. Accordingly, in Nematshahi et al.'s (2021) study, the period between discharge to readmission, age, gender, underline disease, creatinine level, and pulmonary involvement were renowned as influencing factors in predicting COVID-19 readmission [57]. Similarly, in Jeon's (2020) research, age and sex variables and the presence of underlying disease are effective in increasing the risk of readmission of COVID-19 patients [58]. The presence of comorbidities, high BMI, adult age, laboratory indicators such as CRP, creatinine, and ALT/ASP rate was introduced as one of the most important underlying factors for readmission in COVID-19 patients in the Verna study [59]. In a systematic review study conducted by Akbari et al. (2021), they concluded that male sex, white ethnicity, comorbid diseases, and old age are affecting variables on COVID-19 readmission [60]. Fukushima's study (2021) also showed that certain comorbidities such as diabetes, hypertension, and cardiovascular diseases have a higher capability in predicting the readmission risk among COVID-19 patients [61]. Age over 60 years, underlying diseases, especially diabetes, high creatinine level, and lung involvement were the essential predictors of readmission in the patients with COVID-19 (et al. [32]). The most important variables in the Green (2021) study for readmission prediction were age, LOS, ICU admission, oxygen saturation, D-dimer, and cardiovascular diseases [62]. Similarly, we identified 14 highly correlated variables with the output class. Major risk factors for readmission in the current study include COVID-19 status, ICU admission, Oxygen therapy, CRP on admission, duration of hospitalization, Solid-organ transplantation, Lymphocytes on discharge, Coronary artery disease, Cerebrovascular disease, CRP on admission, congestive heart failure, asthma, metastatic solid tumor, and age most of which are non-modifiable. It should be noted that the identified variables in the present study are consistent with the previous researches. In the reviewed studies, baseline variables (e.g. age and sex), laboratory indicators, underlying diseases (comorbidities) and resource utilization variables such as LOS, ICU admission, and oxygen therapy play a pivotal role in predicting the readmission of patients with COVID-19. However in these studies, the importance of radiological data for readmission risk prediction among COVID-19 patients, has been neglected. Similarity, in the present study, after doing feature selection, the selected data set lacks radiological variables. Therefore, more studies are needed in this regard. In addition, several models for predicting the risk of readmission among COVID-19 patients have been developed, one of which gained reasonable performance in the evaluation phase. Interestingly, the selected ML algorithm (XGBoost) can predict the 30-day readmission risk of patients with high accuracy. The proposed model of the present study can help healthcare providers timely detect patient deterioration and reduce the severe complications and the resulting mortalities. This study is a retrospective-single-center study including a relatively small number of patient data. Therefore, the findings may not be generalizable to the wider population. In addition, the existence of some noisy data fields such as inconsistency, meaningless, missing, error-prone, and abnormal fields might impact the data mining accuracy. Moreover, we used only eight ML algorithms for prediction analyses based on some clinical features. Our data set furthermore lacked clinically essential variables such as imaging indicators. Therefore, at first, to remove noisy data, the normal range of each variable is defined using the opinion of two infectious diseases specialists. Then, we specified all the values outside the defined range and completed them by referring them to the responsible doctor. In addition, the records with more than 70% of empty fields (=439 as shown in Fig. 1) were removed. The missing fields in the records with less than 70% missing are imputed by mean and mode values substitution for continuous and discrete variables, respectively. Additional external validation methods should be used to prove the results of the present study and further verify the generalizability of our results. Finally, the selected dataset lacks some clinical variables such as radiological indicators. As practical solutions, the accuracy and generalizability of our models will be enhanced if we test more ML techniques at the larger, multicenter, and prospective datasets.

Conclusion

We implement and validate several predictive models stratifying readmission risk for COVID-19 patients. In particular, it has been observed that the XGBoost model performed best on classification accuracy better than the other ML algorithms. This method can provide caregivers and hospital administrators with an effective instrument to allocate limited hospital resources best. These models also may be an advantage in better and customized care delivery, lessen clinician workload, and diminish severe complication and death in the COVID-19 patients. In future work, the proposed method is expected to be applied to other hospital resource utilization domains such as ICU bed turnover, LOS, and respiratory ventilator.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

41 in total

1. Readmission and Death After Initial Hospital Discharge Among Patients With COVID-19 in a Large Multihospital System.

Authors: John P Donnelly; Xiao Qing Wang; Theodore J Iwashyna; Hallie C Prescott
Journal: JAMA Date: 2021-01-19 Impact factor: 56.272

2. Diagnoses and timing of 30-day readmissions after hospitalization for heart failure, acute myocardial infarction, or pneumonia.

Authors: Kumar Dharmarajan; Angela F Hsieh; Zhenqiu Lin; Héctor Bueno; Joseph S Ross; Leora I Horwitz; José Augusto Barreto-Filho; Nancy Kim; Susannah M Bernheim; Lisa G Suter; Elizabeth E Drye; Harlan M Krumholz
Journal: JAMA Date: 2013-01-23 Impact factor: 56.272

3. Machine learning-based prediction of heart failure readmission or death: implications of choosing the right model and the right metrics.

Authors: Saqib Ejaz Awan; Mohammed Bennamoun; Ferdous Sohel; Frank Mario Sanfilippo; Girish Dwivedi
Journal: ESC Heart Fail Date: 2019-02-27

4. Analysis of Characteristics in Death Patients with COVID-19 Pneumonia without Underlying Diseases.

Authors: Yiqi Hu; He Deng; Lu Huang; Liming Xia; Xin Zhou
Journal: Acad Radiol Date: 2020-04-07 Impact factor: 3.173

5. Explaining the reasons for not maintaining the health guidelines to prevent COVID-19 in high-risk jobs: a qualitative study in Iran.

Authors: Neda SoleimanvandiAzar; Seyed Fahim Irandoost; Sina Ahmadi; Tareq Xosravi; Hadi Ranjbar; Morteza Mansourian; Javad Yoosefi Lebni
Journal: BMC Public Health Date: 2021-05-03 Impact factor: 3.295

6. Predictors of readmission requiring hospitalization after discharge from emergency departments in patients with COVID-19.

Authors: Vincenzo G Menditto; Francesca Fulgenzi; Martina Bonifazi; Umberto Gnudi; Silvia Gennarini; Federico Mei; Aldo Salvi
Journal: Am J Emerg Med Date: 2021-04-22 Impact factor: 2.469

7. Comparing machine learning algorithms for predicting COVID-19 mortality.

Authors: Khadijeh Moulaei; Mostafa Shanbehzadeh; Zahra Mohammadi-Taghiabad; Hadi Kazemi-Arpanahi
Journal: BMC Med Inform Decis Mak Date: 2022-01-04 Impact factor: 2.796

8. Developing a clinical decision support system based on the fuzzy logic and decision tree to predict colorectal cancer.

Authors: Raoof Nopour; Mostafa Shanbehzadeh; Hadi Kazemi-Arpanahi
Journal: Med J Islam Repub Iran Date: 2021-04-03

9. Prediction of 30-Day Readmission After Stroke Using Machine Learning and Natural Language Processing.

Authors: Christina M Lineback; Ravi Garg; Elissa Oh; Andrew M Naidech; Jane L Holl; Shyam Prabhakaran
Journal: Front Neurol Date: 2021-07-13 Impact factor: 4.003

10. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality.

Authors: Sheojung Shin; Peter C Austin; Heather J Ross; Husam Abdel-Qadir; Cassandra Freitas; George Tomlinson; Davide Chicco; Meera Mahendiran; Patrick R Lawler; Filio Billia; Anthony Gramolini; Slava Epelman; Bo Wang; Douglas S Lee
Journal: ESC Heart Fail Date: 2020-11-17

1 in total

1. Predictive modeling for COVID-19 readmission risk using machine learning algorithms.

Authors: Mostafa Shanbehzadeh; Azita Yazdani; Mohsen Shafiee; Hadi Kazemi-Arpanahi
Journal: BMC Med Inform Decis Mak Date: 2022-05-20 Impact factor: 3.298

1 in total