Literature DB >> 33451801

On the prediction of isolation, release, and decease states for COVID-19 patients: A case study in South Korea.

Tarik Alafif¹, Reem Alotaibi², Ayman Albassam³, Abdulelah Almudhayyani⁴.

Abstract

A respiratory syndrome COVID-19 pandemic has become a serious public health issue nowadays. The COVID-19 virus has been affecting tens of millions people worldwide. Some of them have recovered and have been released. Others have been isolated and few others have been unfortunately deceased. In this paper, we apply and compare different machine learning approaches such as decision tree models, random forest, and multinomial logistic regression to predict isolation, release, and decease states for COVID-19 patients in South Korea. The prediction can help health providers and decision makers to distinguish the states of infected patients based on their features in early intervention to take an action either by releasing or isolating the patient after the infection. The proposed approaches are evaluated using Data Science for COVID-19 (DS4C) dataset. An analysis of DS4C dataset is also provided. Experimental results and evaluation show that multinomial logistic regression outperforms other approaches with 95% in a state prediction accuracy and a weighted average F1-score of 95%.

Entities: Chemical

Keywords: COVID-19; Classification; Decease; Decision tree; Isolation; Multinomial logistic regression; Prediction; Random forest; Release

Mesh：

Year: 2021 PMID： 33451801 PMCID： PMC7785285 DOI： 10.1016/j.isatra.2020.12.053

Source DB: PubMed Journal: ISA Trans ISSN： 0019-0578 Impact factor: 5.911

Introduction

The COVID-19 pandemic has first appeared in Wuhan, Hubei province, China in December 2019. It has become a serious public health issue worldwide as it spreads quickly. The COVID-19 virus is an infectious disease which directly affects people lungs. It is a branch of Coronaviruses family. It is believed that it has been transmitted from animals to human. The virus also transmits from human respiratory to another which is already noticed worldwide. The virus may cause mild or severe symptoms to affected people during the virus replications in their bodies. Affected people may develop symptoms such as dyspnea, fever, sore throat, cough, fatigue, and pneumonia. The virus has become a serious disease since it may cause death after developing the symptoms. Fortunately, many people have recovered and have been released while others are still isolated. While many people are recovered, the number of deaths is very few compared to the number of recovered ones. Machine learning (ML) algorithms play an important role in today’s research because of the ability to automate complex tasks. These algorithms can learn from previous experience to predict future outcomes. Many research works have been developed and used in machine learning. Some of the well-known ML classifiers are Decision Trees (DTs), Random Forest (RF), and Multinomial Logistic Regression (MLR). DT, so-called CART, is a popular supervised machine learning and a top-down predictive modeling approach [1]. It uses a simple tree structure representation to classify negative and positive examples recursively based on their input features. Various decision tree approaches have been proposed over the past years. ID3 [2], C4.5 [3] and CART [4] are the most common ones [5]. RF is an ensemble learning approach that is based on generating a large number of decision trees using a different subset of the training data [6], [7]. These subsets are chosen by random sampling of the original training data. The final predictions are made by taking the majority vote from all individual classification trees. RF is a powerful algorithm in machine learning. One of the advantages of RF is the ability to handle large datasets with high-dimensional feature space. It can handle thousands of input variables and identify the most significant variables. In addition to DT and RF, regression approaches are also commonly used to predict future events in many research fields. In the machine learning context, logistic regression is a well-known technique that predicts binary outcomes. On the other hand, MLR is an extended version of a binary logistic regression approach that allows performing multi-class prediction where prediction outcome can be either dichotomous or continuous [8]. Our work is motivated by the success of CARTs, RF, and MLR approaches in many predictive applications such as in vegetation distributions [9], microRNA precursors [10], and response variables [11]. In this paper we propose to use them to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using Data Science for COVID-19 (DS4C) dataset. An analysis for DS4C dataset is provided. Experimental results and evaluation show that MLR outperforms other approaches with 95% in a state prediction accuracy and a weighted average F1-score of 95%. The remainder of this paper is organized as follows. In Section 2, related work is reviewed. Section 3 introduces basic notations and mathematical models used throughout the paper. In Section 4, we briefly describe and analyze DS4C dataset. In Section 5, we present our work. In Section 6, experimental results and evaluation are provided using public DS4C dataset. We provide discussion details in Section 7. Finally, our conclusion and future work are provided in Section 8.

Related work

Many research works have been published recently to analyze and tackle the problem of COVID-19 virus in many fields. Several methods were experimented in [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24] for COVID-19 analysis and predictions. Fanelli and Piazza [12] analyzed and forecasted COVID-19 spreading in China, Italy, and France. Nadia and Hazem [13] studied the effects of sex, region, infection reason, birth year, release date, and the diseased date on the recovered and deceased cases for COVID-19 patients. Hu et al. [14] applied a stacked auto-encoder and K-means to model the transmission dynamics of the epidemics based on provinces and cities. The model was built to forecast the confirmed cases of COVID-19. Remuzzi and Remuzzi [15] predicted the number of infected patients in Italy using an exponential curve. Tartai and Varallyay [16] attempted to predict the results of COVID-19 epidemic in a region using a logistic model. Pedersen and Meneghin [17] analyzed and predicted the spread of the COVID-19 virus using restrictions to reduce the epidemic in Italy. Peng et al. [18] proposed a SEIR model to analyze the epidemic in China. Petropoulos and Makridakis [19] forecasted the confirmed cases, death cases, and recovery cases of COVID-19 using time series forecasting approaches. Chatterjee et al. [20] and Jia et al. [21] developed mathematical models to study the impact of COVID-19 epidemic. Weissman et al. [22] proposed a non-learning predictive mathematical model, called SIR, to predict hospital capacity needs during the COVID-19 pandemic. However, the SIR model predicts only the epidemic spread over time for the whole population within a specific area. Yang et al. [23] used a long term short memory neural network to predict the COVID-19 epidemic in China. Similar to [23], Vattay [24] attempted to predict the epidemic in Italy. Different from the aforementioned research works to tackle the COVID-19 problem from different perspectives, we apply and compare different machine learning approaches such as CARTS, RF, and MLR to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using DS4C dataset to measure the performance of the predictions.

Preliminaries

Let a set of patients records such that , where is the number of patients. Each patient record is belonging to one of the classes that is used for the classification task such that . Each patient record is composed of several input features such that , where is the total number of patient features. Then, each feature becomes a predictor variable of the class such that , is the class variable taking values either in a set of classes , where is the total number of classes. Next, we show the basic mathematical equations used in the following approaches:

CARTs

In CARTs, Gini impurity is used to measure the purity in the decision tree models by knowing how good the split of input features in each node [25], [26]. Then, the Gini impurity is computed using the training examples with the class as shown in Eq. (1): where , and are the examined feature, the number of classes and the probability of the th class respectively. The Gini Index is computed for each input feature partition and the average Gini Index for the examined feature is defined as: where , and are the examined feature, a data partition and number of discrete values in feature respectively. The feature with the lowest Gini Index is chosen as the splitting feature.

RF

An RF consists of a collection of decision trees . Each decision tree is trained on a different subset of the data using bootstrap sampling (bagging) out of . Final predictions are made using the majority votes. The RF employs random feature selection for each node of every tree in the forest, meaning that the splitting feature is chosen at random.

MLR

Logistic regression is defined as a generalized linear model that is usually used to classify binary outcomes [8]. Ridge estimator is proposed to improve model prediction [27]. Logistic regression approach is slightly modified to deal with a multi-class classification that is called the MLR, in case of predicting outcome classes for instances with observations, the parameter matrix of regression coefficients to be calculated will be an matrix. is the exponential function. Class probability with the exception of the last class is shown in Eq. (3): The last class probability is shown in Eq. (4): The (negative) multinomial log-likelihood is shown in Eq. (5):

Evaluation metrics

The accuracy for multi-class classification is defined as the number of instances correctly identified as either truly positive or truly negative out of the total as shown in Eq. (6). The error rate is the complement of accuracy. where , , and represent true positive, true negative, false positive and false negative respectively. F1-score is averaging the harmonic mean of precision and recall as shown in Eqs. (7): where precision and recall represent precision and recall respectively. The weighted average F1-score is computed by averaging F1-score class-by-class while weights are computed based on each target class frequency as shown in Eq. (8). Using such evaluation metric helps to keep an account for imbalance distribution of classes when computing the F1-score. where is the total number of target classes, the weighting parameter of the th class and is the computed F1-score of the th class. The number of samples distribution for each state in DS4C dataset. The samples distribution for sex feature in DS4C dataset. The samples distribution for age feature in DS4C dataset. The causes for COVID-19 infection in South Korea according to the infection cases in DS4C dataset.

DS4C dataset

DS4C [28] has been published on Kaggle recently into public in February 24th, 2020 in South Korea.1 The dataset consists of different data in four sheets (Case data, patient data, time-series data, and additional data). The case data describes the data for COVID-19 infectious cases. The patient data describes the epidemiological and route data of COVID-19 patients. The time-series data describes the status of time-series data of COVID-19 patients. Also, additional data is reported for regions, weather, and population. Each sheet is found in the form of a CSV file. In our work, we only use the patient data file to reach the aim of our research since it contains epidemiological data for COVID-19 patients in South Korea. Patients data consists of 5,165 patients labeled records. Each patient record consists of several features such as sex, age, country, province, city, infection_case, infected_by, contact_number, symptom_onset_date, confirmed_date, released_date, deceased_date, and state. The state feature represents the label for each patient record. The dataset has three states. Each record in the dataset is either labeled isolated, released, or deceased. Fig. 1 shows the states versus the number of infection cases in the dataset. The figure also shows clearly the unbalanced number of the sample’s distribution in this dataset which consists of 2,158 isolated states, 2,929 released states, and 78 deceased states. We also show the samples distribution for sex and age features in the dataset in Fig. 2, Fig. 3 respectively. We notice that the number of infected male is closed to the number of infected female. We also notice that the most infected age among patients is between 20 to 29 years old.

Fig. 1

The number of samples distribution for each state in DS4C dataset.

Fig. 2

The samples distribution for sex feature in DS4C dataset.

Fig. 3

The samples distribution for age feature in DS4C dataset.

The generated DT architecture of our approach using the maximum DT depth of 3. We have investigated the causes of COVID-19 infection in South Korea patients in this dataset. We have found the most patients receive this infection are from the most frequent causes ”Contact patient”. Then, ”Overseas inflow” is followed. Fig. 4 depicts clearly the most common causes of COVID-19 infection in South Korea patients. Therefore, a social distancing is needed to halt or at least decrease the spread of the COVID-19 pandemic.

Fig. 4

The causes for COVID-19 infection in South Korea according to the infection cases in DS4C dataset.

Our work

We propose to use DT models, RF, and MLR to classify the states (isolated, released, and deceased) for COVID-19 patients in South Korea. The DT models learn discriminative features from patient records to perform the prediction. The models predict the patient’s state based on several feature variables. We train our DT models using the features such as sex, age, country, province, city, infection_case, contact_number, symptom_onset_date, confirmed_date, released_date, deceased_date, and state. Patients case_id are excluded. Before training the models, we pre-process the patients’ records to manipulate missing data and non-integer values. The missing data are replaced by 1. We apply different maximum DT depths to build our predictive intelligent models. The models are trained using the patients’ records. For DT models, we use maximum DT depths of 3, 5, and 10 in our work. Fig. 5 shows the generated DT architecture after training using the depth of 3. The DT architectures with the maximum depth of 5 and 10 generate a very large tree which consists of many branches and many leaf nodes and non-leaf nodes. A generalization capability is lost if the depth is more than 10. For the RF model, the model is trained with the default parameters according to randomForest package in R [6]. The default parameters in the randomForest algorithm: ntree, the number of classification tree (the default value is 500) and mtry, the number of features tested at each node (the default values is , where is the total number of features).

Fig. 5

The generated DT architecture of our approach using the maximum DT depth of 3.

For the MLR model, the model-building process is carried out using weka tool with a ridge estimator which is considered as a stable implementation. The predictive model is fed with a full set of features. The Quasi-Newton method is used to perform the optimization task. Ridge values of are used in the log-likelihood calculation.

Experimental results and evaluation

Tree-based approaches are implemented in R using rpart and randomForest packages [6], [29] while MLR approach was built using the Weka version 3.8.4 of machine learning tool [30]. A single laptop 1.6 GHz Dual-Core Intel Core i5 CPU with 8 GB of RAM is used to train and test the models. 10-fold cross-validation is applied for training and testing the models using DS4C dataset. A seed state is set to 1234. Maximum tree depth tuning based DT. RF error rates with an increase in the number of classification trees. The number of trees and the estimated error rates are shown on x-axis and y-axis, respectively. Table 1 shows the accuracy, error rates, and the weighted average F1-score for the tested models. For DT models, we choose three different tree depths: 3, 5, and 10. In order to obtain the effect of the maximum tree depth parameters on the experimental results, we show the performance curve to verify the effectiveness of depth parameters on the models. Fig. 6 shows the models accuracy versus different tree depths. It clearly shows that the DT models accuracy stops improving at the maximum depth of 12.

Table 1

Prediction accuracy, error rates and the weighted average F1-scores for the applied algorithms using DS4C dataset.

Algorithms implemented	Accuracy	Error rate	Weighted average F1-score
DT (Depth = 3)	82.92%	17.08%	81.74%
DT (Depth = 5)	85.63%	14.37%	84.67%
DT (Depth = 10)	88.21%	11.79%	87.47%
RF	92.55%	07.45%	92.28%
MLR	95.00%	05.00%	95.00%

Fig. 6

Maximum tree depth tuning based DT.

Fig. 7 shows the effect of the number of generated trees in the RF on the classification error rate. We can see that the error rate becomes stable when the number of trees is increased. The curve in black color represents out-of-bag (OOB) error rate and the other colors represent misclassification error rate curves. The out-of-bag error is estimated internally during building the DTs.

Fig. 7

RF error rates with an increase in the number of classification trees. The number of trees and the estimated error rates are shown on x-axis and y-axis, respectively.

Prediction accuracy, error rates and the weighted average F1-scores for the applied algorithms using DS4C dataset. The confusion matrices using 10-fold cross-validation are provided in Table 2. The table shows true positives, true negatives, false positives, and false negatives test examples after applying each model. We notice that the DT models using different depths are unable to predict the true positives in the deceased state. This maybe due to the small number of examples found in the deceased state. Following the DT models, the RF approach is able to classify 22 true positive examples from the deceased state. Lastly, the MLR model classifies 44 true positive examples from the deceased state correctly. Besides, the small number of false negatives reveals an encouraging result. From Table 1, the MLR approach outperforms other approaches with 95% in the state prediction accuracy and with a margin of 2.45% to RF approach. To the best of our knowledge, there is no existing similar work to compare with ours since the COVID-19 research area and the DS4C dataset are new.

Table 2

Confusion matrices for actual versus predicted patients’ states.

DT (Depth = 3)
	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	21	1,444	90
Released	57	714	2,839

DT (Depth = 5)

	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	23	1,586	92
Released	55	572	2,837

DT (Depth = 10)

	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	25	1,808	181
Released	53	350	2,748

RF

	Deceased	Isolated	Released

Deceased	22	18	38
Isolated	0	1,992	166
Released	0	163	2,766

MLR

	Deceased	Isolated	Released

Deceased	44	14	20
Isolated	15	2,042	101
Released	19	89	2,821

Confusion matrices for actual versus predicted patients’ states.

Discussion

The experimental results presented above have illustrated the ability of the proposed approaches to produce reliable predictions of COVID-19 patients who are isolated, released, and deceased. The proposed predictive instruments were built using a small bag of features which causes type I and type II errors to arise, as utilized features cover only limited aspects of patients’ personal and infection factors. Though, utilizing subject features allows to produce quality predictions of patient’s status even with the absence of medical-related aspects. The proposed MLR model is able to predict 96.3% and 94.6% of the released and isolated cases respectively, while it predicts only 56.4% of the decease instances as such patients belong to minority state in the dataset. The COVID-19 DS4C dataset holds a significantly unbalanced class distribution where the minority instances belong to the decease state occupy only 1.5% of total instances in the dataset. The reason is that logistic regression model tends to produce more reliable classification outcomes when it deals with few numbers of features, and big amounts of training examples which is the case in the utilized dataset. Therefore, MLR provides most significant performance in prediction accuracy and the weighted average F1-score. One can notice that the DT models using different depths have failed to predict the instances from the decease state as shown in Table 2. This is due to the small number of deceased state examples exist in the dataset. Therefore, the performances of the different depth of DT models have decreased. However, the DT model using the depth of 3 has achieved the best true positive rate only in the release state compared to other models. On the other hand, the RF model performance relatively competes with the MLR model performance with a margin of 2.45% in state prediction accuracy. However, the RF model has lower true positive rates in all states compared to the MLR model as shown in Table 2. In general, based on the characteristics of features, and corresponding experimental results using DS4C dataset, we can conclude that the MLR model is more effective than other classification models for predicting COVID-19 patients states. Unfortunately, none of the existing COVID-19 public datasets provide such adequate features and labeling for predicting the states of isolation, recovery, and decease. Most of the current public COVID-19 datasets focus on providing a statistical data about the death number, infection number, infected areas, patients tweets, infected X-ray scans, and infected CT scans from chests and brains. Also, the current DS4C public dataset lacks some of important features associated with the impact of the COVID-19 virus such as patients’ vital signs, medications usage, and chronic diseases. The lack of COVID-19 data and the mechanism of its collection and collaboration is considered a limitation, and a challenge in COVID-19 research area.

Conclusion

This research work attempts to apply and compare different machine learning approaches such as DT models, RF, and MLR to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using DS4C dataset. An analysis for DS4C dataset is also provided. This study finds that contact patient is the most cause for COVID-19 infection. Experimental results and evaluation show that MLR outperforms other approaches with 95% in the state prediction accuracy and the weighted average F1-score of 95%. The proposed machine learning models can help health providers and decision makers in early intervention to take an action either by releasing or isolating the patient after the infection. Based on our observations from South Korean public COVID-19 statistical dashboard of daily confirmed cases, it reveals that South Korea has the highest number of infection in February, 2020 with 909 confirmed cases. Luckily, the current infection number per day has lowered to less than 50 since the beginning of March, 2020. Still, South Korea has a less number of infection rate compared to other countries which may be due to social distancing. In the future, we plan to collect clinical data from local hospitals to analyze more important features and explore their patterns. These patterns may give a better understanding to the relations of the states. They are maybe beneficial to reduce the impact of the decease states. We also plan to use deep learning based methods such as deep neural networks to predict the state more accurately.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

9 in total

1. Forecasting the novel coronavirus COVID-19.

Authors: Fotios Petropoulos; Spyros Makridakis
Journal: PLoS One Date: 2020-03-31 Impact factor: 3.240

2. Forecasting and Evaluating Multiple Interventions for COVID-19 Worldwide.

Authors: Zixin Hu; Qiyang Ge; Shudi Li; Eric Boerwinkle; Li Jin; Momiao Xiong
Journal: Front Artif Intell Date: 2020-05-22

3. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features.

Authors: Peng Jiang; Haonan Wu; Wenkai Wang; Wei Ma; Xiao Sun; Zuhong Lu
Journal: Nucleic Acids Res Date: 2007-06-06 Impact factor: 16.971

4. Data analysis of coronavirus COVID-19 epidemic in South Korea based on recovered and death cases.

Authors: Nadia Al-Rousan; Hazem Al-Najjar
Journal: J Med Virol Date: 2020-06-24 Impact factor: 20.693

5. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions.

Authors: Zifeng Yang; Zhiqi Zeng; Ke Wang; Sook-San Wong; Wenhua Liang; Mark Zanin; Peng Liu; Xudong Cao; Zhongqiang Gao; Zhitong Mai; Jingyi Liang; Xiaoqing Liu; Shiyue Li; Yimin Li; Feng Ye; Weijie Guan; Yifan Yang; Fei Li; Shengmei Luo; Yuqi Xie; Bin Liu; Zhoulang Wang; Shaobo Zhang; Yaonan Wang; Nanshan Zhong; Jianxing He
Journal: J Thorac Dis Date: 2020-03 Impact factor: 3.005

6. Locally Informed Simulation to Predict Hospital Capacity Needs During the COVID-19 Pandemic.

Authors: Gary E Weissman; Andrew Crane-Droesch; Corey Chivers; ThaiBinh Luong; Asaf Hanish; Michael Z Levy; Jason Lubken; Michael Becker; Michael E Draugelis; George L Anesi; Patrick J Brennan; Jason D Christie; C William Hanson; Mark E Mikkelsen; Scott D Halpern
Journal: Ann Intern Med Date: 2020-04-07 Impact factor: 51.598

7. Healthcare impact of COVID-19 epidemic in India: A stochastic mathematical model.

Authors: Kaustuv Chatterjee; Kaushik Chatterjee; Arun Kumar; Subramanian Shankar
Journal: Med J Armed Forces India Date: 2020-04-02

Review 8. COVID-19 and Italy: what next?

Authors: Andrea Remuzzi; Giuseppe Remuzzi
Journal: Lancet Date: 2020-03-13 Impact factor: 79.321

9. Analysis and forecast of COVID-19 spreading in China, Italy and France.

Authors: Duccio Fanelli; Francesco Piazza
Journal: Chaos Solitons Fractals Date: 2020-03-21 Impact factor: 5.944

9 in total

2 in total

1. Knowledge-based normative safety measure approach: systematic assessment of capabilities to conquer COVID-19.

Authors: Selvaraj Geetha; Samayan Narayanamoorthy; Thangaraj Manirathinam; Ali Ahmadian; Mohd Yazid Bajuri; Daekook Kang
Journal: Eur Phys J Spec Top Date: 2022-07-11 Impact factor: 2.891

2. DISCOVID: discovering patterns of COVID-19 infection from recovered patients: a case study in Saudi Arabia.

Authors: Tarik Alafif; Alaa Etaiwi; Yousef Hawsawi; Abdulmajeed Alrefaei; Ayman Albassam; Hassan Althobaiti
Journal: Int J Inf Technol Date: 2022-07-04

2 in total