| Literature DB >> 35434262 |
Sadegh Ilbeigipour1, Amir Albadvi1.
Abstract
The world today faces a new challenge that is unprecedented in the last 100 years. The emergence of a new coronavirus has led to a human catastrophe. Scientists in various sciences have been looking for solutions to this problem so far. In addition to general vaccination, maintaining social distance and adherence to government guidelines on safety precaution measures are the most well-known strategies to prevent COVID-19 infection. In this research, we tried to examine the symptoms of COVID-19 cases through different supervised machine learning methods. We solved the class imbalance problem using the synthetic minority over-sampling (SMOTE) method and then developed some classification models to predict the outcome of COVID-19 cases (recovery or death). Besides, we implemented a rule-based technique to identify different combinations of variables with specific ranges of their values that together affect disease severity. Our results showed that the random forest model with 95.6% accuracy, 97.1% sensitivity, 94.0% specification, 94.4% precision, 95.8% F-score, and 99.3% AUC-score outperforms state-of-the-art classification models. Finally, we identified the most significant rules that state various combinations of 6 features in certain ranges of their values lead to patients' recovery with a confidence value of 90%. In conclusion, the classification results in this study show better performance than recent studies, and the extracted rules help physicians consider other important factors to improve health services and medical decision-making for different groups of COVID-19 patients.Entities:
Keywords: Association rules mining; COVID-19; Confidence index; Machine learning; Supervised learning; Support index
Year: 2022 PMID: 35434262 PMCID: PMC9004256 DOI: 10.1016/j.imu.2022.100933
Source DB: PubMed Journal: Inform Med Unlocked ISSN: 2352-9148
Definition and possible values of significant characteristics in this study.
| Variable | Description |
|---|---|
| Age | Patient's age |
| Intubation | 1; patient has undergone intubation, 2; patient has not undergone intubation. |
| Fever, cough, headache, chest pain | 0 stands for absence of symptom, and 1 stands for presence of the symptom. |
| Contact coronavirus | 0; no history of contact with COVID-19 cases, 1; history of contact with COVID-19 cases. |
| Section of hospital | the ward where the patient has been hospitalized. 1; regular ward, 2; intensive care unit, 3; no hospitalization. |
| Presence of underlying diseases | 0 stands for absence of underlying diseases, and 1 stands for presence of the underlying diseases. |
| Rate of partial pressure of oxygen, Po2 | 0; PO2 levels are greater than 93, 2; PO2 levels are less than 93. |
| Shortness of breath | 0 stands for absence of symptom, and 1 stands for presence of the symptom. |
| Hospital duration | number of hospitalization days. |
| Result PCR | 0; negative for COVID-19, 1; positive for COVID-19, 3; test result is pending. |
| Condition entering the hospital | 0; sever, 1; mild |
| CT scan manifestation | 1; CT scan results for COVID-19 are negative, 2; CT scan results for COVID-19 are positive. |
| Death | no; patient has recovered, yes; patient has died. |
Fig. 1Scatter plot of the age of patients relative to their (a) sex and (b) hospital duration based on death or recovery class labels.
Fig. 2Bar chart of Intubation, hospital unit, rate Po2, and the class features of the patients based on their age.
Fig. 3Line chart of intubation, hospital ward, blood oxygen level, and class label variables for 50 COVID-19 cases.
Fig. 4Facet chart of the age distribution of patients based on different class labels.
Fig. 5The block diagram of the research methodology.
Fig. 6Top 10 influence attributes selected using the filter-based technique.
Number of samples in different classes in the train and test data set.
| Data set | Recovery class samples | Death class samples | Total sample |
|---|---|---|---|
| Train (70%) | 729 | 714 | 1443 |
| Test (30%) | 302 | 317 | 619 |
| All | 1031 | 1031 | 2062 |
Fig. 7The association rules extracted with the expected conditions (support = 0.3, confidence = 0.9, lift> 1).
Fig. 8Grouped matrix diagram of extracted association rules with expected conditions (support 0.3, confidence 0.9, lift> 1).
Fig. 9Confusion matrices of the models developed in this research.
Classification performance results to diagnosis the outcome of COVID-19 cases.
| Model | Acc(%) | Pr(%) | Se(%) | Sp(%) | F1_score(%) | AUC_score(%) |
|---|---|---|---|---|---|---|
| Decision tree | 93.21 | 91.29 | 95.89 | 90.39 | 93.53 | 93.14 |
| SVM | 87.07 | 86.23 | 88.95 | 85.09 | 87.57 | 95.66 |
| Knn | 86.91 | 82.24 | 94.95 | 78.47 | 88.14 | 93.02 |
| Logistic Regression | 86.59 | 87.74 | 85.80 | 87.41 | 86.76 | 95.48 |
| Random forest | 95.63 | 94.47 | 97.16 | 94.03 | 95.80 | 99.38 |
Fig. 10ROC curve to compare the performance of the models implemented in this research.
Classification performance of the proposed method and comparison with state-of-the-art methods.
| Research | Method | Acc(%) | Pr(%) | Se(%) | Sp(%) | F1_score(%) |
|---|---|---|---|---|---|---|
| An, Chansik, et al. [ | Linear SVM | 91.9 | 25.6 | 92.0 | 91.8 | 40.0 |
| Chen et al. [ | Logistic Regression | – | – | 91.4 | 76.0 | – |
| Chowdhury et al. [ | Nomogram | – | – | 92.0 | 92.0 | – |
| Iwendi et al. [ | Random Forest | 94.0 | 75.0 | – | 86.0 | |
| Sumayh S et al. [ | Random Forest | 95.2 | 95.0 | 94.9 | 93.6 | 95.5 |
| Mohammad and Mahdi [ | Neural Network | 89.9 | 93.6 | 87.7 | 93.2 | 90.5 |
| Rahila et al. [ | Bayes Net | 89.0 | – | 92.6 | 86.0 | – |
| Proposed method | Random Forest | 94.4 |
Fig. 11The ten most essential association rules were discovered in this research.