| Literature DB >> 35844896 |
Tania Ramírez-Del Real1,2, Mireya Martínez-García3, Manlio F Márquez3, Laura López-Trejo4, Guadalupe Gutiérrez-Esparza1,3, Enrique Hernández-Lemus5,6.
Abstract
The fast, exponential increase of COVID-19 infections and their catastrophic effects on patients' health have required the development of tools that support health systems in the quick and efficient diagnosis and prognosis of this disease. In this context, the present study aims to identify the potential factors associated with COVID-19 infections, applying machine learning techniques, particularly random forest, chi-squared, xgboost, and rpart for feature selection; ROSE and SMOTE were used as resampling methods due to the existence of class imbalance. Similarly, machine and deep learning algorithms such as support vector machines, C4.5, random forest, rpart, and deep neural networks were explored during the train/test phase to select the best prediction model. The dataset used in this study contains clinical data, anthropometric measurements, and other health parameters related to smoking habits, alcohol consumption, quality of sleep, physical activity, and health status during confinement due to the pandemic associated with COVID-19. The results showed that the XGBoost model got the best features associated with COVID-19 infection, and random forest approximated the best predictive model with a balanced accuracy of 90.41% using SMOTE as a resampling technique. The model with the best performance provides a tool to help prevent contracting SARS-CoV-2 since the variables with the highest risk factor are detected, and some of them are, to a certain extent controllable.Entities:
Keywords: COVID-19; feature selection; imbalanced data; machine learning; predictive model
Mesh:
Year: 2022 PMID: 35844896 PMCID: PMC9279686 DOI: 10.3389/fpubh.2022.912099
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Dataset variables.
|
|
|
|
|---|---|---|
| Age | Age | Numeric |
| Sex | Sex | Dichotomous |
| weight | Weight | Numeric |
| height | Height | Numeric |
| BMI | Body mass index | Numeric |
| waist | Waist circumference | Numeric |
| SBP | Systolic blood pressure | Numeric |
| DBP | Diastolic blood pressure | Numeric |
| Phyactmet | Physical activity measured in metabolic Equivalent of task (METs) | Dichotomous |
| anxst | State Anxiety | Factor: range from 1 to 4 |
| anxtr | Trait anxiety | Factor: range from 1 to 4 |
| slpsnrr1 | Snoring during sleep | Factor: range from 1 to 5 |
| slpsob1 | Sleep short of breath or headache | Factor: range from 1 to 5 |
| slps3 | Sleep somnolence | Factor: range from 1 to 5 |
| slpop1 | Optimal Sleep | Dichotomous |
| smk | Smoking habit | Dichotomous |
| EtOH_avg | Frequency alcohol consumption | Dichotomous |
| uric | Uric acid | Numeric |
| crea | Creatinine | Numeric |
| HDL | High-density lipoprotein | Numeric |
| LDL | Low-density lipoprotein | Numeric |
| glu | Glucose | Numeric |
| chol | Cholesterol | Numeric |
| trig | Triglycerides | Numeric |
| na1 | Serum sodium | Numeric |
| met_s | Metabolic syndrome | Dichotomous |
| wrk_f | Outdoor work | Dichotomous |
| wrk_h | Home office | Dichotomous |
| umplyd | Unemployed | Dichotomous |
| wrk_hsp | Working in hospital | Dichotomous |
| wrk_off | Working in office | Dichotomous |
| MaritStat | Marital status (single or married) | Dichotomous |
| cocr | Worry for contagion of the COVID-19 | Factor: range from 0 to 2 |
| trbslpt | Sleep problems during COVID-19 pandemic | Dichotomous |
| quislt | Isolation during COVID-19 pandemic | Factor: range from 0 to 4 |
| outli | Outings limited during COVID-19 pandemic | Dichotomous |
| kpgoing | Keep coming out with precautionary measures | Dichotomous |
| phyact | Physical activity during the pandemic | Factor: range from 0 to 4 |
| violence | Domestic violence during pandemic | Dichotomous |
| EtOH_q | Frequency alcohol consumption during pandemic | dichotomous |
| obsty | Obesity | Numeric |
| ovrw | Overweight | Numeric |
| smk_q | Smoking during pandemic | Dichotomous |
| anxdsr | Anxiety during pandemic | Dichotomous |
| hipert | Hypertension during pandemic | Dichotomous |
| news_f | Listen to the news by the family | Dichotomous |
| news_sn | See to the news by social networks | Dichotomous |
| news_tv | Listen to the news on the television or radio | dichotomous |
| lckd_hosp | Hospitalization for COVID-19 infection | Dichotomous |
| COVID | Diagnosis of COVID-19 | Dichotomous |
anxst is 1 = not at all, 2 = a little, 3 = quite, 4 = a lot.
anxtr is 1 = rarely, 2 = sometimes, 3 = frequently, 4 = usually.
slpsnrr1 is 1 = 100, 2 = 80, 3 = 60, 4 = 40, 5 = 20, 6 = 0, being the value of 100, the bigger problem.
slpsobl is 1 = 100, 2 = 80, 3 = 60, 4 = 40, 5 = 20, 6 = 0, being the value of 100, the bigger problem.
slps3 is 1 = 100, 2 = 80, 3 = 60, 4 = 40, 5 = 20, 6 = 0, being the value of 100, the bigger problem.
cocr is 0 = not at all, 1 = a little, 2 = quite, 4 = a lot.
quislt is 1 = not at all, 2 = a little, 3 = quite, 4 = a lot.
phyact is 1 = not at all, 2 = a little, 3 = quite, 4 = a lot.
Figure 1Prediction model.
Results of the feature selection process.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| BMI | Cocr | Weight | BMI | BMI | Weight | BMI |
| Waist | Quislt | BMI | Glu | Cocr | Waist | Cocr |
| Weight | Ovrw | Waist | Cocr | Quislt | BMI | Quislt |
| Uric | Outli | SBP | HDL | Trig | Uric | |
| Trig | DBP | Quislt | Waist | HDL | ||
| HDL | Uric | Trig | HDL | Trig | ||
| LDL | Age | EtOH_q | Age | |||
| DBP | Slps3 | Workf | Glu | |||
| Age | LDL | Ovrw | SBP | |||
| crea | Crea | Glu | EtOH_q | |||
| SBP | SBP | smk | Slps3 | |||
| Glu | Phyactmet | DBP | ||||
| Chol | EtOH_q | Weight | ||||
| Height | Uric |
Figure 2The correlation coefficient of the continuous variables of the dataset.
Feature selection results (SMOTE).
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
|
|
| ||||||||
| rpart | q = 0 | RF | Smote | 78.68 | 78.45 | 78.92 | 78.62 | 90.07 | 60.43 |
| ±2.69 | ±4.49 | ±3.71 | ±0.27 | ±1.62 | ±4.86 | ||||
| rpart | q = 0 | chi-squared | Smote | 78.20 | 85.98 | 70.43 | 70.43 | 87.68 | 67.61 |
| ±2.90 | ±3.07 | ±5.84 | ±5.84 | ±2.05 | ±4.72 | ||||
| rpart | q = 0 | xgbost | Smote |
| 86.91 | 74.30 | 80.24 | 89.26 | 70.30 |
| ±3.03 | ±3.27 | ±6.63 | ±3.26 | ±2.41 | ±4.63 | ||||
| rpart | q = 0 | rpart | Smote | 79.89 | 85.74 | 74.03 | 79.60 | 88.98 | 68.29 |
| ±3.12 | ±2.94 | ±5.61 | ±3.27 | ±2.16 | ±4.99 | ||||
| rpart | q = 0 | bst | Smote | 79.03 | 87.09 | 70.97 | 78.49 | 88.04 | 69.48 |
| ±3.63 | ±2.73 | ±7.28 | ± 3.99 | ±2.61 | ±4.62 | ||||
| C4.5 | RF | Smote | 82.13 | 80.40 | 83.87 | 82.10 | 92.40 | 63.84 | |
| ±1.58 | ±2.45 | ±2.43 | ±1.59 | ±1.03 | ±2.85 | ||||
| C4.5 | chi-squared | Smote | 84.60 | 90.75 | 78.44 | 84.33 | 91.15 | 77.80 | |
| ±2.25 | ±1.70 | ±4.55 | ±2.41 | ±1.68 | ±3.16 | ||||
| C4.5 | xgbost | Smote |
| 88.34 | 82.15 | 82.15 | 92.36 | 74.53 | |
| ±2.35 | ±2.63 | ±3.81 | ±3.81 | ±1.54 | ±4.36 | ||||
| C4.5 | rpart | Smote | 83.29 | 89.80 | 76.77 | 82.99 | 90.42 | 75.78 | |
| ±2.03 | ±2.46 | ±3.87 | ±2.14 | ±1.42 | ± 4.16 | ||||
| C4.5 | bst | Smote | 71.87 | 72.28 | 71.47 | 71.84 | 68.50 | 75.08 | |
| ±2.60 | ±3.44 | ±3.16 | ±2.61 | ± 2.73 | ± 2.70 | ||||
| RF | mtry = 3 | RF | Smote | 85.07 | 83.09 | 87.04 | 85.04 | 93.98 | 67.94 |
| ±1.05 | ±1.64 | ± 0.99 | ±1.06 | ± 0.47 | ±2.20 | ||||
| RF | mtry = 3 | chi-squared | Smote | 88.97 | 93.53 | 84.41 | 88.85 | 93.60 | 84.37 |
| ± 0.69 | ±1.34 | ±1.15 | ± 0.69 | ± 0.41 | ±2.68 | ||||
| RF | mtry = 3 | xgboost | Smote |
| 94.86 | 85.97 | 90.30 | 94.28 | 87.36 |
| ±1.05 | ±1.27 | ±1.75 | ± 1.07 | ± 0.67 | ±2.76 | ||||
| RF | mtry = 3 | rpart | Smote | 87.78 | 92.38 | 83.17 | 87.65 | 93.05 | 81.88 |
| ± 1.09 | ±1.55 | ± 1.88 | ± 1.10 | ± 0.71 | ±3.07 | ||||
| RF | mtry = 3 | bst | Smote | 88.85 | 92.49 | 85.22 | 88.77 | 93.85 | 82.42 |
| ± 1.16 | ±1.42 | ± 1.85 | ±1.18 | ±0.73 | ±2.73 | ||||
| SVM | k = linear | RF | Smote | 52.12 | 43.75 | 60.48 | 51.35 | 72.94 | 30.64 |
| ± 2.37 | ± 3.53 | ± 4.42 | ±2.36 | ± 2.35 | ±1.69 | ||||
| SVM | k = linear | chi-squared | Smote | 69.44 | 76.47 | 62.42 | 69.05 | 83.21 | 52.34 |
| ± 1.30 | ± 3.23 | ± 2.12 | ±1.21 | ± 0.65 | ±3.05 | ||||
| SVM | k = linear | xgboost | Smote | 65.45 | 67.68 | 63.23 | 65.39 | 81.77 | 44.58 |
| ±1.55 | ± 2.08 | ± 2.73 | ± 1.59 | ±1.11 | ± 1.71 | ||||
| SVM | k = linear | ||||||||
| c = 1, g = 0.01 | rpart | Smote |
| 81.59 | 64.03 | 72.27 | 84.68 | 58.86 | |
| ± 0.93 | ± 1.41 | ± 1.54 | ± 0.96 | ± 0.55 | ±1.77 | ||||
| SVM | k = linear | ||||||||
| c = 1, g = 0.01 | bst | Smote | 62.53 | 63.44 | 61.61 | 62.46 | 80.12 | 40.93 | |
| ±1.42 | ±2.61 | ± 3.45 | ± 1.43 | ± 1.19 | ±1.41 |
Bold value indicates the highest value achieved by each of the models in the balanced accuracy metric.
Feature selection results (ROSE).
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
|
|
| ||||||||
| rpart | q = 0 | RF | ROSE | 58.03 | 92.09 | 23.97 | 41.65 | 74.77 | 72.99 |
| ±4.80 | ±18.2 | ±18.0 | ±14.3 | ±2.28 | ±23.2 | ||||
| rpart | q = 0 | chi-squared | ROSE |
| 86.40 | 37.15 | 55.81 | 77.18 | 53.33 |
| ±4.41 | ±6.22 | ± 11.68 | ±6.97 | ±2.85 | ±7.13 | ||||
| rpart | q = 0 | xgbost | ROSE | 60.55 | 87.77 | 33.33 | 53.76 | 76.26 | 52.59 |
| ±3.46 | ±2.18 | ±7.01 | ±5.74 | ±1.86 | ±6.49 | ||||
| rpart | q = 0 | rpart | ROSE | 61.43 | 85.56 | 37.31 | 55.66 | 77.04 | 51.81 |
| ±4.21 | ±5.98 | ±11.38 | ±7.05 | ±2.68 | ±7.07 | ||||
| rpart | q = 0 | bst | ROSE | 59.77 | 89.86 | 29.67 | 50.45 | 75.79 | 56.59 |
| ±3.95 | ±4.79 | ±10.68 | ±9.36 | ±2.17 | ±10.80 | ||||
| C4.5 | RF | ROSE | 57.27 | 78.04 | 36.51 | 48.00 | 75.17 | 48.96 | |
| ±3.64 | ±22.74 | ±21.51 | ±10.01 | ±2.30 | ±14.65 | ||||
| C4.5 | chi-squared | ROSE | 66.10 | 85.92 | 46.29 | 62.13 | 79.85 | 0.5938 | |
| ±3.99 | ±7.71 | ±12.74 | ±6.52 | ±2.99 | ±8.90 | ||||
| C4.5 | xgbost | ROSE | 62.44 | 90.95 | 33.92 | 54.95 | 77.12 | 63.13 | |
| ±2.26 | ±5.01 | ±8.60 | ±5.34 | ±1.59 | ±9.19 | ||||
| C4.5 | rpart | ROSE |
| 92.56 | 42.37 | 62.16 | 79.72 | 72.61 | |
| ±3.54 | ±4.92 | ±8.81 | ±5.82 | ±2.08 | ±11.21 | ||||
| C4.5 | bst | ROSE | 61.81 | 91.15 | 32.47 | 53.79 | 76.74 | 64.11 | |
| ±2.44 | ±5.80 | ±8.14 | ±5.58 | ±1.44 | ±12.64 | ||||
| RF | mtry = 3 | ||||||||
| ntree = 200 | RF | ROSE | 51.65 | 50.99 | 52.31 | 48.73 | 71.70 | 31.38 | |
| ±4.31 | ±18.89 | ±15.62 | ±6.30 | ±4.01 | ±4.80 | ||||
| RF | mtry = 3 | chi-squared | ROSE |
| 83.93 | 46.94 | 60.89 | 79.77 | 61.04 |
| ±4.16 | ±12.29 | ±15.95 | ±8.47 | ±3.07 | ±14.25 | ||||
| RF | mtry = 3 | xgboost | ROSE | 64.33 | 94.08 | 34.58 | 56.92 | 58.97 | 85.57 |
| ±1.91 | ±1.74 | ±4.19 | ±3.24 | ±2.72 | ±3.23 | ||||
| RF | mtry = 3 | rpart | ROSE | 64.66 | 92.23 | 37.08 | 58.38 | 59.43 | 82.85 |
| ±1.81 | ±2.05 | ±4.17 | ±2.95 | ±2.74 | ±3.47 | ||||
| RF | mtry = 3 | bst | ROSE | 64.56 | 93.78 | 35.34 | 57.42 | 59.19 | 85.30 |
| ±2.12 | ±2.18 | ±4.88 | ±3.56 | ±2.93 | ±3.69 | ||||
| SVM | k = linear | RF | ROSE | 57.55 | 58.28 | 56.83 | 57.10 | 76.71 | 36.18 |
| ±2.83 | ±8.20 | ±7.42 | ±2.94 | ±2.08 | ±3.26 | ||||
| SVM | k = linear | chi-squared | ROSE | 68.97 | 78.32 | 59.62 | 68.02 | 82.64 | 53.83 |
| ±1.86 | ±6.32 | ±7.16 | ±2.50 | ±1.60 | ±4.70 | ||||
| SVM | k = linear | xgboost | ROSE | 65.47 | 68.52 | 62.42 | 65.40 | 81.68 | 45.19 |
| ±1.49 | ±5.59 | ±5.05 | ±5.31 | ±1.17 | ±2.93 | ||||
| SVM | k = linear | rpart | ROSE |
| 79.93 | 66.29 | 72.63 | 85.32 | 58.08 |
| ±1.40 | ±4.74 | ±5.36 | ±1.58 | ±1.45 | ±4.23 | ||||
| SVM | k = linear | bst | ROSE | 65.54 | 69.03 | 62.04 | 65.20 | 81.67 | 45.51 |
| ±1.45 | ±6.03 | ±5.80 | ±1.45 | ±1.36 | ±3.17 | ||||
| Deep learning | RF | 61.52 | 24.02 | 99.01 | 47.83 | 55.60 | 77.82 | ||
| ± 2.53 | ±5.46 | ± 6.00 | ± 9.55 | ± 4.38 | ± 1.11 | ||||
| Deep learning | chi-squared | 61.35 | 31.26 | 91.45 | 53.31 | 57.90 | 78.17 | ||
| ±2.08 | ±4.65 | ±1.86 | ±3.59 | ±5.17 | ±1.08 | ||||
| Deep learning | xgboost | 63.19 | 30.86 | 95.53 | 54.22 | 73.03 | 78.80 | ||
| ±1.42 | ±2.91 | ±2.05 | ±2.41 | ±7.85 | ±6.57 | ||||
| Deep learning | rpart |
| 34.36 | 97.39 | 57.77 | 83.10 | 79.97 | ||
| ±1.69 | ±3.56 | ±6.05 | ±2.99 | ±3.06 | ±8.35 | ||||
| Deep learning | bst | 63.89 | 29.77 | 98.01 | 53.85 | 84.72 | 78.97 | ||
| ± 2.31 | ±4.59 | ±0.77 | ±4.25 | ±5.16 | ±1.08 |
Bold value indicates the highest value achieved by each of the models in the balanced accuracy metric.