| Literature DB >> 34899078 |
Abstract
Coronavirus disease, Covid19, pandemic has a great effect on human heath worldwide since it was first detected in late 2019. A clear understanding of the structure of the available Covid19 datasets might give the healthcare provider a better understanding of identifying some of the cases at an early stage. In this article, we will be looking into a Covid19 Mexican Patients' Dataset (Covid109MPD), and we will apply number of machine learning algorithms on the dataset to select the best possible classification algorithm for the death and survived cases in Mexico, then we will study the performance of the enhancement of the specified classifiers in term of their features selection in order to be able to predict sever, and or death, cases from the available dataset. Results show that J48 classifier gives the best classification accuracy with 94.41% and RMSE = 0.2028 and ROC = 0.919, compared to other classifiers, and when using feature selection method, J48 classifier can predict a surviving Covid19MPD case within 94.88% accuracy, and by using only 10 out of the total 19 features.Entities:
Keywords: Covid19; classification; feature importance; feature selection; machine learning; prediction
Year: 2021 PMID: 34899078 PMCID: PMC8646298 DOI: 10.1002/cpe.6675
Source DB: PubMed Journal: Concurr Comput ISSN: 1532-0626 Impact factor: 1.831
Available features and their descriptions
| Feature | Code | Value | Note |
|---|---|---|---|
| 1 | sex | 1, 2 | Female/male |
| 2 | patient_type | 1, 2 | PATIENT_TYPE identifies the type of care received by the patient in the unit. It is called an outpatient if you returned home or it is called an inpatient if you were admitted to hospital. |
| 3 | intubed | 1, 2, 97, 99 | INTUBED identifies if the patient required intubation. |
| 4 | pneumonia | 1, 2, 99 | PNEUMONIA identifies if the patient was diagnosed with pneumonia. |
| 5 | age | Range [0:110] | Age of the tested group |
| 6 | pregnancy | 1, 2, 98, 97 | PREGNANCY identifies if the patient is pregnant. |
| 7 | diabetes | 1, 2, 98 | DIABETES identifies if the patient has a diagnosis of diabetes. |
| 8 | copd | 1, 2, 98 | COPD identifies if the patient has a diagnosis of COPD. |
| 9 | asthma | 1, 2, 98 | ASMA identifies if the patient has a diagnosis of asthma. |
| 10 | inmsupr | 1, 2, 98 | INMUSUPR identifies if the patient has immunosuppression. |
| 11 | hypertension | 1, 2, 98 | HYPERTENSION identifies if the patient has a diagnosis of hypertension. |
| 12 | other_disease | 1, 2, 98 | OTRAS_COM identifies if the patient has a diagnosis of other diseases. |
| 13 | cardiovascular | 1, 2, 98 | CARDIOVASCULAR identifies if the patient has a diagnosis of cardiovascular disease. |
| 14 | obesity | 1, 2, 98 | OBESITY identifies if the patient is diagnosed with obesity. |
| 15 | renal_chronic | 1, 2, 98 | RENAL_CHRONIC identifies if the patient has a diagnosis of chronic kidney failure. |
| 16 | tobacco | 1, 2, 98 | TOBACCO identifies if the patient has a smoking habit. |
| 17 | contact_other_covid | 1, 2, 99 | OTHER_CASE identifies if the patient had contact with any other case diagnosed with SARS CoV‐2. |
| 18 | covid_res | 1, 2, 3 | RESULT identifies the result of the analysis of the sample reported by the laboratory of the National Network of Epidemiological Surveillance Laboratories (INDRE, LESP, and LAVE). |
| 19 | icu | 1, 2, 97 | ICU identifies if the patient required to enter an Intensive Care Unit. |
| 20 | class (died/survived) | 1, 2 | Indicating if the patient passed away or survived the covid19. |
The distribution of all values for the available features
| Feature | Code | Female | Male | ||
|---|---|---|---|---|---|
| 1 | Sex | 98,827 | 101,173 | ||
|
|
|
|
| ||
| 2 | patient_type | 157,101 | 42,899 | ||
|
|
|
|
|
|
|
| 18 | covid_res | 77,709 | 98,859 | 23,432 | – |
|
|
|
|
|
|
|
| 3 | intubed | 3427 | 39,428 | 44 | 157,101 |
| 4 | pneumonia | 30,891 | 169,107 | 2 | – |
| 6 | pregnancy | 1434 | 96,842 | 551 | 101,173 |
| 7 | diabetes | 25,077 | 174,235 | 688 | – |
| 8 | copd | 3204 | 196,182 | 614 | – |
| 9 | asthma | 6445 | 192,944 | 611 | – |
| 10 | inmsupr | 3105 | 196,202 | 693 | – |
| 11 | hypertension | 32,555 | 166,805 | 640 | – |
| 12 | other_disease | 5953 | 193,138 | 909 | – |
| 13 | cardiovascular | 4507 | 194,852 | 641 | – |
| 14 | obesity | 32,509 | 166,850 | 641 | – |
| 15 | renal_chronic | 4054 | 195,313 | 633 | – |
| 16 | tobacco | 16,949 | 182,366 | 685 | – |
| 17 | contact_other_covid | 78,143 | 60,081 | 61,776 | – |
| 19 | icu | 3524 | 39,330 | 45 | 157,101 |
| 20 | class (died/survived) | 12,714 | 187,286 | – | – |
FIGURE 1Features count distribution
Age group frequency
| Age | 0–10 | 11–20 | 21–30 | 31–40 | 41–50 | 51–60 | 61–70 | 71–80 | 81–90 | 91–100 | >100 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Female | 377 | 1997 | 3711 | 18,965 | 24,546 | 21,666 | 14,701 | 7237 | 3725 | 1576 | 298 |
| Male | 462 | 2247 | 3558 | 17,454 | 23,687 | 21,636 | 16,182 | 9299 | 4639 | 1767 | 242 |
| Total | 839 | 4244 | 7269 | 36,419 | 48,233 | 43,302 | 30,883 | 16,536 | 8364 | 3343 | 540 |
FIGURE 2Age group distribution
FIGURE 3Used methods
Classification results for the selected classifiers
| Classifier used | Accuracy (%) | MAE | RMSE | ROC | Time (s) |
|---|---|---|---|---|---|
| Naïve Bayes | 84.23 | 0.1522 | 0.3771 | 0.927 | 1.37 |
| SGD | 93.64 | 0.0636 | 0.2521 | 0.50 | 23.09 |
| J48 |
| 0.0798 |
|
| 45.63 |
| Random forest | 93.50 | 0.0774 | 0.2161 | 0.910 | 209.09 |
| K‐NN ( | 92.71 | 0.0771 | 0.2441 | 0.889 | 0.2 |
FIGURE 4Accuracy percentages per classifier
FIGURE 5Classifiers' MAE and RMSE results
Classification accuracy with feature selection
| Before/after feature selection (%) | |||||
|---|---|---|---|---|---|
| Classifier | Selected features (#) | Accuracy | MAE | RMSE | ROC |
| K‐NN ( | 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 (17) | 92.71/92.82 | 0.0771/0.0706 | 0.2441/0.2585 | 0.889/0.837 |
| J48 | 1, 3, 4, 5, 8, 13, 14, 17, 18, 19 (10) | 94.41/94.88 | 0.0789/0.0752 | 0.2028/0.1994 | 0.919/0.895 |
| Random forest | 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 (17) | 93.50/92.93 | 0.0774/0.0701 | 0.2161/0.2612 | 0.910/0.697 |
FIGURE 6Accuracy comparison after feature selection
FIGURE 7MAE and RMSE values before and after feature selection
FIGURE 8Feature importance base on the interaction test results
Classification results comparison for all used methods
| Feature | J48 all features | J48 feature selection | J48 feature selection based on feature importance | Common features |
|---|---|---|---|---|
| 1 | sex | sex | sex | sex |
| 2 | patient_type | – | patient_type | – |
| 3 | intubed | intubed | intubed | intubed |
| 4 | pneumonia | pneumonia | pneumonia | pneumonia |
| 5 | age | age | age | age |
| 6 | pregnancy | – | pregnancy | – |
| 7 | diabetes | – | – | – |
| 8 | copd | copd | copd | copd |
| 9 | asthma | – | asthma | – |
| 10 | inmsupr | – | – | – |
| 11 | hypertension | – | hypertension | – |
| 12 | other_disease | – | – | – |
| 13 | cardiovascular | cardiovascular | cardiovascular | cardiovascular |
| 14 | obesity | obesity | – | – |
| 15 | renal_chronic | – | – | – |
| 16 | tobacco | – | – | – |
| 17 | contact_other_covid | contact_other_covid | – | – |
| 18 | covid_res | covid_res | covid_res | covid_res |
| 19 | icu | icu | – | – |
| Accuracy (%) | 94.41 | 94.88 | 94.65 | 94.61 |
| MAE | 0.0789 | 0.0752 | 0.0778 | 0.0786 |
| RMSE | 0.2028 | 0.1994 | 0.2014 | 0.200 |