| Literature DB >> 33263111 |
L J Muhammad1, Ebrahem A Algehyne2, Sani Sharif Usman3, Abdulkadir Ahmad4, Chinmay Chakraborty5, I A Mohammed6.
Abstract
COVID-19 or 2019-nCoV is no longer pandemic but rather endemic, with more than 651,247 people around world having lost their lives after contracting the disease. Currently, there is no specific treatment or cure for COVID-19, and thus living with the disease and its symptoms is inevitable. This reality has placed a massive burden on limited healthcare systems worldwide especially in the developing nations. Although neither an effective, clinically proven antiviral agents' strategy nor an approved vaccine exist to eradicate the COVID-19 pandemic, there are alternatives that may reduce the huge burden on not only limited healthcare systems but also the economic sector; the most promising include harnessing non-clinical techniques such as machine learning, data mining, deep learning and other artificial intelligence. These alternatives would facilitate diagnosis and prognosis for 2019-nCoV pandemic patients. Supervised machine learning models for COVID-19 infection were developed in this work with learning algorithms which include logistic regression, decision tree, support vector machine, naive Bayes, and artificial neutral network using epidemiology labeled dataset for positive and negative COVID-19 cases of Mexico. The correlation coefficient analysis between various dependent and independent features was carried out to determine a strength relationship between each dependent feature and independent feature of the dataset prior to developing the models. The 80% of the training dataset were used for training the models while the remaining 20% were used for testing the models. The result of the performance evaluation of the models showed that decision tree model has the highest accuracy of 94.99% while the Support Vector Machine Model has the highest sensitivity of 93.34% and Naïve Bayes Model has the highest specificity of 94.30%. © Springer Nature Singapore Pte Ltd 2020.Entities:
Keywords: COVID-19; Dataset; Decision tree; Machine learning; Pandemic
Year: 2020 PMID: 33263111 PMCID: PMC7694891 DOI: 10.1007/s42979-020-00394-7
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
Fig. 1Essential learning process for the development of predictive models
Fig. 2Methodology to build machine learning classification models for COVID-19 infection
Dataset description
| S. No. | Feature | Description | Non-null count | Data type |
|---|---|---|---|---|
| 1 | Age | = > 0 | 263,007 non-null | int64 |
| 2 | Sex | 0 = female, 1 = male | 263,007 non-null | int64 |
| 3 | Pneumonia | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 4 | Diabetes | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 5 | Asthma | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 6 | Hypertension | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 7 | CVDs | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 8 | Obesity | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 9 | CKDs | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 10 | Tobacco | 0 = negative, 1 = positive | 263,007 non-null | int64 |
| 11 | Result | 0 = negative, 1 = positive | 263,007 non-null | int64 |
Profile information of the dataset
| S. No. | Feature | Minimum | Maximum | Mean | Std. deviation |
|---|---|---|---|---|---|
| 1 | Age | 0 | 120 | 42.59 | 16.90 |
| 2 | Sex | 0 | 1 | 0.49 | 0.50 |
| 3 | Pneumonia | 0 | 99 | 0.17 | 0.81 |
| 4 | Diabetes | 0 | 98 | 0.51 | 6.07 |
| 5 | Asthma | 0 | 98 | 0.38 | 5.80 |
| 6 | Hypertension | 0 | 98 | 0.52 | 5.84 |
| 7 | CVDs | 0 | 98 | 0.38 | 5.91 |
| 8 | Obesity | 0 | 98 | 0.53 | 5.92 |
| 9 | CKDs | 0 | 98 | 0.37 | 5.84 |
| 10 | Tobacco | 0 | 98 | 0.46 | 5.98 |
| 11 | Result | 0 | 1 | 0.39 | 0.49 |
Sample of the dataset
| Age | Sex | PM | DB | AM | HP | CVDs | OB | CKDs | TB | Result |
|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 71 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 50 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 25 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 67 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 44 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 62 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Fig. 3Chart presentation of the profile information of the dataset
Fig. 4Age frequency of the patients
Fig. 5Sex frequency of the patients
Fig. 6COVID-19 Result frequency of the patients
Fig. 7Representation of SVM
Fig. 8Scatterplot correlation coefficient of the feature of the dataset
Fig. 9The correlation matrix of the dataset features
r value and the status of correlation coefficient
| S. No. | Dependent feature | Independent feature | correlation coefficient relationship | |
|---|---|---|---|---|
| 1 | Age | RT-PCR COVID-19 test result | 0.17 | Weak positive correlation coefficient relationship |
| 2 | Sex | RT-PCR COVID-19 Test Result | 0.085 | Weak positive correlation coefficient relationship |
| 3 | Pneumonia | RT-PCR COVID-19 test result | 0.082 | Weak positive correlation coefficient relationship |
| 4 | Diabetes | RT-PCR COVID-19 test result | 0.028 | A weak positive correlation coefficient relationship |
| 5 | Asthma | RT-PCR COVID-19 test result | 0.024 | A weak positive correlation coefficient relationship |
| 6 | Hypertension | RT-PCR COVID-19 Test Result | 0.03 | A weak positive correlation coefficient relationship |
| 7 | CVDs | RT-PCR COVID-19 test result | 0.025 | A weak positive correlation coefficient relationship |
| 8 | Obesity | RT-PCR COVID-19 test result | 0.034 | A weak positive correlation coefficient relationship |
| 9 | CVDs | RT-PCR COVID-19 test result | 0.025 | Weak positive correlation coefficient relationship |
| 10 | Tobacco | RT-PCR COVID-19 test result | 0.025 | Weak positive correlation coefficient relationship |
Fig. 10Decision tree model for prediction COVID-19 infection
Performance evaluation result
| S. No. | Model | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|---|---|---|---|---|
| 1 | Decision tree | 94.99 | 89.2 | 93.22 |
| 2 | Logistic regression | 94.41 | 86.34 | 87.34 |
| 3 | Naive bayes | 94.36 | 83.76 | 94.3 |
| 4 | Support vector machine | 92.4 | 93.34 | 76.5 |
| 5 | Artificial neural network | 89.2 | 92.4 | 83.3 |
Fig. 11Performance evaluation result