| Literature DB >> 36033018 |
Ritesh Jha1, Vandana Bhattacharjee1, Abhijit Mustafi1, Sudip Kumar Sahana1.
Abstract
The novel coronavirus illness (COVID-19) outbreak, which began in a seafood market in Wuhan, Hubei Province, China, in mid-December 2019, has spread to almost all countries, territories, and places throughout the world. And since the fault in diagnosis of a disease causes a psychological impact, this was very much visible in the spread of COVID-19. This research aims to address this issue by providing a better solution for diagnosis of the COVID-19 disease. The paper also addresses a very important issue of having less data for disease prediction models by elaborating on data handling techniques. Thus, special focus has been given on data processing and handling, with an aim to develop an improved machine learning model for diagnosis of COVID-19. Random Forest (RF), Decision tree (DT), K-Nearest Neighbor (KNN), Logistic Regression (LR), Support vector machine, and Deep Neural network (DNN) models are developed using the Hospital Israelita Albert Einstein (in São Paulo, Brazil) dataset to diagnose COVID-19. The dataset is pre-processed and distributed DT is applied to rank the features. Data augmentation has been applied to generate datasets for improving classification accuracy. The DNN model dominates overall techniques giving the highest accuracy of 96.99%, recall of 96.98%, and precision of 96.94%, which is better than or comparable to other research work. All the algorithms are implemented in a distributed environment on the Spark platform.Entities:
Keywords: COVID-19; classification; data augmentation; data pre-processing; disease diagnosis
Year: 2022 PMID: 36033018 PMCID: PMC9416861 DOI: 10.3389/fpsyg.2022.951027
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Dataset description.
| Name of data set | Positive samples | Negative samples | Number of features |
| Hospital Israelita Albert Einstein at Sao Paulo, Brazil | 558 | 5,086 | 111 |
Features selected after applying distributed decision tree.
| Rank | Name of feature | Rank | Name of feature |
| 1 | Leukocytes’ | 23 | Patient admitted to intensive care unit |
| 2 | Patient age quantile | 24 | Segmented |
| 3 | Red blood Cells | 25 | Relationship (Patient/Normal) |
| 4 | Platelets | 26 | Strepto A |
| 5 | Monocytes | 27 | Proteina C reativa mg/dL |
| 6 | pCO2 | 28 | pO2 (venous blood gas analysis) |
| 7 | Eosinophils | 29 | Sodium |
| 8 | Basophils | 30 | Hb saturation |
| 9 | Mean platelet volume | 31 | Creatinine |
| 10 | Lymphocytes | 32 | Influenza B |
| 11 | Mean corpuscular hemoglobin | 33 | Influenza A |
| 12 | Urea | 34 | Urine—Leukocytes |
| 13 | Rhinovirus/Enterovirus | 35 | Respiratory Syncytial Virus |
| 14 | Adenovirus | 36 | Urine—Red blood cells |
| 15 | Patient admitted to regular ward (1 = yes) | 37 | pH (venous blood gas analysis) |
| 16 | Aspartate transaminase | 38 | Coronavirus229E |
| 17 | Hemoglobin | 39 | Influenza B |
| 18 | CoronavirusNL63 | 40 | Inf A H1N1 2009 |
| 19 | Red blood cell distribution width (RDW) | 41 | Coronavirus HKU1 |
| 20 | Rods # | 42 | Parainfluenza 3 |
| 21 | Urine—Aspect | 43 | Parainfluenza 1 |
| 22 | Mean corpuscular volume (MCV) | 44 | Leukocytes |
FIGURE 1Structure of DNN used in our research work.
Spark distributed KNN
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Classifiers Framework Using MLib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Description of datasets used in this research.
| Dataset | Instances of class (0) | Instances of class (1) | Size |
| DR1 | 998 | 282 | 1,280 |
| DS1 | 27,000 | 3,000 | 30,000 |
| DS2 | 35,000 | 5,000 | 40,000 |
| DS3 | 32,086 | 3,558 | 35,644 |
| DS4 | 40,086 | 5,558 | 45,644 |
Performance analysis of classifiers on dataset DR1 (#features = 44, #rows = 1,280).
| Classifier | Parameters | F1-score (%) | Precision (%) | Recall (%) | Accuracy (%) |
| DNN | 4 hidden layers | 91.51 | 92.24 | 93.13 | 93.16 |
| KNN | 89.31 | 88.54 | 90.94 | 90.95 | |
| RF | Max_depth = 10 # Trees = 100 | 90.27 | 89.94 | 91.86 | 91.89 |
| DT | Max_depth = 10 | 89.43 | 88.96 | 90.00 | 90.12 |
| LR | 90.42 | 90.42 | 90.41 | 90.55 | |
| SVM | 90.13 | 90.13 | 90.94 | 90.95 |
Performance analysis of classifiers on DS4 dataset.
| Classifier | Parameters | F1-score (%) | Precision (%) | Recall (%) | Accuracy (%) |
| DNN | 4 hidden layers | 95.80 | 95.80 | 95.80 | 95.80 |
| KNN | 78.31 | 79.10 | 80.35 | 80.35 | |
| RF | Max_depth = 10 # Trees = 100 | 68.67 | 80.62 | 76.63 | 76.63 |
| DT | Max_depth = 10 | 80.15 | 80.11 | 80.18 | 80.18 |
| LR | 93.54 | 93.50 | 93.85 | 93.85 | |
| SVM | 93.16 | 93.04 | 93.16 | 93.16 |
Performance analysis of classifiers on DS1 dataset.
| Classifier | Parameters | F1-score (%) | Precision (%) | Recall (%) | Accuracy (%) |
| DNN | 4 hidden layers | 95.29 | 95.25 | 95.46 | 95.56 |
| KNN | 93.74 | 94.68 | 94.65 | 94.65 | |
| RF | Max_depth = 10 # Trees = 100 | 88.50 | 92.39 | 91.7 | 91.7 |
| DT | Max_depth = 10 | 94.66 | 94.62 | 94.7 | 94.7 |
| LR | 88.08 | 89.37 | 90.93 | 90.93 | |
| SVM | 91.96 | 91.51 | 91.65 | 91.97 |
FIGURE 2Accuracy plot for different datasets.
Comparison of results with other research work.
| Study | Dataset used | Classifier used | Accuracy (%) | AUC | F1-score |
|
| Hospital Israelita Albert Einstein, Brazil | MLP | 93.13 | 0.96 | 0.93 |
|
| Hospital Israelita Albert Einstein, Brazil | SVM, RF | – | 0.84 | 0.72 |
|
| Wenzhou Central Hospital and Cangnan People’s Hospital, China | SVM | 80.00 | – | – |
|
| Hospital Israelita Albert Einstein, Brazil | XGB | – | 0.66 | – |
|
| Hospital Israelita Albert Einstein, Brazil | CNNLSTM | 92.30 | 0.90 | 0.93 |
|
| Hospital Israelita Albert Einstein, Brazil | SVM | 95 | 0.95 | 0.94 |
|
| Hospital Israelita Albert Einstein, Brazil | CNN | 80 | – | 0.78 |
|
| Hospital Israelita Albert Einstein, Brazil | ANN | 90 | 0.95 | – |
|
| Hospital Israelita Albert Einstein, Brazil | XGBoost | 92.67 | – | 0.93 |
|
| Hospital Israelita Albert Einstein, Brazil | CNN | 92.52 | – | – |
| Our work | Hospital Israelita Albert Einstein, Brazil |
|
|
|
|
| KNN (DS1) | 94.65 | – | 93.74 | ||
| RF (DR1) | 91.89 | – | 90.27 | ||
| DT (DS1) | 94.7 | – | 94.66 |
Bold values are highest values.
Execution time of different datasets.
| Datasets | Time (minutes) with DNN classifier |
| DR1 | 0.15 |
| DS1 | 4.22 |
| DS2 | 5.21 |
| DS3 | 4.23 |
| DS4 | 5.23 |
Performance analysis of classifiers on DS2 dataset.
| Classifier | Parameters | F1-score (%) | Precision (%) | Recall (%) | Accuracy (%) |
| DNN | 4 hidden layers | 96.96 | 96.94 | 96.98 | 96.99 |
| KNN | 94.43 | 94.53 | 94.50 | 94.50 | |
| RF | Max_depth = 10 # Trees = 100 | 85.11 | 88.31 | 86.26 | 86.22 |
| DT | Max_depth = 10 | 90.81 | 90.88 | 90.76 | 90.76 |
| LR | 90.48 | 90.50 | 91.40 | 91.40 | |
| SVM | 92.96 | 93.51 | 93.65 | 93.65 |
Performance analysis of classifiers on DS3 dataset.
| Classifier | Parameters | F1-score (%) | Precision (%) | Recall (%) | Accuracy (%) |
| DNN | 4 hidden layers | 94.55 | 94.51 | 94.73 | 94.73 |
| KNN | 87.85 | 87.09 | 90.79 | 90.79 | |
| RF | Max_depth = 10 # Trees = 100 | 86.41 | 82.43 | 90.79 | 90.70 |
| DT | Max_depth = 10 | 89.18 | 88.59 | 90.09 | 90.09 |
| LR | 93.38 | 93.35 | 93.47 | 93.47 | |
| SVM | 93.93 | 93.94 | 93.94 | 93.94 |