| Literature DB >> 33688457 |
Ahmed Hamed1, Ahmed Sobhy1, Hamed Nassar1.
Abstract
Great efforts are now underway to control the coronavirus 2019 disease (COVID-19). Millions of people are medically examined, and their data keep piling up awaiting classification. The data are typically both incomplete and heterogeneous which hampers classical classification algorithms. Some researchers have recently modified the popular KNN algorithm as a solution, where they handle incompleteness by imputation and heterogeneity by converting categorical data into numbers. In this article, we introduce a novel KNN variant (KNNV) algorithm that provides better results as demonstrated by thorough experimental work. We employ rough set theoretic techniques to handle both incompleteness and heterogeneity, as well as to find an ideal value for K. The KNNV algorithm takes an incomplete, heterogeneous dataset, containing medical records of people, and identifies those cases with COVID-19. We use in the process two popular distance metrics, Euclidean and Mahalanobis, in an effort to widen the operational scope. The KNNV algorithm is implemented and tested on a real dataset from the Italian Society of Medical and Interventional Radiology. The experimental results show that it can efficiently and accurately classify COVID-19 cases. It is also compared to three KNN derivatives. The comparison results show that it greatly outperforms all its competitors in terms of four metrics: precision, recall, accuracy, and F-Score. The algorithm given in this article can be easily applied to classify other diseases. Moreover, its methodology can be further extended to do general classification tasks outside the medical field. © King Fahd University of Petroleum & Minerals 2020.Entities:
Keywords: COVID-19 diagnosis; Euclidean; Heterogeneous data; Incomplete data; KNN; Mahalanobis; Rough set theory
Year: 2021 PMID: 33688457 PMCID: PMC7931985 DOI: 10.1007/s13369-020-05212-z
Source DB: PubMed Journal: Arab J Sci Eng ISSN: 2191-4281 Impact factor: 2.807
A toy IHC dataset made of 13 records
| Male | Yes | Yes | Yes | 0.97 | 0.78 | COVID-19 | |||
| Yes | No | Yes | 0.39 | 0.64 | COVID-19 | ||||
| Female | No | No | Yes | 0.77 | 0.79 | 0.39 | Flu | ||
| Male | Yes | No | Yes | No | 0.08 | 0.8 | 0.37 | Flu | |
| Female | Yes | Yes | No | 0.8 | 0.1 | COVID-19 | |||
| Female | No | Yes | 0.42 | 0.55 | 0.39 | Flu | |||
| Male | Yes | Yes | No | Yes | 0.98 | COVID-19 | |||
| Male | Yes | No | Yes | Yes | 0.43 | 0.42 | 0.36 | Flu | |
| Yes | No | No | Yes | 0.96 | 0.11 | Flu | |||
| Female | No | Yes | Yes | 0.34 | 0.55 | COVID-19 | |||
| Male | Yes | Yes | Yes | 0.38 | 0.39 | 0.81 | Flu | ||
| Male | Yes | Yes | Yes | No | 0.85 | COVID-19 | |||
| Female | No | Yes | 0.31 | 0.59 | 0.37 | Flu |
Clearly, is the set of patients and is the set of features. Note that is the set of categorical features and the set of numerical features, with d being the decision label. Due to the missing values, denoted by “,” this dataset is roughly 20% incomplete
IHC dataset used in the experiments, with 68 COVID patients and 62 Flu patients described by 16 conditional features and one decision feature d
| Feature | Type | Value set | |
|---|---|---|---|
| Age | Numerical | [4–90] | |
| Gender | Categorical | {Male, female} | |
| Fever | Categorical | {Yes, no} | |
| Dyspnea | Categorical | {Yes, no} | |
| Nasal | Categorical | {Yes, no} | |
| Cough | Categorical | {Yes, no} | |
| Partial pressure of oxygen (PO2) | Numerical | [32–292] | |
| C-reactive protein (CRP) | Numerical | [0.75–23] | |
| Asthenia | Categorical | {Yes, no} | |
| Leukopenia | Categorical | {Yes, no} | |
| Exposure to COVID-19 patients | Categorical | {Yes, no} | |
| Coming from high risk zone | Categorical | {Yes, no} | |
| Temperature | Numerical | [35.7–40] | |
| Blood test | Categorical | {Yes, no} | |
| Polymerase chain reaction (RT-PCR) | Categorical | {Positive, negative} | |
| Medical history | Categorical | {Cancer, croonic, astham, COPD, chronic, DM} | |
| Decision label | Categorical | {COVID-19, Flu} |
Average values of precision, recall, accuracy, and F-score achieved by KNNV and three related algorithms using both Euclidean and Mahalanobis distances
| Euclidean | Mahalanobis | |||||||
|---|---|---|---|---|---|---|---|---|
| M | cs | M | cs | |||||
| Precision | 0.59 | 0.57 | 0.81 | 0.42 | 0.39 | 0.71 | ||
| Recall | 0.61 | 0.66 | 0.76 | 0.49 | 0.70 | 0.64 | ||
| Accuracy | 0.66 | 0.51 | 0.67 | 0.49 | 0.49 | 0.71 | ||
| 0.61 | 0.59 | 0.65 | 0.46 | 0.48 | 0.68 | |||
Maximum values of precision, recall, accuracy, and F-score achieved by KNNV and three related algorithms using both Euclidean and Mahalanobis distances
| Euclidean | Mahalanobis | |||||||
|---|---|---|---|---|---|---|---|---|
| M | cs | M | cs | |||||
| Precision | 0.77 | 0.65 | 0.89 | 0.53 | 0.41 | 0.77 | ||
| Recall | 0.81 | 1 | 0.93 | 0.60 | 1 | 0.67 | ||
| Accuracy | 1 | 0.69 | 1 | 0.60 | 0.55 | 0.77 | ||
| 0.84 | 0.79 | 1 | 0.54 | 0.52 | 0.71 | |||
Minimum values of precision, recall, accuracy, and F-score achieved by KNNV and three related algorithms using both Euclidean and Mahalanobis distances
| Euclidean | Mahalanobis | |||||||
|---|---|---|---|---|---|---|---|---|
| M | cs | M | cs | |||||
| Precision | 0.39 | 0.44 | 0.60 | 0.34 | 0.36 | 0.58 | ||
| Recall | 0.24 | 0.32 | 0.44 | 0.33 | 0.29 | 0.51 | ||
| Accuracy | 0.33 | 0.43 | 0.58 | 0.41 | 0.31 | 0.51 | ||
| 0.31 | 0.41 | 0.55 | 0.38 | 0.29 | 0.50 | |||
Mean classification accuracy of KNNV for each feature separately
| Feature | Accuracy | Rank | |
|---|---|---|---|
| Age | 0.50 | 9 | |
| Gender | 0.52 | 5 | |
| Fever | 0.59 | 2 | |
| Dyspnea | 0.46 | 13 | |
| Nasal | 0.65 | 1 | |
| Cough | 0.47 | 12 | |
| Partial pressure of oxygen (PO2) | 0.48 | 11 | |
| C-reactive protein (CRP) | 0.52 | 6 | |
| Asthenia | 0.54 | 3 | |
| Leukopenia | 0.53 | 4 | |
| Exposure to COVID-19 patients | 0.52 | 7 | |
| Coming from high risk zone | 0.52 | 8 | |
| Temperature | 0.44 | 14 | |
| Blood test | 0.56 | 15 | |
| Polymerase chain reaction (RT-PCR) | 0.49 | 10 | |
| Medical history | 0.53 | 16 |