| Literature DB >> 35428863 |
Shahadat Uddin1, Ibtisham Haque2, Haohui Lu3, Mohammad Ali Moni4, Ergun Gide5.
Abstract
Disease risk prediction is a rising challenge in the medical domain. Researchers have widely used machine learning algorithms to solve this challenge. The k-nearest neighbour (KNN) algorithm is the most frequently used among the wide range of machine learning algorithms. This paper presents a study on different KNN variants (Classic one, Adaptive, Locally adaptive, k-means clustering, Fuzzy, Mutual, Ensemble, Hassanat and Generalised mean distance) and their performance comparison for disease prediction. This study analysed these variants in-depth through implementations and experimentations using eight machine learning benchmark datasets obtained from Kaggle, UCI Machine learning repository and OpenML. The datasets were related to different disease contexts. We considered the performance measures of accuracy, precision and recall for comparative analysis. The average accuracy values of these variants ranged from 64.22% to 83.62%. The Hassanaat KNN showed the highest average accuracy (83.62%), followed by the ensemble approach KNN (82.34%). A relative performance index is also proposed based on each performance measure to assess each variant and compare the results. This study identified Hassanat KNN as the best performing variant based on the accuracy-based version of this index, followed by the ensemble approach KNN. This study also provided a relative comparison among KNN variants based on precision and recall measures. Finally, this paper summarises which KNN variant is the most promising candidate to follow under the consideration of three performance measures (accuracy, precision and recall) for disease prediction. Healthcare researchers and stakeholders could use the findings of this study to select the appropriate KNN variant for predictive disease risk analytics.Entities:
Mesh:
Year: 2022 PMID: 35428863 PMCID: PMC9012855 DOI: 10.1038/s41598-022-10358-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Visual illustration of the KNN algorithm.
A brief list of eight disease datasets considered in this study.
| ID | Datasets | Features | Data size | References |
|---|---|---|---|---|
| D1 | Heart Attack Possibilities | 13 | 303 | Bhat[ |
| D2 | Heart Failure Outcomes | 12 | 299 | Chicco et al.[ |
| D3 | Diabetes | 8 | 768 | Mahgoub[ |
| D4 | Heart Disease Prediction | 13 | 270 | Bhat[ |
| D5 | Chronic Kidney Disease Preprocessed | 24 | 400 | Soundarapandian[ |
| D6 | Chronic Kidney Disease Prediction | 13 | 400 | Soundarapandian[ |
| D7 | Pima Indians Diabetes | 8 | 767 | Smith et al.[ |
| D8 | Breast Cancer | 5 | 569 | Suwal[ |
Figure 2Confusion Matrix.
Accuracy (%) comparison among KNN variants.
| Dataset ID | Classic KNN | Adaptive KNN | Locally adaptive KNN | Fuzzy KNN | K-means clustering-based KNN | Weight adjusted KNN | Hassanat KNN | Generalised mean distance KNN | Mutual KNN | Ensemble approach KNN |
|---|---|---|---|---|---|---|---|---|---|---|
| D1 | 76.35 | 73.64 | 69.59 | 73.65 | 39.86 | 73.65 | 69.59 | 71.62 | 77.03 | |
| D2 | 58.87 | 63.83 | 58.87 | 63.83 | 47.52 | 63.83 | 67.38 | 62.41 | 65.96 | |
| D3 | 75.25 | 75.00 | 75.51 | 74.24 | 68.18 | 74.24 | 76.26 | 74.24 | 74.24 | |
| D4 | 79.51 | 78.69 | 76.23 | 68.85 | 80.33 | 77.05 | 80.33 | 81.15 | ||
| D5 | 96.26 | 96.26 | 95.72 | 67.38 | 95.72 | 96.79 | 95.72 | 93.05 | ||
| D6 | 96.92 | 97.44 | 97.44 | 64.10 | 97.44 | 96.41 | 97.44 | 94.87 | 92.31 | |
| D7 | 73.88 | 73.60 | 75.00 | 73.60 | 68.82 | 73.60 | 75.56 | 74.44 | 74.44 | |
| D8 | 90.38 | 90.03 | 91.07 | 89 | 91.07 | 91.07 | 91.41 | 90.72 | 90.38 | |
| Average | 80.93 | 81.32 | 80.20 | 81.44 | 64.22 | 81.44 | 80.62 | 80.99 | 82.34 |
Precision (%) comparison among different KNN variants.
| Dataset ID | Classic KNN | Adaptive KNN | Locally adaptive KNN | Fuzzy KNN | K-means clustering-based KNN | Weight adjusted KNN | Hassanat KNN | Generalised mean distance KNN | Mutual KNN | Ensemble approach KNN |
|---|---|---|---|---|---|---|---|---|---|---|
| D1 | 80.90 | 77.42 | 77.11 | 76.84 | 51.52 | 76.84 | 77.11 | 73.53 | 78.57 | |
| D2 | 39.13 | 46.51 | 42.65 | 46.15 | 34.15 | 46.15 | 52.78 | 46.27 | 50 | |
| D3 | 62.16 | 60.32 | 60.61 | 61.17 | 49.12 | 61.17 | 63.48 | 58.91 | 60.75 | |
| D4 | 82.98 | 81.25 | 77.55 | 75 | 84.78 | 78 | 78.57 | 85.11 | ||
| D5 | 67.38 | |||||||||
| D6 | 99.18 | 98.40 | 64.10 | 99.17 | 98.39 | |||||
| D7 | 60 | 58.49 | 60 | 59.78 | 50.75 | 59.78 | 62.63 | 58.97 | 62.07 | |
| D8 | 98.98 | 98.06 | 97.52 | 97.49 | 97.12 | 98.98 | ||||
| Average | 78.02 | 77.65 | 76.73 | 78.54 | 61.19 | 78.54 | 81.10 | 76.85 | 77.99 |
Recall (%) comparison among different KNN variants.
| Dataset ID | Classic KNN | Adaptive KNN | Locally adaptive KNN | Fuzzy KNN | K-means clustering-based KNN | Weight adjusted KNN | Hassanat KNN | Generalised mean distance KNN | Mutual KNN | Ensemble approach KNN |
|---|---|---|---|---|---|---|---|---|---|---|
| D1 | 80 | 80 | 71.11 | 81.11 | 18.89 | 81.11 | 71.11 | 83.33 | 85.56 | |
| D2 | 37.50 | 41.67 | 60.42 | 37.50 | 58.33 | 37.50 | 39.58 | 25 | 56.25 | |
| D3 | 55.20 | 60.80 | 50.40 | 22.40 | 50.40 | 58.40 | 60.80 | 52 | 52 | |
| D4 | 69.64 | 69.64 | 67.86 | 73.21 | 48.21 | 73.21 | 69.64 | 69.64 | 71.43 | |
| D5 | 94.44 | 94.44 | 97.62 | 93.65 | 93.65 | 95.24 | 97.62 | 93.65 | 89.68 | |
| D6 | 95.20 | 96.80 | 98.40 | 96 | 96 | 95.20 | 97.60 | 92 | 88 | |
| D7 | 50.89 | 55.36 | 49.11 | 30.36 | 49.11 | 55.36 | 48.21 | 42.86 | ||
| D8 | 88.24 | 89.14 | 89.14 | 87.78 | 89.14 | 89.14 | 88.69 | 88.24 | ||
| 71.39 | 73.76 | 76.27 | 71.27 | 58.25 | 71.27 | 73.93 | 70.18 | 71.75 |
Comparison of KNN variants showing the number of times they presented the highest measurement values.
| KNN Variants | Accuracy measure (#) | Precision measure (#) | Recall measure (#) |
|---|---|---|---|
| Classic KNN | 0 | 2 | 0 |
| Adaptive KNN | 1 | 1 | 1 |
| Locally Adaptive KNN | 2 | 1 | 2 |
| Fuzzy KNN | 1 | 4 | 0 |
| K-means Clustering-based KNN | 0 | 0 | 2 |
| Weight Adjusted KNN | 1 | 4 | 0 |
| Hassanat KNN | 1 | 4 | 1 |
| Generalised Mean Distance KNN | 1 | 1 | 3 |
| Mutual KNN | 0 | 3 | 1 |
| Ensemble Approach KNN | 3 | 7 | 0 |
Figure 3Average relative performance index (RPI) scores for the three performance measures.
Summary of the different characterisations of measures of all KNN variants this study considered in terms of revealing the highest values.
| KNN Variants | Accuracy | Precision | Recall | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Average | # of times | RPI | Average | # of times | RPI | Average | # of times | RPI | |
| Adaptive KNN | |||||||||
| Locally Adaptive KNN | |||||||||
| Fuzzy KNN | |||||||||
| Kmeans Clustering KNN | |||||||||
| Weight Adjusted KNN | |||||||||
| Hassanat KNN | |||||||||
| Generalised Mean Distance KNN | |||||||||
| Mutual KNN | |||||||||
| Ensemble Approach KNN | |||||||||
Results from the one-way ANOVA test for checking the significance of the difference of three performance measures across the ten KNN variants considered in this study.
| Sum of Squares | df | Mean Square | F | Sig | ||
|---|---|---|---|---|---|---|
| Accuracy | Between Groups | 2200.496 | 9 | 244.500 | 1.594 | 0.134 |
| Within Groups | 10,735.168 | 70 | 153.360 | |||
| Total | 12,935.664 | 79 | ||||
| Precision | Between Groups | 2468.842 | 9 | 274.316 | 0.670 | 0.733 |
| Within Groups | 28,675.727 | 70 | 409.653 | |||
| Total | 31,144.569 | 79 | ||||
| Recall | Between Groups | 1915.288 | 9 | 212.810 | 0.424 | 0.918 |
| Within Groups | 35,172.935 | 70 | 502.470 | |||
| Total | 37,088.223 | 79 |
Comparison of KNN variants through advantages and limitations.
| KNN Variant | Advantage(s) | Limitation(s) |
|---|---|---|
Low time complexity Can classify at high speeds compared to other machine learning algorithms | It does not consider minority class and weight of data points, which may cause the accuracy to fall for noisy datasets | |
| Adaptive KNN (A-KNN) | Perform consistently better with small scale datasets | It does not provide the optimal |
| Locally Adaptive KNN (LA-KNN) | It generally improves the classification performance by considering classes that are discriminated by the classic KNN properly rank the accuracies resulting from multiple | The variant is prone to a higher computational complexity than other variants The high time complexity makes the algorithm undesirable to be used for large scale datasets |
| Fuzzy KNN (F-KNN) | It considers class frequency and weight making it more probable in making a correct prediction | This variant does not provide an optimal |
| K-means Clustering-based KNN (KM-KNN) | The KM-KNN variant reduces the time complexity of the classic KNN algorithm by truncating the training dataset by forming clusters | The algorithm is unsuitable to noisy datasets as it clusters the training data Noisy datasets would produce uneven clusters and thus affect the classification process |
| Weight Adjusted KNN (WA-KNN) | It considers different | The algorithm discriminates the points which have a greater distance from the query, thus causing a bias |
| Hassanat KNN (H-KNN) | The H-KNN variant uses the Hassanat Distance metric, which allows the algorithm to measure the distance in terms of maximum and minimum vector points, making it prone to biased outcomes | It does not consider minority classes, which may affect its performance Inconsistent outcomes for noisy datasets |
| Generalised Mean Distance KNN (GMD-KNN) | It breaks the limitations of biasing the majority classes by considering all classes using a generalised distance algorithm formula It can eliminate biases resulting from variance in class weight and majority | It has many dependable variables, making it a high time complexity KNN variant |
| Mutual KNN (M-KNN) | The M-KNN variant removes noisy data from the dataset, thus improving the neighbourhood findings of the underlying query points and increasing the chances of correct classification | The algorithm incurs a high computational complexity cost due to its reiteration of nearest neighbour searches for training and the testing datasets This variant may be unsuitable for large scale datasets due to its high computational complexity |
| Ensemble Approach KNN (EA-KNN) | The Ensemble Approach variant involves using multiple | If |