| Literature DB >> 29376076 |
Hadi Raeisi Shahraki1, Saeedeh Pourahmad1,2, Najaf Zare1,3.
Abstract
K nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers but in high dimensional setting accuracy of KNN are affected by nuisance features. In this study, we proposed the K important neighbors (KIN) as a novel approach for binary classification in high dimensional problems. To avoid the curse of dimensionality, we implemented smoothly clipped absolute deviation (SCAD) logistic regression at the initial stage and considered the importance of each feature in construction of dissimilarity measure with imposing features contribution as a function of SCAD coefficients on Euclidean distance. The nature of this hybrid dissimilarity measure, which combines information of both features and distances, enjoys all good properties of SCAD penalized regression and KNN simultaneously. In comparison to KNN, simulation studies showed that KIN has a good performance in terms of both accuracy and dimension reduction. The proposed approach was found to be capable of eliminating nearly all of the noninformative features because of utilizing oracle property of SCAD penalized regression in the construction of dissimilarity measure. In very sparse settings, KIN also outperforms support vector machine (SVM) and random forest (RF) as the best classifiers.Entities:
Mesh:
Year: 2017 PMID: 29376076 PMCID: PMC5742505 DOI: 10.1155/2017/7560807
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1K important neighbors (KIN) algorithm for classification.
Misclassification rate (MC) of KIN versus KNN in simulation study.
| Number of features | Degree of sparsity (%) | Sample size | Ratio of the first class |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| KNN | KIN | KNN | KIN | ||||||||
| MC | MC | TC (%) | #FP | MC | MC | TC (%) | #FP | ||||
| 100 | 98% |
| 0.5 | 43.2 | 30.7 | 66% | 2.7 | 40.0 | 27.5 | 69% | 2.5 |
| 0.8 | 21.4 | 19.1 | 71% | 1.4 | 21.6 | 17.9 | 65% | 1.5 | |||
|
| 0.5 | 40.2 | 27.6 | 82% | 2.5 | 37.2 | 25.4 | 85% | 2.1 | ||
| 0.8 | 20.1 | 17.8 | 79% | 1.3 | 20.2 | 16.6 | 83% | 1.1 | |||
| 95% |
| 0.5 | 39.6 | 28.0 | 76% | 2.4 | 35.0 | 21.0 | 84% | 2.1 | |
| 0.8 | 21.1 | 19.1 | 64% | 1.7 | 21.0 | 16.3 | 83% | 1.7 | |||
|
| 0.5 | 36.8 | 23.8 | 87% | 3.0 | 30.8 | 17.9 | 88% | 2.6 | ||
| 0.8 | 20.0 | 16.5 | 87% | 1.9 | 19.2 | 14.2 | 92% | 1.3 | |||
| 90% |
| 0.5 | 39.1 | 33.9 | 68% | 2.1 | 30.8 | 20.5 | 89% | 1.5 | |
| 0.8 | 21.6 | 22.0 | 59% | 1.6 | 20.8 | 16.5 | 89% | 1.5 | |||
|
| 0.5 | 36.9 | 28.0 | 87% | 2.7 | 28.1 | 18.5 | 92% | 2.1 | ||
| 0.8 | 20.0 | 18.6 | 85% | 1.9 | 18.4 | 14.3 | 94% | 1.6 | |||
|
| |||||||||||
| 300 | 98% |
| 0.5 | 43.5 | 35.4 | 68% | 2.8 | 40.1 | 20.4 | 82% | 2.6 |
| 0.8 | 22.1 | 21.7 | 60% | 2.2 | 22.9 | 17.0 | 80% | 2.4 | |||
|
| 0.5 | 41.2 | 25.3 | 77% | 5.3 | 36.6 | 17.4 | 87% | 3.2 | ||
| 0.8 | 20.8 | 18.3 | 77% | 3.1 | 21.0 | 14.9 | 88% | 2.7 | |||
| 95% |
| 0.5 | 43.6 | 41.9 | 57% | 2.1 | 37.6 | 23.8 | 85% | 2.1 | |
| 0.8 | 22.3 | 23.3 | 54% | 2.4 | 23.5 | 19.5 | 83% | 2.1 | |||
|
| 0.5 | 41.8 | 33.0 | 79% | 3.6 | 33.8 | 19.4 | 90% | 2.8 | ||
| 0.8 | 21.1 | 21.9 | 70% | 3.0 | 21.2 | 16.7 | 90% | 2.6 | |||
| 90% |
| 0.5 | 43.6 | 41.8 | 53% | 2.3 | 36.6 | 23.7 | 85% | 2.1 | |
| 0.8 | 32.1 | 29.7 | 64% | 2.1 | 30.3 | 22.9 | 85% | 1.9 | |||
|
| 0.5 | 41.5 | 33.1 | 76% | 3.7 | 34.1 | 19.9 | 91% | 2.7 | ||
| 0.8 | 28.3 | 28.0 | 77% | 3.5 | 26.2 | 16.9 | 92% | 2.5 | |||
|
| |||||||||||
| 500 | 98% |
| 0.5 | 44.7 | 41.3 | 58% | 2.5 | 39.9 | 21.6 | 82% | 2.5 |
| 0.8 | 22.5 | 22.5 | 51% | 2.3 | 23.1 | 17.1 | 78% | 2.5 | |||
|
| 0.5 | 43.5 | 29.0 | 75% | 4.9 | 37.9 | 17.2 | 87% | 4.0 | ||
| 0.8 | 20.5 | 18.9 | 69% | 3.3 | 21.3 | 14.8 | 88% | 3.2 | |||
| 95% |
| 0.5 | 45.7 | 44.8 | 44% | 6.2 | 38.8 | 29.3 | 74% | 2.3 | |
| 0.8 | 23.2 | 25.3 | 40% | 2.3 | 26.6 | 22.5 | 74% | 2.2 | |||
|
| 0.5 | 44.5 | 40.5 | 69% | 3.3 | 35.8 | 22.3 | 88% | 3.3 | ||
| 0.8 | 21.1 | 24.6 | 66% | 3.1 | 21.9 | 18.1 | 88% | 3.0 | |||
| 90% |
| 0.5 | 45.7 | 45.1 | 46% | 2.5 | 39.3 | 26.0 | 83% | 2.0 | |
| 0.8 | 28.2 | 32.2 | 50% | 2.3 | 29.7 | 25.4 | 77% | 2.1 | |||
|
| 0.5 | 46.5 | 45.2 | 58% | 2.8 | 36.9 | 29.5 | 86% | 3.3 | ||
| 0.8 | 26.1 | 30.3 | 59% | 3.2 | 21.8 | 23.2 | 86% | 2.8 | |||
Figure 2Misclassification rate of proposed KIN versus SVM, RF, and KNN for 100, 300, and 500 (up to down) features (N indicates sample size).
Comparison of methods in terms of probability of achieving the maximum accuracy (PAMA) and probability of achieving more than 95% of the maximum accuracy (P95).
| Degree of sparsity | Method | PAMA | P95 |
|---|---|---|---|
| 90% | Random forest (RF) | 45.8% | 66.7% |
| Support vector machine (SVM) | 50.0% | 87.5% | |
|
| 0.0% | 20.8% | |
|
| 8.3% | 41.7% | |
|
| |||
| 95% | Random forest (RF) | 50% | 79.2% |
| Support vector machine (SVM) | 29.2% | 79.2% | |
|
| 0.0% | 29.2% | |
|
| 20.8% | 75% | |
|
| |||
| 98% | Random forest (RF) | 20.8% | 75% |
| Support vector machine (SVM) | 12.5% | 41.7% | |
|
| 0.0% | 33.3% | |
|
| 66.7% | 100% | |
|
| |||
| Total | Random forest (RF) | 38.9% | 73.6% |
| Support vector machine (SVM) | 30.6% | 69.4% | |
|
| 0.0% | 27.8% | |
|
| 32% | 72.2% | |
Accuracy of different classifiers on benchmark data sets.
| Data set |
| Train | Test | RF | SVM | KNN | KIN |
|---|---|---|---|---|---|---|---|
| Connectionist bench | 60 | 104 | 104 | 78.7 | 74.4 | 71.5 | 74.0 |
| Ozone | 72 | 102 | 410 | 80.6 | 79.5 | 72.4 | 75.4 |
| Prostate cancer | 600 | 70 | 34 | 69.1 | 75.2 | 66.0 | 72.3 |
| Colon cancer | 2000 | 32 | 32 | 65.7 | 77.9 | 62.5 | 74.3 |
| Liver transplant | 39 | 102 | 578 | 88.7 | 85.7 | 88.6 | 89.1 |