| Literature DB >> 30931011 |
Asma Gul1,2, Aris Perperoglou1, Zardad Khan1,3, Osama Mahmoud1, Miftahuddin Miftahuddin1, Werner Adler4, Berthold Lausen1.
Abstract
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.Entities:
Keywords: Non-informative features; Bagging; Ensemble methods; Nearest neighbour classifier
Year: 2016 PMID: 30931011 PMCID: PMC6404785 DOI: 10.1007/s11634-015-0227-5
Source DB: PubMed Journal: Adv Data Anal Classif ISSN: 1862-5355
Misclassification rate of the methods on the data sets with added non-informative features from model 1
| Features |
| B | R | MFS | RF | SVM | ES |
|---|---|---|---|---|---|---|---|
| 20 | 0.050 | 0.047 | 0.046 | 0.048 | 0.052 |
| 0.044 |
|
| 0.063 | 0.058 | 0.062 | 0.061 | 0.055 | 0.055 |
|
|
| 0.076 | 0.067 | 0.071 | 0.066 | 0.066 | 0.057 |
|
|
| 0.114 | 0.104 | 0.089 | 0.084 | 0.063 | 0.065 |
|
|
| 0.146 | 0.127 | 0.142 | 0.112 | 0.062 | 0.084 |
|
The first column shows the number of non-informative features added to the data set. Results of the best performing method are highlighted in italics. The value of
Misclassification rate of the classifiers on the data sets from model 1 for different values of w, on 70 features ( noninformative), listed in column 1
| w |
| B | R | MFS | RF | SVM | ES |
|---|---|---|---|---|---|---|---|
| 3 | 0.198 | 0.196 | 0.185 | 0.168 |
| 0.103 | 0.147 |
| 5 | 0.221 | 0.213 | 0.182 | 0.169 |
| 0.115 | 0.162 |
| 10 | 0.225 | 0.198 | 0.114 | 0.104 |
| 0.100 | 0.114 |
| 15 | 0.200 | 0.180 | 0.057 | 0.061 |
| 0.086 | 0.076 |
| 20 | 0.185 | 0.164 | 0.035 | 0.041 |
| 0.077 | 0.039 |
Results of best performing methods for the corresponding value of w is shown in italics
Misclassification rate of the methods on the data sets with added non-informative features from model 2
| Features |
| B | R | MFS | RF | SVM | ES |
|---|---|---|---|---|---|---|---|
| 4 | 0.125 | 0.122 | 0.169 | 0.122 | 0.159 |
| 0.119 |
|
| 0.170 | 0.170 | 0.175 | 0.169 | 0.193 | 0.164 |
|
|
| 0.194 | 0.187 | 0.185 | 0.205 | 0.203 | 0.205 |
|
|
| 0.242 | 0.232 | 0.201 | 0.216 | 0.199 | 0.443 |
|
|
| 0.276 | 0.269 | 0.231 | 0.249 | 0.211 | 0.524 |
|
The first column shows the number of non-informative features added to the data set. Results of the best performing method is shown in italic font
Fig. 1Misclassification rate, of simulated data from model 2 with added non-informative features. a 50 added non-informative features; b 100 added non-informative features; c 200 added non-informative features; d 500 added non-informative features
Summary of the data sets
| Data sets | Sample size | Features | Feature type (continuous/discrete/categorical) |
|---|---|---|---|
| Haberman | 306 | 3 | (0/3/0) |
| Dystrophy | 164 | 5 | (2/3/0) |
| Mammographic | 830 | 5 | (0/5/0) |
| Transfusion | 748 | 5 | (2/3/0) |
| Phoneme | 1000 | 5 | (5/0/0) |
| Bupa | 345 | 6 | (1/5/0) |
| Appendicitis | 106 | 7 | (7/0/0) |
| Diabetes | 768 | 8 | (8/0/0) |
| Biopsy | 683 | 9 | (0/9/0) |
| SAheart | 462 | 9 | (5/3/1) |
| Indian liver | 579 | 10 | (5/4/1) |
| Solar-Flare | 322 | 12 | (0/10/2) |
| Credit approval | 690 | 15 | (2/13/0) |
| House vote | 232 | 17 | (0/0/17) |
| Bands | 365 | 19 | (13/6/0) |
| Hepatitis | 80 | 19 | (2/17/0) |
| Two norms | 1000 | 20 | (20/0/0) |
| German credit | 1000 | 20 | (0/7/13) |
| Body | 507 | 24 | (24/0/0) |
| WPBC | 194 | 33 | (31/2/0) |
| Sonar | 208 | 60 | (60/0/0) |
| Glaucoma | 196 | 61 | (61/0/0) |
| Musk | 476 | 166 | (0/166/0) |
Number of observations, features and feature type. The first 8 are microarray data sets, the rest are from life, finance, physical, and social science
Misclassification rate of kNN, RkNN, BkNN, MFS, RF, SVM and ESkNN
| Data sets |
| B | R | MFS | RF | SVM | ES |
|---|---|---|---|---|---|---|---|
| Haberman | 0.243 | 0.24 | 0.255 | 0.241 | 0.271 | 0.325 |
|
| Dystrophy | 0.117 | 0.118 | 0.121 | 0.110 | 0.115 |
| 0.105 |
| Mammographic | 0.190 | 0.193 | 0.178 | 0.183 |
| 0.191 | 0.174 |
| Transfusion | 0.233 | 0.235 | 0.23 | 0.225 | 0.217 | 0.317 | 0.218 |
| Phoneme | 0.167 | 0.184 | 0.171 | 0.174 | 0.145 | 0.204 |
|
| Bupa | 0.320 | 0.327 |
| 0.327 | 0.271 | 0.319 | 0.319 |
| Appendicitis | 0.142 | 0.139 | 0.144 | 0.149 | 0.145 | 0.224 |
|
| Diabetes | 0.264 | 0.259 | 0.263 | 0.262 |
| 0.27 | 0.256 |
| Biopsy | 0.032 | 0.0311 | 0.028 | 0.039 | 0.027 | 0.058 |
|
| SAheart | 0.336 | 0.334 | 0.343 | 0.337 |
| 0.307 | 0.317 |
| Indian liver | 0.314 | 0.320 | 0.290 | 0.312 | 0.293 | 0.373 |
|
| Solar-flare | 0.027 | 0.026 | 0.025 | 0.026 | 0.025 | 0.042 |
|
| Credit Approval | 0.319 | 0.317 | 0.336 | 0.194 |
| 0.142 | 0.166 |
| House Vote | 0.082 | 0.082 | 0.089 | 0.072 | 0.036 |
| 0.042 |
| Bands | 0.389 | 0.393 | 0.342 | 0.383 |
| 0.367 | 0.350 |
| Hepatitis | 0.423 | 0.372 | 0.288 | 0.362 | 0.276 |
| 0.321 |
| Two Norms | 0.040 | 0.039 | 0.029 | 0.036 | 0.04 |
| 0.033 |
| German Credit | 0.307 | 0.306 | 0.296 | 0.308 |
| 0.291 | 0.286 |
| Body | 0.023 | 0.024 | 0.036 | 0.025 | 0.037 |
| 0.020 |
| WPBC | 0.241 | 0.240 | 0.235 | 0.244 |
| 0.285 | 0.235 |
| Sonar | 0.179 | 0.179 | 0.157 | 0.189 | 0.161 | 0.169 |
|
| Glaucoma | 0.193 | 0.193 | 0.192 | 0.196 |
| 0.122 | 0.176 |
| Musk | 0.142 | 0.142 | 0.113 | 0.114 | 0.110 | 0.133 |
|
The results of best performing methods on the corresponding data set are highlighted in italics
Misclassification rate of kNN, RkNN, BkNN, MFS, RF, SVM and ESkNN with added non-informative features to the data sets
| Data sets |
| B | R | MFS | RF | SVM | ES |
|---|---|---|---|---|---|---|---|
| Haberman | 0.278 | 0.274 | 0.279 | 0.269 | 0.263 | 0.429 |
|
| Dystrophy | 0.249 | 0.248 | 0.291 | 0.237 |
| 0.252 | 0.204 |
| Mammographic | 0.217 | 0.223 | 0.180 | 0.225 |
| 0.527 | 0.189 |
| Transfusion | 0.238 | 0.237 | 0.237 | 0.239 | 0.236 | 0.517 |
|
| Phoneme | 0.279 | 0.279 | 0.252 | 0.351 | 0.269 | 0.538 |
|
| Bupa | 0.362 | 0.352 | 0.389 | 0.376 | 0.342 | 0.560 |
|
| Appendicitis | 0.207 | 0.209 | 0.277 | 0.209 |
| 0.215 | 0.197 |
| Diabetes | 0.358 | 0.354 | 0.349 | 0.348 |
| 0.530 | 0.328 |
| Biopsy | 0.065 | 0.067 | 0.086 | 0.102 |
| 0.067 | 0.052 |
| SAheart | 0.414 | 0.395 | 0.349 | 0.347 | 0.345 | 0.509 | 0.345 |
| Indian liver | 0.316 | 0.315 | 0.286 | 0.286 | 0.286 | 0.519 |
|
| Solar-flare | 0.027 | 0.022 | 0.021 | 0.025 | 0.022 | 0.022 | 0.022 |
| Credit approval | 0.354 | 0.354 | 0.320 | 0.345 | 0.322 | 0.546 |
|
| House vote | 0.128 | 0.125 | 0.126 | 0.112 |
| 0.109 | 0.095 |
| Bands | 0.405 | 0.396 | 0.358 | 0.354 | 0.359 | 0.549 |
|
| Hepatitis | 0.362 | 0.371 | 0.380 | 0.410 | 0.387 |
| 0.333 |
| Two norms | 0.047 | 0.045 | 0.038 | 0.052 | 0.038 | 0.052 |
|
| German credit | 0.308 | 0.305 | 0.301 | 0.371 |
| 0.517 | 0.300 |
| Body | 0.098 | 0.098 | 0.099 | 0.098 |
| 0.092 | 0.088 |
| WPBC | 0.262 | 0.251 | 0.235 | 0.235 | 0.235 | 0.252 |
|
| Sonar | 0.164 | 0.164 | 0.161 | 0.225 | 0.242 | 0.314 |
|
| Glaucoma | 0.256 | 0.249 | 0.242 | 0.272 |
| 0.236 | 0.242 |
| Musk | 0.184 | 0.182 | 0.169 | 0.168 | 0.165 | 0.290 |
|
The results of best performing methods on the corresponding data set are highlighted in italics