| Literature DB >> 33286647 |
Jaegyun Park1, Min-Woo Park1, Dae-Won Kim1, Jaesung Lee1.
Abstract
Multilabel feature selection is an effective preprocessing step for improving multilabel classification accuracy, because it highlights discriminative features for multiple labels. Recently, multi-population genetic algorithms have gained significant attention with regard to feature selection studies. This is owing to their enhanced search capability when compared to that of traditional genetic algorithms that are based on communication among multiple populations. However, conventional methods employ a simple communication process without adapting it to the multilabel feature selection problem, which results in poor-quality final solutions. In this paper, we propose a new multi-population genetic algorithm, based on a novel communication process, which is specialized for the multilabel feature selection problem. Our experimental results on 17 multilabel datasets demonstrate that the proposed method is superior to other multi-population-based feature selection methods.Entities:
Keywords: communication; evolutionary algorithm; multi-population genetic algorithm; multilabel feature selection
Year: 2020 PMID: 33286647 PMCID: PMC7517480 DOI: 10.3390/e22080876
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Notation used for describing/elucidating the proposed method.
| Terms | Meanings |
|---|---|
|
| A multilabel dataset |
|
| A label set in |
|
| A feature set in |
|
| A final feature subset, |
|
| Number of generations |
|
| Number of sub-populations |
|
| Maximum number of selected features |
|
| An |
|
| A |
|
| Fitness values for the individuals of the |
|
| Label-specific accuracy matrix for individuals of |
|
| A complementary individual |
|
| A degree of complementarity for |
Figure 1Schematic overview of proposed method.
Multilabel toy dataset.
| Features | Labels | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Pattern |
|
|
|
|
|
|
|
| |
| Boring | Music | The | Funny | Lovely | Comedy | Documentary | Disney | ||
|
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | |
|
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | |
|
| 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | |
|
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | |
|
| 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | |
|
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
|
| 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | |
Standard statistics of multilabel datasets.
| Dataset |
|
| Type |
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Arts | 7484 | 23,146 | Numeric | 26 | 1.654 | 0.064 | 599 | Text |
| Business | 11,214 | 21,924 | Numeric | 30 | 1.599 | 0.053 | 233 | Text |
| Computers | 12,444 | 34,096 | Numeric | 33 | 1.507 | 0.046 | 428 | Text |
| Education | 12,030 | 27,534 | Numeric | 33 | 1.463 | 0.044 | 511 | Text |
| Emotions | 593 | 72 | Numeric | 6 | 1.869 | 0.311 | 27 | Music |
| Enron | 1702 | 1001 | Nominal | 53 | 3.378 | 0.064 | 753 | Text |
| Entertainment | 12,730 | 32,001 | Numeric | 21 | 1.414 | 0.067 | 337 | Text |
| Genbase | 662 | 1185 | Nominal | 27 | 1.252 | 0.046 | 32 | Biology |
| Health | 9205 | 30,605 | Numeric | 32 | 1.644 | 0.051 | 335 | Text |
| Medical | 978 | 1449 | Nominal | 45 | 1.245 | 0.028 | 94 | Text |
| Recreation | 12,828 | 30,324 | Numeric | 22 | 1.429 | 0.065 | 530 | Text |
| Reference | 8027 | 39,679 | Numeric | 33 | 1.174 | 0.036 | 275 | Text |
| Scene | 2407 | 294 | Numeric | 6 | 1.074 | 0.179 | 15 | Image |
| Science | 6428 | 37,187 | Numeric | 40 | 1.450 | 0.036 | 457 | Text |
| Social | 12,111 | 52,350 | Numeric | 29 | 1.279 | 0.033 | 361 | Text |
| Society | 14,512 | 31,802 | Numeric | 27 | 1.670 | 0.062 | 1054 | Text |
| Yeast | 2417 | 103 | Numeric | 14 | 4.237 | 0.303 | 198 | Biology |
Comparison results of four methods in terms of Hamming loss(↓) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at 95% significance level).
| Dataset | Proposed | TCbGA | EMPNGA | BCO-MDP |
|---|---|---|---|---|
| Arts |
| 0.0635 ± 0.001 | 0.0642 ± 0.001 ▼ | 0.0638 ± 0.001 ▼ |
| Business | 0.0297 ± 0.001 | 0.0297 ± 0.001 | 0.0293 ± 0.001 | |
| Computers |
| 0.0432 ± 0.001 | 0.0435 ± 0.001 | 0.0435 ± 0.001 |
| Education |
| 0.0444 ± 0.000 | 0.0449 ± 0.001 | 0.0447 ± 0.001 |
| Emotions |
| 0.2370 ± 0.013 | 0.2376 ± 0.023 | 0.2366 ± 0.032 |
| Enron | 0.0663 ± 0.006 |
| 0.0892 ± 0.008 ▼ | 0.0840 ± 0.007 ▼ |
| Entertainment |
| 0.0650 ± 0.002 | 0.0646 ± 0.002 | 0.0650 ± 0.002 |
| Genbase |
| 0.0338 ± 0.006 ▼ | 0.0315 ± 0.004 ▼ | 0.0277 ± 0.006 ▼ |
| Health |
| 0.0498 ± 0.001 ▼ | 0.0490 ± 0.001 ▼ | 0.0489 ± 0.002 |
| Medical |
| 0.0206 ± 0.003 ▼ | 0.0186 ± 0.001 ▼ | 0.0181 ± 0.003 ▼ |
| Recreation |
| 0.0638 ± 0.001 ▼ | 0.0638 ± 0.001 ▼ | 0.0641 ± 0.002 ▼ |
| Reference |
| 0.0359 ± 0.000 ▼ | 0.0358 ± 0.001 | 0.0358 ± 0.001 ▼ |
| Scene |
| 0.1372 ± 0.007 | 0.1416 ± 0.006 ▼ | 0.1396 ± 0.012 |
| Science | 0.0367 ± 0.001 | 0.0376 ± 0.001 | 0.0368 ± 0.001 | |
| Social |
| 0.0323 ± 0.001 ▼ | 0.0309 ± 0.001 ▼ | 0.0315 ± 0.002 ▼ |
| Society |
| 0.0598 ± 0.001 ▼ | 0.0595 ± 0.001 ▼ | 0.0590 ± 0.001 |
| Yeast |
| 0.2233 ± 0.007 | 0.2253 ± 0.005 | 0.2241 ± 0.006 |
| Avg. Rank |
| 2.71 | 3.35 | 2.71 |
Comparison results of four methods in terms of one-error(↓) (▼/Δ indicates that the corresponding method is significantly worse/better than the proposed method based on paired t-test at 95% significance level).
| Dataset | Proposed | TCbGA | EMPNGA | BCO-MDP |
|---|---|---|---|---|
| Arts |
| 0.7717 ± 0.120 ▼ | 0.7684 ± 0.122 ▼ | 0.7640 ± 0.126 ▼ |
| Business |
| 0.3935 ± 0.418 | 0.3933 ± 0.418 | 0.3935 ± 0.418 |
| Computers |
| 0.4616 ± 0.008 ▼ | 0.4566 ± 0.009 | 0.4626 ± 0.008 ▼ |
| Education |
| 0.6756 ± 0.011 ▼ | 0.6777 ± 0.011 ▼ | 0.6776 ± 0.014 ▼ |
| Emotions | 0.2992 ± 0.029 | 0.3085 ± 0.054 |
| 0.2992 ± 0.068 |
| Enron |
| 0.5982 ± 0.318 | 0.6074 ± 0.317 ▼ | 0.5976 ± 0.316 ▼ |
| Entertainment |
| 0.6710 ± 0.023 ▼ | 0.6339 ± 0.014 ▼ | 0.6483 ± 0.023 ▼ |
| Genbase |
| 0.8652 ± 0.207 | 0.8235 ± 0.272 | 0.8045 ± 0.303 |
| Health |
| 0.7935 ± 0.266 ▼ | 0.7900 ± 0.270 ▼ | 0.7885 ± 0.272 ▼ |
| Medical |
| 0.8395 ± 0.206 | 0.8138 ± 0.236 | 0.8287 ± 0.216 |
| Recreation |
| 0.7531 ± 0.010 ▼ | 0.7533 ± 0.014 ▼ | 0.7482 ± 0.021 ▼ |
| Reference | 0.7130 ± 0.247 | 0.7171 ± 0.243 |
| 0.7164 ± 0.244 |
| Scene | 0.3168 ± 0.029 | 0.2927 ± 0.027 Δ | 0.2871 ± 0.023 Δ | |
| Science |
| 0.7342 ± 0.019 ▼ | 0.7265 ± 0.018 ▼ | 0.7445 ± 0.013 ▼ |
| Social |
| 0.5637 ± 0.161 ▼ | 0.5441 ± 0.164 ▼ | 0.5677 ± 0.156 ▼ |
| Society | 0.4880 ± 0.019 | 0.4963 ± 0.013 |
| 0.4901 ± 0.014 |
| Yeast |
| 0.2431 ± 0.019 | 0.2652 ± 0.020 ▼ | 0.2513 ± 0.019 ▼ |
| Avg. Rank |
| 3.41 | 2.41 | 2.76 |
Comparison results of four methods in terms of multilabel accuracy(↑) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at the 95% significance level).
| Dataset | Proposed | TCbGA | EMPNGA | BCO-MDP |
|---|---|---|---|---|
| Arts |
| 0.0330 ± 0.007 ▼ | 0.0464 ± 0.009 ▼ | 0.0518 ± 0.016 ▼ |
| Business | 0.6772 ± 0.009 |
| 0.6767 ± 0.011 | 0.6760 ± 0.010 |
| Computers | 0.4155 ± 0.008 | 0.4148 ± 0.007 |
| 0.4147 ± 0.010 |
| Education |
| 0.0291 ± 0.007 ▼ | 0.0367 ± 0.015 ▼ | 0.0410 ± 0.022 ▼ |
| Emotions | 0.5323 ± 0.036 | 0.5267 ± 0.035 | 0.5202 ± 0.031 |
|
| Enron |
| 0.3315 ± 0.019 | 0.3173 ± 0.019 ▼ | 0.3389 ± 0.034 |
| Entertainment |
| 0.0586 ± 0.022 ▼ | 0.1116 ± 0.016 ▼ | 0.1218 ± 0.046 ▼ |
| Genbase |
| 0.3789 ± 0.130 ▼ | 0.4238 ± 0.088 ▼ | 0.5471 ± 0.157 ▼ |
| Health |
| 0.4074 ± 0.016 | 0.4120 ± 0.019 | 0.4026 ± 0.015 ▼ |
| Medical |
| 0.3545 ± 0.084 ▼ | 0.3628 ± 0.055 ▼ | 0.4498 ± 0.117 ▼ |
| Recreation |
| 0.0477 ± 0.012 ▼ | 0.0574 ± 0.007 ▼ | 0.0573 ± 0.017 ▼ |
| Reference | 0.4048 ± 0.015 | 0.3568 ± 0.125 |
| 0.4005 ± 0.011 |
| Scene |
| 0.5663 ± 0.021 | 0.5705 ± 0.016 | 0.5712 ± 0.034 |
| Science |
| 0.0256 ± 0.008 ▼ | 0.0360 ± 0.011 ▼ | 0.0385 ± 0.011 ▼ |
| Social |
| 0.0720 ± 0.027 ▼ | 0.1907 ± 0.168 ▼ | 0.1187 ± 0.033 ▼ |
| Society | 0.2423 ± 0.135 | 0.1617 ± 0.162 | 0.2586 ± 0.165 |
|
| Yeast |
| 0.4435 ± 0.012 | 0.4448 ± 0.012 | 0.4418 ± 0.013 |
| Avg. Rank |
| 3.53 | 2.59 | 2.53 |
Comparison results of four methods in terms of subset accuracy(↑) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at the 95% significance level).
| Dataset | Proposed | TCbGA | EMPNGA | BCO-MDP |
|---|---|---|---|---|
| Arts |
| 0.0287 ± 0.009 ▼ | 0.0422 ± 0.010 ▼ | 0.0438 ± 0.016 ▼ |
| Business | 0.5326 ± 0.011 | 0.5326 ± 0.012 | 0.5322 ± 0.012 |
|
| Computers |
| 0.3365 ± 0.007 | 0.3379 ± 0.009 | 0.3318 ± 0.007 ▼ |
| Education |
| 0.0162 ± 0.006 ▼ | 0.0327 ± 0.013 ▼ | 0.0317 ± 0.010 ▼ |
| Emotions | 0.2534 ± 0.039 |
| 0.2508 ± 0.041 | 0.2525 ± 0.055 |
| Enron | 0.1076 ± 0.020 |
| 0.0418 ± 0.028 ▼ | 0.0947 ± 0.034 |
| Entertainment |
| 0.0791 ± 0.025 ▼ | 0.0903 ± 0.023 ▼ | 0.0862 ± 0.033 ▼ |
| Genbase |
| 0.2576 ± 0.092 ▼ | 0.4288 ± 0.070 ▼ | 0.5098 ± 0.123 ▼ |
| Health |
| 0.3160 ± 0.014 ▼ | 0.3293 ± 0.017 | 0.3129 ± 0.017 ▼ |
| Medical |
| 0.2600 ± 0.047 ▼ | 0.3472 ± 0.049 ▼ | 0.3138 ± 0.096 ▼ |
| Recreation |
| 0.0393 ± 0.016 ▼ | 0.0475 ± 0.011 ▼ | 0.0478 ± 0.021 ▼ |
| Reference | 0.3579 ± 0.014 | 0.3532 ± 0.009 |
| 0.3265 ± 0.112 |
| Scene | 0.4341 ± 0.025 | 0.4168 ± 0.033 | 0.3819 ± 0.027 ▼ |
|
| Science |
| 0.0258 ± 0.003 ▼ | 0.0351 ± 0.011 ▼ | 0.0311 ± 0.008 ▼ |
| Social |
| 0.0667 ± 0.036 ▼ | 0.2850 ± 0.183 ▼ | 0.0981 ± 0.041 ▼ |
| Society | 0.2222 ± 0.060 | 0.1187 ± 0.132 | 0.1926 ± 0.125 |
|
| Yeast | 0.1029 ± 0.014 | 0.0969 ± 0.014 |
| 0.0988 ± 0.016 |
| Avg. Rank |
| 3.29 | 2.59 | 2.65 |
Figure 2Bonferroni-Dunn test results of four comparison methods with four evaluation measures.
Figure 3Multilabel accuracy(↑) for the best individual in each sub-population, obtained using three methods.
Figure 4Pairwise comparison results of paired t-test at 95% significance level in terms of multilabel accuracy(↑).