| Literature DB >> 33267316 |
Jaesung Lee1, Jaegyun Park1, Hae-Cheon Kim1, Dae-Won Kim1.
Abstract
Multi-label feature selection is an important task for text categorization. This is because it enables learning algorithms to focus on essential features that foreshadow relevant categories, thereby improving the accuracy of text categorization. Recent studies have considered the hybridization of evolutionary feature wrappers and filters to enhance the evolutionary search process. However, the relative effectiveness of feature subset searches of evolutionary and feature filter operators has not been considered. This results in degenerated final feature subsets. In this paper, we propose a novel hybridization approach based on competition between the operators. This enables the proposed algorithm to apply each operator selectively and modify the feature subset according to its relative effectiveness, unlike conventional methods. The experimental results on 16 text datasets verify that the proposed method is superior to conventional methods.Entities:
Keywords: evolutionary algorithm; feature selection; hybrid search; multi-label text categorization; particle swarm optimization
Year: 2019 PMID: 33267316 PMCID: PMC7515086 DOI: 10.3390/e21060602
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Brief summary of conventional feature selection approaches.
| Advantages | Disadvantages | |
|---|---|---|
| Filter methods | Rapid identification of a feature subset | Lower performance than that of wrapper |
| Wrapper methods | High performance than that of filter | High complexity |
| Hybrid methods (first type) | To start in a region exhibiting potential | Premature convergence |
| Hybrid methods (second type) | Improved search capability | Randomized engagement of operator |
Figure 1Schematic overview of the competitive particle swarm optimization.
Notations used in the design of the proposed method.
| Terms | Meanings |
|---|---|
|
| The evolution-based particle group |
|
| The filter-based particle group |
|
| The number of the evolution-based particles |
|
| The number of the filter-based particles |
|
| The fitness values for feature subsets generated from |
|
| The fitness values for feature subsets generated from |
|
| The number of spent fitness function calls (FFCs) |
|
| Maximum number of permitted FFCs |
|
| The best feature subset |
The standard statistics of multi-label text datasets.
| Dataset |
|
| Type |
|
|
|
| Domain |
|---|---|---|---|---|---|---|---|---|
| RCV1 (S1) | 6000 | 945 | Numeric | 101 | 2.880 | 0.029 | 1028 | Text |
| RCV1 (S2) | 6000 | 945 | Numeric | 101 | 2.634 | 0.026 | 954 | Text |
| RCV1 (S3) | 6000 | 945 | Numeric | 101 | 2.614 | 0.026 | 939 | Text |
| RCV1 (S4) | 6000 | 945 | Numeric | 101 | 2.484 | 0.025 | 816 | Text |
| RCV1 (S5) | 6000 | 945 | Numeric | 101 | 2.642 | 0.026 | 946 | Text |
| Arts | 7484 | 1157 | Numeric | 26 | 1.654 | 0.064 | 599 | Text |
| Business | 11,214 | 1096 | Numeric | 30 | 1.599 | 0.053 | 233 | Text |
| Computers | 12,444 | 1705 | Numeric | 33 | 1.507 | 0.046 | 428 | Text |
| Education | 12,030 | 1377 | Numeric | 33 | 1.463 | 0.044 | 511 | Text |
| Entertainment | 12,730 | 1600 | Numeric | 21 | 1.414 | 0.067 | 337 | Text |
| Health | 9205 | 1530 | Numeric | 32 | 1.644 | 0.051 | 335 | Text |
| Recreation | 12,828 | 1516 | Numeric | 22 | 1.429 | 0.065 | 530 | Text |
| Reference | 8027 | 1984 | Numeric | 33 | 1.174 | 0.036 | 275 | Text |
| Science | 6428 | 1859 | Numeric | 40 | 1.450 | 0.036 | 457 | Text |
| Social | 12,111 | 2618 | Numeric | 29 | 1.279 | 0.033 | 361 | Text |
| Society | 14,512 | 1590 | Numeric | 27 | 1.670 | 0.062 | 1054 | Text |
Comparison results of four compared methods in terms of Hamming loss for MLNB (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) | 0.029 ± 0.001 | 0.030 ± 0.001 | 0.030 ± 0.000 |
|
| RCV1 (S2) |
| 0.028 ± 0.003 | 0.027 ± 0.001 | 0.027 ± 0.001 |
| RCV1 (S3) |
| 0.027 ± 0.001 | 0.027 ± 0.001 | 0.026 ± 0.001 |
| RCV1 (S4) |
| 0.025 ± 0.001 | 0.025 ± 0.001 | 0.024 ± 0.001 |
| RCV1 (S5) |
| 0.028 ± 0.003 | 0.028 ± 0.001 | 0.026 ± 0.001 |
| Arts |
| 0.067 ± 0.002 | 0.069 ± 0.002 | 0.066 ± 0.002 |
| Business |
| 0.036 ± 0.004 | 0.034 ± 0.001 | 0.034 ± 0.002 |
| Computers |
| 0.051 ± 0.004 | 0.046 ± 0.001 | 0.047 ± 0.001 |
| Education |
| 0.048 ± 0.002 | 0.048 ± 0.002 | 0.047 ± 0.001 |
| Entertainment |
| 0.069 ± 0.003 | 0.065 ± 0.001 | 0.065 ± 0.001 |
| Health |
| 0.050 ± 0.003 | 0.047 ± 0.001 | 0.047 ± 0.002 |
| Recreation |
| 0.070 ± 0.003 | 0.067 ± 0.002 | 0.065 ± 0.001 |
| Reference |
| 0.040 ± 0.003 | 0.037 ± 0.002 | 0.037 ± 0.001 |
| Science |
| 0.043 ± 0.003 | 0.042 ± 0.001 | 0.042 ± 0.001 |
| Social |
| 0.042 ± 0.004 | 0.032 ± 0.002 | 0.032 ± 0.001 |
| Society |
| 0.065 ± 0.004 | 0.064 ± 0.001 | 0.063 ± 0.001 |
| Avg. Rank |
| 3.88 | 2.94 | 2.13 |
Comparison results of four compared methods in terms of one-error for MLNB (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.637 ± 0.129 | 0.621 ± 0.134 | 0.648 ± 0.125 |
| RCV1 (S2) |
| 0.654 ± 0.023 | 0.580 ± 0.015 | 0.599 ± 0.019 |
| RCV1 (S3) |
| 0.718 ± 0.150 | 0.671 ± 0.174 | 0.683 ± 0.168 |
| RCV1 (S4) |
| 0.696 ± 0.160 | 0.671 ± 0.175 | 0.672 ± 0.174 |
| RCV1 (S5) |
| 0.695 ± 0.161 | 0.656 ± 0.182 | 0.652 ± 0.185 |
| Arts |
| 0.712 ± 0.149 | 0.710 ± 0.149 | 0.712 ± 0.149 |
| Business |
| 0.399 ± 0.409 | 0.398 ± 0.400 | 0.396 ± 0.406 |
| Computers |
| 0.469 ± 0.006 | 0.445 ± 0.009 | 0.448 ± 0.007 |
| Education |
| 0.661 ± 0.008 | 0.616 ± 0.020 | 0.639 ± 0.016 |
| Entertainment |
| 0.605 ± 0.019 | 0.563 ± 0.015 | 0.586 ± 0.015 |
| Health |
| 0.774 ± 0.282 | 0.764 ± 0.300 | 0.778 ± 0.238 |
| Recreation |
| 0.739 ± 0.013 | 0.675 ± 0.013 | 0.675 ± 0.011 |
| Reference |
| 0.715 ± 0.243 | 0.718 ± 0.241 | 0.715 ± 0.243 |
| Science |
| 0.707 ± 0.018 | 0.696 ± 0.027 | 0.696 ± 0.023 |
| Social |
| 0.571 ± 0.152 | 0.472 ± 0.186 | 0.490 ± 0.179 |
| Society |
| 0.510 ± 0.017 | 0.489 ± 0.019 | 0.479 ± 0.016 |
| Avg. Rank |
| 3.75 | 2.31 | 2.94 |
Comparison results of four compared methods in terms of Multi-label accuracy for MLNB (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.176 ± 0.011 | 0.168 ± 0.013 | 0.124 ± 0.013 |
| RCV1 (S2) |
| 0.177 ± 0.011 | 0.179 ± 0.014 | 0.157 ± 0.018 |
| RCV1 (S3) |
| 0.161 ± 0.004 | 0.178 ± 0.019 | 0.168 ± 0.014 |
| RCV1 (S4) |
| 0.170 ± 0.007 | 0.192 ± 0.014 | 0.183 ± 0.019 |
| RCV1 (S5) |
| 0.187 ± 0.009 | 0.191 ± 0.016 | 0.165 ± 0.012 |
| Arts |
| 0.094 ± 0.008 | 0.099 ± 0.008 | 0.106 ± 0.012 |
| Business |
| 0.662 ± 0.009 | 0.654 ± 0.008 | 0.656 ± 0.011 |
| Computers |
| 0.369 ± 0.010 | 0.388 ± 0.006 | 0.391 ± 0.008 |
| Education |
| 0.075 ± 0.008 | 0.109 ± 0.012 | 0.085 ± 0.018 |
| Entertainment |
| 0.173 ± 0.007 | 0.220 ± 0.011 | 0.188 ± 0.011 |
| Health |
| 0.410 ± 0.017 | 0.397 ± 0.019 | 0.423 ± 0.020 |
| Recreation |
| 0.045 ± 0.004 | 0.111 ± 0.011 | 0.119 ± 0.008 |
| Reference |
| 0.360 ± 0.010 | 0.352 ± 0.009 | 0.350 ± 0.013 |
| Science |
| 0.075 ± 0.007 | 0.064 ± 0.010 | 0.070 ± 0.017 |
| Social |
| 0.340 ± 0.021 | 0.471 ± 0.014 | 0.449 ± 0.025 |
| Society |
| 0.290 ± 0.019 | 0.254 ± 0.012 | 0.211 ± 0.041 |
| Avg. Rank |
| 3.19 | 2.75 | 3.06 |
Comparison results of four compared methods in terms of subset accuracy for MLNB (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.009 ± 0.002 | 0.012 ± 0.007 | 0.012 ± 0.005 |
| RCV1 (S2) |
| 0.011 ± 0.005 | 0.087 ± 0.010 | 0.087 ± 0.004 |
| RCV1 (S3) |
| 0.025 ± 0.005 | 0.093 ± 0.009 | 0.102 ± 0.005 |
| RCV1 (S4) |
| 0.033 ± 0.014 | 0.120 ± 0.016 | 0.126 ± 0.016 |
| RCV1 (S5) |
| 0.013 ± 0.003 | 0.082 ± 0.012 | 0.091 ± 0.011 |
| Arts |
| 0.058 ± 0.007 | 0.071 ± 0.007 | 0.075 ± 0.006 |
| Business |
| 0.514 ± 0.016 | 0.507 ± 0.011 | 0.512 ± 0.011 |
| Computers |
| 0.299 ± 0.011 | 0.316 ± 0.010 | 0.319 ± 0.009 |
| Education |
| 0.047 ± 0.009 | 0.074 ± 0.007 | 0.064 ± 0.013 |
| Entertainment |
| 0.130 ± 0.009 | 0.188 ± 0.010 | 0.176 ± 0.022 |
| Health |
| 0.307 ± 0.016 | 0.314 ± 0.009 | 0.308 ± 0.054 |
| Recreation |
| 0.020 ± 0.003 | 0.093 ± 0.013 | 0.106 ± 0.016 |
| Reference |
| 0.321 ± 0.006 | 0.316 ± 0.011 | 0.294 ± 0.074 |
| Science |
| 0.053 ± 0.008 | 0.048 ± 0.005 | 0.055 ± 0.011 |
| Social |
| 0.287 ± 0.022 | 0.432 ± 0.016 | 0.412 ± 0.012 |
| Society |
| 0.215 ± 0.012 | 0.179 ± 0.028 | 0.157 ± 0.021 |
| Avg. Rank |
| 3.56 | 2.81 | 2.63 |
Comparison results of four compared methods in terms of Hamming loss for ML-ELM (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.038 ± 0.001 | 0.039 ± 0.001 | 0.040 ± 0.001 |
| RCV1 (S2) |
| 0.037 ± 0.003 | 0.037 ± 0.001 | 0.036 ± 0.000 |
| RCV1 (S3) |
| 0.037 ± 0.001 | 0.037 ± 0.003 | 0.036 ± 0.001 |
| RCV1 (S4) |
| 0.036 ± 0.002 | 0.035 ± 0.001 | 0.034 ± 0.001 |
| RCV1 (S5) |
| 0.036 ± 0.001 | 0.035 ± 0.001 | 0.035 ± 0.001 |
| Arts |
| 0.092 ± 0.005 | 0.089 ± 0.001 | 0.088 ± 0.002 |
| Business |
| 0.029 ± 0.001 | 0.029 ± 0.001 | 0.029 ± 0.001 |
| Computers |
| 0.045 ± 0.001 | 0.044 ± 0.001 | 0.044 ± 0.001 |
| Education |
| 0.060 ± 0.002 | 0.057 ± 0.001 | 0.056 ± 0.001 |
| Entertainment |
| 0.088 ± 0.003 | 0.088 ± 0.004 | 0.083 ± 0.002 |
| Health |
| 0.049 ± 0.002 | 0.047 ± 0.002 | 0.046 ± 0.001 |
| Recreation |
| 0.115 ± 0.006 | 0.102 ± 0.003 | 0.100 ± 0.005 |
| Reference |
| 0.038 ± 0.001 | 0.037 ± 0.001 | 0.037 ± 0.001 |
| Science |
| 0.053 ± 0.002 | 0.051 ± 0.001 | 0.050 ± 0.001 |
| Social |
| 0.036 ± 0.001 | 0.028 ± 0.001 | 0.029 ± 0.001 |
| Society |
| 0.064 ± 0.002 | 0.062 ± 0.001 | 0.062 ± 0.001 |
| Avg. Rank |
| 3.75 | 2.88 | 2.38 |
Comparison results of four compared methods in terms of one-error for ML-ELM (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.704 ± 0.026 | 0.602 ± 0.018 | 0.614 ± 0.014 |
| RCV1 (S2) |
| 0.715 ± 0.023 | 0.612 ± 0.017 | 0.611 ± 0.017 |
| RCV1 (S3) |
| 0.727 ± 0.010 | 0.598 ± 0.020 | 0.606 ± 0.014 |
| RCV1 (S4) |
| 0.698 ± 0.011 | 0.589 ± 0.020 | 0.567 ± 0.018 |
| RCV1 (S5) |
| 0.692 ± 0.014 | 0.580 ± 0.025 | 0.588 ± 0.029 |
| Arts |
| 0.633 ± 0.021 | 0.637 ± 0.018 | 0.626 ± 0.019 |
| Business |
| 0.132 ± 0.007 | 0.133 ± 0.006 | 0.131 ± 0.007 |
| Computers |
| 0.455 ± 0.009 | 0.441 ± 0.006 | 0.439 ± 0.009 |
| Education |
| 0.636 ± 0.014 | 0.598 ± 0.013 | 0.620 ± 0.020 |
| Entertainment |
| 0.591 ± 0.016 | 0.556 ± 0.019 | 0.569 ± 0.022 |
| Health |
| 0.433 ± 0.017 | 0.422 ± 0.017 | 0.398 ± 0.023 |
| Recreation |
| 0.741 ± 0.025 | 0.661 ± 0.019 | 0.666 ± 0.021 |
| Reference |
| 0.511 ± 0.017 | 0.507 ± 0.014 | 0.502 ± 0.012 |
| Science |
| 0.689 ± 0.025 | 0.663 ± 0.016 | 0.674 ± 0.021 |
| Social |
| 0.512 ± 0.021 | 0.386 ± 0.017 | 0.421 ± 0.020 |
| Society |
| 0.479 ± 0.018 | 0.470 ± 0.014 | 0.463 ± 0.015 |
| Avg. Rank |
| 3.88 | 2.63 | 2.50 |
Comparison results of four compared methods in terms of Multi-label accuracy for ML-ELM (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.214 ± 0.007 | 0.220 ± 0.006 | 0.215 ± 0.011 |
| RCV1 (S2) |
| 0.198 ± 0.010 | 0.242 ± 0.016 | 0.243 ± 0.013 |
| RCV1 (S3) |
| 0.202 ± 0.006 | 0.251 ± 0.010 | 0.258 ± 0.010 |
| RCV1 (S4) |
| 0.215 ± 0.009 | 0.266 ± 0.009 | 0.275 ± 0.010 |
| RCV1 (S5) |
| 0.206 ± 0.006 | 0.256 ± 0.012 | 0.256 ± 0.012 |
| Arts |
| 0.275 ± 0.012 | 0.283 ± 0.009 | 0.284 ± 0.014 |
| Business | 0.686 ± 0.007 |
| 0.680 ± 0.010 | 0.681 ± 0.008 |
| Computers |
| 0.427 ± 0.010 | 0.441 ± 0.010 | 0.441 ± 0.008 |
| Education |
| 0.286 ± 0.011 | 0.315 ± 0.012 | 0.318 ± 0.013 |
| Entertainment |
| 0.336 ± 0.015 | 0.362 ± 0.014 | 0.362 ± 0.009 |
| Health |
| 0.449 ± 0.011 | 0.462 ± 0.019 | 0.466 ± 0.012 |
| Recreation |
| 0.210 ± 0.007 | 0.263 ± 0.017 | 0.285 ± 0.023 |
| Reference |
| 0.437 ± 0.007 | 0.437 ± 0.016 | 0.447 ± 0.009 |
| Science |
| 0.246 ± 0.011 | 0.254 ± 0.017 | 0.270 ± 0.014 |
| Social |
| 0.435 ± 0.021 | 0.543 ± 0.015 | 0.519 ± 0.021 |
| Society |
| 0.392 ± 0.010 | 0.398 ± 0.011 | 0.402 ± 0.011 |
| Avg. Rank |
| 3.81 | 2.81 | 2.31 |
Comparison results of four compared methods in terms of subset accuracy for ML-ELM (The highest performance is shown in bold font and indicated by a check mark).
| Dataset | Proposed | EGA+CDM | bALO-QR | CSO |
|---|---|---|---|---|
| RCV1 (S1) |
| 0.012 ± 0.002 | 0.011 ± 0.006 | 0.013 ± 0.006 |
| RCV1 (S2) |
| 0.011 ± 0.003 | 0.090 ± 0.012 | 0.099 ± 0.009 |
| RCV1 (S3) |
| 0.011 ± 0.004 | 0.108 ± 0.007 | 0.111 ± 0.009 |
| RCV1 (S4) |
| 0.023 ± 0.005 | 0.120 ± 0.014 | 0.126 ± 0.012 |
| RCV1 (S5) |
| 0.008 ± 0.003 | 0.090 ± 0.014 | 0.092 ± 0.012 |
| Arts |
| 0.118 ± 0.009 | 0.143 ± 0.009 | 0.140 ± 0.020 |
| Business | 0.528 ± 0.008 | 0.527 ± 0.015 | 0.526 ± 0.013 |
|
| Computers |
| 0.323 ± 0.011 | 0.338 ± 0.011 | 0.340 ± 0.007 |
| Education |
| 0.186 ± 0.016 | 0.197 ± 0.019 | 0.214 ± 0.010 |
| Entertainment |
| 0.231 ± 0.021 | 0.243 ± 0.017 | 0.276 ± 0.013 |
| Health |
| 0.315 ± 0.015 | 0.325 ± 0.011 | 0.352 ± 0.014 |
| Recreation |
| 0.086 ± 0.010 | 0.137 ± 0.016 | 0.146 ± 0.017 |
| Reference |
| 0.376 ± 0.007 | 0.379 ± 0.013 | 0.386 ± 0.015 |
| Science |
| 0.153 ± 0.015 | 0.176 ± 0.015 | 0.179 ± 0.013 |
| Social |
| 0.333 ± 0.018 | 0.482 ± 0.017 | 0.468 ± 0.014 |
| Society |
| 0.274 ± 0.012 | 0.281 ± 0.010 | 0.289 ± 0.014 |
| Avg. Rank |
| 3.88 | 3.00 | 2.06 |
Friedman statistics and critical value in terms of each evaluation measure for MLNB.
| Evaluation Measure | Friedman Statistics | Critical Values ( |
|---|---|---|
| Hamming loss | 101.914 | 2.812 |
| One-error | 63.304 | |
| Multi-label accuracy | 24.520 | |
| Subset accuracy | 34.557 |
Friedman statistics and critical value in terms of each evaluation measure for ML-ELM.
| Evaluation Measure | Friedman Statistics | Critical values ( |
|---|---|---|
| Hamming loss | 61.632 | 2.812 |
| One-error | 81.314 | |
| Multi-label accuracy | 51.484 | |
| Subset accuracy | 114.668 |
Figure 2Bonferroni–Dunn test results of four comparison methods with four evaluation measures for MLNB.
Figure 3Bonferroni–Dunn test results of four comparison methods with four evaluation measures for ML-ELM.
Figure 4Competition results between particle groups for proposed method in terms of subset accuracy.
Figure 5Comparison results between two methods in terms of subset accuracy.