| Literature DB >> 34992644 |
Li Zhang1.
Abstract
Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.Entities:
Mesh:
Year: 2021 PMID: 34992644 PMCID: PMC8727115 DOI: 10.1155/2021/3569632
Source DB: PubMed Journal: Comput Intell Neurosci
Experimental data set description.
| No. | Data set | Samples | Features | Categories | Data sources |
|---|---|---|---|---|---|
| 1 | Lymphography | 148 | 18 | 8 | UCI |
| 2 | Dermatology | 358 | 34 | 6 | UCI |
| 3 | Cardiotocography | 2126 | 41 | 3 | UCI |
| 4 | Pendigits | 7494 | 16 | 10 | UCI |
| 5 | Lung | 203 | 3312 | 5 | ASU |
| 6 | Carcinom | 174 | 9182 | 11 | ASU |
| 7 | Nci9 | 60 | 9712 | 9 | ASU |
| 8 | PCMAC | 1943 | 3289 | 2 | ASU |
| 9 | Pixraw10P | 100 | 10,000 | 10 | ASU |
| 10 | SMK-CAN-187 | 187 | 19,993 | 2 | ASU |
| 11 | Lymphoma | 96 | 4026 | 9 | ASU |
| 12 | COIL20 | 1440 | 1024 | 20 | ASU |
The difference between NDCRFS and the comparison algorithms.
| No. | MIM | IG-RFE | IWFS | CMIM | DWFS | CIFE |
|---|---|---|---|---|---|---|
| 1 | 0.667 | 0.818 | 0.333 | 0.333 | 0.429 | 0.176 |
| 2 | 0.935 | 0.935 | 0.765 | 0.765 | 0.818 | 0.765 |
| 3 | 0.538 | 0.579 | 0.5 | 0.463 | 0.5 | 0.5 |
| 4 | 0.818 | 0.818 | 0.333 | 0.333 | 0.25 | 0.25 |
| 5 | 0.017 | 0.017 | 0.017 | 0.0 | 0.132 | 0.0 |
| 6 | 0.0 | 0.017 | 0.017 | 0.0 | 0.034 | 0.091 |
| 7 | 0.579 | 0.622 | 0.053 | 0.224 | 0.017 | 0.034 |
| 8 | 0.429 | 0.5 | 0.224 | 0.395 | 0.25 | 0.091 |
| 9 | 0.034 | 0.017 | 0.017 | 0.017 | 0.091 | 0.017 |
| 10 | 0.091 | 0.017 | 0.818 | 0.0 | 0.765 | 0.0 |
| 11 | 0.132 | 0.132 | 0.034 | 0.132 | 0.071 | 0.071 |
| 12 | 0.017 | 0.2 | 0.017 | 0.0 | 0.071 | 0.0 |
| Average | 0.355 | 0.389 | 0.261 | 0.222 | 0.286 | 0.166 |
Average classification accuracy (%) of KNN classifier.
| Data set | NDCRFS | MIM | IG-RFE | IWFS | CMIM | DWFS | CIFE |
|---|---|---|---|---|---|---|---|
| Lymphography |
| 34.78 | 35.59 | 35.59 | 34.88 | 35.28 | 34.78 |
| Dermatology |
| 92.164 | 92.164 | 88.512 | 90.79 | 96.68 | 87.139 |
| Cardiotocography |
| 98.401 | 98.401 | 98.401 | 98.401 | 98.589 | 98.401 |
| Pendigits |
| 97.145 | 97.145 | 97.238 | 97.505 | 98.159 | 97.625 |
| Lung |
| 88.064 | 83.712 | 76.391 | 81.678 | 87.681 | 74.922 |
| Carcinom |
| 68.037 | 32.255 | 60.035 | 65.84 | 67.026 | 31.952 |
| Nci9 |
| 75.44 | 74.012 | 69.024 | 76.119 | 48.429 | 57.25 |
| PCMAC |
| 85.538 | 86.155 | 82.348 | 84.765 | 85.743 | 78.952 |
| Pixraw10P |
| 88.0 | 91.0 | 88.0 | 92.0 | 88.0 | 92.0 |
| SMK-CAN-187 |
| 68.393 | 69.004 | 70.0 | 65.747 | 68.421 | 58.876 |
| Lymphoma |
| 84.722 | 84.75 | 69.806 | 90.083 | 72.056 | 82.833 |
| COIL20 |
| 80.733 | 79.743 | 71.667 | 77.114 | 72.024 | 60.652 |
| Average accuracy rate |
| 84.24 | 76.994 | 75.584 | 83.64 | 76.507 | 71.28 |
| Wins/Ties/Losses | 12/0/0 | 12/0/0 | 12/0/0 | 12/0/0 | 12/0/0 | 12/0/0 | |
The “Average” column gives the average accuracy value of the feature selection algorithm over all datasets. Bold represents the highest average classification prediction under this dataset.
Average classification accuracy (%) of SVM classifier.
| Data set | NDCRFS | MIM | IG-RFE | IWFS | CMIM | DWFS | CIFE |
|---|---|---|---|---|---|---|---|
| Lymphography |
| 42.499 | 43.329 | 41.45 | 42.825 | 43.329 | 42.825 |
| Dermatology |
| 93.777 | 93.824 | 93.283 | 94.079 | 97.761 | 93.53 |
| Cardiotocography |
| 98.401 | 98.401 | 98.401 | 98.401 | 98.401 | 98.401 |
| Pendigits |
| 63.331 | 63.331 | 55.35 | 59.741 | 56.979 | 57.219 |
| Lung | 84.788 | 77.89 | 78.391 | 77.891 |
| 85.311 | 77.402 |
| Carcinom |
| 50.998 | 25.028 | 50.447 | 51.545 | 55.773 | 20.915 |
| Nci9 | 76.512 |
| 76.69 | 62.595 | 74.429 | 57.929 | 58.821 |
| PCMAC |
| 85.588 | 85.486 | 82.194 | 85.333 | 85.382 | 80.394 |
| Pixraw10P |
| 91.0 | 91.0 | 91.0 | 91.0 | 91.0 | 91.0 |
| SMK-CAN-187 | 70.982 | 70.569 | 62.532 |
| 65.32 | 71.053 | 57.255 |
| Lymphoma | 85.5 | 81.278 | 79.611 | 67.056 | 81.972 | 72.194 |
|
| COIL20 |
| 63.886 | 62.067 | 52.824 | 55.933 | 48.638 | 40.905 |
| Average accuracy rate |
| 73.363 | 71.641 | 70.226 | 73.898 | 71.979 | 65.333 |
| Wins/Ties/Losses | 10/1/1 | 12/0/0 | 12/0/0 | 11/0/1 | 10/0/2 | 11/0/1 | |
The “Average” column gives the average accuracy value of the feature selection algorithm over all datasets. Bold represents the highest average classification prediction under this dataset.
Average classification accuracy (%) of C4.5 classifier.
| Data set | NDCRFS | MIM | IG-RFE | IWFS | CMIM | DWFS | CIFE |
|---|---|---|---|---|---|---|---|
| Lymphography |
| 41.893 | 41.473 | 41.347 | 42.322 | 43.002 | 42.322 |
| Dermatology |
| 94.434 | 94.149 | 94.187 | 95.021 | 93.337 | 94.727 |
| Cardiotocography |
| 98.401 | 98.401 | 98.401 | 98.401 | 98.401 | 98.401 |
| Pendigits |
| 94.343 | 94.196 | 93.782 | 93.768 | 94.222 | 93.675 |
| Lung |
| 79.918 | 85.113 | 75.964 | 83.842 | 84.157 | 77.236 |
| Carcinom |
| 54.586 | 25.79 | 48.292 | 56.822 | 53.999 | 24.3 |
| Nci9 | 69.929 | 61.012 | 65.095 | 60.667 |
| 57.929 | 60.226 |
| PCMAC |
| 86.464 | 86.515 | 82.502 | 85.897 | 86.669 | 80.805 |
| Pixraw10P |
| 97.0 | 96.0 | 92.0 | 95.0 | 92.0 | 95.0 |
| SMK-CAN-187 | 64.125 | 62.006 | 61.494 | 63.656 | 62.077 |
| 57.852 |
| Lymphoma |
| 79.75 | 80.0 | 69.528 | 82.806 | 69.417 | 86.917 |
| COIL20 |
| 67.614 | 72.762 | 63.186 | 62.895 | 70.629 | 58.295 |
| Average accuracy rate |
| 76.452 | 75.082 | 73.626 | 77.495 | 75.792 | 72.48 |
| Wins/Ties/Losses | 11/1/0 | 11/1/0 | 11/1/0 | 10/1/1 | 10/1/1 | 11/1/0 | |
Figure 1Comparison of accuracy in KNN classifier.
Figure 2Comparison of accuracy in C4.5 classifier.
Figure 3Comparison of accuracy in SVM classifier.
The runtimes of different feature selection algorithms.
| Date set | Runtime (s) | ||||||
|---|---|---|---|---|---|---|---|
| NDCRFS | MIM | IG-RFE | IWFS | CMIM | DWFS | CIFE | |
| Lymphography | 0.141 | 0.089 | 0.171 | 0.078 | 0.062 | 0.078 | 0.09 |
| Dermatology | 1.373 | 0.712 | 1.576 | 0.811 | 0.671 | 0.843 | 0.824 |
| Cardiotocography | 9.952 | 5.976 | 12.215 | 6.303 | 5.523 | 6.38 | 5.599 |
| Pendigits | 5.725 | 4.177 | 6.568 | 3.588 | 3.198 | 3.807 | 3.878 |
| Lung | 216.292 | 155.033 | 322.127 | 134.73 | 127.425 | 166.766 | 131.861 |
| Carcinom | 629.731 | 577.148 | 744.337 | 351.026 | 315.636 | 407.857 | 502.515 |
| Nci9 | 149.744 | 130.371 | 167.166 | 100.876 | 81.922 | 104.424 | 133.998 |
| PCMAC | 1206.53 | 1130.49 | 1689.445 | 878.968 | 615.348 | 836.969 | 1133.675 |
| Pixraw10P | 335.022 | 242.977 | 415.235 | 216.42 | 188.65 | 171.897 | 259.263 |
| SMK-CAN-187 | 1649.124 | 731.813 | 1905.724 | 812.859 | 727.913 | 995.003 | 749.035 |
| Lymphoma | 102.755 | 45.368 | 113.09 | 96.2 | 43.591 | 94.084 | 248.495 |
| COIL20 | 414.124 | 307.934 | 570.717 | 290.075 | 273.888 | 264.382 | 248.495 |
| Average | 393.376 | 277.674 | 495.698 | 240.995 | 198.652 | 254.374 | 284.811 |