| Literature DB >> 33266871 |
Lin Sun1,2, Xiaoyu Zhang1, Jiucheng Xu1,2, Shiguang Zhang1,2.
Abstract
Attribute reduction as an important preprocessing step for data mining, and has become a hot research topic in rough set theory. Neighborhood rough set theory can overcome the shortcoming that classical rough set theory may lose some useful information in the process of discretization for continuous-valued data sets. In this paper, to improve the classification performance of complex data, a novel attribute reduction method using neighborhood entropy measures, combining algebra view with information view, in neighborhood rough sets is proposed, which has the ability of dealing with continuous data whilst maintaining the classification information of original attributes. First, to efficiently analyze the uncertainty of knowledge in neighborhood rough sets, by combining neighborhood approximate precision with neighborhood entropy, a new average neighborhood entropy, based on the strong complementarity between the algebra definition of attribute significance and the definition of information view, is presented. Then, a concept of decision neighborhood entropy is investigated for handling the uncertainty and noisiness of neighborhood decision systems, which integrates the credibility degree with the coverage degree of neighborhood decision systems to fully reflect the decision ability of attributes. Moreover, some of their properties are derived and the relationships among these measures are established, which helps to understand the essence of knowledge content and the uncertainty of neighborhood decision systems. Finally, a heuristic attribute reduction algorithm is proposed to improve the classification performance of complex data sets. The experimental results under an instance and several public data sets demonstrate that the proposed method is very effective for selecting the most relevant attributes with great classification performance.Entities:
Keywords: attribute reduction; classification; neighborhood entropy; neighborhood rough sets; rough sets
Year: 2019 PMID: 33266871 PMCID: PMC7514638 DOI: 10.3390/e21020155
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1The process of the attribute reduction method based on decision neighborhood entropy.
A neighborhood decision system.
| U | a | b | c | d |
|---|---|---|---|---|
|
| 0.12 | 0.41 | 0.61 | Y |
|
| 0.21 | 0.15 | 0.14 | Y |
|
| 0.31 | 0.11 | 0.26 | N |
|
| 0.61 | 0.13 | 0.23 | N |
Description of the eleven public data sets.
| No. | Data Sets | Samples | Attributes | Classes | Reference |
|---|---|---|---|---|---|
| 1 | Ionosphere | 351 | 33 | 2 | Fen et al. [ |
| 2 | Wdbc | 569 | 31 | 2 | |
| 3 | Wine | 178 | 13 | 3 | |
| 4 | Wpbc | 198 | 33 | 2 | |
| 5 | Brain_Tumor1 | 90 | 5920 | 5 | Huang et al. [ |
| 6 | Colon | 62 | 2000 | 2 | Mu et al. [ |
| 7 | DLBCL | 77 | 5469 | 2 | Wang et al. [ |
| 8 | Leukemia | 72 | 7129 | 2 | Dong et al. [ |
| 9 | Lung | 181 | 12,533 | 2 | Sun et al. [ |
| 10 | Prostate | 136 | 12,600 | 2 | |
| 11 | SRBCT | 63 | 2308 | 4 | Tibshirani et al. [ |
Wdbc—Wisconsin Diagnostic Breast Cancer; Wpbc—Wisconsin Prognostic Breast Cancer; DLBCL—Diffuse Large B Cell Lymphoma; and SRBCT—Small Round Blue Cell Tumor.
Figure 2The number of selected attributes and the classification accuracy of the eleven data sets with the different neighborhood parameter values. (a) Ionosphere data set; (b) Wdbc data set; (c) Wine data set; (d) Wpbc data set; (e) Brain_Tumor1 data set; (f) Colon data set; (g) DLBCL data set; (h) Leukemia data set; (i) Lung data set; (j) Prostate data set; (k) SRBCT data set.
The number of selected attributes and the classification accuracy under the SVM and KNN classifiers on the raw data and the reduced data with Algorithm 1.
| Data Sets | Raw Data | Reduced Data using Algorithm 1 |
| ||||
|---|---|---|---|---|---|---|---|
| Attributes | SVM | KNN | Attributes | SVM | KNN | ||
| Ionosphere | 33 | 0.874 | 0.857 | 11 | 0.909 | 0.893 | 0.3 |
| Wdbc | 31 | 0.538 | 0.896 | 10 | 0.959 | 0.959 | 0.15 |
| Wine | 13 | 0.401 | 0.69 | 7 | 0.959 | 0.96 | 0.15 |
| Wpbc | 33 | 0.667 | 0.752 | 6 | 0.772 | 0.753 | 0.2 |
| Brain_Tumor1 | 5920 | 0.86 | 0.783 | 13 | 0.83 | 0.897 | 0.15 |
| DLBCL | 2000 | 0.965 | 0.896 | 10 | 0.993 | 0.998 | 0.05 |
| Colon | 5469 | 0.811 | 0.776 | 5 | 0.808 | 0.818 | 0.15 |
| Lung | 7129 | 0.979 | 0.975 | 6 | 0.99 | 0.99 | 0.1 |
| Leukemia | 12,533 | 0.973 | 0.842 | 6 | 0.967 | 0.981 | 0.3 |
| Prostate | 12,600 | 0.916 | 0.796 | 3 | 0.829 | 0.858 | 0.5 |
| SRBCT | 2308 | 0.984 | 0.808 | 6 | 1 | 1 | 0.25 |
| Average | 4369.9 | 0.815 | 0.825 | 7.5 | 0.911 | 0.919 | |
SVM—support vector machine and KNN—k-nearest neighbors.
The number of selected attributes of the five reduction algorithms on the four UCI data sets.
| Data Sets | RS | NRS | CDA | MDNRS | ARDNE |
|---|---|---|---|---|---|
| Ionosphere | 17 | 8 | 9 | 8 | 11 |
| Wdbc | 8 | 2 | 2 | 2 | 10 |
| Wine | 5 | 3 | 2 | 4 | 7 |
| Wpbc | 7 | 2 | 2 | 4 | 6 |
| Average | 9.25 | 3.75 | 3.75 | 4.5 | 8.5 |
RS—classical rough set algorithm; NRS—neighborhood rough set algorithm; CDA—covering decision algorithm; MDNRS—max-decision neighborhood rough set algorithm; and ARDNE— attribute reduction algorithm based on decision neighborhood entropy.
Classification accuracy of the five reduction algorithms on the four UCI data sets with KNN.
| Data Sets | RS | NRS | CDA | MDNRS | ARDNE |
|---|---|---|---|---|---|
| Ionosphere | 0.866 | 0.859 | 0.848 | 0.891 | 0.893 |
| Wdbc | 0.911 | 0.923 | 0.923 | 0.930 | 0.959 |
| Wine | 0.863 | 0.752 | 0.727 | 0.911 | 0.960 |
| Wpbc | 0.743 | 0.738 | 0.738 | 0.761 | 0.753 |
| Average | 0.846 | 0.818 | 0.809 | 0.873 | 0.891 |
Classification accuracy of the five reduction algorithms on the four UCI data sets with SVM.
| Data Sets | RS | NRS | CDA | MDNRS | ARDNE |
|---|---|---|---|---|---|
| Ionosphere | 0.881 | 0.872 | 0.878 | 0.870 | 0.909 |
| Wdbc | 0.589 | 0.595 | 0.595 | 0.861 | 0.959 |
| Wine | 0.640 | 0.402 | 0.643 | 0.910 | 0.959 |
| Wpbc | 0.778 | 0.757 | 0.757 | 0.692 | 0.772 |
| Average | 0.722 | 0.657 | 0.718 | 0.833 | 0.900 |
Classification results of the four entropy-based reduction algorithms with KNN.
| Data Sets | MEAR | EGAR | ADNEAR | ARDNE | ||||
|---|---|---|---|---|---|---|---|---|
| Genes | Accuracy | Genes | Accuracy | Genes | Accuracy | Genes | Accuracy | |
| Brain_Tumor1 | 2 | 0.683 | 8 | 0.667 | 9 | 0.711 | 13 | 0.897 |
| Colon | 5 | 0.77 | 5 | 0.540 | 5 | 0.555 | 5 | 0.817 |
| DLBCL | 2 | 0.765 | 20 | 0.752 | 7 | 0.757 | 10 | 0.998 |
| Leukemia | 3 | 0.928 | 3 | 0.587 | 3 | 0.587 | 6 | 0.981 |
| SRBCT | 4 | 0.537 | 8 | 0.503 | 8 | 0.503 | 6 | 1 |
| Average | 3.2 | 0.737 | 8.2 | 0.610 | 6.4 | 0.622 | 8 | 0.938 |
MEAR—mutual entropy-based attribute reduction algorithm; EGAR—entropy gain-based attribute reduction algorithm; and ADNEAR—average decision neighborhood entropy-based attribute reduction algorithm.
Classification results of the four entropy-based reduction algorithms with SVM
| Data Sets | MEAR | EGAR | ADNEAR | ARDNE | ||||
|---|---|---|---|---|---|---|---|---|
| Genes | Accuracy | Genes | Accuracy | Genes | Accuracy | Genes | Accuracy | |
| Brain_Tumor1 | 2 | 0.691 | 8 | 0.666 | 9 | 0.666 | 13 | 0.830 |
| Colon | 5 | 0.849 | 5 | 0.643 | 5 | 0.643 | 5 | 0.808 |
| DLBCL | 2 | 0.777 | 20 | 0.862 | 7 | 0.862 | 10 | 0.993 |
| Leukemia | 3 | 0.920 | 3 | 0.536 | 3 | 0.536 | 6 | 0.967 |
| SRBCT | 4 | 0.539 | 8 | 0.535 | 8 | 0.535 | 6 | 1 |
| Average | 3.2 | 0.755 | 8.2 | 0.648 | 6.4 | 0.648 | 8 | 0.919 |
The number of selected genes of the eight reduction algorithms on the four gene expression data sets.
| Data Sets | SFS | SGL | ASGL-CMI | SC2 | FLD-NRS | LLE-NRS | RelieF+NRS | ARDNE |
|---|---|---|---|---|---|---|---|---|
| Colon | 19 | 55 | 33 | 4 | 6 | 16 | 9 | 5 |
| Leukemia | 7 | - | - | 5 | 6 | 22 | 17 | 6 |
| Lung | 3 | 43 | 32 | 3 | 3 | 16 | 23 | 6 |
| Prostate | 3 | 34 | 29 | 5 | 4 | 19 | 16 | 3 |
| Average | 8 | 44 | 31.3 | 4.25 | 4.75 | 18.25 | 16.25 | 5 |
SFS—sequential forward selection algorithm; SGL—sparse group lasso algorithm; ASGL-CMI— adaptive sparse group lasso based on conditional mutual information algorithm; SC2—Spearman’s rank correlation coefficient algorithm; FLD-NRS—gene selection algorithm based on fisher linear discriminant and neighborhood rough set; LLE-NRS—gene selection algorithm based on locally linear embedding and neighborhood rough set algorithm, and RelieF+NRS—RelieF algorithm combined with NRS algorithm.
The classification accuracy of the eight reduction algorithms on the four gene expression data sets.
| Data Sets | SFS | SGL | ASGL-CMI | SC2 | FLD-NRS | LLE-NRS | RelieF+NRS | ARDNE |
|---|---|---|---|---|---|---|---|---|
| Colon | 0.521 | 0.826 | 0.851 | 0.805 | 0.88 | 0.84 | 0.564 | 0.81 |
| Leukemia | 0.969 | - | - | 0.852 | 0.828 | 0.868 | 0.563 | 0.967 |
| Lung | 0.833 | 0.827 | 0.841 | 0.806 | 0.889 | 0.907 | 0.919 | 0.987 |
| Prostate | 0.840 | 0.834 | 0.858 | 0.795 | 0.8 | 0.711 | 0.642 | 0.858 |
| Average | 0.791 | 0.829 | 0.85 | 0.815 | 0.849 | 0.832 | 0.672 | 0.898 |
Ranking of the five attribute reduction algorithms with KNN.
| Data Sets | RS | NRS | CDA | MDNRS | ARDNE |
|---|---|---|---|---|---|
| Ionosphere | 3 | 4 | 5 | 2 | 1 |
| Wdbc | 5 | 3.5 | 3.5 | 2 | 1 |
| Wine | 3 | 4 | 5 | 2 | 1 |
| Wpbc | 3 | 4.5 | 4.5 | 1 | 2 |
| Average | 3.5 | 4 | 4.5 | 1.75 | 1.25 |
Ranking of the five attribute reduction algorithms with SVM.
| Data Sets | RS | NRS | CDA | MDNRS | ARDNE |
|---|---|---|---|---|---|
| Ionosphere | 2 | 4 | 3 | 5 | 1 |
| Wdbc | 5 | 3.5 | 3.5 | 2 | 1 |
| Wine | 4 | 5 | 3 | 2 | 1 |
| Wpbc | 1 | 3.5 | 3.5 | 5 | 2 |
| Average | 3 | 4 | 3.25 | 3.5 | 1.25 |
F Values for the two classifiers.
| KNN | SVM | |
|---|---|---|
|
| 13 | 7 |
|
| 13 | 2.33 |
F—Iman-Davenport test and —Friedman statistics.
Ranking of the four attribute reduction algorithms with KNN.
| Data Sets | MEAR | EGAR | ADNEAR | ARDNE |
|---|---|---|---|---|
| Brain_Tumor1 | 3 | 4 | 2 | 1 |
| Colon | 2 | 4 | 3 | 1 |
| DLBCL | 2 | 4 | 3 | 1 |
| Leukemia | 2 | 3.5 | 3.5 | 1 |
| SRBCT | 2 | 3.5 | 3.5 | 1 |
| Average | 2.2 | 3.8 | 3 | 1 |
Ranking of the four attribute reduction algorithms with SVM.
| Data Sets | MEAR | EGAR | ADNEAR | ARDNE |
|---|---|---|---|---|
| Brain_Tumor1 | 2 | 3.5 | 3.5 | 1 |
| Colon | 1 | 3.5 | 3.5 | 2 |
| DLBCL | 4 | 2.5 | 2.5 | 1 |
| Leukemia | 2 | 3.5 | 3.5 | 1 |
| SRBCT | 2 | 3.5 | 3.5 | 1 |
| Average | 2.2 | 3.3 | 3.3 | 1.2 |
F Values for the two classifiers.
| KNN | SVM | |
|---|---|---|
|
| 12.84 | 9.18 |
|
| 23.78 | 6.31 |
Ranking of the eight attribute reduction algorithms with SVM.
| Data Sets | SFS | SGL | ASGL-CMI | SC2 | FLD-NRS | LLE-NRS | RelieF+NRS | ARDNE |
|---|---|---|---|---|---|---|---|---|
| Lung | 6 | 7 | 5 | 8 | 4 | 3 | 2 | 1 |
| Prostate | 3 | 4 | 1.5 | 6 | 5 | 7 | 8 | 1.5 |
| Average | 4.5 | 5.5 | 3.25 | 7 | 4.5 | 5 | 5 | 1.25 |
F Values for the two classifiers.
| SVM | |
|---|---|
|
| 6.63 |
|
| 0.9 |