| Literature DB >> 33266854 |
Lin Sun1, Lanying Wang1, Jiucheng Xu1, Shiguang Zhang1,2.
Abstract
For continuous numerical data sets, neighborhood rough sets-based attribute reduction is an important step for improving classification performance. However, most of the traditional reduction algorithms can only handle finite sets, and yield low accuracy and high cardinality. In this paper, a novel attribute reduction method using Lebesgue and entropy measures in neighborhood rough sets is proposed, which has the ability of dealing with continuous numerical data whilst maintaining the original classification information. First, Fisher score method is employed to eliminate irrelevant attributes to significantly reduce computation complexity for high-dimensional data sets. Then, Lebesgue measure is introduced into neighborhood rough sets to investigate uncertainty measure. In order to analyze the uncertainty and noisy of neighborhood decision systems well, based on Lebesgue and entropy measures, some neighborhood entropy-based uncertainty measures are presented, and by combining algebra view with information view in neighborhood rough sets, a neighborhood roughness joint entropy is developed in neighborhood decision systems. Moreover, some of their properties are derived and the relationships are established, which help to understand the essence of knowledge and the uncertainty of neighborhood decision systems. Finally, a heuristic attribute reduction algorithm is designed to improve the classification performance of large-scale complex data. The experimental results under an instance and several public data sets show that the proposed method is very effective for selecting the most relevant attributes with high classification accuracy.Entities:
Keywords: Lebesgue measure; attribute reduction; neighborhood entropy; neighborhood rough sets; rough sets
Year: 2019 PMID: 33266854 PMCID: PMC7514624 DOI: 10.3390/e21020138
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Flowchart of the attribute reduction algorithm for data classification.
A neighborhood decision system.
|
|
|
|
|
|
|---|---|---|---|---|
|
| 0.12 | 0.41 | 0.61 |
|
|
| 0.21 | 0.15 | 0.14 |
|
|
| 0.31 | 0.11 | 0.26 |
|
|
| 0.61 | 0.13 | 0.23 |
|
Description of the seven public data sets.
| No. | Data sets | Samples | Attributes | Classes | Author |
|---|---|---|---|---|---|
| 1 | Wine | 178 | 13 | 3 | Faris et al. [ |
| 2 | Sonar | 208 | 60 | 2 | Wang and Li [ |
| 3 | Segmentation | 2310 | 19 | 7 | Liu et al. [ |
| 4 | Wdbc | 569 | 31 | 2 | Li et al. [ |
| 5 | Wpbc | 198 | 33 | 2 | Chen et al. [ |
| 6 | Prostate | 136 | 12600 | 2 | Mu et al. [ |
| 7 | DLBCL | 77 | 5469 | 2 | Sun et al. [ |
| 8 | Leukemia | 72 | 11225 | 3 | Sun et al. [ |
| 9 | Tumors | 327 | 12558 | 7 | Wang et al. [ |
Figure 2The classification accuracy versus the number of genes on the four gene expression data sets.
Figure 3Reduction rate and classification accuracy for nine data sets with neighborhood parameter values.
The reduction results and the number of selected attributes with the three reduction algorithms.
| Data Sets | FINEN | NEIEN | ARNRJE |
|
|---|---|---|---|---|
| Wine | {1, 2, 3, 4, 6, 7, 8, 9, 10 11, 12, 13}/12 | {1, 2, 3, 4, 5, 7, 10, 11, 12, 13}/10 | {1, 2, 3, 5, 7, 10, 11, 13}/8 | 0.65 |
| Sonar | {1, 5, 9, 10, 11, 12, 18, 19, 22, 26, 27, 28, 29, 32, 35, 36, 37, 40, 45, 46, 48, 53, 57, 58, 59, 60}/26 | {6, 10, 11, 12, 15, 17, 18, 20, 21, 23, 24, 26, 28, 29, 30, 32, 33, 36, 37, 39, 40, 41, 42, 45, 48, 50, 54, 57}/29 | {2, 3, 4, 5, 9, 10, 11, 12, 13, 14, 16, 22, 24, 30, 32, 34, 36, 37, 38, 39, 46, 57, 60}/23 | 0.4 |
| Segmentation | {2, 5, 6, 7, 11, 12, 13, 17, 18}/9 | {2, 5, 6, 7, 11, 12, 13, 17, 18}/9 | {2, 5, 6, 10, 11, 13, 17, 18}/8 | 0.75 |
| Wdbc | {7, 8, 10, 12, 13, 16, 21, 22, 25, 27, 28, 29}/12 | {1, 7, 8, 10, 13, 16, 21, 22, 25, 27, 28, 29}/12 | {6, 8, 9, 11, 12, 14, 16, 19, 20, 25, 27, 28, 29}/13 | 0.35 |
| Wpbc | {1, 12, 13, 16, 24, 32}/6 | {1, 5, 12, 24, 32}/5 | {2, 19, 23, 24, 29, 31}/6 | 0.55 |
| Prostate | {4483, 6185, 8129, 8623, 8850, 9850, 10753, 12067}/8 | {4483, 4847, 6185, 6627, 8623, 8850, 9587, 12067}/8 | {11052, 6185, 8986, 5486, 6392, 5757, 8850, 4483}/8 | 0.9 |
| DLBCL | {453, 1570, 1698, 3127, 3257, 4767}/6 | {453, 2930, 3574, 4767, 5283}/5 | {453, 1479, 1570, 3127, 3257, 4767}/6 | 0.5 |
| Leukemia | {2833, 6720, 5555, 10127, 10038, 4839, 8952, 9053}/8 | {2833, 6720, 5555, 10127, 10038, 3479, 8964, 515}/8 | {461, 1787, 1834, 1962, 2131, 2356, 3821, 5552}/8 | 0.4 |
| Tumors | {2543, 7648, 3264, 6320, 5411, 6671, 8548, 7781, 10126, 6764, 4178, 4448, 8337, 3043, 4831, 3880}/16 | {5411, 6320, 7648, 3264, 3324, 6671, 4300, 6079, 6764, 10126, 8397, 8383, 9046, 7922, 10865, 8687, 2132}/17 | {3880, 843, 1730, 3342, 6151, 2960, 3264, 3596, 5624, 4026, 7648, 8383, 8332, 9788, 5412, 8556, 3324, 10126}/18 | 0.25 |
Average sizes of attribute subsets selected by the six methods using 10-fold cross validation.
| Data Sets | ODP | NRS | FRSINT | FINEN | NEIEN | ARNRJE |
|---|---|---|---|---|---|---|
| Wine | 13 | 9.1 | 8.1 | 12.3 | 10.2 | 9.5 |
| Sonar | 60 | 24.8 | 18.7 | 25.8 | 28.9 | 24.6 |
| Segmentation | 19 | 10.7 | 8.4 | 9.5 | 9.2 | 8.9 |
| Wdbc | 30 | 17.3 | 11.9 | 12.1 | 11.8 | 13.3 |
| Wpbc | 32 | 11.6 | 7.8 | 6.4 | 5.3 | 6.2 |
| Prostate | 12600 | 6.5 | 8.9 | 8.4 | 7.7 | 8 |
| DLBCL | 5469 | 8.3 | 8.8 | 6.1 | 5.3 | 7.6 |
| Leukemia | 11225 | 14.7 | 9.8 | 8.5 | 8.2 | 9.1 |
| Tumors | 12558 | 10.6 | 9.5 | 15.8 | 17.1 | 18.2 |
| Average | 4667.3 | 12.6 | 10.2 | 11.7 | 11.5 | 11.7 |
Classification accuracy of the six methods under the 3NN classifier.
| Data Sets | ODP | NRS | FRSINT | FINEN | NEIEN | ARNRJE |
|---|---|---|---|---|---|---|
| Wine | 0.9192 | 0.9453 | 0.9281 | 0.9577 | 0.9620 | 0.9775 |
| Sonar | 0.8605 | 0.8588 | 0.8504 | 0.8513 | 0.8326 | 0.8942 |
| Segmentation | 0.8714 | 0.8021 | 0.9506 | 0.9504 | 0.9488 | 0.8381 |
| Wdbc | 0.9432 | 0.9553 | 0.9366 | 0.9226 | 0.9385 | 0.9456 |
| Wpbc | 0.6667 | 0.6752 | 0.6312 | 0.6613 | 0.6263 | 0.6919 |
| Prostate | 0.8235 | 0.8329 | 0.8503 | 0.8689 | 0.8431 | 0.8897 |
| DLBCL | 0.8831 | 0.9610 | 0.9635 | 0.9635 | 0.9585 | 0.9740 |
| Leukemia | 0.7339 | 0.9274 | 0.8655 | 0.9246 | 0.886 | 0.9306 |
| Tumors | 0.7074 | 0.725 | 0.7239 | 0.7781 | 0.7372 | 0.7194 |
| Average | 0.8232 | 0.8537 | 0.8556 | 0.8754 | 0.8592 | 0.8734 |
Classification accuracy of the six methods under the LibSVM classifier.
| Data Sets | ODP | NRS | FRSINT | FINEN | NEIEN | ARNRJE |
|---|---|---|---|---|---|---|
| Wine | 0.9210 | 0.9213 | 0.9295 | 0.9503 | 0.9210 | 0.9719 |
| Sonar | 0.6587 | 0.7735 | 0.7909 | 0.8168 | 0.8036 | 0.7885 |
| Segmentation | 0.9048 | 0.8606 | 0.9438 | 0.9317 | 0.9356 | 0.9095 |
| Wdbc | 0.5167 | 0.9453 | 0.9362 | 0.9230 | 0.9449 | 0.9051 |
| Wpbc | 0.7374 | 0.7029 | 0.7002 | 0.6875 | 0.7188 | 0.7374 |
| Prostate | 0.8750 | 0.8353 | 0.8527 | 0.9039 | 0.8691 | 0.9118 |
| DLBCL | 0.8701 | 0.9231 | 0.924 | 0.9051 | 0.8758 | 0.9351 |
| Leukemia | 0.4679 | 0.9165 | 0.9454 | 0.9122 | 0.942 | 0.9583 |
| Tumors | 0.2788 | 0.7516 | 0.7308 | 0.7742 | 0.7712 | 0.7627 |
| Average | 0.6923 | 0.8478 | 0.8615 | 0.8672 | 0.8647 | 0.8756 |
The recall rate with the three reduction algorithms under 3NN.
| Data Sets | FINEN | NEIEN | ARNRJE |
|---|---|---|---|
| Wine | 0.961 | 0.978 | 1.000 |
| Sonar | 0.910 | 0.937 | 0.955 |
| Segmentation | 0.833 | 0.833 | 0.838 |
| Wdbc | 0.951 | 0.953 | 0.966 |
| Wpbc | 0.815 | 0.753 | 0.849 |
| Prostate | 0.934 | 0.882 | 0.890 |
| DLBCL | 1.000 | 0.974 | 0.974 |
| Leukemia | 0.917 | 0.958 | 0.979 |
| Tumors | 0.567 | 0.733 | 0.867 |
| Average | 0.876 | 0.889 | 0.924 |
The recall rate with the three reduction algorithms under LibSVM.
| Data Sets | FINEN | NEIEN | ARNRJE |
|---|---|---|---|
| Wine | 0.958 | 1.000 | 1.000 |
| Sonar | 0.771 | 0.688 | 0.874 |
| Segmentation | 0.567 | 0.567 | 0.667 |
| Wdbc | 0.930 | 0.627 | 0.924 |
| Wpbc | 0.737 | 0.737 | 0.768 |
| Prostate | 0.566 | 0.566 | 0.566 |
| DLBCL | 0.753 | 0.753 | 1.000 |
| Leukemia | 0.389 | 0.389 | 0.653 |
| Tumors | 0.567 | 0.691 | 0.566 |
| Average | 0.693 | 0.669 | 0.780 |