| Literature DB >> 29912884 |
Shenglong Yu1,2, Hong Zhao1,2.
Abstract
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of "good" features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.Entities:
Mesh:
Year: 2018 PMID: 29912884 PMCID: PMC6005488 DOI: 10.1371/journal.pone.0197564
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
A numerical decision system (Liver).
| Patient | Mcv | Alkphos | Sgpt | Sgot | Gammagt | Selector |
|---|---|---|---|---|---|---|
| 0.53 | 0.60 | 0.27 | 0.29 | 0.09 | 1 | |
| 0.53 | 0.36 | 0.36 | 0.35 | 0.06 | 2 | |
| 0.55 | 0.27 | 0.19 | 0.14 | 0.17 | 2 | |
| 0.68 | 0.48 | 0.20 | 0.25 | 0.11 | 2 | |
| 0.58 | 0.41 | 0.05 | 0.30 | 0.02 | 2 | |
| 0.87 | 0.28 | 0.06 | 0.16 | 0.04 | 2 | |
| 0.61 | 0.34 | 0.11 | 0.16 | 0.01 | 2 | |
| ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
| 0.68 | 0.39 | 0.15 | 0.27 | 0.03 | 1 | |
| 0.87 | 0.66 | 0.35 | 0.52 | 0.21 | 1 |
An example of test cost vector.
| Mcv | Alkphos | Sgpt | Sgot | Gammagt | |
|---|---|---|---|---|---|
| $16.00 | $20.00 | $45.00 | $28.00 | $33.00 |
A subtable of the Liver decision system.
| Patient | Mcv | Alkphos | Sgpt | Sgot | Gammagt | Selector |
|---|---|---|---|---|---|---|
| 0.53 | 0.60 | 0.27 | 0.29 | 0.09 | 1 | |
| 0.68 | 0.36 | 0.20 | 0.35 | 0.12 | 2 | |
| 0.55 | 0.27 | 0.19 | 0.14 | 0.17 | 2 | |
| 0.68 | 0.48 | 0.20 | 0.25 | 0.11 | 1 | |
| 0.58 | 0.57 | 0.05 | 0.30 | 0.08 | 2 | |
| 0.87 | 0.28 | 0.06 | 0.20 | 0.09 | 1 |
An error range vector.
| Mcv | Alkphos | Sgpt | Sgot | Gammagt | |
|---|---|---|---|---|---|
| 0.06 | 0.04 | 0.02 | 0.03 | 0.01 |
A feature importance vector of the Liver subtable.
| Mcv | Alkphos | Sgpt | Sgot | Gammagt | |
|---|---|---|---|---|---|
| 0.8456 | 0.7439 | 0.9506 | 0.9345 | 0.9680 |
Datasets information.
| No. | Name | Domain | | | | | | |
|---|---|---|---|---|---|
| 1 | Liver | Clinic | 345 | 6 | 2 |
| 2 | Wpbc | Clinic | 198 | 33 | 2 |
| 3 | Promoters | Game | 106 | 57 | 2 |
| 4 | Voting | Society | 435 | 16 | 2 |
| 5 | Ionosphere | Physics | 351 | 34 | 2 |
| 6 | Credit-g | Commerce | 1000 | 20 | 2 |
| 7 | Waveform | Vocality | 5000 | 40 | 3 |
| 8 | Prostate-GE | Clinic | 102 | 5966 | 2 |
| 9 | SMK-CAN-187 | Society | 187 | 19993 | 2 |
Fig 1Finding optimal factor of Liver dataset.
Fig 9Finding optimal factor of Waveform dataset.
Fig 10Average below factor of Liver dataset.
Fig 18Average below factor of Waveform dataset.
Fig 19Average exceeding factor of Liver dataset.
Fig 27Average exceeding factor of Waveform dataset.
Results for α = 0 and α with the optimal setting.
| Dataset | optimal | |||||
|---|---|---|---|---|---|---|
| Uniform | Normal | Pareto | Uniform | Normal | Pareto | |
| Liver | 0.145 | 0.220 | 0.443 | 0.894 | 0.319 | 0.979 |
| Wpbc | 0.018 | 0.235 | 0.703 | 0.854 | 0.337 | 1.000 |
| Promoters | 0.000 | 0.040 | 0.295 | 0.920 | 0.415 | 1.000 |
| Voting | 0.440 | 0.510 | 0.523 | 0.979 | 0.661 | 0.991 |
| Ionosphere | 0.086 | 1.000 | 0.898 | 0.929 | 1.000 | 1.000 |
| Credit-g | 0.220 | 0.553 | 0.548 | 0.957 | 0.796 | 0.999 |
| Prostate-GE | 0.003 | 0.126 | 0.716 | 0.980 | 0.488 | 1.000 |
| SMK-CAN-187 | 0.003 | 0.018 | 0.741 | 0.994 | 0.565 | 1.000 |
| Waveform | 0.000 | 0.000 | 0.438 | 0.678 | 0.201 | 1.000 |
Fig 28Classification accuracy.
Fig 29Finding optimal factor.