| Literature DB >> 28358850 |
Xuyang Teng1, Hongbin Dong1, Xiurong Zhou2.
Abstract
Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers.Entities:
Mesh:
Year: 2017 PMID: 28358850 PMCID: PMC5373580 DOI: 10.1371/journal.pone.0173907
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
S-shaped and V-shaped families of transfer functions.
| S-shaped function | V-shaped function | ||
|---|---|---|---|
| Name | Transfer function | Name | Transfer function |
| S1 |
| V1 |
|
| S2 |
| V2 | |tanh ( |
| S3 |
| V3 |
|
| S4 |
| V4 |
|
Descriptions of UCI benchmark datasets.
| No. | Dataset | Number of Instances | Number of Features | Number of classes | Scientific area |
|---|---|---|---|---|---|
| 1 | Breast Cancer | 569 | 32 | 2 | Biology |
| 2 | Dermatology | 366 | 33 | 6 | Biology |
| 3 | Soybean | 683 | 35 | 19 | Biology |
| 4 | QSAR | 1055 | 41 | 2 | Chemometrics |
| 5 | Synthetic Control | 600 | 60 | 6 | Computer |
| 6 | Mice Protein | 1080 | 82 | 8 | Biology |
| 7 | Gas Sensor Array | 13910 | 129 | 6 | Computer |
| 8 | Musk | 6598 | 168 | 2 | Physical |
| 9 | Multi-feature pixel | 2000 | 240 | 10 | Computer |
| 10 | Isolet | 1559 | 618 | 26 | Computer |
Minimization results for fitness function 1 using different transfer functions.
| Dataset | Transfer function | |||||||
|---|---|---|---|---|---|---|---|---|
| S1 | S2 | S3 | S4 | V1 | V2 | V3 | V4 | |
| Dermatology | 5.85E-06 | 5.11E-06 | 9.83E-06 | 9.27E-06 | 5.39E-06 | 4.61E-06 | ||
| Soybean | 6.59E-03 | 1.36E-02 | 5.60E-02 | 9.47E-02 | 2.58E-03 | 1.94E-03 | ||
| Synthetic control | 3.47E-04 | 3.75E-03 | 4.39E-02 | 8.20E-02 | 4.29E-05 | 4.26E-05 | ||
| Mice Protein | 9.57E-05 | 6.91E-04 | 1.52E-02 | 3.10E-02 | 8.90E-06 | 1.01E-05 | ||
| Pixel | 5.88E-02 | 7.39E-02 | 7.64E-02 | 8.08E-02 | 5.44E-06 | 7.77E-06 | ||
| Isolet | 8.86E-02 | 9.43E-02 | 9.36E-02 | 9.27E-02 | 4.95E-06 | 5.39E-06 | ||
Parameter testing of probability p and inertia weight w.
| Number of Selected Features | Fitness Value | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dermatology | QSAR | Synthetic | Pixel | Dermatology | QSAR | Synthetic | Pixel | ||
| 0.1 | 3 | 4 | 5 | 19 | 0.5 | 6.09E-06 | 1.17E-02 | 1.04E-02 | 8.62E-06 |
| 0.2 | 4 | 10 | 8 | 36 | 1.0 | 4.52E-06 | 9.15E-03 | 9.97E-03 | 4.69E-06 |
| 0.3 | 8 | 14 | 13 | 58 | 1.5 | 4.39E-06 | 2.94E-06 | ||
| 0.4 | 10 | 16 | 20 | 89 | 2.0 | 1.76E-05 | 3.74E-05 | ||
| 0.5 | 13 | 20 | 27 | 96 | 2.5 | 4.54E-06 | 1.26E-04 | 8.33E-05 | 3.05E-07 |
Comparison of classification accuracy for three classifiers on full set.
| No. | Classification Accuracy/% | ||
|---|---|---|---|
| SVM | 1-NN | Naïve Bayes | |
| Breast Cancer | 97.92 | 95.96 | 92.97 |
| Dematology | 95.35 | 94.54 | 97.54 |
| Soybean | 93.85 | 91.22 | 92.97 |
| QSAR | 85.59 | 84.46 | 75.92 |
| Synthetic Control | 99.17 | 96.50 | 94.67 |
| Mice Protein | 100 | 99.26 | 87.50 |
| Gas Sensor Array | 97.14 | 99.47 | 59.47 |
| Musk | 94.92 | 95.80 | 83.86 |
| Multi-feature Pixel | 97.55 | 96.15 | 93.3 |
| Isolet | 96.81 | 89.58 | 84.21 |
| Avg A | |||
Comparison of classification accuracy for the SVM classifier.
| No. | Classification Accuracy/%(Number of Features in Subset)Rank of Classification Accuracy | |||||||
|---|---|---|---|---|---|---|---|---|
| FCBF | IG | ReliefF | mRMR | SFS | CMFS- | VPFS-Avg | VPFS-Best | |
| 1 | 95.78(2.5) | 93.50(6) | 93.32(7) | 94.55(5) | 95.78(2.5) | 94.90(4) | 96.17(1) | |
| 2 | 95.67(3) | 85.25(7) | 95.36(4) | 93.99(5.5) | 93.99(5.5) | 97.00(2) | 97.87(1) | |
| 3 | 91.80(5) | 92.83(2) | 92.68(3) | 92.97(1) | 90.19(6) | 90.04(7) | 92.31(4) | |
| 4 | 73.74(7) | 83.32(4) | 82.46(5) | 80.47(6) | 84.08(2) | 83.98(3) | 84.99(1) | |
| 5 | 82.67(4) | 72.33(7) | 77.00(5) | 73.83(6) | 93.83(2) | 90.83(3) | 94.86(1) | |
| 6 | 96.29(3) | 99.73(2) | 95.28(4) | 93.25(6) | 92.50(7) | 94.45(5) | 96.48 | |
| 7 | 84.06(7) | 84.14(6) | 84.28(5) | 95.14(2) | 85.44(4) | 93.67(3) | 97.86(1) | |
| 8 | 84.58(7) | 91.06(4) | 88.31(5) | 87.59(6) | 93.48(2) | 92.54(3) | 94.59(1) | |
| 9 | 93.90(5) | 90.45(7) | 92.10(6) | 94.40(4) | 94.95(2) | 94.90(3) | 96.22(1) | |
| 10 | 84.67(3) | 73.32(6) | 51.21(7) | 85.54(1) | 80.56(5) | 82.48(4) | 85.12(2) | |
| 9/0/1 | 8/0/2 | 8/0/2 | 7/0/3 | 10/0/0 | 10/0/0 | |||
| 4.65 | 5.1 | 4.8 | 4.05 | 3.7 | 3.9 | 1.8 | ||
| 0.0032 | 0.0006 | 0.0019 | 0.0198 | 0.0486 | 0.0295 | |||
Comparison of classification accuracy for the naïve bayes classifier.
| No. | Classification Accuracy/%(Number of Features in Subset)Rank of Classification Accuracy | |||||||
|---|---|---|---|---|---|---|---|---|
| FCBF | IG | ReliefF | mRMR | SFS | CMFS- | VPFS-Avg | VPFS-Best | |
| 1 | 95.08(2) | 94.73(3) | 92.79(7) | 94.55(4) | 94.38(5.5) | 94.38(5.5) | 95.52(1) | |
| 2 | 96.98(1) | 86.89(7) | 96.72(3) | 96.45(4.5) | 94.81(6) | 96.45(4.5) | 96.83(2) | |
| 3 | 90.04(2) | 87.99(5) | 89.31(3.5) | 89.31(3.5) | 85.94(7) | 86.68(6) | 91.21(1) | |
| 4 | 63.98(7) | 73.65(6) | 75.17(5) | 75.83(4) | 77.44(3) | 81.61(2) | 81.73(1) | |
| 5 | 80.00(6) | 77.67(7) | 80.67(5) | 82.67(4) | 94.50(1) | 91.12(3) | 94.29(2) | |
| 6 | 94.07(3) | 98.42(2) | 83.89(4) | 75.46(7) | 82.68(6) | 83.42(5) | 87.96 | |
| 7 | 55.75(7) | 65.86(4) | 61.92(6) | 77.89(2) | 62.06(5) | 68.03(3) | 82.92(1) | |
| 8 | 76.46(7) | 86.47(4) | 84.72(5) | 84.59(6) | 89.13(3) | 90.65(2) | 90.95(1) | |
| 9 | 91.15(2) | 82.7(7) | 84.25(6) | 88.15(5) | 91.40(1) | 90.45(4) | 90.99(3) | |
| 10 | 84.22(2) | 55.23(6) | 35.23(7) | 68.77(5) | 70.37(4) | 82.80(3) | 85.17(1) | |
| 7/0/3 | 9/0/1 | 9/0/1 | 9/0/1 | 8/0/2 | 10/0/0 | |||
| 3.9 | 5 | 4.95 | 4.2 | 4.25 | 3.9 | 1.8 | ||
| 0.0295 | 0.0009 | 0.0011 | 0.0129 | 0.0112 | 0.0295 | |||
Comparison of classification accuracy for the 1-NN classifier.
| No. | Classification Accuracy/%(Number of Features in Subset)Rank of Classification Accuracy | |||||||
|---|---|---|---|---|---|---|---|---|
| FCBF | IG | ReliefF | mRMR | SFS | CMFS- | VPFS-Avg | VPFS-Best | |
| 1 | 94.55(4.5) | 94.55(4.5) | 94.20(6) | 93.32(7) | 94.38(3) | 94.90(2) | 95.06(1) | |
| 2 | 95.36(1) | 83.33(7) | 94.26(2) | 93.44(4) | 89.87(5) | 88.25(6) | 93.96(3) | |
| 3 | 86.38(5) | 86.53(4) | 89.02(3) | 83.31(7) | 83.89(6) | 89.31(2) | 89.59(1) | |
| 4 | 81.13(7) | 82.08(5) | 83.13(2) | 82.94(3.5) | 82.94(3.5) | 81.52(6) | 83.14(1) | |
| 5 | 80.50(7) | 83.00(6) | 84.00(4) | 85.67(3) | 83.17(5) | 89.33(2) | 91.13(11 | |
| 6 | 94.63(6) | 98.52(3) | 70.37(7) | 97.50(5) | 98.00 (4) | 99.26 | ||
| 7 | 99.22(3) | 99.30(2) | 97.64(6) | 98.92(4) | 95.69(7) | 98.48(5) | 99.31 | |
| 8 | 93.14(7) | 93.66(5) | 94.14(4) | 93.45(6) | 94.60(2) | 94.40(3) | 94.68 | |
| 9 | 92.80(2) | 84.4(7) | 86.75(6) | 90.40(5) | 92.25(4) | 93.75(1) | 92.60(3) | |
| 10 | 66.45(6) | 68.30(4) | 48.93(7) | 80.07(1) | 71.07(3) | 67.12(5) | 74.88(2) | |
| 7/0/3 | 7/0/3 | 7/0/3 | 8/0/2 | 9/0/1 | 9/0/1 | |||
| 4.85 | 4.1 | 3.85 | 4.35 | 4.65 | 3.8 | 2.4 | ||
| 0.0112 | 0.0769 | 0.1289 | 0.0431 | 0.0198 | 0.1418 | |||
Comparison of compression ratio.
| No. | Number of Features in Subset(Compression Ratio/%) | |||||||
|---|---|---|---|---|---|---|---|---|
| FCBF | IG | ReliefF | mRmR | SFS | CMFS- | VPFS-Avg | VPFS-Best | |
| 1 | 7( | 7( | 7( | 7( | 7( | 7( | 7( | 7( |
| 2 | 16(51.52) | 16(51.52) | 16(51.52) | 16(51.52) | 16(51.52) | 13( | 16(51.52) | 16(51.52) |
| 3 | 16(54.29) | 16(54.29) | 16(54.29) | 16(54.29) | 16(54.29) | 12( | 17(51.43) | 16(54.29) |
| 4 | 5( | 23(43.90) | 23(43.90) | 23(43.90) | 23(43.90) | 23(43.90) | 22(46.34) | 21(48.78) |
| 5 | 15(75.00) | 14(76.67) | 14(76.67) | 14(76.67) | 14(76.67) | 11( | 12(80.00) | 14(76.67) |
| 6 | 17( | 36(56.10) | 36(56.10) | 36(56.10) | 36(56.10) | 28(65.85) | 35(57.32) | 36(56.10) |
| 7 | 12( | 14(89.15) | 14(89.15) | 14(89.15) | 14(89.15) | 14(89.15) | 13(89.92) | 13(89.92) |
| 8 | 6( | 12(92.86) | 12(92.86) | 14(91.67) | 14(91.67) | 12(92.86) | 14(91.67) | 15(91.07) |
| 9 | 27( | 58(75.83) | 58(75.83) | 58(75.83) | 58(75.83) | 44(81.67) | 59(75.42) | 60(75.00) |
| 10 | 31( | 49(92.07) | 49(92.07) | 49(92.07) | 49(92.07) | 39(93.69) | 49(92.07) | 55(91.10) |
Run times on UCI datasets.
| Dataset | Running Time/s | ||||
|---|---|---|---|---|---|
| CoFS | mRMR | CMFS- | SFS | VPFS | |
| Synthetic Control | 1.06 | 0.31 | 0.16 | 0.27 | 1.01 |
| Multi-Feature Pixel | 44.37 | 22.28 | 2.48 | 15.09 | 15.92 |
| Isolet | 124.02 | 57.95 | 42.17 | 90.92 | 18.86 |