| Literature DB >> 36089930 |
Abstract
Feature Selection (FS) is considered as an important preprocessing step in data mining and is used to remove redundant or unrelated features from high-dimensional data. Most optimization algorithms for FS problems are not balanced in search. A hybrid algorithm called nonlinear binary grasshopper whale optimization algorithm (NL-BGWOA) is proposed to solve the problem in this paper. In the proposed method, a new position updating strategy combining the position changes of whales and grasshoppers population is expressed, which optimizes the diversity of searching in the target domain. Ten distinct high-dimensional UCI datasets, the multi-modal Parkinson's speech datasets, and the COVID-19 symptom dataset are used to validate the proposed method. It has been demonstrated that the proposed NL-BGWOA performs well across most of high-dimensional datasets, which shows a high accuracy rate of up to 0.9895. Furthermore, the experimental results on the medical datasets also demonstrate the advantages of the proposed method in actual FS problem, including accuracy, size of feature subsets, and fitness with best values of 0.913, 5.7, and 0.0873, respectively. The results reveal that the proposed NL-BGWOA has comprehensive superiority in solving the FS problem of high-dimensional data. © Jilin University 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.Entities:
Keywords: Biomimetic position updating strategy; Feature selection; High-dimensional UCI datasets; Hybrid bionic optimization algorithm; Multi-modal medical datasets; Nature-inspired algorithm
Year: 2022 PMID: 36089930 PMCID: PMC9449924 DOI: 10.1007/s42235-022-00253-6
Source DB: PubMed Journal: J Bionic Eng ISSN: 1672-6529 Impact factor: 2.995
Differences in bionic behavior between some nature-inspired meta-heuristic algorithms
| Algorithm | Inspiration | References |
|---|---|---|
| Aquila Optimizer (AO) | Aquila Bird | [ |
| Arithmetic Optimization Algorithm (AOA) | Arithmetic Operators | [ |
| Butterfly Optimization Algorithm (BOA) | Food Search and Mating Behavior of Butterflies | [ |
| Dwarf Mongoose Optimization algorithm (DMO) | Behavior of The Dwarf Mongoose | [ |
| Ebola Optimization Search Algorithm (EOSA) | Ebola Virus | [ |
| Genetic Algorithm (GA) | Evolutionary Biology | [ |
| Grasshopper Optimization Algorithm (GOA) | Foraging and Swarming Behavior of Grasshoppers | [ |
| Grey Wolf Optimization (GWO) | Hunting Process of Grey Wolves | [ |
| Particle Swarm Optimization (PSO) | Simplified Social Model | [ |
| Reptile Search Algorithm (RSA) | Behavior of Crocodiles | [ |
| Whale Optimization Algorithm (WOA) | Social Behavior of Humpback Whales | [ |
Some hybrid optimization algorithms
| References | Hybrid methods | Abbreviation |
|---|---|---|
| [ | WOA and SA | WOASA(T)-1, 2 |
| [ | GWO and PSO | BGWOPSO |
| [ | GWO and GOA | GWO–GOA |
| [ | MA and KHA | MAKHA |
| [ | SOA and TEO | SOA-TEO1, 2, 3 |
| [ | MPMD and WOA | MPMDIWOA |
| [ | GWO and CSA | GWOCSA |
| [ | BA and PSO | HBBEPSO |
| [ | CRO and SA | BCROSAT |
| [ | ACO and BCO | AC-ABC |
Fig. 1Behavior of individual grasshoppers and whale
Fig. 2The comparison between the nonlinear and linear coefficients
Fig. 3Comparison of adaptive weight with different
Fig. 4The proposed NL-BGWOA
Fig. 5The overall structure of the proposed NL-BGWOA
Parameter settings of the algorithms used for comparison in the current study
| Algorithm | Parameter | Values |
|---|---|---|
| PSO(BPSO) | Number of particles | 10 |
| Maximum number of iterations | 100 | |
| Inertia weight | 1 | |
| Acceleration constants in PSO(BPSO) | [2,2] | |
| WOA(BWOA) | Number of whales | 10 |
| Maximum number of iterations | 100 | |
| Classification quality coefficient | 0.99 | |
| GOA(BGOA) | Number of grasshoppers | 10 |
| Maximum number of iterations | 100 | |
| Linear decrease coefficient | (1, 1.0E−04) | |
| Classification quality coefficient | 0.99 | |
| NL-BGWOA | Number of individuals | 10 |
| Maximum number of iterations | 100 | |
| Classification quality coefficient | 0.99 | |
| Linear decrease coefficient | (1, 1.0E−04) |
Benchmark datasets
| No. of datasets | Name | No. of features | No. of samples |
|---|---|---|---|
| D1 | Arrhythmia | 279 | 452 |
| D2 | BreastEW | 32 | 596 |
| D3 | Clean1 | 165 | 476 |
| D4 | Clean2 | 165 | 6598 |
| D5 | Dermatology | 34 | 366 |
| D6 | Hill-Valley | 100 | 606 |
| D7 | LonosphereEW | 34 | 351 |
| D8 | SonarEW | 60 | 208 |
| D9 | Spambase | 57 | 4601 |
| D10 | WaveformEW | 40 | 5000 |
Multi-modal Parkinson's disease speech datasets
| No. of datasets | No. of features | No. of samples |
|---|---|---|
| P1 | 754 | 756 |
| P2 | 26 | 1040 |
COVID-19 symptom dataset
| No. of samples | Label | Symptoms |
|---|---|---|
| 2575 | Boolean (COVID-19 positive or negative) | Age, fever, body pain, runny nose, difficult breathing, and the infection probability of COVID-19 patients |
Mean classification accuracy
| Datasets | Algorithms | ||||||
|---|---|---|---|---|---|---|---|
| PSO | WOA | GOA | BPSO | BWOA | BGOA | NL-BGWOA | |
| D1 | 0.5841 | 0.5911 | 0.5833 | 0.5810 | 0.5977 | 0.5926 | |
| D2 | 0.9588 | 0.9566 | 0.9737 | 0.9558 | 0.9823 | 0.9823 | |
| D3 | 0.8549 | 0.9100 | 0.8931 | 0.9021 | 0.9205 | 0.9180 | |
| D4 | 0.9465 | 0.9713 | 0.9645 | 0.9692 | 0.9723 | 0.9604 | |
| D5 | 0.9863 | 0.9863 | 0.9534 | 0.9863 | 0.9786 | 0.9863 | |
| D6 | 0.5816 | 0.5560 | 0.5818 | 0.5597 | 0.6237 | 0.6045 | |
| D7 | 0.9307 | 0.9420 | 0.8893 | 0.9200 | 0.9371 | 0.9164 | |
| D8 | 0.8667 | 0.8988 | 0.8667 | 0.8654 | 0.8805 | 0.8857 | |
| D9 | 0.9105 | 0.8850 | 0.8934 | 0.8787 | 0.9160 | 0.9167 | |
| D10 | 0.7120 | 0.8056 | 0.7582 | 0.7610 | 0.8120 | 0.8104 | |
Fig. 6Mean classification accuracy for several representative datasets
Mean size of selected feature subsets
| Datasets | Algorithms | ||||||
|---|---|---|---|---|---|---|---|
| PSO | WOA | GOA | BPSO | BWOA | BGOA | NL-BGWOA | |
| D1 | 137.6 | 143.0 | 128.1 | 140.0 | 134.3 | 173.9 | |
| D2 | 14.7 | 13.5 | 12.8 | 14.8 | 10.2 | 10.9 | |
| D3 | 104.9 | 86.8 | 78.7 | 82.1 | 64.9 | 86.6 | |
| D4 | 109.4 | 78 | 79.7 | 84.6 | 68.3 | 71.5 | |
| D5 | 19.7 | 22.1 | 17.3 | 18.3 | 20.5 | 15.8 | |
| D6 | 51.3 | 48.5 | 47.9 | 46.0 | 39.6 | 55.0 | |
| D7 | 14.7 | 13.6 | 15.4 | 15.0 | 5.1 | 13.1 | |
| D8 | 37.6 | 28.6 | 29.1 | 27.6 | 14.5 | 27.4 | |
| D9 | 29.9 | 31.2 | 27 | 29.9 | 24.9 | 29.5 | |
| D10 | 35.8 | 32.5 | 17.7 | 22.7 | 11.0 | 21.7 | |
Fig. 7Mean size of selected feature subsets for several representative datasets
Mean fitness values
| Datasets | Algorithms | ||||||
|---|---|---|---|---|---|---|---|
| PSO | WOA | GOA | BPSO | BWOA | BGOA | NL-BGWOA | |
| D1 | 0.4569 | 0.4404 | 0.4207 | 0.4147 | 0.4113 | 0.4141 | |
| D2 | 0.0412 | 0.0434 | 0.0659 | 0.0442 | 0.0205 | 0.0205 | |
| D3 | 0.1508 | 0.1719 | 0.2170 | 0.1783 | 0.1636 | 0.2139 | |
| D4 | 0.0596 | 0.0382 | 0.0800 | 0.0368 | 0.0371 | 0.0747 | |
| D5 | 0.0549 | 0.0234 | 0.0697 | 0.0623 | 0.0215 | 0.0526 | |
| D6 | 0.4468 | 0.4450 | 0.4243 | 0.4405 | 0.4256 | 0.4109 | |
| D7 | 0.0693 | 0.0579 | 0.1449 | 0.0800 | 0.1268 | 0.1736 | |
| D8 | 0.1382 | 0.1684 | 0.1378 | 0.1207 | 0.2305 | 0.1143 | |
| D9 | 0.1305 | 0.1227 | 0.1911 | 0.1970 | 0.1241 | 0.1258 | |
| D10 | 0.2871 | 0.1944 | 0.3063 | 0.2243 | 0.1965 | 0.2249 | |
Fig. 8Mean fitness values comparison for several representative datasets
Evaluation of the current study on the multi-modal Parkinson datasets
| Algorithms | D1 | D2 | ||||
|---|---|---|---|---|---|---|
| Accuracy | Num-FS | Fitness | Accuracy | Num-FS | Fitness | |
| PSO | 0.765 | 378.6 | 0.2354 | 0.694 | 13.6 | 0.3060 |
| WOA | 0.782 | 183.4 | 0.2179 | 0.665 | 11.5 | 0.3351 |
| GOA | 0.866 | 365.5 | 0.2393 | 0.665 | 11.1 | 0.3437 |
| BPSO | 0.766 | 371.1 | 0.2341 | 0.688 | 13.1 | 0.3118 |
| BWOA | 0.905 | 180.7 | 0.0955 | 0.695 | 10.5 | 0.3053 |
| BGOA | 0.911 | 485.7 | 0.0900 | 0.692 | 9.2 | 0.3151 |
| NL-BGWOA | ||||||
Fig. 9The comparison of different algorithms on the multi-modal Parkinson datasets
Evaluation of the COVID-19 symptom dataset
| Algorithms | Accuracy | Num-FS | Fitness |
|---|---|---|---|
| PSO | 0.5184 | 2.0 | 0.4805 |
| WOA | 0.5073 | 3.0 | 0.4629 |
| GOA | 0.5188 | 2.0 | 0.4431 |
| BPSO | 0.5167 | 3.0 | 0.4386 |
| BWOA | 0.5204 | 3.0 | 0.4837 |
| BGOA | 0.5201 | 2.0 | 0.4353 |
| NL-BGWOA |
Fig. 10The comparison of different algorithms on the COVID-19 symptom datasets
Results of Friedman test
| Algorithms | Mean | Ranking |
|---|---|---|
| PSO | 2.0000 | 7 |
| WOA | 2.8077 | 5 |
| GOA | 2.0385 | 6 |
| BPSO | 3.2091 | 4 |
| BWOA | 4.3462 | 2 |
| BGOA | 4.0769 | 3 |
| NL-BGWOA |
| No. of datasets | Features | Description |
|---|---|---|
| P1 | Baseline features | Jitter variants Shimmer variants Fundamental frequency parameters Harmonicity parameters Recurrence period density entropy (RPDE) Detrended fluctuation analysis (DFA) Pitch period entropy (PPE) |
| Time frequency features | Intensity parameters Formant frequencies Bandwidth | |
| Mel frequency cepstral coefficients | MFCCs | |
| Wavelet transform based features | Wavelet transform (WT) features related with F0 | |
| Vocal fold features | Glottis quotient (GQ) Glottal to noise excitation (GNE) Vocal fold excitation ratio (VFER) Empirical mode decomposition (EMD) | |
| P2 | Frequency features | Jitter (local) Jitter (local, absolute) Jitter (rap) Jitter (ppq5) Jitter (ddp) |
| Amplitude features | Shimmer (local) Shimmer (local, dB) Shimmer (apq3) Shimmer (apq5) Shimmer (apq11) Shimmer (dda) | |
| Harmonicity features | Autocorrelation Noise-to-harmonic Harmonic-to-noise | |
| Pitch features | Median pitch Mean pitch Standard deviation Minimum pitch Maximum pitch | |
| Pulse features | Number of pulses Number of periods Mean period Standard deviation of period | |
| Voicing features | Fraction of locally unvoiced frames Number of voice breaks Degree of voice breaks |