| Literature DB >> 36105649 |
Yelleti Vivek1,2, Vadlamani Ravi1, P Radha Krishna2.
Abstract
Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FSPM), and named them PB-ADE and P-DE-FSPM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2-2.9. We also reported feature subset with high AUC and least cardinality.Entities:
Keywords: Apache spark; Differential evolution; Feature subset selection; MapReduce; Multithreading; Threshold accepting
Year: 2022 PMID: 36105649 PMCID: PMC9463682 DOI: 10.1007/s10586-022-03725-w
Source DB: PubMed Journal: Cluster Comput ISSN: 1386-7857 Impact factor: 2.303
Sequential versions of DE and its wrapper variants
| Authors | # objectives | Algorithm | Wrapper (classifier)/filter | Parallel/sequential |
|---|---|---|---|---|
| Zhang et.al. [ | Multi Objective | Self-Learning DE | Wrapper (KNN) | Sequential |
| Vivekanandan and Sriraman [ | Single Objective | Modified DE | Filter | Sequential |
| Nayak et.al. [ | Multi Objective | FAEMODE | Filter | Sequential |
| Mlakar et.al. [ | Multi Objective | DE + HOG | Wrapper (SVM) | Sequential |
| Khushaba et.al [ | Single Objective | DEFS | Filter | Sequential |
| Hancer [ | Multi Objective | MODE-CFS | Filter | Sequential |
| Hancer et.al. [ | Multi Objective | DE + MIFS | Filter | Sequential |
| Ghosh et.al. [ | Multi Objective | SADE | Wrapper (Fuzzy-KNN) | Sequential |
| Bhadra and Bandyopadhyay [ | Multi Objective | MoDE | Filter MI | Sequential |
| Bhaig [ | Multi Objective | Modified DE | Wrapper (SVM) | Sequential |
| Almasoudy et.al. [ | Multi Objective | Modified DE | Wrapper (ELM) | Sequential |
| Zorarpaci et.al. [ | Single Objective | DE + ACO | Weka J48 classifier | Sequential |
| Srikrishna et.al[ | Single Objective | Quantum DE | Wrapper (LR) | Sequential |
| Lopez et.al. [ | Single Objective | DE-FSPM | Wrapper (SVM) | Sequential |
| Al-ani [ | Single Objective | DE + Wheel based strategy | Filter | Sequential |
| Zhao et.al. [ | Single Objective | Modified DE | Wrapper (SVM) | Sequential |
| Hancer [ | Multi Objective | DE | Filter(Fuzzy + Kernel) | Sequential |
| Li et.al. [ | Single Objective | DE | Wrapper (SVM) | Sequential |
| Wang et.al. [ | Single Objective | DE | Wrapper (KNN) | Sequential |
| Krishna and Ravi [ | Single Objective | Adaptive DE | Wrapper (LR) | Sequential |
| Single Objective | PB-TADE, PB-DETA, PB-DE, PB-ADE,P-DE-FSPM | Wrapper (LR) | Parallel |
Parallel and distributed versions of DE and its variants
| Authors | Algorithm | Environment | Problem solved |
|---|---|---|---|
| Zhou [ | DE | Spark | Pros and cons of various approaches are discussed |
| Teijeiroet.al. [ | DE | Spark + AWS | Tested on benchmark functions |
| Chou et.al [ | DE | Spark | Clustering |
| Al-Sawwa and Ludwig [ | DE | Spark | Designed a DE based classifier |
| Chen et.al [ | Modified DE | SPMD | Cluster Optimization |
| Adhianto et.al [ | DE | OpenMP | Optical Network problem |
| Liu et.al. [ | DE | Distributed Cloud | Power electronic circuit optimization |
| Deng et.al. [ | DE | Spark | Tested on benchmark functions and reported speedup |
| Wong et.al. [ | Self-Adaptive DE | CUDA | Tested on benchmark functions and reported speedup |
| He et.al. [ | Five variants of DE | Spark + Cloud | Developed a ring topology model and evaluated benchmark functions to report speedup |
| Cao et.al. [ | DPCCMOEA | MPI | Developed co-evolutionary based DE to solve large scale optimization |
| Ge et.al. [ | DDE-AMS | MPI | Developed adaptive population model to solve large scale optimization |
| Falco et.al [ | DE | MPI | Resource allocation |
| Veronse and Krohling [ | DE | CUDA | To solve large scale optimization in GPU environment |
| Glotik et.al. [ | PSADE | MATLAB | Hydro Scheduling algorithm |
| Thomert et.al. [ | NSDE-II | OpenMP | Cloud work placements |
| Daoudi et.al [ | DE | Hadoop | Clustering |
| Kromer et.al. [ | DE | Unified Parallel C | To solve large scale optimization problems |
PB-TADE, PB-DETA, PB-DE, PB-ADE,P-DE-FSPM | Spark | A parallel EA based wrapper algorithm solving FSS |
Fig. 1Schema of the population RDD
Fig. 5Flowchart of PB-DE based wrapper
Fig. 2Schematic representation of the DETA based wrapper
Fig. 6Flow chart of the PB-DETA based wrapper
Fig. 3Schematic representation of the PB-TADE based wrapper
Fig. 7Flowchart of the parallel PB-TADE based wrapper
Fig. 8Flowchart of PB-ADE based wrapper
Fig. 9Flowchart of P-DE-FSPM based wrapper
Fig. 4Schema of the population RDD for the P-DE-FSPM
Time complexity of the algorithms
| Algorithm | Time complexity |
|---|---|
| DE | O( |
| DETA | O( |
| TADE | O( |
| ADE | O( |
| DE-FSPM | O( |
Description of the benchmark datasets
| Name of the dataset | # objects | # features | # classes | Size of the dataset |
|---|---|---|---|---|
| Epsilon | 5,00,000 | 2000 | 2 | 10.8 GB |
| Microsoft Malware | 32,59,724 | 76 | 2 | 1.8 GB |
| IEEE Malware | 15,00,000 | 1000 | 2 | 3.2 GB |
| OVM_Omentum | 1584 | 10,935 | 2 | 108.3 MB |
| OVM_Uterus | 1584 | 10,935 | 2 | 108.3 MB |
Hyperparameters for all the approaches
| Dataset | PB-DE | PB-DETA | PB-TADE | P-DE-FSPM | ||||
|---|---|---|---|---|---|---|---|---|
| MF | CR | MF | CR | MF | CR | MF | CR | |
| Epsilon | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 |
| Microsoft Malware | 0.8 | 0.9 | 0.8 | 0.9 | 0.8 | 0.9 | 0.8 | 0.9 |
| IEEE Malware | 0.8 | 0.9 | 0.8 | 0.9 | 0.8 | 0.9 | 0.8 | 0.9 |
| OVM_Omentum | 0.75 | 0.9 | 0.75 | 0.9 | 0.75 | 0.9 | 0.75 | 0.9 |
| OVM_Uterus | 0.85 | 0.9 | 0.85 | 0.9 | 0.85 | 0.9 | 0.85 | 0.9 |
Average Cardinality and mean AUC obtained
| Dataset | PB-DE | PB-DETA | PB-ADE | P-DE-FSPM | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Avg | Mean AUC | Avg. Cardinality | Mean AUC | Avg | Mean AUC | Avg | Mean | Avg | Mean | |
| Epsilon | 617.3 | 0.7932 | 486.1 | 0.8029 | 457.7 | 555.65 | 0.797 | 558 | 0.7971 | |
| Microsoft Malware | 29.6 | 0.6872 | 21.7 | 0.7002 | 18.60 | 16.95 | 0.682 | 15.8 | 0.6924 | |
| IEEE Malware | 643.45 | 0.7929 | 477.9 | 0.8035 | 463.9 | 463.5 | 0.790 | 499.55 | 0.7937 | |
| OVA_Omentum | 47.28 | 0.8607 | 35.54 | 0.8722 | 26.15 | 49.3 | 0.846 | 32.9 | 0.870 | |
| OVA_Uterus | 37.3 | 0.8607 | 28.60 | 0.8712 | 27.12 | 46.2 | 0.845 | 49.7 | 0.871 | |
Most often repeated features selected by each approach
| Dataset | Approach | Most repeated features |
|---|---|---|
| Epsilon | PB-DE | 1,3,5,7,9 |
| PB-DETA | 1,3,6,12,19 | |
| PB-TADE | 1,3,6,12,19 | |
| PB-ADE | 1,3,5,7,9 | |
| P-DE-FSPM | 1,3,6,12,9 | |
| Microsoft Malware | PB-DE | AVProductsInstalled,HasTpm,Isprotected,Census_OEMN_Name Identifier,SmartScreen |
| PB-DETA | AVProductsInstalled,HasTpm,IsPassiveMode, OsSuite,SmartScreen | |
| PB-TADE | AVProductsInstalled,HasTpm,OsSuite, RipStateBuild,SmartScreen | |
| PB-ADE | AVProductsInstalled,HasTpm,Isprotected,Census_OEMN_Name Identifier,SmartScreen | |
| P-DE-FSPM | AVProductsInstalled,HasTpm,OsSuite, RipStateBuild,SmartScreen | |
| IEEE Malware | PB-DE | GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException |
| PB-DETA | GetProcAddress,GetLastError,Sleep,ReadFile, RaiseException | |
| PB-TADE | GetProcAddress,GetLastError,Sleep,ReadFile, RaiseException | |
| PB-ADE | GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException | |
| P-DE-FSPM | GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException | |
| OVA_Omentum | PB-DE | 158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at |
| PB-DETA | 1554436_s_at, 201669_s_at, 20644_s_at, 207442_s_at,, 208970_s_at | |
| PB-TADE | 1554436_s_at, 201669_s_at, 20644_s_at, 207442_s_at,, 208970_s_at | |
| PB-ADE | 158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at | |
| P-DE-FSPM | 158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at | |
| OVA_Uterus | PB-DE | 205866_s_at,209682_s_at,217294_s_at, 222421_s_at,220148_s_at, |
| PB-DETA | 202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at, | |
| PB-TADE | 202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at, | |
| PB-ADE | 202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at, | |
| P-DE-FSPM | 202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at, |
Cardinalities and the corresponding AUC of the Top-most repeated feature subsets
| Dataset | PB-DE | PB-DETA | PB-TADE | PB-ADE | P-DE-FSPM | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| #s1 | AUC | #s1 | AUC | #s1 | AUC | #s1 | AUC | #s1 | AUC | |
| Epsilon | 639 | 0.7967 | 564 | 0.8068 | 488 | 505 | 0.797 | 555 | 0.801 | |
| Microsoft Malware | 31 | 0.6983 | 24 | 0.7057 | 17 | 17 | 0.7 | 20 | 0.70 | |
| IEEE Malware | 550 | 0.7956 | 486 | 0.8057 | 487 | 483 | 0.804 | 588 | 0.798 | |
| OVA_Omentum | 55 | 0.8701 | 37 | 0.8723 | 33 | 66 | 0.866 | 33 | 0.876 | |
| OVA_Uterus | 37 | 0.8607 | 28 | 0.8712 | 27 | 60 | 0.86 | 50 | 0.877 | |
*Where #s1 is the cardinality of the top-most repeated feature subset
Least Cardinal Feature subset with the highest AUC
| Dataset | PB-DE | PB-DETA | PB-TADE | PB-ADE | P-DE-FSPM | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| #s1 | AUC | #s1 | AUC | #s1 | AUC | #s1 | AUC | #s1 | AUC | |
| Epsilon | 588 | 0.7967 | 471 | 0.8068 | 505 | 0.797 | 555 | 0.801 | ||
| Microsoft Malware | 27 | 0.6915 | 22 | 0.7007 | 17 | 0.7 | 20 | 0.70 | ||
| IEEE Malware | 550 | 0.7956 | 484 | 0.8057 | 388 | 0.799 | 464 | 0.804 | ||
| OVA_Omentum | 41 | 0.8504 | 31 | 0.8699 | 66 | 0.866 | 27 | 0.864 | ||
| OVA_Uterus | 31 | 0.8504 | 24 | 0.8699 | 60 | 0.86 | 48 | 0.870 | ||
*Where #s1 is the cardinality of the feature subset having least cardinal subset with highest AUC
Speedup Analysis of parallel versions over the sequential ones
| Dataset | Algorithm | Sequential E.T | Parallel E.T | S.U |
|---|---|---|---|---|
| Epsilon | PB-DE | 12,780 | 4485 | 2.84 |
| PB-DETA | 12,680 | 4361 | 2.90 | |
| PB-ADE | 12,780 | 4485 | 2.84 | |
| P-DE-FSPM | 12,912 | 4860 | 2.65 | |
| Microsoft Malware | PB-DE | 16,412 | 6741 | 2.43 |
| PB-DETA | 15,781 | 6447 | 2.44 | |
| PB-ADE | 16,222 | 6741 | 2.40 | |
| P-DE-FSPM | 15,640 | 7077 | 2.21 | |
| IEEE Malware | ||||
| PB-DETA | 19,793 | 7936 | 2.49 | |
| PB-TADE | 19,801 | 7938 | 2.49 | |
| PB-ADE | 20,517 | 8817 | 2.32 | |
| P-DE-FSPM | 20,331 | 8996 | 2.26 | |
| OVA_Omentum | PB-DE | 14,892 | 5428 | 2.74 |
| PB-DETA | 14,651 | 5226 | 2.80 | |
| PB-ADE | 14,891 | 5407 | 2.75 | |
| P-DE-FSPM | 16,860 | 6882 | 2.45 | |
| OVA_Uterus | PB-DE | 14,108 | 5368 | 2.62 |
| PB-DETA | 13,979 | 5222 | 2.68 | |
| PB-TADE | 13,968 | 5378 | 2.68 | |
| P-DE-FSPM | 16,042 | 6712 | 2.39 |
Top results are highlighted in bold
*Where E.T is the execution time given in seconds
Paired t-test results
| Model | Parameter | Dataset | ||||
|---|---|---|---|---|---|---|
| Epsilon | Microsoft Malware | IEEE Malware | OVA_ | OVA_ | ||
PB-DE vs PB-TADE | t-statistic | 7.72 | 4.62 | 8.045 | 4.168 | 3.69 |
| p-value | 2.66 × 10–09 | 4.25 × 10–05 | 9.91 × 10–10 | 0.00017 | 0.00069 | |
PB-DETA vs PB-TADE | t-statistic | 3.56 | 3.106 | 3.63 | 2.06 | 1.744 |
| p-value | 0.001 | 0.0035 | 0.0008 | 0.045 | 0.0891 | |
PB-ADE vs PB-TADE | t-statistic | 7.21 | 3.648 | 5.22 | 3.54 | 6.084 |
| p-value | 1.25 × 10–09 | 0.0007 | 6.57 × 10–06 | 0.0010 | 4.36 × 10–07 | |
P-DE-FSPM vs PB-TADE | t-statistic | 14.21 | 4.035 | 6.90 | 2.45 | 1.78 |
| p-value | 8.54 × 10–17 | 0.00025 | 3.32 × 10–08 | 0.0186 | 0.0818 | |