Literature DB >> 36034050

Binary Simulated Normal Distribution Optimizer for feature selection: Theory and application in COVID-19 datasets.

Shameem Ahmed1, Khalid Hassan Sheikh1, Seyedali Mirjalili2,3,4, Ram Sarkar1.   

Abstract

Classification accuracy achieved by a machine learning technique depends on the feature set used in the learning process. However, it is often found that all the features extracted by some means for a particular task do not contribute to the classification process. Feature selection (FS) is an imperative and challenging pre-processing technique that helps to discard the unnecessary and irrelevant features while reducing the computational time and space requirement and increasing the classification accuracy. Generalized Normal Distribution Optimizer (GNDO), a recently proposed meta-heuristic algorithm, can be used to solve any optimization problem. In this paper, a hybrid version of GNDO with Simulated Annealing (SA) called Binary Simulated Normal Distribution Optimizer (BSNDO) is proposed which uses SA as a local search to achieve higher classification accuracy. The proposed method is evaluated on 18 well-known UCI datasets and compared with its predecessor as well as some popular FS methods. Moreover, this method is tested on high dimensional microarray datasets to prove its worth in real-life datasets. On top of that, it is also applied to a COVID-19 dataset for classification purposes. The obtained results prove the usefulness of BSNDO as a FS method. The source code of this work is publicly available at https://github.com/ahmed-shameem/Feature_selection.
© 2022 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  Algorithm; COVID-19; Feature selection; Generalized Normal Distribution Optimizer; Meta-heuristic; Optimization; Simulated annealing

Year:  2022        PMID: 36034050      PMCID: PMC9396289          DOI: 10.1016/j.eswa.2022.116834

Source DB:  PubMed          Journal:  Expert Syst Appl        ISSN: 0957-4174            Impact factor:   8.665


Introduction

Data mining and Machine learning are some of the fastest-growing research topics in the information industry due to the availability of ample amounts of data that can be converted to potentially useful information. These fields are essential and integral part of the knowledge discovery (KDD) process which consists of a set of iterative sequences of tasks such as data cleaning, data reduction, data integration, and data transformation etc. Han et al. (2011). These pre-processing steps have a major impact on the performance of data mining and machine learning algorithms. Data can be considered as the ’new currency’ in this decade, which simply states the importance of data. Hence, handling data properly for our needs is a new adventure. With the growing popularity of these fields, we are receiving data in abundance, which is making our job difficult as the dimensions of these data are very high. Now, any data mining and machine learning algorithm take a huge amount of time during training because of this. To solve this problem of the ’curse of dimensionality’, researchers have come up with various techniques. Feature selection (FS) is one such most popular technique, which removes the unnecessary and irrelevant features, thereby reducing the number of attributes that do not help in the classification purpose, rather act as a noise and increase the space requirement as well as the computational cost. Generally speaking, there are two ways to perform FS: filter and wrapper (Liu & Motoda, 2012). Filter methods try to evaluate the feature subset using some designated methods such as Information gain (IG), Chi-square (Zheng et al., 2004), Laplacian score (He et al., 2006) etc. Whereas, wrapper methods use a learning algorithm to evaluate the selected feature subset. Filter methods are usually faster than wrapper methods but generally wrapper methods produce better classification accuracy (Liu & Motoda, 2012). Some of the recent and promising ones like the column-subset selection problem (Boutsidis et al., 2014, Cortinovis and Kressner, 2020, Drineas et al., 2008, Tripathi and Reza, 2020) are known to perform FS with provable theoretical bounds. These methods have been used to perform FS on k-means (Boutsidis et al., 2009), SVM (Paul et al., 2016) which provide a significant performance enhancement. These methods are known to outperform existing methods like mutual information, recursive feature elimination, etc. Finding the most functional feature subset or necessary features is a challenging task. For the last few years, meta-heuristic algorithms have been employed to address the FS problem. These works have widened the way to FS in an efficient manner. If any dataset consists of ‘N’ features/attributes, then there are -1 number of feature combinations. Evaluating all these feature subsets is a hectic task, i.e., it is very time consuming and hence inefficient. To solve this problem random search is another possible solution (Lai et al., 2006). However, meta-heuristic procedures are considered to be more appropriate as they can handle the worst case scenario (Talbi, 2009). There are many such meta-heuristic algorithms in literature like genetic algorithm (GA) (Davis, 1991), particle swarm optimization (PSO) algorithm (Kennedy & Eberhart, 1995), artificial bee colony (ABC) algorithm (Karaboga & Basturk, 2007), harmony search (HS) algorithm (Geem et al., 2001), sine–cosine algorithm (SCA) (Mirjalili, 2016) etc. The search process of any meta-heuristic algorithm depends on the balance between its exploration and exploitation phases. Exploration simply means diversification of solutions, i.e., evaluate the candidate solutions which are not neighbouring solutions. Exploitation, on the other hand, means intensification, i.e., searching the neighbourhood for possible better solutions. These two traits become the deciding factor in finding an optimal solution. Hence, proper tuning between these two is very important. In this paper, we have tried to maintain a fine balance between these two phases of Generalized Normal Distribution Optimizer (GNDO) (Zhang et al., 2020) with the help of Simulated Annealing (SA) (Kirkpatrick et al., 1983) which acts as a local search to enhance the exploitation capabilities of GNDO. The proposed method has been applied over various datasets to prove its worth and effectiveness. The rest of the paper is organized in the following manner: Section 2 discusses some popular and recent meta-heuristic algorithms found in the literature, Section 3 presents the motivation and contribution of this work, Section 4 describes the search process of GNDO and SA, Section 5 discusses the fitness function and transfer function used here, as well as the time complexity of the method, Section 6 reports the detailed experiments that have been performed to prove the effectiveness of the proposed method, Section 7 proves the robustness of the model, Section 8 shows the effectiveness of the proposed method in COVID detection and finally Section 9 concludes the paper along with its future work.

Related work

In recent times, optimization algorithms have attracted a lot of attention from researchers. In particular, meta-heuristic algorithms have seen numerous improvements over the years. Meta-heuristic is a genre of randomized algorithms where the algorithm learns to find the optimal solution through the iteration process. Meta-heuristic algorithms can be divided into different categories: single solution based and population-based (Gendreau & Potvin, 2005), nature-inspired and non-nature inspired (Abdel-Basset et al., 2018, au2 et al., 2013), etc. From the ‘inspiration’ point of view, these algorithms can broadly be divided into four categories (Nematollahi et al., 2019): Evolutionary, Swarm inspired, Physics based, and Human related. Evolutionary Algorithms: These algorithms are basically inspired by the biological process of evolution. In an evolutionary endeavour, the fittest individual is generated through crossover and mutation in each generation, which inspired the pioneer algorithm in this field, Genetic Algorithm (GA) (Davis, 1991). Other evolutionary algorithms are Genetic programming (Koza, 1994), Co-evolving algorithm (Hillis, 1990), Cultural algorithm (Xue et al., 2011), Biogeography-Based Optimization (Simon, 2008), Grammatical evolution (Ryan et al., 1998) etc. Swarm inspired algorithms: This genre of algorithm mimics the individual and social behaviour of swarm, herd, schools, groups, and teams. The key idea behind such algorithms in the optimization field is that in swarms, each individual has a certain behaviour but with the collective effort, the swarm can solve very complex optimization problems. One of the most popular algorithms in this field is PSO (Kennedy & Eberhart, 1995), which is inspired by the behaviour of flock of birds. The other famous swarm-based algorithms are the Shuffled frog-leaping algorithm (Eusuff et al., 2006), Bacterial foraging Passino (2002), ABC (Karaboga & Basturk, 2007), Firefly Algorithm (Yang, 2009), Grey Wolf Optimizer (GWO) (Mirjalili et al., 2014), Crow search algorithm (Askarzadeh, 2016), The Whale Optimization Algorithm (Mirjalili & Lewis, 2016), Grasshopper Optimization Algorithm (Saremi et al., 2017), Squirrel Search Algorithm (Jain et al., 2019). Physics based algorithms: This type of algorithms is inspired by the working principle of the physical world. Music, metallurgy to mathematics, physics, chemistry, and complex dynamic systems, are some of the physical processes which inspire Physics based meta-heuristic algorithms. Some noted algorithms are the Gravitational Search algorithm (GSA) (Rashedi et al., 2009), SA (Kirkpatrick et al., 1983), Self propelled particles (Vicsek et al., 1995), HS algorithm (Geem et al., 2001), Black hole optimization (Hatamlou, 2013), Multi-verse optimizer (Mirjalili et al., 2015), Find-Fix-Finish-Exploit-Analyze (Kashan et al., 2019) etc. Human related algorithms: These are developed based on human behaviour Teaching–Learning-Based optimization (Rao et al., 2011), Society and civilization (Ray & Liew, 2003), Fireworks algorithm (Tan & Zhu, 2010), are some algorithms in this genre. However, one of the issues with meta-heuristic algorithms is their premature convergence that leads to finding less optimal solutions. Therefore, these algorithms are often coupled with other techniques (e.g. local search algorithms). In this case, the local search algorithm tries to find solutions that are locally adjacent to an existing solution, which can outperform the existing solutions. Some of the commonly used local search algorithms are Hill Climbing (HC), SA (Kirkpatrick et al., 1983), Tabu Search (TS) (Glover & Laguna, 1998), Late acceptance hill climbing (LAHC). Some modifications to HC algorithm are HC (Al-Betar, 2016) and Adaptive HC (Al-Betar et al., 2019). Some of the work based on hybridization of local search and meta-heuristic algorithms are Elgamal et al., 2020, Kurtuluş et al., 2020 and Mafarja and Mirjalili (2017).

Motivation and contributions

For the past few decades, meta-heuristic algorithms have proved their utilities in several research fields. Because of its immense usefulness, researchers are investing more time in it to come up with better-performing algorithms. At the end of the day, we want to find the most optimal solution to such NP-hard problems. So there is not really any best result. We can always improve our findings with new or modified algorithms. Moreover, according to the No Free Lunch (NFL) theorem (Wolpert & Macready, 1997), any two algorithms produce equivalent results when they are evaluated on all possible optimization problems. It has been observed that an algorithm may achieve superior results on some problems, but that does not ensure the same on other problems. Hence, we can say that there is no such universal algorithm that is qualified enough to be used in all the optimization problems and produce the best results. These inferences keep the research resilient in this field. As FS is considered as an optimization problem (Ghosh et al., 2020), so researchers are also coming up with new and efficient FS methods using meta-heuristic algorithms. This is the motivation of our proposed work where we have designed a new algorithm by modifying the GNDO (Zhang et al., 2020). The GNDO algorithm is inspired by the generalized normal distribution model, where each individual uses a generalized normal distribution curve to update its current position in the hope of finding a better position. GNDO is employed to increase the accuracy of extracting the unknown parameters of the single diode model, double diode model and photovoltaic module model. There are two ways of hybridizing meta-heuristic algorithms (Talbi, 2009): low-level and high-level. A low-level approach en routes one algorithm in the other, whereas the algorithms are executed in succession in the high-level accession. This work follows the high-level version to hybridize GNDO and SA, maintaining a pipeline model where the output of one meta-heuristic algorithm is considered as the input to the other. To the best of our knowledge, this is the first time, GNDO is hybridized with SA for solving FS problems. The present work proposes an improved version of the binary form of GNDO (BGNDO), known as Binary Simulated Normal Distribution Optimizer (BSNDO), hybridized with another meta-heuristic algorithm called SA (Kirkpatrick et al., 1983). Recently, some hybrid FS methods have been proposed (Ahmed et al., 2021, Ahmed et al., 2020b, Bhattacharyya et al., 2020, Sheikh et al., 2020), which have demonstrated their effectiveness and superiority over other methods. This has also motivated us to come up with hybrid version of GNDO. Also, COVID-19 is a threat to the humanity as many people are suffering from it as well as many have died. Our normal day-to-day life is destroyed because of this uncertainty. Many works have been proposed for detection of COVID, few of them are Das et al., 2021, Garain et al., 2021, Karbhari et al., 2021 etc. We have performed FS on a publicly available COVID-19 dataset for classification purpose of COVID-19. In a nutshell, the main contributions of this work are as follows: A new FS method called BSNDO is introduced using BGNDO and another popular meta-heuristic called SA. The proposed hybrid FS method is assessed on 18 standard UCI datasets (Dua & Graff, 2017) using K-nearest Neighbours (KNN) classifier, Random Forest classifier as well as Naive Bayes classifier. BSNDO is also applied on high-dimensional microarray datasets to prove its effectiveness. It is also applied on a publicly available COVID-19 dataset for classification purposes. The proposed FS method is compared with many state-of-the-art meta-heuristic based FS methods.

Preliminaries

Generalized normal distribution optimizer

GNDO (Zhang et al., 2020) is inspired from normal distribution (Gaussian distribution) theory. This distribution is used to narrate the natural phenomenon. A normal distribution is described as follows: assume a random variable ‘r’, obeys a probability distribution having location parameter and scale parameter , and its probability density function can be written as: Then ‘r’ is a random variable and this distribution is called normal distribution, i.e., r (). Any population-based optimization algorithm starts with the random initialization, and then all solutions converge towards the global optima following the rules of exploration and exploitation. In the end, all individuals assemble around the achieved the best solution. Now, we can visualize this search process as multiple normal distributions. The position of every individual is regarded as random variables which are subject to normal distribution. Exploration of GNDO is dependent on three randomly selected agents. And the exploitation of GNDO is based on the generalized normal distribution model, which is accompanied by the current mean position and the current optimal position. Based on the correspondence between the distribution of the solutions in the population and the normal distribution, a generalized distribution model can be built by: where is the trail vector of the th agent at time t, is generalized mean position of the th agent, is generalized standard variance and is a penalty factor. Also, and can be defined as follows: where a, b, and are random numbers [0, 1], is the current best position and M is the mean position of the current population, which is calculated by: As current best individual contains useful information related to the global optimal solution, the th individual is pulled towards the direction of . It is to be noted that when gets confined into local optima, all agents still move towards the direction of that will lead the algorithm to premature converge. To resolve this concern, the mean position of the current population M is introduced. Although the position of may not change in some generations, however, the mean position M is changed over the generation that becomes useful for finding better solutions. Thus, the mean position M is introduced in the searching process, which increases the probability to avoid the local optima. is employed to amplify the local search ability of GNDO. Further, can be interpreted as a random progression to perform the local search around the generalized mean position . Moreover, the distance between the position of the th individual and the mean position M and the position of the best individual is larger, the oscillation of the generated random sequence is prominent. Hence, the probability to find a better solution around an individual is very minimal when has a very bad fitness value which can help the individual to search better solution. On the contrary, there is a large probability for the individual to find a better solution around it when the individual has good fitness. Thus, a random sequence with weak oscillation may help the individual to achieve a better solution. In the GNDO algorithm, the penalty factor is used to increase the randomness of the generated generalized standard variance. Most penalty factors are located [−1, 1]. As the generated generalized standard variances are all positive, the penalty factor can increase the search directions of GNDO, which can enhance its searchability. Now, the global exploration of GNDO is dependent on three randomly selected individuals, which is given by: where and are randomly generated numbers subject to standard normal distribution, is the adjust parameter which is a random number [0, 1], and are trail vectors which are calculated as follows: where p1, p2 and p3 are three randomly generated integers [1, D], following . The th individual is given information by p2 and p3. The solution p1 shares information with th solution . The adjust parameter is used to balance the two information-sharing procedures. Moreover, and are random numbers with standard normal distribution, which can make GNDO has a larger search space in the process of performing the global search. In order to bring the better solution into the next generation population, a mechanism is designed, which is represented as : A brief description of the parameters used in GNDO are summarized in Table 1.
Table 1

Brief description of parameters used in GNDO.

ParameterDescriptionValue
αGeneralized mean positionNA
βGeneralized standard variance (enhances local search ability)NA
γPenalty factor (enhances randomness of generated generalized standard variance)[−1,1]
a,b, λ1, λ2Random numbers[0,1]
λ3, λ4Random numbers subject to standard normal distribution[0,1]
δAdjust parameter[0,1]
DDimension of search spaceNA
The pseudocode of the GNDO algorithm is given in Algorithm 1. Brief description of parameters used in GNDO.

Simulated annealing

SA, proposed by Kirkpatrick et al. (1983), is inspired by an analogy between simulation of annealing of solids and large combinatorial optimization problems. Often meta-heuristic algorithms can fail to find the global optima, rather they can stagnate in local optima. To overcome this issue, SA uses a probabilistic approach to accept a poor solution. By accepting poor solutions with a certain probability, exploration increases. The algorithm starts with a randomly generated initial solution, in each iteration, a solution neighbouring with respect to the current solution is generated on a random basis based on the existing neighbouring structure. Then, the neighbouring solution is evaluated based on the fitness function. There may occur two possibilities: Neighbouring solution is a better performing solution than the existing solution: in this case, a new solution is always accepted. Neighbouring solution is worse performing solution in comparison with the existing solution: in this case, the worse solution can be accepted with a certain probability determined by the Boltzmann probability, . Here is the difference of fitness value of the neighbouring solution and existing best solution, and is the temperature of “simulated annealing process”. The temperature is periodically reduced over the iterations. The temperature is initialized to where represents the feature-length. The temperature reducing scheme can be represented by the following equation: . It is to be noted that the temperature decay and the probability of exploration/exploitation are taken from the work done by Kirkpatrick et al. (1983) and Zhang et al. (2020) respectively. A brief description of the parameters used in SA are summarized in Table 2.
Table 2

Brief description of parameters used in SA.

ParameterDescriptionValue
PBoltzmann probability[0,1]
TTemperatureNA
NNumber of attributes for each datasetNA
Algorithm 2 shows the pseudo-code of SA. Brief description of parameters used in SA.

Proposed method

This section elaborates the fitness function and transfer function used, and the computational complexity of the proposed algorithm. At every iteration, the agents update their position following the rules of GNDO and at the end, they try to find a better solution in their neighbourhood using SA.

Fitness function

Selecting the relevant features from a dataset that actually helps the classifier to identify the class of a sample is the main challenge. Now, during the process of selecting relevant features, we have to automatically rule out the redundant ones and maximize the classification accuracy of a classification problem when the selected feature subset is used for classification purposes (Pudil et al., 1994). This work applies BSNDO to find the best feature subset and calculate the classification accuracy of this subset using a classifier. Let be the classification accuracy of the model calculated using a classifier, be the dimension of the feature subset and be the total number of attributes present in the original dataset. So, (1 - ) is the classification error and is the fraction of features selected from the original dataset. We define the fitness function as: where denotes weightage given to the classification error.

Transfer function

As FS is a binary optimization problem (Ghosh et al., 2020), its output is {0, 1} where zero represents that the feature is rejected as it is redundant and one represents that the feature is useful and hence it is selected. However, we cannot discard the possibility of the obtained result going out of the desired range. To ensure that the output always stays within the expected range, we have to apply a binarization function on each agent. Here, this task is performed by the sigmoid (S-shaped) transfer function (Mirjalili & Lewis, 2013). The S-shaped transfer function, depicted in Fig. 1, is given by -
Fig. 1

S-shaped transfer function.

The range of this function [0,1]. If the transfer function produces output , where rnd is a random number with uniform distribution in the range (0, 1), we set the value to be 1 i.e., we consider that attribute is useful and if it is rnd, we set the value to be 0 i.e., the attribute is redundant, hence it will not be considered (Mafarja et al., 2019). S-shaped transfer function.

Computational complexity

For any meta-heuristic algorithm, the computation complexity depends on the time taken by each individual to update their positions, the maximum number of iterations and some other operations like comparison/sorting and the time to update variables. The computational complexity of BSNDO is , where represents the maximum number of iterations, represents the number of agents, represents the dimension of the search space, and indicates the required time for calculating the fitness of a particular solution using a classifier. The usage of SA is to find a better solution if available in the neighbourhood of the current solution. SA does not affect the computational cost in terms of -notation.

Experiments

Dataset details

To investigate the performances of BGNDO and BSNDO, 18 standard UCI datasets (Dua & Graff, 2017) are considered here. These datasets are from diverse domains. Some basic information regarding these datasets is provided in Table 3. As the datasets used here are assorted in terms of the number of features and instances, so it helps us to understand the robustness of the proposed FS method.
Table 3

Brief idea of the datasets employed here to assess the proposed FS method.

Sl. No.Dataset#Attributes#Samples#ClassesDomain
1Breastcancer96992Biology
2BreastEW305692Biology
3CongressEW164352Politics
4Exactly1310002Biology
5Exactly21310002Biology
6HeartEW132702Biology
7IonosphereEW343512Electromagnetic
8KrvskpEW3631962Game
9Lymphography181484Biology
10M-of-n1310002Biology
11PenglungEW325732Biology
12SonarEW602082Biology
13SpectEW222672Biology
14Tic-tac-toe99582Game
15Vote163002Politics
16WaveformEW4050003Physics
17WineEW131783Chemistry
18Zoo161016Artificial
Brief idea of the datasets employed here to assess the proposed FS method.

Parameter settings

For any multi-agent evolutionary algorithm, the parameters always play an important role in determining the outcome. Specially, the population size and the total number of iterations (number of generations) always affect the outcome of the algorithm heavily. Hence, we have performed some experiments to determine the approximate ideal population size and a total number of iterations. The achieved classification accuracy by BGNDO and BSNDO for different population sizes varying from 10 to 50 are provided in Table 4. Similarly, the numbers of selected features for different population sizes varying from 10 to 50 by BGNDO and BSNDO are depicted in Table 5. To observe the convergence of the solution to the optimal position, convergence graphs have been plotted (which are given in Fig. 2) over 50 iterations. To maintain the fairness of the comparison, we have run each dataset 10 times and taken the average over these runs.
Table 4

Achieved classification accuracy obtained by BGNDO and BSNDO with different population sizes.

Pop_size10
20
30
40
50
DatasetBGNDOBSNDOBGNDOBSNDOBGNDOBSNDOBGNDOBSNDOBGNDOBSNDO
Breastcancer98.5710097.1410099.2898.5799.2899.2898.57100
BreastEW96.4997.3697.3798.2596.4998.2497.3697.3695.6199.122
CongressEW96.5510097.710098.8598.8596.5597.798.8598.85
Exactly10010010010010010010099.5100100
Exactly27778.580.580.58078.579808079.5
HeartEW85.1883.3390.7494.4488.8890.7485.1894.4485.1890.74
IonosphereEW91.4392.8695.7195.7491.4394.2894.2992.8692.8594.28
KrvskpEW97.6598.1298.1298.4498.4397.4998.5997.8197.9697.33
Lymphography93.3393.3396.6796.679096.6796.679096.6793.33
M-of-n100100100100100100100100100100
PenglungEW93.3393.3310010093.3393.3386.6710093.33100
SonarEW95.2388.0997.6295.2492.8592.8697.6295.2392.8597.62
SpectEW90.5694.4492.4596.2292.4594.4494.3390.7492.4588.89
Tic-tac-toe83.3386.4689.5887.584.8986.4683.8586.4688.5484.89
Vote98.3310010010098.3398.3310010098.33100
WaveformEW84.384.484.58783.484.685.385.685.783.8
WineEW97.2210010010010010010010097.22100
Zoo100100100100100100100100100100
Table 5

Number of selected features by BGNDO and BSNDO for different population sizes.

Pop_size10
20
30
40
50
DatasetBGNDOBSNDOBGNDOBSNDOBGNDOBSNDOBGNDOBSNDOBGNDOBSNDO
Breastcancer4474464344
BreastEW13814415111651312
CongressEW66977981078
Exactly76106766776
Exactly264988691276
HeartEW5564645445
IonosphereEW1712261615162012108
KrvskpEW24242222212526242217
Lymphography105115886696
M-of-n7686767676
PenglungEW48132209187129132171179177139
SonarEW30243927333130362728
SpectEW9131461071161112
Tic-tac-toe6999799999
Vote107103738667
WaveformEW3425273327263128264
WineEW6493714444
Zoo86115617658
Fig. 2

Convergence graphs depicting the convergence of best individual at every iteration for 18 UCI datasets using BGNDO and BSNDO.

From the initial experiments, we have found that a population size of 20 leads to noteworthy results. Keeping the computational cost in min,d this population size is considered for further experiments. At the same time, from the convergence graphs, approximately after 30 iterations, the best solution is almost at the optimal position. Hence, it has been used for further experiments. Achieved classification accuracy obtained by BGNDO and BSNDO with different population sizes. Number of selected features by BGNDO and BSNDO for different population sizes. Convergence graphs depicting the convergence of best individual at every iteration for 18 UCI datasets using BGNDO and BSNDO.

Result and discussion

This section discusses the outcomes produced by BGNDO and BSNDO evaluated on UCI datasets whose details are given in Table 3 while evaluated using KNN, Random Forest and Naive Bayes classifiers. These results establish the superiority of BSNDO over BGNDO. Table 6, Table 7, Table 8 present the outcomes obtained by the proposed BSNDO algorithm while evaluated by KNN, Random Forest and Naive Bayes classifiers respectively. Compared to the BGNDO algorithm, the obtained results clearly depicts the effect of BSNDO in finding a better solution. Observing these results, we can conclude that BSNDO performs better than BGNDO on UCI datasets. Furthermore, KNN is used widely in the references for FS on UCI datasets (Emary et al., 2016, Mafarja and Mirjalili, 2017, Mafarja et al., 2019), hence, for further experiments and discussion, we have utilized KNN classifier with .
Table 6

Achieved classification accuracy and number of selected features by BGNDO and BSNDO using KNN classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font).

Sl. No.DatasetOriginal
BGNDO
BSNDO
AccuracyFeaturesAccuracyFeaturesAccuracyFeatures
1Breastcancer96997.1471004
2BreastEW92.633097.371498.254
3CongressEW92.181697.791007
4Exactly72.313100101006
5Exactly273.31380.5980.58
6HeartEW68.151390.74694.444
7IonosphereEW83.433495.712695.7416
8KrvskpEW96.13698.122298.4422
9Lymphography81.331896.671196.675
10M-of-n87.41310081006
11PenglungEW81.33325100209100187
12SonarEW80.956094.623995.2427
13SpectEW82.222292.451496.226
14Tic-tac-toe81.1983.854887.58
15Vote92.3316100101003
16WaveformEW81.444084.5278733
17WineEW66.671310091003
18Zoo8716100111005
Table 7

Achieved classification accuracy and number of selected features by BGNDO and BSNDO using Random Forest classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font).

Sl no.DatasetOriginal
BGNDO
BSNDO
AccuracyFeaturesAccuracyFeaturesAccuracyFeatures
1Breastcancer97.8997.14797.862
2BreastEW98.23095.6121004
3CongressEW97.71696198.855
4Exactly78.51310061006
5Exactly27413761761
6HeartEW81.51388.89594.445
7IonosphereEW91.43495.712498.5720
8KrvskpEW99.53698.122899.5317
9Lymphography901893.33896.674
10M-of-n1001310081006
11PenglungEW86.732593.33140100193
12SonarEW90.76092.864295.2414
13SpectEW88.92290.741596.37
14Tic-tac-toe95.8982.94594.378
15Vote951698.331298.336
16WaveformEW85.840833386.229
17WineEW1001310041003
18Zoo1001610041003
Table 8

Achieved classification accuracy and number of selected features by BGNDO and BSNDO using Naive Bayes classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font).

Sl. No.DatasetOriginal
BGNDO
BSNDO
AccuracyFeaturesAccuracyFeaturesAccuracyFeatures
1Breastcancer89.28997.87799.245
2BreastEW96.493097.361498.25
3CongressEW98.851698.8581006
4Exatly69.5139681006
5Exactly27613769766
6HeartEW94.441392.3994.444
7IonosphereEW95.713492.881895.7415
8KrvskpEW65.883695.31997.1812
9Lymphography86.671890151008
10M-of-n96.51398.561006
11PenglungEW6032573.3312093.33157
12SonarEW80.956080.952397.6121
13SpectEW72.222290.241492.5914
14Tic-tac-toe75.52972.92682.925
15Vote98.331610081003
16WaveformEW82.24082.52585.818
17WineEW1001310031002
18Zoo1001610051003
Inspecting the results in these tables, we can observe that BSNDO provides better results than BGNDO in every dataset while evaluated using different classifiers. From Table 6 we can see that BSNDO achieves 90% accuracy in 15 datasets (83.33%), out of which it produces 100% classification accuracy in 8 datasets (44.44%) while evaluated using KNN classifier. It produces better classification accuracy than BGNDO except Exactly, Exactly2, Lymphography, M-of-n, PenglungEW, Vote, WineEW and Zoo, where both produces equivalent accuracy. Considering the number of selected features, BSNDO beats BGNDO in every dataset except WaveformEW. They select the same number of features in the case of KrvskpEW. Similarly, from Table 7 we can see that BSNDO achieves 90% accuracy in 16 datasets (88.89%) while evaluated using Random Forest classifier. It achieves 100% accuracy in 6 datasets (33.33%). In the case of Exactly, Exactly2, M-of-n, Vote, WineEW and Zoo, BSNDO and BGNDO produces the same classification accuracy. BSNDO has the upper hand over BGNDO over the rest cases. Talking about the number of selected features, only in the case of BreastEW, CongressEW, PenglungEW and Tic-tac-toe, BGNDO produces better results. They select the same number of features in cases of Exactly, Exactly2 and HeartEW. BSNDO selects fewer features than BGNDO in the rest cases. While evaluated using Naive Bayes classifier, BSNDO produces 90% classification accuracy in 15 datasets (83.33%) Table 8. It achieves 100% accuracy in 7 datasets (38.89%). BSNDO and BGNDO produce equivalent results in the case of Exactly2, Vote, WineEW and Zoo. BSNDO produces better results in the rest of the cases. It also selects fewer features than BGNDO in almost every dataset except KrvskpEW and PenglungEW. Both of them selects the same number of features in the case of M-of-n and SpectEW. From the above discussion, we can say that BSNDO is superior to BGNDO while evaluated using KNN, Random Forest and Naive Bayes classifiers. The results achieved by BSNDO using these classifiers establish the fact that BSNDO produces noteworthy and impressive results while evaluated using different classifiers. Achieved classification accuracy and number of selected features by BGNDO and BSNDO using KNN classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font). Achieved classification accuracy and number of selected features by BGNDO and BSNDO using Random Forest classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font). Achieved classification accuracy and number of selected features by BGNDO and BSNDO using Naive Bayes classifier (highest classification accuracies and lowest no. of selected features are highlighted in bold font).

Comparison

We have established this claim that BSNDO produces better results than BGNDO beforehand. This section provides the performance comparison of BSNDO with eight state-of-the-art FS methods. These state-of-the-art methods consist of few popular methods and some recently proposed hybrid methods. They are: GA, PSO, adaptive switching grey-whale optimizer (ASGW), serial grey-whale optimizer (HSGW), random switching grey-whale optimizer (RSGW), social ski driver algorithm and late acceptance hill-climbing (SSDsLAHC) (Chatterjee et al., 2020), electrical harmony based meta-heuristic (Sheikh et al., 2020) and embedded chaotic whale survival algorithm (ECWSA-4) (Guha et al., 2020). From Table 9 we can say that BSNDO produces the overall best result. In the case of Breastcancer, BSNDO and EHHM produce 100% accuracy. In BreastEW, BSNDO holds the second position along with SSDsLAHC after ASGW and EHHM. BSNDO holds the top position along with SSDsLAHC in CongressEW producing 100% accuracy. In the case of Exactly, BSNDO produces the best result with SSDsLAHC, HSGW, BGA, BPSO and EHHM. HSGW beats BSNDO in Exactly2 with a very narrow margin. It stands at third position in the case of HeartEW. It stands at fifth position in IonosphereEW and SonarEW. In the case of KrvskpEW and Lymphography, it attains the second position after BGA and EHHM respectively. In the case of M-of-n, PenglungEW, Vote, WineEW and Zoo, BSNDO stands at first position along with few other methods. It achieves the highest classification accuracy in the case of SpectEW and Tic-tac-toe. In the case of WaveformEW, it produces the second best result after EHHM.
Table 9

Comparison of BSNDO with state-of-the-art FS methods based on achieved classification accuracy tested on UCI datasets (highest classification accuracies are highlighted).

DatasetBSNDOSSDs+LAHCHSGWRSGWASGWBGABPSOEHHMECWSA-4
Breastcancer10098.9398.697.198.597.4396.2910095.21
BreastEW98.2598.2598.198.210097.5497.1910097.38
CongressEW10010097.596.199.496.7996.3398.8596.23
Exactly10010010099.799.910010010078.09
Exactly280.57981.577.977.77776.879.178.9
HeartEW90.7491.6792.384.883.187.4183.790.785.63
IonosphereEW95.7496.4394.497.897.294.8994.8998.686.79
KrvskpEW98.4497.8197.397.297.198.597.3197.8193.53
Lymphography96.6796.6793.489.388.483.7889.1996.987.02
M-of-n10010010010010010010010092.47
PenglungEW10010094.210010091.8991.8910087.63
SonarEW95.2497.6296.497.994.899.0494.2392.8576.84
SpectEW96.2295.1586.281.58789.5588.8190.7479.84
Tic-tac-toe87.587.2482.885.986.579.9679.968578.75
Vote10010098.399.698.497.339698.495.08
WaveformEW8584.474.875.774.678.3675.686.880.18
WineEW10010010010010098.8897.7510098.02
Zoo10010010010010090.296.0810098.95

Avg rank1.83323.53.94444.225.332.335.944
Ass rank124567839
Table 10 gives the comparison of BSNDO with state-of-the-art FS methods based on the number of selected features. It selects the least number of features in BreastEW, Exactly along with SSDsLAHC, BGA and BPSO, Lymphography along with BGA and BPSO. It also produces the best result in the case of Vote along with BPSO. It stands at second position in the case of Breastcancer along with BGA and BPSO, HeartEW, M-of-n with SSDsLAHC, BGA and BPSO, SpectEW with BPSO, WineEW along with SSDsLAHC. BSNDO stands at third position in the case of Exactly2 with SSDsLAHC and Zoo along with BPSO. It stands at fourth position in the case of IonosphereEW and Tic-tac-toe along with ECWSA-4. It selects the same number of features as EHHM in the case of CongressEW attaining the fifth position. In the case of PenglungEW and WaveformEW, it stands at ninth position.
Table 10

Comparison of BSNDO with state-of-the-art methods based on number of selected features tested on UCI datasets (least number of selected features are highlighted).

DatasetBSNDOSSDs+LAHCHSGWRSGWASGWBGABPSOEHHMECWSA-4
Breastcancer42.555.9334.8674447
BreastEW4916.66717.515.833891315
CongressEW75.58.8679.78.8332374
Exactly666.77.16.8676677
Exactly2889.0339.27.9331159
HeartEW458.7676.1336.3675389
IonosphereEW161218.16720.517.377710
KrvskpEW222024.824.824.511121516
Lymphography56.510.56710.56711.255610
M-of-n666.87.16.8676675
PenglungEW187140135.33181.2170.3841307493
SonarEW2723.534.336.43335.519222223
SpectEW6910.23313.310.16756117
Tic-tac-toe897775668
Vote34.57.5678.88.9675356
WaveformEW3322.526.93327.53325.83315152015
WineEW334.5335.8675.9334517
Zoo54.55.5335.37.64517

Avg rank3.53.225.2776.2775.331.612.0552.9444.16
Ass rank547981236
Comparison of BSNDO with state-of-the-art FS methods based on achieved classification accuracy tested on UCI datasets (highest classification accuracies are highlighted). To make a quantitative decision about a process, we perform a statistical test. The goal of this test is to determine whether there is enough clarity to “reject” a conjecture about the process. The conjecture is called the null hypothesis. In our case, the null hypothesis states that the two sets of results have the same distribution, which implies that if the distribution of two results is statistically different, then the generated -value from the test statistics will be 0.05 when the test is performed at 0.05% significance level. This will result in the rejection of the null hypothesis. So, to determine the statistical significance of the BSNDO algorithm, Wilcoxon rank-sum test (Wilcoxon, 1992) has been performed. It is a non-parametric statistical test where a pairwise comparison is performed. Individual meta-heuristic algorithms were run 20 times for each UCI dataset used here to perform the statistical test. From the test results provided in Table 11, we can conclude that the results of the proposed BSNDO algorithm is found to be statistically significant.
Table 11

-values generated via pairwise Wilcoxon test using the results obtained from 20 independent runs of the proposed BSNDO method and state-of-the-art FS methods used for comparison.

DatasetSSDs+LAHCHSGWRSGWASGWBGABPSOEHHMECWSA-4
Breastcancer0.0316230.0002120.0002920.0137240.0001260.0003920.0004550.000392
BreastEW0.2752340.000290.0001190.2840880.0030880.0013731.91E−061.91E−06
CongressEW0.5992660.0001380.0002970.0174740.0001610.0082110.1768530.000212
Exactly0.0229588.83E−058.73E−050.0001318.54E−058.72E−050.0001310.000618
Exactly27.88E−058.72E−058.66E−050.0001270.0001537.99E−050.000210.974353
HeartEW0.3597978.81E−050.0001020.000230.0001538.72E−050.5216738.72E−05
IonosphereEW0.9255750.0002910.0001270.0031910.0008450.0470310.2942521.91E−06
KrvskpEW0.0001558.86E−058.83E−050.0005158.81E−050.0001321.91E−060.009436
Lymphography0.5109170.0001278.34E−050.0014580.0002850.0002133.62E−050.000119
M-of-n0.0727898.77E−058.79E−058.77E−058.82E−050.0001310.0006188.72E−05
PenglungEW0.0549770.0002340.0147860.1455370.2136810.4740820.0094368.77E−05
SonarEW0.2649620.0001150.000130.0005270.9375370.0118061.91E−060.000127
SpectEW0.2626468.77E−050.0002120.0110680.0001780.000131.91E−060.000127
Tic-tac-toe0.1557878.78E−058.83E−050.0217480.0001878.84E−050.4523758.84E−05
Vote0.1667930.0001280.0001270.0290280.000190.0004064.77E−059.02E−05
WaveformEW0.0001828.86E−050.0001038.77E−058.86E−058.83E−051.91E−060.000297
WineEW0.3657129.95E−058.71E−050.0008590.053490.0002690.0013410.000269
Zoo0.7630259.02E−050.0001040.04550.0016890.0001470.9743538.77E−05
Comparison of BSNDO with state-of-the-art methods based on number of selected features tested on UCI datasets (least number of selected features are highlighted). -values generated via pairwise Wilcoxon test using the results obtained from 20 independent runs of the proposed BSNDO method and state-of-the-art FS methods used for comparison.

Additional testing on microarray datasets

The reported results, mentioned above, establish the fact that BSNDO performs better than the state-of-the-art methods considered here for comparison. To check the robustness of the proposed method, we have applied it on several high-dimensional microarray datasets (Ahmed et al., 2020a). The description of the datasets is given in Table 12. To confirm the superiority of the proposed method, it is compared with some state-of-the-art methods, namely: GA (Ghosh et al., 2018a), Memetic algorithm (MA) (Ghosh et al., 2019b, Ghosh et al., 2018b), WFACOFS (Ghosh et al., 2019a) and ECWSA (Guha et al., 2020). The comparison table is given in Table 13
Table 12

Description of datasets used to check the robustness of BSNDO.

DatasetNumber of featuresNumber of samplesNumber of classes
AMLGSA219112 616542
DLBCL7070772
Leukaemia5147722
Prostate12 5331022
MLL12 533723
SRBCT2308834
Table 13

Comparison of the results of BSNDO on microarray with state-of-the-art methods. The number of features selected is provided in brackets at the side of the accuracy.

DatasetGAMAWFACOFSECWSA-1ECWSA-2ECWSA-3ECWSA-4BSNDO
AMLGSE2191100(98)100(91)96.3(17)96.67(17)100(9)95.83(16)95.83(18)100(9)
DLBCL100(88)100(105)100(3)100(29)100(24)100(26)100(31)100(10)
Leukaemia100(85)100(65)100(5)97.22(7)100(8)100(4)97.22(5)100(12)
Prostrate100(99)100(107)100(22)96.3(16)98.15(16)96.3(9)96.3(19)95.24(20)
MLL100(94)100(80)100(25)100(16)100(17)100(8)100(15)100(16)
SRBCT100(78)100(50)100(19)100(45)100(32)100(34)100(30)100(11)
As microarray datasets consist of a very high number of attributes, it becomes a challenging task for us to rule out the irrelevant ones. The obtained results again demonstrate the effectiveness of BSNDO. From this Table 13, we can see that BSNDO produces noteworthy results as compared to the other methods considered here for comparison. It produces 100% accuracy in every dataset except Prostrate. It also selects the least features in the case of AMLGSE2191 and SRBCT. Description of datasets used to check the robustness of BSNDO. Comparison of the results of BSNDO on microarray with state-of-the-art methods. The number of features selected is provided in brackets at the side of the accuracy.

Testing on COVID-19 dataset

COVID-19 is a contagious disease, which is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The common symptoms of COVID-19 are fever, cough, fatigue, breathing difficulties, and loss of smell and taste etc. Symptoms begin 1 to 14 days after exposure to the virus. While most people have mild symptoms, some people develop acute respiratory distress syndrome (ARDS). Because of its nature, the accurate result of the COVID-19 test has become a challenging task. Some recent COVID-19 screening techniques are Bandyopadhyay et al., 2021, Barnes et al., 2021, Dey et al., 2021, Ismael and Şengür, 2021, Kundu et al., 2021 and Nigam et al. (2021). More than 190 million people is suffering from COVID-19, and more than 4 million people have died because of it. So, detecting the COVID-19 and keeping those people in quarantine have become one of the topmost priorities of every country. Though vaccination process has started, it will take time to reach everyone, especially in the under-development countries. We have tested our FS method on a COVID-19 dataset, which is publicly available at https://github.com/Atharva-Peshkar/Covid-19-Patient-Health-Analytics in csv format. This dataset contains 1086 instances and 74 attributes. The obtained results are compared with some meta-heuristics based FS methods: SSDsLAHC, ASGW, HSGW, RSGW, GA, PSO and Adaptive -coral reefs optimization (ACRO) (Ahmed et al., 2020a). The comparison table shows the achieved classification accuracy and the number of selected features (in brackets) (see Table 14).
Table 14

Comparison of achieved classification accuracy evaluated on mentioned COVID-19 dataset.

GNDO+SASSDs+LAHCASGWHSGWRSGWGAPSOAβCRO
98.61 (26)97.69 (23)97.69 (40)96.31 (50)97.75 (54)94.9 (23)97.24 (31)98.2(20)
Comparison of achieved classification accuracy evaluated on mentioned COVID-19 dataset.

Conclusion and future work

In this work, a new FS method, called BSNDO, which is based on GNDO and SA has been proposed. SA has been used as a local search to enhance the exploitation of the GNDO and to create a proper balance between exploration and exploitation of the overall method. The proposed method shows significant improvement in achieved classification accuracy while FS is performed using BSNDO than BGNDO and some state-of-the-art methods. Primarily the method has been tested on various UCI datasets. To prove the robustness of the model, the proposed method is also applied on high dimensional microarray datasets. Furthermore, it is experimented on a COVID-19 dataset for detecting the COVID-19 cases. This is to be noted that all the datasets used here are publicly available. The obtained results show the applicability of BSNDO in varied datasets. One of the limitations of this method may be the computational complexity due to addition of the local search to the GNDO algorithm. As a future scope of this work, a deeper analysis of the gene selections done by BSNDO and their biological impact can be studied. The proposed FS method can also be applied to some other real-world problems like handwritten word or digit recognition, and face recognition etc., where researchers sometimes use very high dimensional feature vectors without knowing the importance of all the features.

CRediT authorship contribution statement

Shameem Ahmed: Conceptualization, Methodology, Writing – original draft, Software, Investigation. Khalid Hassan Sheikh: Writing – review & editing, Software, Investigation, Conceptualization. Seyedali Mirjalili: Writing – review & editing, Supervision, Project administration, Validation. Ram Sarkar: Writing – review & editing, Supervision, Project administration, Validation, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  10 in total

1.  Novel type of phase transition in a system of self-driven particles.

Authors: 
Journal:  Phys Rev Lett       Date:  1995-08-07       Impact factor: 9.161

2.  Optimization by simulated annealing.

Authors:  S Kirkpatrick; C D Gelatt; M P Vecchi
Journal:  Science       Date:  1983-05-13       Impact factor: 47.728

3.  Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods.

Authors:  Manosij Ghosh; Sukdev Adhikary; Kushal Kanti Ghosh; Aritra Sardar; Shemim Begum; Ram Sarkar
Journal:  Med Biol Eng Comput       Date:  2018-08-01       Impact factor: 2.602

4.  Detection of COVID-19 from CT scan images: A spiking neural network-based approach.

Authors:  Avishek Garain; Arpan Basu; Fabio Giampaolo; Juan D Velasquez; Ram Sarkar
Journal:  Neural Comput Appl       Date:  2021-04-16       Impact factor: 5.606

5.  COVID-19: Automatic detection from X-ray images by utilizing deep learning methods.

Authors:  Bhawna Nigam; Ayan Nigam; Rahul Jain; Shubham Dodia; Nidhi Arora; B Annappa
Journal:  Expert Syst Appl       Date:  2021-03-16       Impact factor: 6.954

6.  Generation of Synthetic Chest X-ray Images and Detection of COVID-19: A Deep Learning Based Approach.

Authors:  Yash Karbhari; Arpan Basu; Zong-Woo Geem; Gi-Tae Han; Ram Sarkar
Journal:  Diagnostics (Basel)       Date:  2021-05-18

7.  Fuzzy rank-based fusion of CNN models using Gompertz function for screening COVID-19 CT-scans.

Authors:  Rohit Kundu; Hritam Basak; Pawan Kumar Singh; Ali Ahmadian; Massimiliano Ferrara; Ram Sarkar
Journal:  Sci Rep       Date:  2021-07-08       Impact factor: 4.379

8.  Harris Hawks optimisation with Simulated Annealing as a deep feature selection method for screening of COVID-19 CT-scans.

Authors:  Rajarshi Bandyopadhyay; Arpan Basu; Erik Cuevas; Ram Sarkar
Journal:  Appl Soft Comput       Date:  2021-07-14       Impact factor: 6.725

9.  Choquet fuzzy integral-based classifier ensemble technique for COVID-19 detection.

Authors:  Subhrajit Dey; Rajdeep Bhattacharya; Samir Malakar; Seyedali Mirjalili; Ram Sarkar
Journal:  Comput Biol Med       Date:  2021-06-22       Impact factor: 4.589

  10 in total
  2 in total

1.  An adaptive and altruistic PSO-based deep feature selection method for Pneumonia detection from Chest X-rays.

Authors:  Rishav Pramanik; Sourodip Sarkar; Ram Sarkar
Journal:  Appl Soft Comput       Date:  2022-08-10       Impact factor: 8.263

2.  A hybrid binary dwarf mongoose optimization algorithm with simulated annealing for feature selection on high dimensional multi-class datasets.

Authors:  Olatunji A Akinola; Absalom E Ezugwu; Olaide N Oyelade; Jeffrey O Agushaka
Journal:  Sci Rep       Date:  2022-09-02       Impact factor: 4.996

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.