Literature DB >> 33286647

Multi-Population Genetic Algorithm for Multilabel Feature Selection Based on Label Complementary Communication.

Jaegyun Park¹, Min-Woo Park¹, Dae-Won Kim¹, Jaesung Lee¹.

Abstract

Multilabel feature selection is an effective preprocessing step for improving multilabel classification accuracy, because it highlights discriminative features for multiple labels. Recently, multi-population genetic algorithms have gained significant attention with regard to feature selection studies. This is owing to their enhanced search capability when compared to that of traditional genetic algorithms that are based on communication among multiple populations. However, conventional methods employ a simple communication process without adapting it to the multilabel feature selection problem, which results in poor-quality final solutions. In this paper, we propose a new multi-population genetic algorithm, based on a novel communication process, which is specialized for the multilabel feature selection problem. Our experimental results on 17 multilabel datasets demonstrate that the proposed method is superior to other multi-population-based feature selection methods.

Entities: Chemical Disease Species

Keywords: communication; evolutionary algorithm; multi-population genetic algorithm; multilabel feature selection

Year: 2020 PMID： 33286647 PMCID： PMC7517480 DOI： 10.3390/e22080876

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Multilabel feature selection (MLFS) involves the identification of important features that depend on a given set of labels. It is often used as an effective preprocessing step for complicated learning processes, because noisy features in relationships between multiple labels can be eliminated from subsequent training, resulting in improved multilabel classification performance [1,2]. Given an original feature set , MLFS identifies a feature subset composed of features that are dependent on the label set . Conventional studies on MLFS have indicated that population-based evolutionary algorithms are promising, owing to their global search capability [3,4]. Conventional genetic algorithms that are based on a single population have suffered from premature convergence of the population, resulting in local optimal solutions [5]. Multi-population genetic algorithms (MPGAs) have recently gained significant attention as a means for circumventing the aforementioned issue. This is because they enable one sub-population to avoid premature convergence by referencing individuals or solutions from other sub-populations [6,7,8]. With regard to the feature selection problem, the communication process would improve the search capability of the sub-populations, because they can acquire hints regarding important features by referencing the best individuals from other sub-populations [9]. To the best of our knowledge, most studies have used a traditional communication process to solve the MLFS problem, even though it is intended for solving a single-label feature selection problem [4,5]. A novel communication process should be designed to maximize the benefit of using the MPGA for solving the MLFS problem. In this paper, we propose a new MPGA that specializes in solving the MLFS problem by enhancing the communication process. Specifically, an individual to be referenced is chosen from other sub-populations based on the concept of label complementarity from the viewpoint of the discriminating power corresponding to each label; then, the chosen individual is used in our improved update process. In this regard, our primary contributions are as follows: We proposed an MPGA that specializes in solving the MLFS problem by introducing a novel communication process and improving the update process. We introduced a new concept of label complementarity derived from the fact that feature subsets with a high discriminating power for different label subsets can complement each other.

2. Related Work

Recent MLFS methods can be broadly classified into filter-based and wrapper-based methods. The filter-based methods assess the importance of features through their own measure based on feature and label distributions. Thereafter, the top-n features with the highest scores are selected. Li et al. [10] proposed a granular MLFS method that attempts to select a more compact feature subset using information granules of the labels instead of the entire label set. Kashef and Nezamabadi-pour [11] proposed a Pareto dominance-based multilabel feature filter for online feature selection, which concerns the number of features being added sequentially. Gonzalez-Lopez et al. [12,13] proposed distributed models that measure the quality of each feature based on mutual information on Apache Spark. Seo et al. [14] proposed a generalized information-theoretic criterion for MLFS. They introduced entropy approximation generalized to cardinality, which was chosen by users based on the trade-off between approximation precision and computational cost. However, the classification performance of these methods is limited, because they work independently of the subsequent learning algorithm. In contrast, wrapper-based methods evaluate the superiority of candidate feature subsets that are based on a specific learning algorithm such as a multilabel naive Bayes classifier [15]. They generally outperform the filter-based methods in terms of classification accuracy [16]. Among the wrapper-based methods, population-based evolutionary search methods are frequently used for feature selection, owing to their stochastic global search capability [17]. Lu et al. [18] proposed a new functional constriction factor to avoid premature convergence in traditional particle swarm optimization. Mafarja and Mirjalili [19] proposed binary variants of a whale optimization algorithm and applied them to the feature selection. Nakisa et al. [20] used five population-based methods in order to determine the best subset of electroencephalogram features. Dong et al. [21] improved a genetic algorithm using granular computing to select important features in high-dimensional data with a low sample size. Moreover, Lim and Kim [22] proposed an initialization method for evolutionary search-based MLFS algorithms by approximating conditional mutual information. Lee et al. [23] introduced a score function to deal with multilabel text datasets without problem transformation in a memetic search. However, these single population-based methods suffer from premature convergence of the population, resulting in limited search capability. Although methods, such as a multi-niche crowding genetic algorithm [24], can be used to mitigate premature convergence, they are still sensitive to the initialization of the population. To resolve these issues, recent single-label feature selection studies have considered multi-population-based methods while using multiple isolated sub-populations. Ma and Xia [25] proposed a tribe competition-based genetic algorithm that attempts to ensure the diversity of solutions by allowing the sub-populations to generate feature subsets with different numbers of features. Additionally, it explores an entire search space by competitively allocating computing resources to the sub-populations. Zhang et al. [26] proposed an enhanced multi-population niche genetic algorithm. To avoid local optima, it included a process of exchanging the best individuals or solutions between the sub-populations during the search process. It also reduced the chances of similar individuals being selected as parents, based on the Hamming distance. Wang et al. [27] proposed a bacterial colony optimization method by considering a multi-dimensional population. Similar to the study that was conducted by Ma and Xia, the entire search space was divided based on the number of features selected, and the sub-populations explored different search spaces.

3. Label Complementary Multi-Population Genetic Algorithm for Mlfs

3.1. Preliminary

Table 1 summarizes the terms used for elucidating the proposed method. Conventional MPGAs for single-label feature selection entail the following processes.

Table 1

Notation used for describing/elucidating the proposed method.

Terms	Meanings
D	A multilabel dataset
L	A label set in D, L={l1,…,l\|L\|}
F	A feature set in D, F={f1,…,f\|F\|}
S	A final feature subset, \|S\|≤n
t	Number of generations
m	Number of sub-populations
n	Maximum number of selected features
indi	An i-th individual
Pk	A k-th sub-population, Pk={ind1,…,ind\|Pk\|}
vk	Fitness values for the individuals of the Pk
Ak	Label-specific accuracy matrix for individuals of Pk, Ak=(aij)∈R\|Pk\|×\|L\|
indc	A complementary individual
ci	A degree of complementarity for indi

Step 1: Initialization of sub-populations. Each sub-population consists of individuals whose number is a pre-defined parameter. Furthermore, each individual represents a feature subset. For example, in the genetic algorithm, each individual is represented as a binary vector called a chromosome, which comprises ones and zeros that represent selected and unselected features, respectively. In particle swarm optimization, each individual is represented as a probability vector. The components of a particle are regarded as the probabilities that the corresponding features will be selected. In most studies, the individuals are initialized randomly. Step 2: Evaluation using a fitness function. The individuals of each sub-population can be evaluated using a fitness function. Given a feature subset represented by each individual , a learning algorithm, such as a naive Bayes classifier, is trained, and trained classifier is used to predict the label for each test pattern. Given a correct label and the predicted label, a fitness value can be computed using evaluation metrics, such as accuracy. Intuitively, a feature subset that results in better single-label prediction has better a fitness value. Step 3: Communication among sub-populations. The sub-populations communicate with each other based on the best individuals in terms of the fitness value. In each sub-population, the worst individual (with the lowest fitness value) is replaced by the best individual of another sub-population. Step 4: Sub-population update. The individuals generate offspring via genetic operators. First, each sub-population chooses the parents based on fitness values. For example, roulette-wheel selection employs the fitness value percentage of each individual in each subpopulation, as the probability that the individual will be chosen as a parent. Subsequently, the offspring are generated via the crossover of parents or mutation. Whenever the individuals are modified in Step 4, they are evaluated in the same manner as in Step 2. During the search process, MPGAs repeat Step 3→Step 4→Step 2 until a stopping criterion is met. In the left side of Figure 1, the aforementioned process is presented as a flowchart.

Figure 1

Schematic overview of proposed method.

3.2. Motivation and Approach

We designed a novel MPGA that specializes in solving the MLFS problem. To extend the benefits of the communication process used in conventional studies to the MLFS problem, the following issues should be considered: Through communication between the sub-populations, the discriminating power of multiple labels should be complemented. Additionally, the referenced individuals should be used to generate offspring that are superior to the previous generation. Feature subsets with high discriminating power for different label subsets can complement each other. Therefore, each sub-population should refer to an individual with the highest discriminating power for a subset of labels that are relatively difficult to discriminate, resulting in improved search capability for the MLFS. Existing fitness-based parent selection methods may not fully use the individuals referenced from other sub-populations, because they are selected, regardless of fitness in our method. This issue can be resolved by ensuring that one of the important individuals in each sub-population is involved when generating the offspring. Figure 1 presents a schematic overview of the proposed MPGA for solving the MLFS problem. Particularly, we modified the communication and update process of the existing MPGA. First, with regard to sub-population communication, the conventional method communicates by exchanging the best individuals among sub-populations. Specifically, the sub-population imports the best individual of ; then, the worst individual is replaced by of . Similarly, of is replaced by of . In the proposed label complementary communication for MLFS, the evaluation of the individuals is performed similarly to that performed in the conventional methods for single-label feature selection; however, the learning algorithm is replaced by a multilabel classification algorithm, such as a multilabel naive Bayes classifier (MLNB) [15], which uses a series of functions that predict each label. Therefore, the discriminating power corresponding to each label can be obtained by reusing the learning algorithm that was trained to evaluate the fitness values of individuals; a detailed description of this process is presented in Section 3.3. As shown in Figure 1, the best individual of lacks sufficient classification performance with regard to the label . To complement the discriminating power with regard to , refers to individual of , which best discriminates . In the sub-population updating step, the conventional method stochastically or deterministically selects the parents of via fitness-based selection. Here, the individual that is imported from is selected and used with a high probability, because it had the highest fitness in . In contrast, in the proposed label complementary updating step, the complementary individual referenced from is chosen, regardless of fitness. Because the important individuals of are the complementary individual and the best individual , one of them is selected as a parent. In other words, one of the important individuals is always involved in the generation of offspring. For diversity, another parent is selected from the remaining individuals at random. Finally, the selected parents generate offspring while using a genetic operator. If a MPGA begins with a promising initial sub-populations, then a good-quality feature subset can be found by spending fewer time than that begins with a randomly-initialized sub-populations. In this study, we introduce a simple but effective initialization method. Given an original feature set and the number of sub-populations m, the spherical k-means algorithm partitions F into m clusters [28]; herein, each of the clusters are composed of different features without overlapping, such that . Subsequently, each sub-population is intialized based on repetitive entropy-based stochastic sampling from cluster . Section 3.3 presents a detailed description of the sampling process.

3.3. Algorithm

Algorithm 1 represents the pseudocode of the proposed method. Each individual (chromosome) is represented by a binary string that is composed of ones and zeros, representing selected and unselected features, respectively. For simplicity, each sub-population is represented as a set of individuals, i.e., . Additionally, all of the sub-populations have the same number of individuals. In the initialization step (line 4), the individuals of each sub-population are initialized by Algorithm 2, and then evaluated to obtain their fitness values (line 6). In this study, the MLNB is used as the learning algorithm. Given the trained learning algorithm, the fitness values are computed according to the multilabel evaluation metrics detailed in Section 4.1. To evaluate the discriminating power corresponding to each label, our algorithm uses an accuracy metric used in the fitness evaluation of single-label feature selection methods. For each individual that belongs to , the label-specific accuracy vector is computed by reusing the already trained learning algorithm; here, is the accuracy corresponding to the j-th label predicted by . Consequently, the label-specific accuracy matrix is computed across all individuals of (line 7). Input:; ▹ the multilabel dataset D, the number of sub-populations m Output: S; ▹ the final feature subset S ; initialization ▹ use Algorithm 2 for each sub-population do evaluate using D; ▹ compute fitness values via a fitness function compute the label-specific accuracy matrix for individuals of ; ▹ reuse the fitness function end for while (not termination-condition) do for each sub-population do communication; ▹ use Algorithm 3 update; ▹ use Algorithm 4 evaluate using D; compute the label-specific accuracy matrix for individuals of ; ; end for end while the best feature subset so far; input:m; ▹ the number of sub-populations m output:; ▹ the initial sub-populations for each feature do ▹ the original feature set F if then ; end if end for patition F into m clusters; ▹ use the spherical k-means algorithm for to m do for each individual do initialize by selecting n features via stochastic sampling; ▹ use Equation (1) end for end for After the initialization process, the sub-populations complement each other via the proposed label complementary communication (line 11), i.e., Algorithm 3. Specifically, each sub-population identifies a complementary individual that can complement itself from the other sub-populations. Next, our algorithm updates the sub-populations while using via Algorithm 4. All of the sub-populations repeat these processes until the termination condition is met. We use the number of fitness function calls (FFCs) as the termination criterion, and the algorithm conducts the search until the available FFCs are exhausted. Finally, Algorithm 1 outputs the best feature subset. Algorithm 2 represents the procedure of initialization process for each sub-population. With regard to lines 3–7, if the entropy of any feature is zero, then it is preferentially removed because it does not have any information. Each cluster of features is generated by the spherical k-means algorithm (line 8), and it is used to initialize each sub-population (lines 9–13). Given each feature , its importance score is calculated as where is the entropy of a variable x. Finally, each individual of is initialized via stochastic sampling based on the importance scores (line 11). input:; ▹ the sub-population P, the label-spcific accuracy matrix output:; ▹ the complementary individual find an index of the best individual in the P; ▹ the best individual find an index set of labels with the highest error based on ; for each individual do ▹ the other sub-populations ; ▹ the degree of complementarity c end for find a individual with highest c; Algorithm 3 illustrates the procedure for realizing the label complementary communication between the sub-populations for multiple labels. For simplicity, an input sub-population and the others are represented as P and , respectively. With regard to lines 3–4, our algorithm finds an index set of labels for which the best individual in P yields the lowest accuracies, where the size of is set to half the size of the entire label set . To find the complementary individual from the other sub-populations , our algorithm computes the degree of complementarity for each individual in , where is regarded as the discriminating power with regard to the labels in . Specifically, is calculated by adding the accuracies corresponding to the labels in (line 6). In contrast with the simple communication of exchanging the best individuals, the individual referenced from the other sub-populations can complement the discriminating power of the sub-population P for the entire label set L, which results in an improved search capability for MLFS. input:; ▹ the sub-populations output:; ▹ the new sub-population generate new offspring by crossover from the and best individual of ; ; whiledo ▹ keep the number of individuals in the P select an individual at random among the and best individual of the ; select an individual at random among remaining individuals of ; generate new offspring by crossover from the and ; ; end while run a mutation on overlapping individuals; Algorithm 4 represents the detailed procedure for generating new offspring. Because the complementary individual and the best individual in are considered to be important, our algorithm generates offspring from them once (line 3–4). With regard to lines 6–7, our algorithm conducts parent selection to generate offspring. Particularly, the first parent is randomly selected between and the best individual; consequently, the important individuals are always involved in the generation of offspring. Furthermore, to generate diverse offspring, the other parent is selected from one of the remaining individuals. As shown in line 8, the selected parent pair generates offspring via a restrictive crossover method that is frequently used to control the number of selected features in feature selection [29]. When compared to updating based on fitness-based parent selection, our algorithm can generate offspring that are superior to the previous generation by actively using the complementary individual . The generated offspring are sequentially added to (line 9). To maintain the number of individuals in each sub-population, the generation process is repeated until the offspring are as numerous as the number of individuals in , i.e., . Furthermore, as described in line 11, a restrictive mutation is conducted on overlapping individuals. Finally, we conducted the time complexity analysis of the proposed method. The most time is spent to evaluate feature subsets, because the learning algorithm should be trained through complicated sub-procedures for multiple labels [30]. Because the numbers of training patterns and given labels are regarded as constant values during the evaluation process, the computation time required to evaluate a feature subset S is determined by the number of selected features , i.e., , where represents the assumed basic time associated with the evaluation of a single feature [3]. Given the total number of individuals and maximum number of iterations , the feature subset evaluation is conducted times. Thus, the time complexity of the proposed method is .

3.4. Algorithm: Example

We implement the proposed method on the multilabel toy dataset provided in Table 2 as a representative example. In the table, each text pattern is relevant to multiple labels, where the labels are represented as one if relevant and zero otherwise. Specifically, the first pattern includes the terms “Music”, “The”, “Funny”, and “Lovely”, but not “Boring.” This pattern can be assigned to the labels “Comedy” and “Disney” simultaneously. For simplicity, we set the number of sub-populations and the number of features as two. Additionally, the number osf individuals in each sub-population was set to three. To focus on the communication process, in the initialization step, two sub-populations were initialized at random, as follows:

Table 2

Multilabel toy dataset.

	Features					Labels
Pattern	f1	f2	f3	f4	f5	l1	l2	l3
	Boring	Music	The	Funny	Lovely	Comedy	Documentary	Disney
w1	0	1	1	1	1	1	0	1
w2	1	0	1	0	1	0	1	1
w3	1	1	1	0	1	0	1	1
w4	0	1	0	1	0	1	0	0
w5	0	0	1	1	0	1	0	0
w6	1	0	0	0	0	0	1	0
w7	0	0	1	1	1	1	0	1

MLNB and multilabel accuracy are used to evaluate each individual. A detailed description of the evaluation metrics, including multilabel accuracy, is given in Section 4.1. Additionally, the fitness values for each sub-population are calculated as the average value obtained from 10 repeated experiments, as follows: where the label-specific accuracy matrix for is calculated using the MLNB that was pretrained for fitness evaluation. In the communication process for , our algorithm determines the index set of labels for which the lowest accuracies are yielded by the best individual , as it has the highest fitness in . We indicate important individuals in the sub-population using bold font. In , has the lowest accuracy, 30% for , as is when because , = . To complement , our algorithm finds the complementary individual from . Based on and , the degree of complementarity for each individual of is calculated as Because the individual belonging to has , the complementary individual for is . Conventional methods import the best individual that belongs to . Our example exhibits a low accuracy of 40% for . However, our method refers to of , which has the highest accuracy with regard to . This indicates that our method can further complement the discriminating power of for multiple labels and increase the likelihood of avoiding local optima, resulting in improved multilabel accuracy. This process is similar for . In the update process, selects its best individual and to be the parental pair once. Next, one of or is selected as a parent, and one of or is selected as the other parent at random. The selected parent pair generates offspring via the genetic operators used in conventional methods. Given and as the parent pair, our algorithm generates offspring 00110 and 10001 via the restrictive crossover. As a result, a feature subset represented by the offspring 10001 achieved a multilabel accuracy of 91%. This search process is repeated until the stopping criterion is met.

4. Experimental Results

4.1. Datasets and Evaluation

We conducted experiments using 17 multilabel datasets corresponding to various domains; these datasets can be obtained from http://mulan.sourceforge.net/datasets-mlc.html [31]. Specifically, the Emotions dataset [32] consists of 8 rhythmic features and 64 timbre features. The Enron dataset [33] was sampled from a large email message set, the Enron corpus. The Genbase and Yeast datasets [34,35] contain information regarding the functions of biological proteins and genes. The Medical dataset [36] is a subset of a large corpus that is associated with suicide letters in clinical free text. The Scene dataset [37] has indexing information on still images containing multiple objects. The remaining 11 datasets were obtained from the Yahoo dataset collection [38], composed of more than 10,000 features. Table 3 indicates standard statistics for the 17 datasets used in our experiments. It includes the number of patterns , number of features , types of features, and number of labels . If the feature type was numeric, we discretized the features while using label-attribute interdependence maximization, which is a discretization method that is specialized for multilabel data [39]. The label cardinality represents the average number of labels in each pattern, and label density is the label cardinality for the total number of labels. Further, indicates the number of unique label subsets in L, and represents the applications that are related to each dataset.

Table 3

Standard statistics of multilabel datasets.

Dataset	\|W\|	\|F\|	Type	\|L\|	Card.	Den.	Distinct.	Domain
Arts	7484	23,146	Numeric	26	1.654	0.064	599	Text
Business	11,214	21,924	Numeric	30	1.599	0.053	233	Text
Computers	12,444	34,096	Numeric	33	1.507	0.046	428	Text
Education	12,030	27,534	Numeric	33	1.463	0.044	511	Text
Emotions	593	72	Numeric	6	1.869	0.311	27	Music
Enron	1702	1001	Nominal	53	3.378	0.064	753	Text
Entertainment	12,730	32,001	Numeric	21	1.414	0.067	337	Text
Genbase	662	1185	Nominal	27	1.252	0.046	32	Biology
Health	9205	30,605	Numeric	32	1.644	0.051	335	Text
Medical	978	1449	Nominal	45	1.245	0.028	94	Text
Recreation	12,828	30,324	Numeric	22	1.429	0.065	530	Text
Reference	8027	39,679	Numeric	33	1.174	0.036	275	Text
Scene	2407	294	Numeric	6	1.074	0.179	15	Image
Science	6428	37,187	Numeric	40	1.450	0.036	457	Text
Social	12,111	52,350	Numeric	29	1.279	0.033	361	Text
Society	14,512	31,802	Numeric	27	1.670	0.062	1054	Text
Yeast	2417	103	Numeric	14	4.237	0.303	198	Biology

We compared the proposed method with three state-of-the-art multi-population-based methods that have exhibited promising performance for solving the feature selection problem: TCbGA [25], EMPNGA [26], and BCO-MDP [27]. We set the parameters for each method to the values used in the corresponding original study. For fairness, we set the maximum number of allowable FFCs and selected features to 300 and 50, respectively. The total population size was set to 50. The MLNB and a holdout cross-validation method were used in order to evaluate the quality of the feature subsets obtained by each method. Furthermore, 80% and 20% of each dataset were used as the training and test sets, respectively. We repeated each experiment 10 times and used the average value of the results. In the proposed method, we set the number of sub-populations to five; thus, each sub-population size was 10. We used four evaluation metrics to evaluate the quality of the feature subsets: Hamming loss, one-error, multilabel accuracy, and subset accuracy [40,41,42]. Let be a given test set, where is a correct label subset that is associated with a pattern . Given a test pattern and a multilabel classifier, such as MLNB, estimate a predicted label set . Specifically, a series of functions is induced from the training patterns. Next, each function determines the class membership of with respect to each pattern, i.e., , where is a predetermined threshold, such as . The four metrics can be computed given and Y according to the test patterns. The Hamming loss is defined as where △ denotes the symmetric difference between two sets. Furthermore, one-error is defined as where returns one if the proposition stated in the brackets is true and zero otherwise. Multilabel accuracy is defined as It computes the Jaccard coefficient between two sets. Finally, subset accuracy is defined as It determines whether two sets are exactly identical. A superior feature subset will exhibit higher values of the multilabel and subset accuracies and lower values of the Hamming loss and one-error metrics. We conducted additional statistical tests in order to verify the statistical significance of our results. First, we conducted a paired t-test [43] at 95% significance level to compare the proposed method with each of other MLFS methods on each of datasets; because there are three comparison algorithms, the paired t-test is performed three times. Here, three null hypotheses (i.e., two methods have equal performance) can either be rejected or accepted. We also performed the Bonferroni–Dunn test in order to compare the average ranks of the proposed and other methods [44]. If the difference between the average rank of one comparison method and that of the proposed method is within the critical difference (CD), its performance is considered to be similar to that of the proposed method. In our experiments, we set the significance level to , and, thus, the CD can be computed as 1.0601 [45].

4.2. Comparison Results

Table 4, Table 5, Table 6 and Table 7 present the experimental results of the proposed method and compare them with those of the other methods on 17 multilabel datasets. The resulting values are represented by their average performances with the corresponding standard deviations; herein, a better average value is indicated by bold font on each dataset. In addition, for each dataset, the paired t-test was conducted at the 95% significance level. As shown in Table 4, Table 5, Table 6 and Table 7, ▼(Δ) indicates that the corresponding method is significantly worse(better) than the proposed method based on the paired t-test. Table 4 shows that the proposed method is statistically superior or similar than TCbGA on 88% of the datasets and than EMPNGA and BCO-MDP on all datasets in terms of the Hamming loss. Table 5 shows that the proposed method is statistically superior or similar than other methods on 94% of the datasets in terms of the one-error. Particularly, Table 6 and Table 7 show that the proposed method is statistically superior or similar than other methods on all datasets in terms of the multilabel accuracy and the subset accuracy.

Table 4

Comparison results of four methods in terms of Hamming loss(↓) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at 95% significance level).

Dataset	Proposed	TCbGA	EMPNGA	BCO-MDP
Arts	0.0629 ± 0.001	0.0635 ± 0.001	0.0642 ± 0.001 ▼	0.0638 ± 0.001 ▼
Business	0.0297 ± 0.001	0.0289 ± 0.001 Δ	0.0297 ± 0.001	0.0293 ± 0.001
Computers	0.0428 ± 0.001	0.0432 ± 0.001	0.0435 ± 0.001	0.0435 ± 0.001
Education	0.0443 ± 0.001	0.0444 ± 0.000	0.0449 ± 0.001	0.0447 ± 0.001
Emotions	0.2336 ± 0.022	0.2370 ± 0.013	0.2376 ± 0.023	0.2366 ± 0.032
Enron	0.0663 ± 0.006	0.0628 ± 0.004	0.0892 ± 0.008 ▼	0.0840 ± 0.007 ▼
Entertainment	0.0641 ± 0.001	0.0650 ± 0.002	0.0646 ± 0.002	0.0650 ± 0.002
Genbase	0.0074 ± 0.003	0.0338 ± 0.006 ▼	0.0315 ± 0.004 ▼	0.0277 ± 0.006 ▼
Health	0.0465 ± 0.003	0.0498 ± 0.001 ▼	0.0490 ± 0.001 ▼	0.0489 ± 0.002
Medical	0.0138 ± 0.002	0.0206 ± 0.003 ▼	0.0186 ± 0.001 ▼	0.0181 ± 0.003 ▼
Recreation	0.0626 ± 0.001	0.0638 ± 0.001 ▼	0.0638 ± 0.001 ▼	0.0641 ± 0.002 ▼
Reference	0.0342 ± 0.002	0.0359 ± 0.000 ▼	0.0358 ± 0.001	0.0358 ± 0.001 ▼
Scene	0.1341 ± 0.007	0.1372 ± 0.007	0.1416 ± 0.006 ▼	0.1396 ± 0.012
Science	0.0367 ± 0.001	0.0362 ± 0.001 Δ	0.0376 ± 0.001	0.0368 ± 0.001
Social	0.0297 ± 0.002	0.0323 ± 0.001 ▼	0.0309 ± 0.001 ▼	0.0315 ± 0.002 ▼
Society	0.0586 ± 0.001	0.0598 ± 0.001 ▼	0.0595 ± 0.001 ▼	0.0590 ± 0.001
Yeast	0.2208 ± 0.009	0.2233 ± 0.007	0.2253 ± 0.005	0.2241 ± 0.006
Avg. Rank	1.24	2.71	3.35	2.71

Table 5

Comparison results of four methods in terms of one-error(↓) (▼/Δ indicates that the corresponding method is significantly worse/better than the proposed method based on paired t-test at 95% significance level).

Dataset	Proposed	TCbGA	EMPNGA	BCO-MDP
Arts	0.7354 ± 0.140	0.7717 ± 0.120 ▼	0.7684 ± 0.122 ▼	0.7640 ± 0.126 ▼
Business	0.3930 ± 0.417	0.3935 ± 0.418	0.3933 ± 0.418	0.3935 ± 0.418
Computers	0.4530 ± 0.011	0.4616 ± 0.008 ▼	0.4566 ± 0.009	0.4626 ± 0.008 ▼
Education	0.6520 ± 0.020	0.6756 ± 0.011 ▼	0.6777 ± 0.011 ▼	0.6776 ± 0.014 ▼
Emotions	0.2992 ± 0.029	0.3085 ± 0.054	0.2915 ± 0.060	0.2992 ± 0.068
Enron	0.5797 ± 0.327	0.5982 ± 0.318	0.6074 ± 0.317 ▼	0.5976 ± 0.316 ▼
Entertainment	0.6085 ± 0.023	0.6710 ± 0.023 ▼	0.6339 ± 0.014 ▼	0.6483 ± 0.023 ▼
Genbase	0.7197 ± 0.441	0.8652 ± 0.207	0.8235 ± 0.272	0.8045 ± 0.303
Health	0.7659 ± 0.299	0.7935 ± 0.266 ▼	0.7900 ± 0.270 ▼	0.7885 ± 0.272 ▼
Medical	0.7713 ± 0.293	0.8395 ± 0.206	0.8138 ± 0.236	0.8287 ± 0.216
Recreation	0.7062 ± 0.035	0.7531 ± 0.010 ▼	0.7533 ± 0.014 ▼	0.7482 ± 0.021 ▼
Reference	0.7130 ± 0.247	0.7171 ± 0.243	0.7126 ± 0.247	0.7164 ± 0.244
Scene	0.3168 ± 0.029	0.2927 ± 0.027 Δ	0.2844 ± 0.026 Δ	0.2871 ± 0.023 Δ
Science	0.7097 ± 0.019	0.7342 ± 0.019 ▼	0.7265 ± 0.018 ▼	0.7445 ± 0.013 ▼
Social	0.4872 ± 0.183	0.5637 ± 0.161 ▼	0.5441 ± 0.164 ▼	0.5677 ± 0.156 ▼
Society	0.4880 ± 0.019	0.4963 ± 0.013	0.4859 ± 0.019	0.4901 ± 0.014
Yeast	0.2369 ± 0.023	0.2431 ± 0.019	0.2652 ± 0.020 ▼	0.2513 ± 0.019 ▼
Avg. Rank	1.35	3.41	2.41	2.76

Table 6

Comparison results of four methods in terms of multilabel accuracy(↑) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at the 95% significance level).

Dataset	Proposed	TCbGA	EMPNGA	BCO-MDP
Arts	0.0924 ± 0.021	0.0330 ± 0.007 ▼	0.0464 ± 0.009 ▼	0.0518 ± 0.016 ▼
Business	0.6772 ± 0.009	0.6784 ± 0.008	0.6767 ± 0.011	0.6760 ± 0.010
Computers	0.4155 ± 0.008	0.4148 ± 0.007	0.4159 ± 0.010	0.4147 ± 0.010
Education	0.0748 ± 0.026	0.0291 ± 0.007 ▼	0.0367 ± 0.015 ▼	0.0410 ± 0.022 ▼
Emotions	0.5323 ± 0.036	0.5267 ± 0.035	0.5202 ± 0.031	0.5329 ± 0.031
Enron	0.3445 ± 0.021	0.3315 ± 0.019	0.3173 ± 0.019 ▼	0.3389 ± 0.034
Entertainment	0.1904 ± 0.051	0.0586 ± 0.022 ▼	0.1116 ± 0.016 ▼	0.1218 ± 0.046 ▼
Genbase	0.8907 ± 0.058	0.3789 ± 0.130 ▼	0.4238 ± 0.088 ▼	0.5471 ± 0.157 ▼
Health	0.4277 ± 0.027	0.4074 ± 0.016	0.4120 ± 0.019	0.4026 ± 0.015 ▼
Medical	0.5772 ± 0.089	0.3545 ± 0.084 ▼	0.3628 ± 0.055 ▼	0.4498 ± 0.117 ▼
Recreation	0.1001 ± 0.026	0.0477 ± 0.012 ▼	0.0574 ± 0.007 ▼	0.0573 ± 0.017 ▼
Reference	0.4048 ± 0.015	0.3568 ± 0.125	0.4066 ± 0.012	0.4005 ± 0.011
Scene	0.5730 ± 0.038	0.5663 ± 0.021	0.5705 ± 0.016	0.5712 ± 0.034
Science	0.0744 ± 0.041	0.0256 ± 0.008 ▼	0.0360 ± 0.011 ▼	0.0385 ± 0.011 ▼
Social	0.4935 ± 0.047	0.0720 ± 0.027 ▼	0.1907 ± 0.168 ▼	0.1187 ± 0.033 ▼
Society	0.2423 ± 0.135	0.1617 ± 0.162	0.2586 ± 0.165	0.2873 ± 0.126
Yeast	0.4468 ± 0.012	0.4435 ± 0.012	0.4448 ± 0.012	0.4418 ± 0.013
Avg. Rank	1.35	3.53	2.59	2.53

Table 7

Comparison results of four methods in terms of subset accuracy(↑) (▼/Δ indicates that the corresponding method is significantly worse/better than proposed method based on paired t-test at the 95% significance level).

Dataset	Proposed	TCbGA	EMPNGA	BCO-MDP
Arts	0.0666 ± 0.015	0.0287 ± 0.009 ▼	0.0422 ± 0.010 ▼	0.0438 ± 0.016 ▼
Business	0.5326 ± 0.011	0.5326 ± 0.012	0.5322 ± 0.012	0.5334 ± 0.013
Computers	0.3386 ± 0.012	0.3365 ± 0.007	0.3379 ± 0.009	0.3318 ± 0.007 ▼
Education	0.0599 ± 0.019	0.0162 ± 0.006 ▼	0.0327 ± 0.013 ▼	0.0317 ± 0.010 ▼
Emotions	0.2534 ± 0.039	0.2593 ± 0.043	0.2508 ± 0.041	0.2525 ± 0.055
Enron	0.1076 ± 0.020	0.1168 ± 0.020	0.0418 ± 0.028 ▼	0.0947 ± 0.034
Entertainment	0.1709 ± 0.051	0.0791 ± 0.025 ▼	0.0903 ± 0.023 ▼	0.0862 ± 0.033 ▼
Genbase	0.8485 ± 0.041	0.2576 ± 0.092 ▼	0.4288 ± 0.070 ▼	0.5098 ± 0.123 ▼
Health	0.3386 ± 0.028	0.3160 ± 0.014 ▼	0.3293 ± 0.017	0.3129 ± 0.017 ▼
Medical	0.4636 ± 0.071	0.2600 ± 0.047 ▼	0.3472 ± 0.049 ▼	0.3138 ± 0.096 ▼
Recreation	0.0829 ± 0.021	0.0393 ± 0.016 ▼	0.0475 ± 0.011 ▼	0.0478 ± 0.021 ▼
Reference	0.3579 ± 0.014	0.3532 ± 0.009	0.3654 ± 0.013	0.3265 ± 0.112
Scene	0.4341 ± 0.025	0.4168 ± 0.033	0.3819 ± 0.027 ▼	0.4472 ± 0.028
Science	0.0602 ± 0.030	0.0258 ± 0.003 ▼	0.0351 ± 0.011 ▼	0.0311 ± 0.008 ▼
Social	0.4185 ± 0.051	0.0667 ± 0.036 ▼	0.2850 ± 0.183 ▼	0.0981 ± 0.041 ▼
Society	0.2222 ± 0.060	0.1187 ± 0.132	0.1926 ± 0.125	0.2257 ± 0.116
Yeast	0.1029 ± 0.014	0.0969 ± 0.014	0.1085 ± 0.006	0.0988 ± 0.016
Avg. Rank	1.41	3.29	2.59	2.65

Figure 2 illustrates the CD diagrams, showing the relative performance of the four methods. Here, the horizontal axis represents the average rank of each method, where the higher ranks are placed on the right side of each subfigure. In addition, the methods within the same CD as that of the proposed method are connected by a bold red line, which means that the difference among them is not significant. Figure 2b indicates that the proposed method significantly outperformed the TCbGA and BCO-MDP in terms of the one-error. The results for the one-error indicate that the simple communication of exchanging the best individuals in the EMPNGA can also yield good results, because the one-error is evaluated based only on the label predicted with the highest probability. In contrast, Figure 2a,c,d indicates that the proposed method significantly outperformed all other methods in terms of the Hamming loss, multilabel accuracy, and subset accuracy. The three metrics are evaluated based on the predicted label subsets; thus, the proposed method, which employs label complementary communication, can outperform the existing methods.

Figure 2

Bonferroni-Dunn test results of four comparison methods with four evaluation measures.

4.3. Analysis

We conducted an in-depth analysis to determine whether the proposed communication process is effective for solving the MLFS problem via additional experiments on eight datasets using the MLNB. To validate the effectiveness of label complementary communication in the proposed method, we designed Proposed-SC, which is equivalent to the proposed method, except that it does not include the proposed communication process, i.e., Algorithm 3. Specifically, the Proposed-SC uses the simple communication method of exchanging the best individuals and roulette wheel selection as the fitness-based parent selection method. For improved readability, we named the proposed method described in Section 3 as the Proposed-LCC. In addition, we designed Proposed-NC, which is equivalent to Proposed-SC, except that it does not conduct any communication process. Figure 3 shows the search capability of each sub-population during the search process. The vertical axis indicates the multilabel accuracy for the best individual in each method; herein, the baseline indicates the multilabel accuracy obtained by random prediction from 10 repetitions and it is regarded as the baseline performance. As stated in Section 4, the numbers of maximum FFCs and the total number of individuals are 300 and 50, respectively. Therefore, the sub-populations communicate with each other every 50 FFCs. Additionally, the number of sub-populations is five.

Figure 3

Multilabel accuracy(↑) for the best individual in each sub-population, obtained using three methods.

As shown in Figure 3, the Proposed-LCC exhibited a better search capability than Proposed-SC and Proposed-NC on eight multilabel datasets. We note that, in MLFS, Proposed-SC and Proposed-NC exhibited a similar level of search capability in MLFS, and it even revealed worse search capability than a method without communication on the Education datasets. It implies that the simple communication method of exchanging the best individuals failed to deal with the multiple labels. In contrast, Proposed-LCC conducted effective MLFS searches. Particularly, in Figure 3c, the initial sub-populations of Proposed-LCC revealed relatively low multilabel accuracy (50 FFCs). This is because each of sub-populations consists of different features by our initialization method and, thus, may not be related to entire label set. During search process, Proposed-LCC exhibited an effective improvement in multilabel accuracy. This indicates that the proposed label complementary communication method can improve the search capability of the sub-populations by referencing individuals from other sub-populations based on the discriminating power of subsets with regard to labels that are difficult to classify. We also conducted the paired t-test at 95% significance level in order to determine whether the three methods were statistically different. For fairness, Proposed-SC and Proposed-NC also obtained results from 10 repetitions on the eight datasets, respectively. Figure 4 presents the pairwise comparison results on each of datasets in terms of the multilabel accuracy; the p-values for each of tests are shown in each subfigure and the asterisk indicates that corresponding hypothesis was rejected. As shown in Figure 4, the Proposed-LCC significantly outperformed Proposed-SC on seven datasets, except for the Recreation dataset and outperformed Proposed-NC on all datasets. On the other hand, Proposed-SC and Proposed-NC have equal performance on all datasets. As a result, the additional experiment and statistical test verify that the proposed label complementary communication successfully improves the search capability of the sub-populations with regard to MLFS.

Figure 4

Pairwise comparison results of paired t-test at 95% significance level in terms of multilabel accuracy(↑).

5. Conclusions

In this paper, we proposed a novel MPGA, with label complementary communication, which specializes in solving the MLFS problem. It is aimed at improving the search capability of sub-populations through a communication process that employs the complementary discriminating powers of sub-populations with regard to multiple labels. Our experimental results and statistical tests verified that the proposed method significantly outperformed three state-of-the-art multi-population-based feature selection methods on 17 multilabel datasets. Future studies can be conducted to overcome the limitation of the proposed method: we have simply set the number of labels to be complemented to half the total number of labels. As the search progresses, this value can be adjusted according to the improvement in the discriminating power for each label. For example, the proposed label complementary communication may only be conducted for labels for which the discrimination performance is not better than that in the previous generation.

2 in total