Literature DB >> 34828418

Zoo: Selecting Transcriptomic and Methylomic Biomarkers by Ensembling Animal-Inspired Swarm Intelligence Feature Selection Algorithms.

Yuanyuan Han¹, Lan Huang¹, Fengfeng Zhou¹.

Abstract

Biological omics data such as transcriptomes and methylomes have the inherent "large p small n" paradigm, i.e., the number of features is much larger than that of the samples. A feature selection (FS) algorithm selects a subset of the transcriptomic or methylomic biomarkers in order to build a better prediction model. The hidden patterns in the FS solution space make it challenging to achieve a feature subset with satisfying prediction performances. Swarm intelligence (SI) algorithms mimic the target searching behaviors of various animals and have demonstrated promising capabilities in selecting features with good machine learning performances. Our study revealed that different SI-based feature selection algorithms contributed complementary searching capabilities in the FS solution space, and their collaboration generated a better feature subset than the individual SI feature selection algorithms. Nine SI-based feature selection algorithms were integrated to vote for the selected features, which were further refined by the dynamic recursive feature elimination framework. In most cases, the proposed Zoo algorithm outperformed the existing feature selection algorithms on transcriptomics and methylomics datasets.

Entities: Chemical

Keywords: feature selection; machine learning; prediction; program code; swarm intelligence

Mesh：

Substances：
Biomarkers

Year: 2021 PMID： 34828418 PMCID： PMC8621246 DOI： 10.3390/genes12111814

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.096

1. Introduction

The accelerated accumulation of omics data has been benefited by the rapid innovation and development of various high-throughput omics technologies [1]. There are many types of omics data, including genomics data, transcriptomics data, methylomics data, metabolomics data and proteomics data, that describe the biological systems from different perspectives [2]. They also introduce the challenge of high feature dimensionalities for data analysis, i.e., the number of features in a dataset usually far exceeds that of samples [3]. This data dimension disaster may be partly solved by dimension reduction [4] or feature selection [5,6]. Feature selection is an NP-hard problem whose global optimal solution cannot be found within polynomial time [7]. Thus, except for the exhaustive searching strategy, all the existing feature selection algorithms try to find feature subsets with locally optimized performances. Feature selection algorithms may be roughly grouped as filters and wrappers [8]. A filter ranks the features in the descending order of their associations with the class labels, and the association between a feature and the class label may be measured by various metrics such as the t-test [9] and correlation coefficient [10]. A wrapper iteratively evaluates a heuristically generated feature subset by a predefined classifier and outputs the feature subset with the best optimization performance [11,12]. More complicated frameworks have also been designed to find feature subsets with better prediction performances, e.g., embedded [13] and meta-heuristic [14] feature selection algorithms. Swarm intelligence (SI) is a type of meta-heuristic feature selection algorithm that imitates living organisms’ behaviors to generate intermediate feature subsets for performance evaluations [15]. An SI feature selection algorithm extracts the living organisms’ behaviors as abstract algorithmic operations for feature subsets, including genetic information exchanges and dynamic searching strategies [16]. Popular SI feature selection algorithms include Grey Wolf Optimization (GWO) [17], Cuckoo Searching (CS) [18], the Whale Optimization Algorithm (WOA) [19], the bat algorithm (BA) [20], the Firefly Algorithm (FA) [21], the moth–flame optimization algorithm (MFO) [22], Particle Swarm Optimization (PSO) [23,24], the Manta Rays Foraging Optimization algorithm (MRFO) [25] and the Dragonfly algorithm (DF) [26]. Datasets have inherent patterns and an SI algorithm usually cannot guarantee the choice of the best feature subset on all the datasets. Rostami et al. studied 11 types of state-of-the-art swarm intelligence for feature selection problems. The results showed that swarm intelligence algorithms tend to fall into local optimal solutions for high-dimensional data sets, and different swarm intelligence algorithms perform differently [27]. Brezocnik et al., found that promising swarm intelligence algorithms used in feature selection algorithms included PSO, BA, GWO, FA, DA and ant colony optimization (ACO), and many swarm intelligence algorithms were rarely applied to feature selection problems. Some of the latest algorithms, such as BCO, CS, FA and GWO, were also used in conjunction with other techniques and showed very promising results in FS [28]. Our study revealed that the integration of multiple SI feature selection algorithms might deliver satisfying solutions for most datasets. Thus, this study integrated and evaluated the recommended feature subsets of nine SI-based feature selection algorithms, including WOA, BA CS, FA, MFO, PSO, MRFO, DF and GWO. A majority voting strategy was used to find the features recommended by more than four SI feature selection algorithms, and the redundant features were further refined by the dynamical recursive feature elimination (dRFE) strategy [29]. The proposed feature selection algorithm Zoo was comprehensively evaluated for the prediction performances of its recommended feature subsets, and its source code is publicly available at http://www.healthinformaticslab.org/supp/resources.php (accessed on 9 November 2021).

2. Materials and Methods

2.1. Summary of Datasets

This study evaluated the performances of feature selection algorithms using the binary classification problems of transcriptome and methylome datasets, as shown in Table S1. Firstly, the proposed swarm intelligent (SI) feature selection algorithm Zoo was tuned using 17 popular transcriptome datasets [30], consisting of 15 cancer datasets and 2 cardiovascular disease ones. Seventeen publicly available transcriptome datasets were used for algorithm tuning. They include the 5 datasets of Myeloma (accession: GDS531) [31], Gastric (accession: GSE37023) [32], Gastric1/Gastric2 (accession: GSE29272) [33], T1D (accession: GSE35725) [34] and Stroke (accession: GSE22255) [35] obtained from the NCBI Gene Expression Omnibus (GEO) database; the 6 datasets of DLBCL [36], Prostate [37], ALL [38], CNS [39], Lymphoma [40] and Adenoma [41] provided from the Broad Institute Genome Data Analysis Center; and the 2 datasets of Colon [42] and Leukemia [43] found from R/Bioconductor packages colonCA and golubEsets, respectively. The ALL dataset was divided into 4 datasets: ALL1, ALL2, AL3 and ALL4, according to different phenotypic annotations. Ten additional transcriptome datasets were chosen in order to compare the prediction performances of the proposed algorithm Zoo and the existing feature selection algorithms, as shown in Table S1. These ten binary classification datasets were retrieved from the Gene Expression Omnibus (GEO) database [44]. The thyroid cancer samples with different phenotypes (GSE35570-1 and GSE35570-2, under the accession number GSE35570) were profiled using the platform GPL570 (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array), which has 54,675 features. This GPL570 platform was also used to profile the transcriptomes of peripheral blood lymphocytes with and without autism (GSE25507) [45], Parkinson’s disease and controls (GSE99039) [46], metastatic recurrent and primary colorectal cancers (GSE21510) [47], lung cancers and the matched distant normal lung tissues (GSE33532) [48], female lung cancers and controls (GSE19804) [49], breast cancers and controls (GSE27562) [50] and lung cancers in early and late stages (GSE30219) [51]. The transcriptomes of lung cancers in males and females (GSE4824) [52] were profiled using another platform, GPL96 (HG-U133A) Affymetrix Human Genome U133A Array), which has 22,283 features. Five methylome datasets were chosen to evaluate how the investigated feature selection algorithms perform on different types of omics data, as shown in Table S1. The methylation platform GPL13534 (Illumina HumanMethylation450 BeadChip, HumanMethylation450_15017482) was used to profile the methylomes of these 5 datasets, which provided 485,577 methylation features. This study abstracted binary classification problems from the methylomes of peripheral blood mononuclear cells for smokers and non-smokers (GSE53045) [53], breast cancers and normal samples (GSE66695) [54], normal fallopian tube samples with and without BRCA1/2 mutations (GSE74845) [55], Alzheimer’s disease and controls (GSE80970) [56] and gastric light or mild intestinal metaplasia (GSE103186) [57]. Features with missing data were removed from further analysis. A stratified split strategy of the ratios 1:1:1 was used to divide each dataset into the training, validation and testing subsets. The features were selected based on the training dataset, and the parameters were optimized based on the validation dataset. The final performance was calculated using the test dataset.

2.2. Performance Metrics

This study evaluated a feature selection algorithm according to the binary classification performances of its recommended feature subset. A binary classification problem had two classes of samples, i.e., positive and negative ones. The numbers of positive and negative samples were denoted as P and N [58]. The prediction accuracy of the positive samples was calculated as sensitivity, i.e., Sn = TP/(TP + FN), where TP and FN were the numbers of correctly and incorrectly predicted positive samples, respectively. The specificity (Sp) was similarly defined for the negative samples, and Sp = TN/(TN + FP), where TN and FP were the numbers of true negatives and false positives, respectively. The overall accuracy was defined as Acc = (TP + TN)/(TP + FN + TN + FP). The metric Acc was used to evaluate all the feature selection algorithms.

2.3. Stratified k-Fold Cross Validation Strategy

A stratified three-fold cross-validation (S3FCV) strategy [59] was utilized to evaluate the classification performances. The random seed was set to 0. S3FCV randomly split the positive and negative samples into three equally sized subsets. In each iteration, one positive and one negative subset was combined as the test set, and the remaining samples were used to train the classification model. S3FCV ensured that each sample was used as a test sample once and only once, and the same ratio between positive and negative samples was maintained in the training and test datasets. This study implemented and carried out all the experiments in the Python programming language version 3.7.6.

2.4. Nine Swarm Intelligence Feature Selection Algorithms

Swarm intelligence (SI) optimization algorithms have demonstrated powerful capabilities in many combinatorial optimization problems, and many SI algorithms have been modified for the feature selection task [27,60]. The Whale Optimization Algorithm (WOA) mimics the hunting behavior of humpback whales [19,61] by the bubble-net feeding method. WOA randomly searches for solutions in the exploration stage, and the exploitation stage carries out a delicate local search in the search space around a promising solution revealed in the exploration stage. WOA uses a logarithmic spiral function to mathematically formulate the behavior whereby a humpback whale creates a spiral bubble net around the prey. The Bat Algorithm (BA) carries out its optimization procedure using operations inspired by the bat’s echolocation behaviors [62]. A bat’s flight is affected by the echolocation’s frequency, speed and loudness, and these variables are adjusted based on the proximity to the target. Cuckoo Search (CS) searches for the optimization target using three rules inspired by the brood parasitism of certain species of cuckoos [63,64]. CS assumes that each cuckoo lays an egg in one randomly selected nest; the best place among the selected nests will be reserved for the next generation of cuckoos, and the number of available bird nests is fixed. The host bird of a nest has a probability of finding the cuckoo egg in its nest. If this happens, the host bird will remove the cuckoo egg, or build a new nest instead. Yang X. S. developed the Firefly Algorithm (FA) in 2008 by mimicking the behaviors of firefly flashing characteristics [65,66]. Fireflies are unisex, and a firefly with a brighter flashing light attracts neighboring fireflies to move toward it. The Moth–Flame Optimization (MFO) algorithm is a meta-heuristic algorithm simulating the navigation mode of moths [22,67]. A moth executes a straight-line flight to a remote target by maintaining a fixed angle with the moon in the night. This habit causes moths to be trapped spirally around artificial lights. MFO mathematically formulates this behavior to optimize the feature selection procedure. Particle Swarm Optimization (PSO) places a swarm of particles in the solution space and evaluates the fitness of each particle [68,69]. The movement of each particle will be defined by its own history locations, the best locations and the other particles’ information. Random perturbations will also be considered. The whole swarm is expected to move close to a locally optimal solution in regard to the fitness function. The Manta Ray Foraging Optimization (MRFO) mathematically formulates the foraging strategy of manta rays [25,70]. Three foraging strategies of manta rays are abstracted as optimization rules, i.e., chain foraging, cyclone foraging and somersault foraging. The Dragonfly Algorithm (DF) is another popular optimization algorithm inspired by the foraging and migration behaviors of dragonflies [26,71]. The operation separation mimics the mechanism whereby two neighboring dragonflies avoid collisions with each other. The second operation alignment models when the dragonflies match their movement velocities with neighboring ones. The last operation cohesion models the dragonflies’ tendency toward the neighborhood’s mass center. Grey Wolf Optimization (GWO) is a bio-inspired SI optimization algorithm that mimics the hunting process of grey wolves in nature [72,73]. A wolf pack consists of four levels of social hierarchies, i.e., alpha, beta, omega and delta wolves. The alpha wolves make decisions, and the betas assist the alphas in decision making. The deltas are minors to alphas and betas and are responsible for scouting and hunting, while the omegas have the lowest priority in eating the preys. The best feature selection solution is defined as the alpha, while the second and third best solutions are beta and delta. The rest of the solutions are the omega wolves. The next generation of wolves is updated using the combined information of alpha, beta, delta and the random information.

2.5. The Ensemble SI-Based Feature Selection Algorithm Zoo

The first step of the proposed Zoo algorithm evaluated the association of each feature with the class label in the training dataset using the t-test, and ranked the features in ascending order of the t-test p-values, as shown in Figure 1. Most swarm intelligence (SI) algorithms had high time complexities due to the population-based random solution searching strategy. In order to avoid an extremely long running time, this study retrieved the top-ranked 1000 features to evaluate the SI algorithms.

Figure 1

Flowchart of the proposed feature selection algorithm Zoo.

Secondly, the 9 SI feature selection algorithms in the above section were applied to the datasets using the selected 1000 features. The binary version of each SI algorithm was used as a feature selection algorithm in this study. Ten random runs of each SI algorithm were carried out, and the feature subset with the highest prediction accuracy on the validation dataset was output as the final solution. Thirdly, each feature was counted by its vote by the nine SI feature selection algorithms, and the majority rule was used to generate the subset of features. A dynamic recursive feature elimination (dRFE) strategy was used to further refine the subset of features The S3FCV strategy was used in the SVM-based dRFE framework with 7 as the maximal number of features removed in each iteration. The feature subset achieving the best prediction accuracy was delivered as the final output.

2.6. Binary Animal-Inspired SI-Based Feature Selection Algorithms

Feature selection may be formulated as a binary SI algorithm, in which a binary-valued array represents a feature subset, and the value 1 or 0 in each position of this array denotes the choice or not of a corresponding feature. All of the nine animal-inspired SI algorithms investigated in this study are equation-based algorithms [74], and they randomly initiate a set of feature subsets for their own optimization procedures. The binary versions of the Manta Ray Foraging Optimization (MRFO) were re-implemented using the Python code from the Matlab code [72]. Additionally, the Dragonfly Algorithm (DF) was implemented based on the original Matlab codes. The binary feature selection algorithms of the other seven SI algorithms were implemented using the open-source framework Evolopy-FS [75,76]. The fitness function is defined so as to integrate the effects of both classification error rate and the number of selected features, similar to [77]. The parameter ω is used to balance the two factors of the error rate E and the rate of selected features Selected/Dimension, where and are the numbers of selected and all the features. This study set ω = 0.9. Three classifiers were used to calculate the classification performances of the fitness functions using the training and testing subsets split by the ratio 2:1 of the training dataset. The three classifiers are Support Vector Machine (SVM), Naïve Bayes (NBayes) and k Nearest Neighbor (KNN).

2.7. The Existing Feature Selection Algorithms

The proposed Zoo algorithm was compared with nine existing feature selection algorithms using three binary classifiers. In order to maintain a fair comparison, the number of features selected by a feature selection algorithm was set to be the same as Zoo. The parameters of the nine feature selection algorithms for comparison are described in Table S2. Each algorithm is abbreviated in the brackets and referenced as a function in the Python package sklearn version 0.19.2. The features may be ranked by four algorithms, i.e., adaptive boosting (AdaBoost), the gini index of the decision tree classifier (DT_gini), Gradient Boosting (GB) and Random Forest (RF). A binary classification model was trained using one of two algorithms, i.e., L1 regularized logistic regression (LR_L1) and Linear Support Vector Machine (lSVC_L1). The model coefficients are used to rank the features in descending order. The Recursive Feature Elimination (RFE) strategy may be used with the two classifiers: Support Vector Machine (RFE_SVC) and Random Forest (RFE_RF). The function SelectKBest() was also used to select the top-ranked k features (abbreviated as SK_mic). The performance metric maximum accuracy (mAcc) was used to evaluate the feature selection algorithm. The S3FCV strategy was used to calculate the classification performances using the five classifiers, i.e., logistic regression (LR), k Nearest Neighbor (KNN), Gaussian Naïve Bayes classifier (NBayes), Decision Tree (DT) and Support Vector Machine (SVM).

3. Results

3.1. Evaluating the Classifiers for the Selected Features

Seven among the first seventeen transcriptome datasets received the worst prediction performances in the previous study [30], and these datasets were used to tune the algorithmic parameters in this study. The details of these datasets are annotated in Table S2. Figure 2 showed the experimental results of the t-test-based Incremental Feature Selection (IFS) strategies [78] with at most 100 features. The SVM classifier only achieved Acc = 0.7500 using 66 features for the dataset CNS. The best accuracy was only 0.9247 using 30 features for the dataset ALL4. Thus, these datasets need to be improved by finding better features for the prediction tasks and will be used in the following sections to tune the parameters.

Figure 2

Incremental feature selection based on t-test for seven datasets. The vertical axis gives the accuracy (Acc) of the top-ranked k features by the SVM classifier. The horizontal axis lists the value of k. Acc was calculated using the S10FCV strategy. (a) The performance evaluation on the three leukemia datasets: ALL2, ALL3 and ALL4. (b) The performance evaluation on the four datasets: CNS, Colon, Mye and T1D.

Three classifiers, SVM, NBayes and KNN, were evaluated for their classification performances when each was used in the fitness function of the Zoo feature selection algorithm, as shown in Figure 3. The fitness function was defined as Fitness = ω × E + (1 − ω) × R, where E was the error rate of the classification model, and R was the ratio of the selected features among all of them. This study set ω = 0.9.

Figure 3

Performance comparison of the three classifiers for their integration in the fitness function of the Zoo feature selection algorithm. The horizontal axis lists the seven datasets, and the vertical axis gives the data of the performance metric mAcc using the S3FCV strategy. The metric mAcc was calculated as the maximum Acc using the five classifiers: NBayes, SVM, LR, DT and KNN on the Zoo-recommended features.

The population size and the maximum number of iterations were set to 50 and 100 for all of the nine SI feature selection algorithms. The major parameters of the nine SI feature selection algorithms were set to the default values, as listed in Table S3. Each dataset was filtered by the t-test, and the top-ranked 1000 features were screened by a random run of each of the nine SI feature selection algorithms. A majority voting strategy was used to find the features recommended by more than four out of the nine SI feature selection algorithms. A further refining step using the dRFE algorithm was carried out to remove potentially redundant features in each feature subset. The remaining features were used to build the prediction model using the same classifier integrated in the fitness function. Figure 3 shows that the classifier NBayes achieved the best classification performances for five out of the seven datasets, while the classifiers KNN and SVM performed the best only for four and three datasets, respectively. Thus, NBayes was used as the classifier integrated into the fitness function of the Zoo algorithm.

3.2. Finding the Best Population Size for Five SI Algorithms

The internal parameters of the five SI feature selection algorithms GWO/WOA/FA/MFO/MRFO were randomly generated, and their population sizes (variable N) were evaluated for the classification accuracies of their recommended features, as shown in Figure 4. Due to the high time complexities of the SI algorithms, all the seven datasets evaluated in this experiment were firstly screened by the t-test, and only the top-ranked 1000 features between the two groups of each dataset were loaded to the SI feature selection algorithms. Each SI algorithm selected features from the training dataset and evaluated these features on the validation dataset. The classification accuracy of the finally recommended features was calculated on the test dataset. For a fair comparison, the maximum number of iterations was set to 100 for all the five SI feature selection algorithms evaluated in this section.

Figure 4

Evaluation of population sizes for the five SI feature selection algorithms. The prediction accuracies of the default classifier NBayes using the features recommended by the SI feature selection algorithms (a) GWO, (b) WOA, (c) FA, (d) MFO and (e) MRFO. The rows give the data for the Scheme 10. 20, 30, 40, 50, 60, 70, 80, 90 and 100. Each classification accuracy was colored by the red scale, with a deeper red color for a smaller accuracy.

GWO achieved the best averaged rank of 3.1429 for N = 10, as shown in Figure 4a. The prediction accuracies of the GWO-recommended features were ranked 7, 5, 2, 1, 2, 3 and 2 for the seven difficult datasets: ALL2, ALL3, ALL4, CNS, Colon, Mye and T1D, while the second-best averaged rank of 3.7143 was achieved by N = 30. From the perspective of prediction accuracies, GWO recommended the best averaged prediction accuracy of 0.7021 for the seven datasets when N = 10. The second-best averaged prediction accuracy was 0.6910 for N = 80. Thus, the following sections used N = 10 for the GWO algorithm. WOA achieved the best averaged rank of 1.5714 on the seven datasets for N = 10, as shown in Figure 4b. The data showed that the WOA-selected features with N = 10 achieved the best prediction accuracies on four out of the seven datasets, i.e., ALL4, CNS, Mye and T1D. Although the WOA-selected features with N = 60 achieved a slightly better averaged accuracy of 0.6928 than that (averaged accuracy 0.6899) with N = 10, N = 60 only achieved the third-best averaged rank over the seven datasets. Thus, WOA used N = 10 in the following sections. The FA-selected features achieved the best averaged rank for N = 30 and 50, as shown in Figure 4c. N = 70 achieved the best averaged accuracy of 0.7000, which was only slightly better than that (0.6861) for N = 30 and 50. A larger population size (N) required a longer running time. Thus, this study set N = 30 as the default population size for the FA feature selection algorithm. MFO recommended features with N = 90 to achieve the best averaged rank (1.2857) and the best prediction accuracy (0.7076), as shown in Figure 4d. Actually, the MFO-selected features achieved the best prediction accuracies on six out of the seven evaluated datasets. Thus, the population size of MFO was set as 90 by default in this study. Figure 4e shows that MRFO recommended the features achieving the best averaged rank (3.2857) and the best averaged prediction accuracy (0.6877) with N = 10. The second-best averaged rank (4.2857) was achieved with N = 100. Thus, the remainder of this study set the default population size N = 10 for MRFO.

3.3. Parameter Tunings of the Other Four SI Algorithms

The other four SI feature selection algorithms carried different parameters and were optimized separately. Due to the high complexities in the various parameters of these SI algorithms, the population size N and the number of iterations T were initialized as N = 50 and T = 100. The Bat Algorithm (BA) had three parameters: pulse emission rate (R), loudness (A) and population size (N), which are evaluated in Figure S1a. To simplify the evaluation procedure, this study assumed R = A. Figure S1a shows that R = A = 0.8 achieved the best averaged rank of 1.2857 for the BA algorithm, and 474.00 were recommended by BA on average. Since R = A = 0.2 achieved a slightly worse averaged rank of 1.4286 with a better averaged number of features (460.14), this study chose R = A = 0.2 as the default value for BA. Then, the BA algorithm was evaluated for its different population sizes. Both the best averaged accuracy (0.6995) and the best averaged rank (1.4286) were achieved by N = 30 for the BA algorithm. Thus, the default population size N was set as 30. The Particle Swarm Optimization (PSO) algorithm needed to set the lower bound of the inertia weight (denoted as MinW), which is evaluated in Figure S1b. The population size (N) was also evaluated. Both MinW = 0.1 and 0.2 achieved the best averaged accuracy of 0.6957 and the best averaged rank of 1.2857. However, the PSO algorithm recommended more than 27 features using MinW = 0.2 than when using MinW = 0.1. Thus, the remaining sections of this study set MinW = 0.1 as the default value. The PSO-selected features achieved the overall highest accuracy of 0.9500 using N = 80 on the dataset Colon, which was at least 0.1000 larger than the second-best accuracy of 0.8500. The averaged rank by N = 90 was 2.2857, the fourth-best averaged rank. This was mainly due to N = 80 achieving the accuracy of 0.6000 on the dataset CNS, which was smaller than that (0.6500) of the cases N = 40 and 90 with the best averaged rank of 2.0000. Thus, this study set N = 80 as the default population size of the PSO algorithm. The Cuckoo Search (CS) algorithm mimicked the cuckoo’s reproduction behaviors, being found by the host birds [65,66]. The CS’s parameters ProbF and the population size (N) are evaluated in Figure S1c. The parameter ProbF = 0.8 achieved the best averaged rank of 1.8571 and the best averaged accuracy of 0.6933. Another value, ProbF = 0.3, achieved the second-best in both the averaged rank (2.2857) and the averaged accuracy (0.6861). Since the value ProbF = 0.3 was closer to the popular value choice of 0.25 [65,66] and recommended 6.86 fewer features than ProbF = 0.8, this study set ProbF = 0.3 as the default choice. The population size N = 60 achieved the best averaged rank of 2.1429, while the value N = 80 achieved the best averaged accuracy of 0.6869. This was due to the four values (N = 30, 50, 60, and 90) achieving the best accuracy of 0.8103, while N = 80 achieved a slightly worse accuracy of 0.7931. Since N = 80 achieved 0.0500 accuracy improvements on the two other datasets CNS and Colon, this study set N = 80 as the default population size of the CS algorithm. The lower bound of the inertia weight (denoted as MinW) and the population size (N) of the dragonfly (DF) algorithm are evaluated in Figure S1d. The parameter MinW = 0.5 and 0.9 achieved the top two best averaged ranks of 0.6670 and 0.6645, respectively. These two values also achieved the top two best averaged accuracies of 0.6670 and 0.6645, respectively. Although these two values of the parameter MinW were only slightly different, the DF algorithm with MinW = 0.5 selected 97.71 features on average, which was much fewer than 147.14) with MinW = 0.9. Thus, the default value of MinW was set as 0.5 in this study. The population size N = 30 achieved the best in both average rank (1.7143) and the averaged accuracy (0.6937). Thus, this value (30) was set as the default value of the population size of the DF algorithm.

3.4. Finding the Best Classifier for Zoo

The Zoo-selected features were evaluated by five popular classifiers, i.e., KNN, NBayes, SVM, LR and DT, as shown in Figure 5. Each of the nine SI feature selection algorithms was executed for ten random runs on the training dataset, and the selected feature subset with the best prediction accuracy on the validation dataset was returned. The Zoo feature selection algorithm combined the nine feature subsets and carried out an additional feature screening using the dRFE algorithm to remove potentially redundant features [76]. The five classifiers evaluated the Zoo-selected features on the test dataset.

Figure 5

Performance comparison of the five classifiers on the features selected by the Zoo feature selection algorithm. The horizontal axis lists the seven datasets, and the last group of data columns gives the averaged performances of the five classifiers on the seven datasets. The vertical axis gives the prediction accuracies of the classifiers.

Figure 5 shows that the classifier KNN achieved the overall highest prediction accuracies on the seven datasets. Both KNN and LR achieved the best prediction accuracies on three datasets. It is interesting to observe that these two classifiers achieved the worst accuracy of 0.9000 on the ALL4 dataset, compared with the best accuracy of 0.9667 achieved by the NBayes and DT classifiers. Unfortunately, the NBayes and DT classifiers did not perform well on the other six datasets. This study recommends KNN as the default classifier to build prediction models using the Zoo-selected features.

3.5. Choosing the Maximum Number of Iterations

We screened 500 iterations for the nine investigated SI feature selection algorithms, as shown in Figure 6. The curves in Figure 6 show that some SI algorithms converged to the minimum fitness values very early. The FA and DF algorithms converged at the first and eighth iterations, respectively. Figure 6 shows that all the SI feature selection algorithms reached stable averaged fitness values after 150 iterations. We evaluated the differences between the minimum fitness values within the total 500 iterations and the fitness on the 150th iteration. Besides FA and DF, the PSO algorithm also reached 0 in the difference. The BA algorithm achieved a difference of 4.29 × 10 −5. The maximum difference of 1.2 × 10−3 was achieved by the GWO algorithm. Considering such minor differences in the fitness values and the time costs proportional to the numbers of iterations, this study chose the maximum number of iterations of T = 150 for all the nine SI feature selection algorithms.

Figure 6

Evaluation of the maximum numbers of iterations for the nine SI feature selection algorithms. The horizontal axis lists the maximum numbers of iterations. The vertical axis gives the averaged fitness values of the selected feature subsets over the seven datasets.

3.6. Comparison with Other Feature Selection Algorithms

The features selected by Zoo achieved generally satisfactory prediction accuracies for the 32 transcriptome and methylome datasets, as shown in Figure 7. Firstly, the Zoo-recommended features achieved the best averaged accuracy of 0.7982 for the 32 datasets. The feature selection algorithm LR_L1 achieved the second best averaged accuracy of 0.7730, while all the other eight feature selection algorithms did not achieve averaged accuracies better than 0.7600. Secondly, the Zoo-recommended features also achieved the best averaged rank of 2.7813 on the 32 datasets, and were ranked the best on 15 out of the 32 datasets.

Figure 7

Heatmap table of the classification performances using the features recommended by the nine existing feature selection algorithms and Zoo. All of the 32 datasets were evaluated using the KNN classifier. Each row is the data of one dataset, and the last row is the averaged accuracy of each feature selection algorithms on the 32 datasets. A darker background represents a smaller accuracy in that row, and a white background represents the best accuracy in the same row. All the nine feature selection algorithms compared with Zoo were available as functions in the Python package sklearn version 0.19.2, as shown in Table S2.

The experimental data showed that the proposed feature selection algorithm Zoo tended to select features with very promising prediction accuracies compared with the nine existing algorithms.

4. Conclusions

This study proposed a novel feature selection algorithm, the Zoo algorithm, by integrating nine SI-based feature selection algorithms. Seven transcriptome datasets with small prediction accuracies in a previous study were used to tune the parameters of Zoo. The experimental data analysis showed that the SI-based feature selection algorithms recommended features with complementary contributions to each other, and their union needed an additional step of redundancy removal by feature selection algorithms such as dRFE. The comparison with the nine existing feature selection algorithms showed that the Zoo-recommended features achieved promising prediction accuracies on transcriptomics and methylomics datasets. It is recommended that the Zoo algorithm be combined with a KNN classifier to predict the performance of the selected feature subset. The main limitation of Zoo was that the operating time is usually several hours due to the high time complexities of the SI-based feature selection algorithms. Additionally, the current version of Zoo did not efficiently integrate the internal operators of the nine SI feature selection algorithms.

46 in total

1. Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers.

Authors: Sophie Rousseaux; Alexandra Debernardi; Baptiste Jacquiau; Anne-Laure Vitte; Aurélien Vesin; Hélène Nagy-Mignotte; Denis Moro-Sibilot; Pierre-Yves Brichon; Sylvie Lantuejoul; Pierre Hainaut; Julien Laffaire; Aurélien de Reyniès; David G Beer; Jean-François Timsit; Christian Brambilla; Elisabeth Brambilla; Saadi Khochbin
Journal: Sci Transl Med Date: 2013-05-22 Impact factor: 17.956

2. A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers.

Authors: Yuanyuan Han; Lan Huang; Fengfeng Zhou
Journal: Bioinformatics Date: 2021-01-30 Impact factor: 6.937

3. FeSTwo, a two-step feature selection algorithm based on feature engineering and sampling for the chronological age regression problem.

Authors: Zhipeng Wei; Shiying Ding; Meiyu Duan; Shuai Liu; Lan Huang; Fengfeng Zhou
Journal: Comput Biol Med Date: 2020-09-26 Impact factor: 4.589

4. Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection.

Authors: S Sreejith; H Khanna Nehemiah; A Kannan
Journal: Comput Biol Med Date: 2020-09-18 Impact factor: 4.589

5. Region of Interest Selection for Functional Features.

Authors: Qiyue Wang; Yao Lu; Xiaoke Zhang; James Hahn
Journal: Neurocomputing Date: 2020-10-14 Impact factor: 5.719

6. Integrating factor analysis and a transgenic mouse model to reveal a peripheral blood predictor of breast tumors.

Authors: Heather G LaBreche; Joseph R Nevins; Erich Huang
Journal: BMC Med Genomics Date: 2011-07-22 Impact factor: 3.063

7. Distant metastasis time to event analysis with CNNs in independent head and neck cancer cohorts.

Authors: Marco Riboldi; Guillaume Landry; Elia Lombardo; Christopher Kurz; Sebastian Marschner; Michele Avanzo; Vito Gagliardi; Giuseppe Fanetti; Giovanni Franchin; Joseph Stancanello; Stefanie Corradini; Maximilian Niyazi; Claus Belka; Katia Parodi
Journal: Sci Rep Date: 2021-03-19 Impact factor: 4.379

8. McTwo: a two-step feature selection algorithm based on maximal information coefficient.

Authors: Ruiquan Ge; Manli Zhou; Youxi Luo; Qinghan Meng; Guoqin Mai; Dongli Ma; Guoqing Wang; Fengfeng Zhou
Journal: BMC Bioinformatics Date: 2016-03-23 Impact factor: 3.169

9. Elevated DNA methylation across a 48-kb region spanning the HOXA gene cluster is associated with Alzheimer's disease neuropathology.

Authors: Rebecca G Smith; Eilis Hannon; Philip L De Jager; Lori Chibnik; Simon J Lott; Daniel Condliffe; Adam R Smith; Vahram Haroutunian; Claire Troakes; Safa Al-Sarraj; David A Bennett; John Powell; Simon Lovestone; Leonard Schalkwyk; Jonathan Mill; Katie Lunnon
Journal: Alzheimers Dement Date: 2018-03-15 Impact factor: 21.566

10. GARS: Genetic Algorithm for the identification of a Robust Subset of features in high-dimensional datasets.

Authors: Mattia Chiesa; Giada Maioli; Gualtiero I Colombo; Luca Piacentini
Journal: BMC Bioinformatics Date: 2020-02-11 Impact factor: 3.169

1 in total

1. Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer.

Authors: Le Minh Thao Doan; Claudio Angione; Annalisa Occhipinti
Journal: Methods Mol Biol Date: 2023

1 in total