Literature DB >> 31182017

A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization.

Fei Han1,2, Di Tang3,4, Yu-Wen-Tian Sun3,4, Zhun Cheng5, Jing Jiang3,4, Qiu-Wei Li3,4.   

Abstract

BACKGROUND: Gene selection is one of the critical steps in the course of the classification of microarray data. Since particle swarm optimization has no complicated evolutionary operators and fewer parameters need to be adjusted, it has been used increasingly as an effective technique for gene selection. Since particle swarm optimization is apt to converge to local minima which lead to premature convergence, some particle swarm optimization based gene selection methods may select non-optimal genes with high probability. To select predictive genes with low redundancy as well as not filtering out key genes is still a challenge.
RESULTS: To obtain predictive genes with lower redundancy as well as overcome the deficiencies of traditional particle swarm optimization based gene selection methods, a hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization is proposed in this paper. To select the genes highly related to out samples' classes, a gene scoring strategy based on randomization and extreme learning machine is proposed to filter much irrelevant genes. With the third-level gene pool established by multiple filter strategy, an improved particle swarm optimization is proposed to perform gene selection. In the improved particle swarm optimization, to decrease the likelihood of the premature of the swarm the Metropolis criterion of simulated annealing algorithm is introduced to update the particles, and the half of the swarm are reinitialized when the swarm is trapped into local minima.
CONCLUSIONS: Combining the gene scoring strategy with the improved particle swarm optimization, the new method could select functional gene subsets which are significantly sensitive to the samples' classes. With the few discriminative genes selected by the proposed method, extreme learning machine and support vector machine classifiers achieve much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.

Entities:  

Keywords:  Gene scoring; Gene selection; Microarray data; Particle swarm optimization

Mesh:

Year:  2019        PMID: 31182017      PMCID: PMC6557739          DOI: 10.1186/s12859-019-2773-x

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

One of the major applications of microarray data analysis is to perform sample classification between different disease phenotypes, for diagnostic and prognostic purposes [1]. However, for small size of samples in comparison to high dimensionality, along with experimental variations in measured gene expression levels, it is difficult to implement a particular biological classification problem as well as gain deeper understanding of the functions of particular genes [1]. Gene selection is one of the critical steps in the course of the classification of microarray data [2]. Selecting a useful gene subset not only decreases the computational complexity, but also increases the classification accuracy. The methods for gene selection are broadly divided into three categories: filter, wrapper and embedded methods [3]. A filter method relies on general characteristics of the training data to select genes without involving any classifier for evaluation. Most filter methods consider each feature separately with ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection methods [4]. In addition to considering feature dependencies, wrapper methods take into account the interaction between feature subset search and model selection. However, wrapper methods have a higher risk of overfitting than filter ones and are very computationally intensive [5]. Embedded methods have the advantage that they include the interaction with the classification model, while being far less computationally intensive than wrapper methods [6]. Since it has no complicated evolutionary operators and fewer parameters need to be adjusted [7, 8], particle swarm optimization (PSO) [9, 10] has been used increasingly as an effective technique for global optimization in past decades. In recent years, PSO has been also implemented to perform gene selection. In [11], a combination of Integer-Coded GA (ICGA) and particle swarm optimization, coupled with extreme learning machine (ELM) was used to select an optimal set of genes. In [12, 13], binary PSO (BPSO) combined with filter method was applied to search optimal gene subsets. The method in [12] simplified gene selection and obtained a higher classification accuracy compared with some similar gene selection methods based on GA, while the method in [13] could determine the appropriate number of genes and obtained high classification accuracy by support vector machine. In [14], the Kmeans-PSO-ELM method used K-means method to group the initial gene pool into several clusters, and the standard PSO combined with ELM was used to perform gene selection, which could obtain a compact set of informative genes. Since traditional PSO is apt to converge to local minima which lead to premature convergence, the above PSO based gene selection method still has much room to improve. To overcome the deficiencies of the above PSO based gene selection methods and obtain predictive genes with more interpretability, two gene selection methods based on binary PSO and gene-to-class sensitivity (GCS) information were proposed in [15, 16]. In the KMeans-GCSI-MBPSO-ELM [16], GCS information combined with K-means method was used to identify relevant genes for subsequent sample classification, and a modified BPSO coupling GCS information (GCSI) combined with ELM was used to select smallest possible gene subsets. Although the KMeans-GCSI-MBPSO-ELM could obtain predictive genes with lower redundancy and better interpretability, it might filter out a few critical genes highly related to samples’ classes in some cases and thus lead into worse classification accuracy [16]. To overcome the weakness of the KMeans-GCSI-MBPSO-ELM, the BPSO-GCSI-ELM [15] method also encoded GCS information into binary PSO to perform gene selection by initializing particles, updating the particles, modifying maximum velocity, and adopting mutation operation adaptively. Although the BPSO-GCSI-ELM method could avoid filtering out some critical genes, it may increase the computational cost because of the large initial gene pool. To obtain predictive genes with lower redundancy as well as overcome the deficiencies of the above mentioned gene selection methods, a hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization (PSO) is proposed in this paper. Firstly, with the initial gene pool obtained with double filter strategies, randomization method combined with ELM is proposed to score each gene, and the third-level gene pool for further gene selection is established. Secondly, an improved PSO aiming at improving the search ability of the swarm is proposed to perform gene selection. In the improved PSO, to decrease the probability of converging into local minima, the Metropolis criterion of simulated annealing (SA) algorithm is introduced to update the particles, and the half of the swarm are reinitialized when the swarm is trapped into local minima. With the compact and relevant gene pool obtained by multiple filter strategies, the improved PSO could select the optimal gene subsets with high probability. Finally, experimental results on six public microarray data verify the effectiveness and efficiency of the proposed hybrid gene selection method. The remainder of this paper is organized as follows. The related preliminaries are briefly described in “Background” section. The proposed gene selection method is introduced in “Methods” section. “Results” section gives the experimental results on six public microarray data. Finally, the concluding remarks are offered in “Conclusions” section.

Methods

Particle swarm optimization

Particle swarm optimization (PSO) is a population-based stochastic optimization technique developed by Eberhart and Kennedy [9]. PSO works by initializing a flock of birds randomly over the searching space, where each bird is called a particle with no quality or volume. Each particle flies with a certain velocity according to its momentum and the influence of its own previous best position (P) as well as the best position of all particles (P). Assume that the dimension of searching space is D and the total number of particles is n. Then the original PSO is described as follows where v(t) and x(t) denote the velocity vector and the position of the i-th particle, respectively, at the t-th iteration; P(t) and P(t) denote the previous best position of the i-th particle and the best position of all particle, respectively; c1 and c2 are the positive acceleration constants; Y1() and r2() are random number between 0 and 1. In addition, it needs to place a limit on the velocity. To improve the convergence performance of the original PSO, a modified particle swarm optimization [10] was proposed. An inertial weight was introduced in the velocity vector evolution equation described as follows: where w is the inertial weight. Shi & Eberhart [10] advised the linearly decreasing method to adjust the weight as follows: where t is the current iteration number; w,w and T are the initial inertial weight, the final inertial weight and the maximum number of iteration, respectively.

Extreme learning machine

In [17], a learning algorithm for single-hidden layer feedforward neural networks (SLFN) called extreme learning machine (ELM) was proposed to solve the problem caused by gradient-based learning algorithms. ELM randomly chooses the input weights and hidden biases, and analytically determines the output weights of SLFN. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [18, 19]. For N arbitrary distinct samples (XX,T)(i=1,2,…,N.), where XX=[xx,xx,…,xx]∈R, T=[ t, t, …, t] ∈ R. A SLFN with N hidden neurons and activation function g() can approximate these N samples with zero error. This means that where The wh=[wh,wh,...,wh] is the input weight vector connecting the i-th hidden neuron and the input neurons, the wo=[wo,wo,...,wo] is the output weight vector connecting the i-th hidden neuron and the output neurons, and the b is the bias of the i-th hidden neuron. In the course of learning, first, the input weights and the hidden biases are arbitrarily chosen and need not be adjusted at all. Second, the smallest norm least-squares solution of the Eq. 5 is obtained as follows: where H+ is the Moore-Penrose (MP) generalized inverse of matrix H. It was concluded that the ELM has the minimum train-ing error and smallest norm of weights [18, 19]. The smallest norm of weights tends to have the best generalization performance [18, 19]. Since the solution is obtained by an analytical method and all the parameters of SLFN need not be adjusted, ELM converges much faster than gradient-based algorithm.

The proposed gene selection method

Gene selection generally consists of two steps, which are to identify relevant genes and to tend to select smallest subsets from the relevant genes. Different from the KMeans-GCSI-MBPSO-ELM [16] and BPSO-GCSI-ELM [15] methods, a scoring criterion following the double filter strategy is proposed to select highly relevant genes in this paper, which may decrease the size of the gene pool dramatically. For selecting compact gene subset from the refined gene pool, an improved PSO with the new strategies for reinitializing the swarm and updating of the P is proposed. Since the proposed method combines the scoring criterion with the improved PSO, coupled with ELM, to perform gene selection, it is referred to as the SC-IPSO-ELM method. The rough frame of the proposed method is shown in Fig. 1, and the detailed steps are described as follows.
Fig. 1

The frame of the proposed hybrid gene selection method

The frame of the proposed hybrid gene selection method Step 1: Form a first-level initial gene pool. The dataset is divided into training and testing datasets. Select 200–400 genes from all original genes by using information index to classification (IIC) method [16, 20] as follows: where and are the means of expression value of the gene g in the j-th and k-th classes, respectively, and and are the standard deviations of expression value of gene g in the j-th and k-th classes, respectively. c is the total number of classes. From [16, 20], the higher the value of d(g), the more classification information the gene g contains, so the gene g is more relevant to samples categories. The high classification accuracy will be obtained with high probability by a classifier if the microarray data is projected onto the gene g whose IIC value, d(g), is high. The genes are ranked by their IIC values on the training dataset, and those genes with higher IIC values are chosen to establish the first-level gene pool. Step 2: Establish a second-level initial gene pool. Randomly generate different gene subsets from the first-level gene pool. Then, each gene subset’s predictive ability is evaluated according to the 5-fold cross validation (CV) classification accuracy obtained by ELM on the training dataset projected onto the gene subset. When the 5-fold cross validation classification accuracy is less than the predetermined value (θ), the corresponding gene subset is deleted. Thus, the genes in the remained gene subsets have comparatively high predictive ability and form the second-level initial gene pool. The number of the gene subsets in the second-level gene pool is noted as l. Each gene subset is ranked as integer number (from 1 to l) according to the corresponding 5-fold cross validation classification accuracy. The higher the classification accuracy is, the smaller the rank number of the corresponding gene subset is. Step 3: Establish a third-level initial gene pool by scoring strategy. The psedo-code of the scoring rule for the i-th gene in the second-level gene pool is listed as Algorithm 1. where R is the ranked number of the j-th gene sub-set in the second-level gene pool. After obtaining the scores of all genes in the second-level gene pool, they are normalized into the interval of [0, 1] with linear transformation. Obviously, the higher value of the gene score is, the more relevant to the samples classes of the gene is. Further filter out those genes with much lower score values, and the remaining genes in the second-level pool form the third-level gene pool. Step 4: Use an improved PSO to select the optimal gene subsets from the third-level initial gene pool. The i-th particle X=(x,x,…,x) represents a candidate gene subset, and the element x is the serial number of the selected gene. The dimension of the particles is equal to the number of the selected genes from the third-level initial gene pool, which is predetermined according to [15, 16]. The fitness function of the i-th particle, f(X), is the 5-fold cross validation classification accuracy obtained by ELM on the training dataset projected onto the selected gene subset represented by the i-th particle. The optimization process of the improved PSO is the same as the traditional PSO except the following respects. One is the strategy of updating the best position of the swarm. To decrease the probability of premature convergence of the swarm, the Metropolis criterion in SA [21] is introduced to update the best position of the swarm. In the (i+1)-th optimization generation, the best position of the swarm, p, is updated by the Eq. 8 as follows: where T(i+1) is the annealing temperature which decreases linearly as the following equation In Eq. 7, T0, T, and It are the initial annealing temperature, final annealing temperature and maximum optimization generation number. The other is the strategy of mutating the swarm. When the swarm converges to the local minima, the particles in the swarm are close to each other, and the swarm loses its diversity. Mutating the swarm makes the particles repel each other and improves the diversity of the swarm, so the swarm jumps the local minima with high probability. In the improved PSO, the mutation operation is taken if the global best fitness value of the swarm does not change for predetermined generations (Nmu), which randomly select half number of particles in the swarm to reinitialize. The SC-IPSO-ELM method firstly identifies the relevant genes by the randomization method combined with ELM. Then, with the proposed gene scoring criterion, the much more relevant and compact gene pool is obtained. Finally, to obtain the optimal gene subsets, the tradition PSO is modified to improve its global search ability. Although the SC-IPSO-ELM method does not encode prior information to perform gene selection as the KMeans-GCSI-MBPSO-ELM [16] and BPSO-GCSI-ELM [15] methods, it could also select the most predictive genes with low redundancy effectively. Moreover, the multiple filter strategies produce much more compact gene pool than the methods in [15, 16], which could decrease the computational cost of PSO searching the optimal gene subsets. Compared to the gene-to-class sensitivity information, genes’ rank information obtained by the scoring strategy is more robust, so the SC-IPSO-ELM method may not filter out predictive genes with higher probability than the methods in [15, 16]. The proposed gene selection method contains filtering irrelevant genes to establish the gene pool and using PSO to select functional gene subsets from the gene pool, and its computational complexity can be calculated as follows: where N,N,l,N,l,N,N and Iter are the number of the original total genes, the number of training data, the number of the initial randomly generated gene subsets in Step 2, the size the first-level gene pool, the number of the selected gene subsets in Step 2, the size of the second-level gene pool, the swarm size and the maximum iteration number in the improved PSO, respectively. The four items on the right side of Eq. 10 are the computational complexity of Step 1, Step 2, Step 3 and Step 4 of the proposed method, respectively. The first and fourth terms are as the same as those of the methods in [15, 16]. The N and N both are much smaller than N. Generally, the l and l are not greater than N. The computational complexity of the SC-IPSO-ELM method can be approximated as the sum of the first and fourth terms on the right side of Eq. 10 which is similar to the methods in [15, 16], so the time complexity of the proposed method is at the same order of magnitude of that of the methods in [15, 16]. Since the third-level gene pool is established by multiple filter strategy, the size of the third-level gene pool is small. The small third-level gene pool leads to small N and Iter, which may decrease the computational cost of Step 4.

Results

Datasets

To verify the effectiveness and efficiency of the proposed gene selection method, we conduct experiments on the six public microarray datasets including Leukemia, Colon, SRBCT, Brain cancer data,Lung and Lymphoma data. The detailed description of the datasets is listed in Table 1.
Table 1

Six microarray datasets

DataTotal SamplesTraining samplesTesting samplesNumber of classesNumber of genes
Leukemia72383427129
Brain Cancer60303027129
Colon62402222000
SRBCT83632042308
LUNG20310310053312
Lymphoma58292927129
Six microarray datasets The Leukemia data [22] contains total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7129 gene expression values. The Leukemia data are available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial. The Brain cancer data contains 60 samples in two classes, 46 patients with classic and 14 patients with desmoplastic brain cancer. The Lymphoma data includes 58 samples where 32 patients did cured and 26 patients did not cured. Each sample in the Brain cancer and Lymphoma has 7129 genes. These two data are available at http://linus.nci.nih.gov/~brb/DataArchive_New.html. The Colon data consists of expression levels of 62 samples of which 40 samples are colon cancer samples and the remaining are normal samples. Although original expression levels for 6000 genes are measured, 4000 genes out of all the 6000 genes were removed considering the reliability of measured values in the measured expression levels. The measured expression values of 2000 genes are publicly available at http://microarray.princeton.edu/oncology/. The entire SRBCT data [23] includes the expression data of 2308 genes. There are totally 63 training samples and 25 testing samples, five of the testing samples being not SRBCT. The 63 training samples contain 23 Ewing family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 neuroblastoma (NB), and 8 Burkitt lymphomas (BL). The 20 testing samples contain 6 EWS, 5 RMS, 6 NB, and 3 BL. The data are available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial. The LUNG data [24, 25] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and a dataset with 203 samples and 3312 genes was obtained [24, 25]. The data is also available at https://link.springer.com/article/10.1186/1471-2105-7-228#SupplementaryMaterial. In the experiments on all data, the swarm size is 60, the maximum iteration number is selected as 20, the acceleration constants c1 and c2 are both selected as 1.49445, and the inertial weight varies from 0.9 to 0.4. The size of the third-level gene pool is 40 on all data. The parameter N is fixed as 3 on all data. The values of these parameters are determined by the cross-validation runs on the training datasets and according to [15, 16].

The prediction ability of the selected gene subsets

To verify the prediction ability of the selected gene subsets obtained by the proposed method, ELM is used to perform sample classification with some gene subsets selected by the SC-IPSO-ELM method on the six datasets. Each experiment is conducted 100 times, and the mean classification accuracies are listed in Table 2.
Table 2

The classification accuracy obtained by elm with different gene subsets selected by the sc-ipso-elm method on the six microarray data

DataSelected gene subsets5-fold CV Accuracy Mean(%) ±stdTest Accuracy Mean(%) ±std
Leukemia4050,2642,2121100 ±0.00100 ±0.00
4050,2642,1882100 ±0.00100 ±0.00
4050,2642,3258100 ±0.00100 ±0.00
42335,2642,1843,4050100 ±0.00100 ±0.00
Brain cancer1091,798,33790.14 ±0.03689.62 ±0.025
3052,973,3041,3692,479692.00 ±0.02391.78 ±0.046
4628,7129,7045,4413,79892.29 ±0.02090.22 ±0.022
7129,2881,3052,865,1970,2935,487192.78 ±0.01291.88 ±0.019
Colon14,1976,1325,1993,1870,1892,653,1917,187,22,1209,106093.63 ±0.02597.27 ±0.013
377,792,14,1976,765,187,251,1110,175,53,1293,1740,20093.00 ±0.03598.06 ±0.013
792,1423,14,1976,1909,1110,1589,102,107,1916,175,115193.73 ±0.03198.71 ±0.013
792,14,1976,765,1909,1524,1110,175,43,53,1293,1740,25196.86 ±0.03399.05 ±0.011
SRBCT742,1003,1954,430,2050,123100 ±0.00100 ±0.00
545,1955,1434,509,971,255100 ±0.00100 ±0.00
1003,545,1911,153,123,1489,2161100 ±0.00100 ±0.00
1955,2050,545,2144,2045,123,1489100 ±0.00100 ±0.00
LUNG1765,2779,2841,1474,2045,3191,2763,2817,525,163098.27 ±0.01493.33 ±0.011
525,1493,607,2763,792,580,867,368,3279,2158,122598.39 ±0.02393.47 ±0.012
1765,883,2763,792,580,867,985,3279,2988,2045,81498.67 ±0.02193.60 ±0.019
1765,525,2763,2841,1474,2583,867,985,2045,814,91898.67 ±0.01994.01 ±0.024
Lymphoma152,2347,2650,5679,438,1855,586390.60 ±0.02385.11 ±0.020
1855,2828,152,2437,806,530,110292.36 ±0.02789.33 ±0.019
5279,4687,4940,5449,1133,1855,451993.51 ±0.02290.47 ±0.029
152,2437,4829,2828,6441,806,250893.79 ±0.02090.45 ±0.023
The classification accuracy obtained by elm with different gene subsets selected by the sc-ipso-elm method on the six microarray data From Table 2, with the small gene subsets selected by the proposed approach, ELM obtains 100% 5-fold cross validation and test accuracies both on the Leukemia and SRBCT data, With the about five and thirteen genes selected by the SC-IPSO-ELM method on the Brain cancer and Colon, respectively, ELM obtains high prediction accuracies. These results indicate that the SC-IPSO-ELM method has the ability of selecting those predictive genes highly related to samples’ classes.

Biological and functional analysis of the selected gene subsets

The experiment on each microarray data is conducted 500 times, and the top ten frequently selected genes are listed in Tables 3, 4, 5, 6, 7 and 8 for the six microarray data.
Table 3

The top ten frequently selected genes with the sc-ipso-elm method on the leukemia data

Gene No.Gene NameDescription
2354M92287CCND3 Cyclin D3 ∗∘
6855M31523CF3 Transcription factor 3 (E2A immunoglobulin enhancer bind-ing factors E12/E47)
2642U05259MB-1 gene ∗∘⋆∙
4050X03934GB DEF = T-cell antigen receptor gene T3-delta ∗⋆
1834M23197CD33 CD33 antigen (differenti-ation antigen) ∗∘
1882M27891CST3 Cystatin C (amyloid an-giopathy and cerebral hemor-rhage) ∗∘⋆∙
4377X62654ME491 gene extracted from H.sapiens gene for Me491/CD63 antigen
2121M63138CTSD Cathepsin D (lysosomal aspartyl protease) ∗∘
2288M84526DF D component of comple-ment (adipsin)
6271M33493Tryptase-III mRNA, 3’ end

*also selected in [15]; ∘also selected in [26]; ⊲also selected in [22]; ⋆also selected in [16]; ∙also selected in [27]

Table 4

The top ten frequently selected genes with the sc-ipso-elm method on the brain cancer data

Gene No.Gene NameDescription
798D86961Lipoma HMGIC fusion partner-like 2
865D87454KIAA0265 protein
2648M28879Granzyme B (granzyme 2, cytotoxic T-lymphocyte-associated serine esterase 1)
2881M57506Chemokine (C-C motif) ligand 1
3041M64934Kell blood group ∗
3052M65254Protein phosphatase 2 (formerly 2A), regulatory subunit A (PR 65), beta isoform
3692U03644CBF1 interacting corepressor
4628U50079Histone deacetylase 1
6571X93036FXYD domain containing ion transport regulator 3
7129Z97074Rab9 effector protein with kelch motifs

*also selected in [15]

Table 5

The top ten frequently selected genes with the sc-ipso-elm method on the colon data

Gene No.Gene NameDescription
14H20709MYOSIN LIGHT CHAIN ALKALI, SMOOTH-MUSCLE ISOFORM (HU-MAN) ∗∘
1772H08393COLLAGEN ALPHA 2(XI) CHAIN (Homo sapiens)
1935X62048H.sapiens Wee1 hu gene
286H64489LEUKOCYTE ANTIGEN CD37 (Homo sapiens)
792R88740ATP SYNTHASE COUPLING FACTOR 6MITOCHONDRIAL PRE-CURSOR (HUMAN) ∘⋆
187T51023HEAT SHOCK PROTEIN HSP 90-BETA (HUMAN)
1976K03474Human Mullerian inhibiting substance gene, complete cds
493R87126MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gal-lus)
1635M36634Human vasoactive intestinal peptide (VIP) mRNA, com-plete cds
698T51261GLIA DERIVED NEXIN PRECURSOR (Mus muscu-lus)

*also selected in [28]; ∘also selected in [29]; ⊲also selected in [15]; ⋆also selected in [16]

Table 6

The top ten frequently selected genes with the sc-ipso-elm method on the srbct data

Gene No.Gene NameDescription
742812105Transmembrane protein ∗∘
1003796258Sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) ∗⋆∙
255325182Cadherin 2, N-cadherin (neuronal) ∘
123236282Wiskott-Aldrich syndrome (ecezema-thrombocytopenia
5451435862Antigen identified by monoclonal antibodies 12E7, F21 and O13 ∗⋆
1319866702Protein tyrosine phosphatase, non-receptor type 13 (APO-1/CD95 (Fas)-associated phosphatase)
1606624360Proteasome (prosome, macropain) subunit, beta type, 8 (large multifunctional protease 7)
2046244618ESTs
246377461Caveolin 1, caveolae protein, 22kD
509207274Human DNA for insulin-like growth factor II (IGF-2); exon 7 and additional ORF

*also selected in [23]; ∘also selected in [30]; ⊲also selected in [15]; ⋆also selected in [16]; ∙also selected in [31]

Table 7

The top ten frequently selected genes with the sc-ipso-elm method on the lung data

Gene No.Gene NameDescription
2763185_atNeuro-oncological ventral antigen 1
58039333_atCollagen, type IV, alpha 1 ∘
79238704_atCadherin 2, N-cadherin (neuronal) ∗∘
284132696_atPre-B-cell leukemia transcription factor 3
204535276_atClaudin 4
265732648_atDelta-like homolog (Drosophila)
176539722_atNuclear receptor co-repressor 1 ∗∘
149338967_atChromosome 14 open reading frame 2
319139383_atAdenylate cyclase 6
23381315_atOrnithine decarboxylase antizyme 1

*also selected in [16]; ∘also selected in [15]

Table 8

The top ten frequently selected genes with the sc-ipso-elm method on the lymphoma data

Gene No.Gene NameDescription
152M97935_5_atSignal transducer and activator of transcription 1, 91kDa
1855L17328_atFasciculation and elongation protein zeta 2 (zygin II)
2437M18185_atGastric inhibitory polypeptide
2347M14091_atSerine (or cysteine) proteinase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 7
2828M37763_atNeurotrophin 3
5279U83843_atChaperonin containing TCP1, subunit 7 (eta)
806D86968_atMitogen-activated protein kinase kinase kinase 4 ∗
4092U22178_s_atMicroseminoprotein, beta-
4940U66559_atAnaplastic lymphoma kinase (Ki-1)
4194U28150_atATP-binding cassette, sub-family D (ALD), member 2

*also selected in [16]

The top ten frequently selected genes with the sc-ipso-elm method on the leukemia data *also selected in [15]; ∘also selected in [26]; ⊲also selected in [22]; ⋆also selected in [16]; ∙also selected in [27] The top ten frequently selected genes with the sc-ipso-elm method on the brain cancer data *also selected in [15] The top ten frequently selected genes with the sc-ipso-elm method on the colon data *also selected in [28]; ∘also selected in [29]; ⊲also selected in [15]; ⋆also selected in [16] The top ten frequently selected genes with the sc-ipso-elm method on the srbct data *also selected in [23]; ∘also selected in [30]; ⊲also selected in [15]; ⋆also selected in [16]; ∙also selected in [31] The top ten frequently selected genes with the sc-ipso-elm method on the lung data *also selected in [16]; ∘also selected in [15] The top ten frequently selected genes with the sc-ipso-elm method on the lymphoma data *also selected in [16] From Tables 3, 4, 5, 6, 7 and 8, many genes selected by the SC-IPSO-ELM method were also selected by one or more methods proposed in [15, 16, 22, 23, 26–31]. On the Leukemia data, gene U05259, a B lymphocyte antigen receptor, encodes cell surface proteins for which monoclonal antibodies have been demonstrated to be useful in distinguishing lymphoid from myeloid lineage cells [18]. Gene M63138 is the member of the peptidase C1 family involved in the pathogenesis of breast cancer and possibly Alzheimer’s disease [18]. A muscle index can be calculated based on an average intensity of 17 ESTs in the array that are homologous to smooth muscle genes which included gene H20709 in the Colon data. Although the SC-IPSO-ELM method does not encode gene-to-class sensitivity (GCS) information extracted from the microarray data, it could also select some genes with comparatively high GCS values selected by the GCSI-based methods. Since the expression levels of all genes in the Brain cancer and Lymphoma data are not distinct in two classes, the different approaches considering different factors may select different discriminative gene subsets. Thus, the genes selected by the SC-IPSO-ELM are surely different from ones selected by other gene selected methods, which is verified by Tables 4 and 8. Figure 2 shows the heatmap with top ten frequently selected genes for the six data. It can be found that most of frequently selected genes’ expression levels clearly differentiate between/among two/multi classes on all data but the Brain cancer and Lymphoma data. From Fig. 2b and e, there has no single gene whose expression levels are distinct between two classes, which was verified in [15, 16]. Hence, the proposed method is capable of selecting predictive genes whose expression levels are distinct among different classes in most cases.
Fig. 2

The heatmap of expression levels based on the top ten frequently selected genes on the six data

The heatmap of expression levels based on the top ten frequently selected genes on the six data

Comparison with the GCSI based gene selection methods

In [15, 16], two effective gene selection methods by considering GCS information we proposed. Experimental results on several public microarray data verified that the two methods, the KMeans-GCSI-MBPSO-ELM and BPSO-GCSI-ELM methods, outperformed than some PSO-based methods and other classical gene selection methods such as GS2, GS1, Cho’s and F-test. To avoid repetition of the comparison with the PSO-based and other classical gene selection methods, the SC-IPSO-ELM method is compared with only the KMeans-GCSI-MBPSO-ELM and BPSO-GCSI-ELM methods on the six data by using ELM and support vector machine (SVM), and the corresponding results of the average of 100 trials are listed in Tables 9 and 10.
Table 9

The 5-fold cv classification accuracies of elm based on the three gene selection methods on the six microarray data

DataKMeans-GCSI-MBPSO-ELMBPSO-GCSI-ELMSC-IPSO-ELM
5-fold CV Accuracy(%) ± stdgenes5-fold CV Accuracy(%) ± stdgenes5-fold CV Accuracy(%) ± stdgenes
Leukemia100.00 ±0.003100.00 ±0.003100.00 ±0.003
Brain cancer88.63 ±0.0216689.88 ±0.0223791.88 ±0.0197
Colon97.61 ±0.0137697.82 ±0.0132999.05 ±0.01113
SRBCT100.00 ±0.006100.00 ±0.006100.00 ±0.006
LUNG97.10 ±0.0631196.28 ±0.0721298.67 ±0.01911
Lymphoma86.97 ±0.024884.50 ±0.023893.79 ±0.0207
Table 10

The classification accuracies of svm based on the three gene selection methods on the six microarray data

DataKMeans-GCSI-MBPSO-ELMBPSO-GCSI-ELMSC-IPSO-ELM
5-fold CV Accuracy(%) ± stdgenes5-fold CV Accuracy(%) ± stdgenes5-fold CV Accuracy(%) ± stdgenes
Leukemia99.99 ±0.0014399.99 ±0.0014399.99 ±0.00143
Brain cancer84.05 ±0.0301682.70 ±0.0319786.55 ±0.02997
Colon90.69 ±0.0226692.02 ±0.0275993.35 ±0.031013
SRBCT99.24 ±0.0119698.34 ±0.0100699.39 ±0.00746
LUNG94.63 ±0.0541196.65 ±0.0581195.38 ±0.04711
Lymphoma77.59 ±0.032872.41 ±0.034881.03 ±0.0257
The 5-fold cv classification accuracies of elm based on the three gene selection methods on the six microarray data The classification accuracies of svm based on the three gene selection methods on the six microarray data From Tables 9 and 10, the SC-IPSO-ELM method selects the almost same number of genes as the two GCSI based methods on the Leukemia, Brain cancer, SRBCT, LUNG and Lymphoma data, while it selects the most number of genes on the Colon data among three methods. ELM achieves 100% 5-fold CV accuracy on the Leukemia and SRBCT data with the genes selected by the three methods, and SVM achieves the same 5-fold CV accuracy on the Leukemia data with the genes selected by the three methods. ELM and SVM both obtain the highest 5-fold CV accuracy on the Brain cancer, Colon data and Lymphoma data with the genes selected by the SC-IPSO-ELM method, SVM obtains the slightly higher 5-fold CV accuracy on the SRBCT data with the SC-IPSO-ELM than that with the two GCSI based methods, and SVM obtains the highest 5-fold CV accuracy on the LUNG data with the BPSO-GCSI-ELM. On the whole, the SC-IPSO-ELM could select more predictive gene subsets than the two GCSI based methods.

Discussion on the parameter selection

To establish second-level gene pool, it is critical to determine the value of the parameter, θ. Figure 3 shows the relationship between the classification accuracy on the training data obtained by ELM and the parameter, θ. On the Leukemia, Colon data, LUNG and Lymphoma data, the 5-fold CV and test accuracy both have an upward trend as the values of the parameter, increases, while they have a downward trend as the values of the parameter increases on the Brain cancer data. On the SRBCT data, the test accuracy decreases as the value of the parameter increases, while the 5-fold CV accuracy increases as the value of the parameter increases.
Fig. 3

The parameter, θac versus the classification accuracy on the training dataset obtained by ELM

The parameter, θac versus the classification accuracy on the training dataset obtained by ELM For using the improved PSO to select the gene subset, the dimension of the particle is the number of the selected genes. Figure 4 shows the effect on different number of the selected genes. The 5-fold CV accuracy obtained by ELM has an upward trend as the number of the selected genes increases on the six data but the Colon data, while the curves of the test accuracy obtained by ELM fluctuate as the number of the selected genes increases on the six data.
Fig. 4

The number of the selected genes versus the classification accuracy on the training dataset obtained by ELM

The number of the selected genes versus the classification accuracy on the training dataset obtained by ELM Figures 3 and 4 provide a guide on how to select the values of the parameters θ and the number of the selected genes in the SC-IPSO-ELM. In general, these parameters should be selected empirically in particular applications.

Conclusions

To obtain predictive genes with lower redundancy, a hybrid gene selection method based on gene scoring strategy and improved PSO was proposed in this paper. To decrease the computational cost, the relevant genes are filtered out through different strategies to establish more compact gene pool for further gene selection. Then, the improved PSO was proposed to select the most predictive gene subsets from the gene pool. Experimental results verified the proposed method could select highly predictive and compact gene subsets and outperformed than other PSO-based and GCSI-based gene selection methods. However, the proposed method selects genes lack of much interpretability. Future work will include how to encode some prior information into the proposed method for gene selection and apply it to RNA-Seq data analysis.
  16 in total

1.  Gene selection: a Bayesian variable selection approach.

Authors:  Kyeong Eun Lee; Naijun Sha; Edward R Dougherty; Marina Vannucci; Bani K Mallick
Journal:  Bioinformatics       Date:  2003-01       Impact factor: 6.937

2.  Applications of support vector machines to cancer classification with microarray data.

Authors:  Feng Chu; Lipo Wang
Journal:  Int J Neural Syst       Date:  2005-12       Impact factor: 5.866

3.  Gene extraction for cancer diagnosis by support vector machines--an improvement.

Authors:  Te Ming Huang; Vojislav Kecman
Journal:  Artif Intell Med       Date:  2005 Sep-Oct       Impact factor: 5.326

Review 4.  A review of feature selection techniques in bioinformatics.

Authors:  Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal:  Bioinformatics       Date:  2007-08-24       Impact factor: 6.937

5.  A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification.

Authors:  Qi Shen; Wei-Min Shi; Wei Kong; Bao-Xian Ye
Journal:  Talanta       Date:  2006-09-01       Impact factor: 6.057

6.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors:  U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal:  Proc Natl Acad Sci U S A       Date:  1999-06-08       Impact factor: 11.205

7.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors:  T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

8.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.

Authors:  J Khan; J S Wei; M Ringnér; L H Saal; M Ladanyi; F Westermann; F Berthold; M Schwab; C R Antonescu; C Peterson; P S Meltzer
Journal:  Nat Med       Date:  2001-06       Impact factor: 53.440

9.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Authors:  A Bhattacharjee; W G Richards; J Staunton; C Li; S Monti; P Vasa; C Ladd; J Beheshti; R Bueno; M Gillette; M Loda; G Weber; E J Mark; E S Lander; W Wong; B E Johnson; T R Golub; D J Sugarbaker; M Meyerson
Journal:  Proc Natl Acad Sci U S A       Date:  2001-11-13       Impact factor: 11.205

10.  A stable gene selection in microarray data analysis.

Authors:  Kun Yang; Zhipeng Cai; Jianzhong Li; Guohui Lin
Journal:  BMC Bioinformatics       Date:  2006-04-27       Impact factor: 3.169

View more
  3 in total

1.  A Novel Biomarker Identification Approach for Gastric Cancer Using Gene Expression and DNA Methylation Dataset.

Authors:  Ge Zhang; Zijing Xue; Chaokun Yan; Jianlin Wang; Huimin Luo
Journal:  Front Genet       Date:  2021-03-25       Impact factor: 4.599

2.  A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm.

Authors:  Saeid Azadifar; Ali Ahmadi
Journal:  BMC Med Inform Decis Mak       Date:  2021-11-27       Impact factor: 2.796

3.  An efficient gene selection method for microarray data based on LASSO and BPSO.

Authors:  Ying Xiong; Qing-Hua Ling; Fei Han; Qing-Hua Liu
Journal:  BMC Bioinformatics       Date:  2019-12-30       Impact factor: 3.169

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.