Literature DB >> 29535919

Improving Classification of Cancer and Mining Biomarkers from Gene Expression Profiles Using Hybrid Optimization Algorithms and Fuzzy Support Vector Machine.

Niloofar Yousefi Moteghaed¹, Keivan Maghooli¹, Masoud Garshasbi².

Abstract

BACKGROUND: Gene expression data are characteristically high dimensional with a small sample size in contrast to the feature size and variability inherent in biological processes that contribute to difficulties in analysis. Selection of highly discriminative features decreases the computational cost and complexity of the classifier and improves its reliability for prediction of a new class of samples.
METHODS: The present study used hybrid particle swarm optimization and genetic algorithms for gene selection and a fuzzy support vector machine (SVM) as the classifier. Fuzzy logic is used to infer the importance of each sample in the training phase and decrease the outlier sensitivity of the system to increase the ability to generalize the classifier. A decision-tree algorithm was applied to the most frequent genes to develop a set of rules for each type of cancer. This improved the abilities of the algorithm by finding the best parameters for the classifier during the training phase without the need for trial-and-error by the user. The proposed approach was tested on four benchmark gene expression profiles.
RESULTS: Good results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67% and for breast cancer is 98%. The results show that the best kernel used in training the SVM classifier is the radial basis function.
CONCLUSIONS: The experimental results show that the proposed algorithm can decrease the dimensionality of the dataset, determine the most informative gene subset, and improve classification accuracy using the optimal parameters of the classifier with no user interface.

Entities: Chemical

Keywords: Cancer classification; fuzzy support vector machine; gene expression; genetic algorithm; particle swarm optimization algorithm

Year: 2018 PMID： 29535919 PMCID： PMC5840891

Source DB: PubMed Journal: J Med Signals Sens ISSN： 2228-7477

Introduction

The DNA microarray technology allows monitoring of thousands of genes simultaneously in a single experiment. The use of this technology to monitor changes in expression levels of genes among samples can help physicians efficiently and accurately diagnose disease, classify tumors and cancer types, and propose effective treatment procedure. Gene expression is a dynamic process that provides valuable knowledge about biological networks and cellular states. The expression level of each gene indicates the activation and transcription of that gene in cell states. The gene expression pattern of a cell or a tissue determines the structure and function of that cell or tissue. On a microarray chip, the number of genes are exceeding more than a thousand, in contrast of small number of samples. Thus, the curse of dimensionality, noisiness, and stochastic nature of this data are major problems that arise in microarray data analysis and lead to many data mining and machine learning challenges.[1234] Determination of a small subset of relevant genes in a given dataset as a solution for high-dimensional problem can improve the classification accuracy.[34] Furthermore, the problem of stability can be tackled using other biological databases and bioinformatics tools such as protein–protein interaction and pathway databases.[45] Several methods have been proposed for informative gene selection and classification. The Taguchi-genetic algorithm (GA) and Taguchi-particle swarm optimization (PSO) use correlation-based feature selection and are hybrid methods where k-NN serves as a classifier[678] and Shen et al.[9] used a modified PSO and a support vector machine (SVM). Li et al.[10] and Hernandeze et al.[11] developed a hybrid GA and SVM model. Tong and Schierz developed a hybrid GA and a neural network classifier[12] and Li et al. and Yang et al.[1314] used k-NN to apply to microarray data. Chuang et al.[15] proposed improved PSO and used the k-NN method for tumor classification. Shen et al.[16] developed a hybrid PSO and Tabu search with LDA classification for cancer classification. Martinez et al.[17] proposed an algorithm based on swarm intelligence feature selection. Lee and Leu[18] used a GA with dynamic parameter settings, a Chi-square test for homogeneity and SVM for cancer classification. Alba et al.[19] combined a PSO and a GA individually with a SVM to find small samples of informative genes. Zhenyu et al.[20] proposed a multiple kernel SVM-based data mining and knowledge discovery system. Wang and Simon[21] used single genes to create classification models such as k-NN, SVM, and the random forest models. Shah and Kusiak[22] developed an integrated algorithm involving a GA and correlation-based heuristics for data preprocessing and a decision tree and SVM to make predictions. Chuang et al.[23] and Mao et al.[24] applied fuzzy SVMs to gene expression profiles to classify multiple cancer types. Ng and Chan[25] combined an information-theoretic approach with sequential forward floating searches and a decision tree. Yeh et al.[26] applied a GA and decision tree to build a model of selected genes. In,[27] hybrid PSO and GA algorithms are used as a feature selection method and also, in,[28] a novel-weighted SVM based on PSO are used for gene selection and tumor classification are applied on gene expression data. Chu and Wang[29] used principal component analysis, a class separability measure, the Fisher's ratio and t-test for gene selection and a voting scheme for multigroup classification using a binary SVM. The present study used a hybrid GA and PSO algorithm as the feature selection method. The fitness function of each gene subset was determined using the fuzzy support vector machine (FSVM) classifier. The use of fuzzy logic in the SVM training phase decreased the effect of redundant noisy data by determining the importance of each sample in the training stage. The t-test method was initially used to preprocess the original gene expression data and the proposed hybrid method was then applied to select the most important subsets of genes using 10-fold cross validation. The 10-fold cross-validation accuracy of each gene subset was the evaluation criteria. One purpose of this study was to increase the classification accuracy by selecting the best parameters for a classifier using the proposed hybrid PSO/GA/FSVM algorithm without need for user trial and error. The use of a suitable combination of optimization algorithms for feature selection and selection of the proper model for the classifier improve classification results to allow accurate prediction of blind test samples.

Materials and Methods

The proposed method was evaluated using four public microarray datasets. There are several types of blood cancer and it is important to distinguish between them. The first dataset comprised 72 samples of acute lymphoblastic leukemia (ALL) and mixed lineage leukemia (MLL) cancer types with 12582 genes by Armstrong Scott.[30] The second dataset comprised 72 samples of ALL and acute myeloid leukemia (AML) cancer types with 7129 genes by Golub et al.[31] The third dataset generated by Alon et al.[32] contains the expression of 2000 genes in 62 samples for normal and colon tumor tissues. The last dataset comprised 49 samples with 7129 genes by West.[33] Table 1 provides the details of the datasets.

Table 1

Datasets which used for testing the efficiency of proposed method

Genetic algorithm and particle swarm optimization

A GA is a computational optimization method that searches all parts of a solution space using different groups of feature subsets to find the best answer. The initial population is generated randomly, and then, all chromosomes are evaluated using a fitness function. The GA operators are selection, crossover, and mutation. The crossover operator creates new population by combining two chromosomes, depending on the selection operator. The crossover operator in a GA can eliminate fragmentation and genetic variation in the population. Mutation is another operator that creates a variety of solutions. The process continues to the last generation in which the best fitness is satisfied. PSO, like GA, is an algorithm inspired by the social behavior of birds in a flock.[34] This algorithm was developed by Eberhart and Kennedy.[35] In PSO, each particle moves in the search space at a velocity that is adjusted using its own memory and its neighbors’ experiences. The fitness values are obtained using a fitness function.

Support vector machine

In machine learning and data mining tasks, SVMs are supervised learning algorithms associated with learning models that are used for classification and regression analysis problems. The current standard incarnation was proposed by Cortes and Vapnik.[36] SVM is specifically designed for two-class analysis problems. Let data set I = 1, n, be, where Xi is the set of training samples and yi are the associated labels. Each yi can take one of two values (+1 or −1) depending the class.[3637] In the linear case, classification of new data can be done by using the following formula: Where C is the soft margin constant parameter with an upper bound in the Lagrange multipliers. For the nonlinear case, SVM transforms the input data into higher dimensional feature space using a kernel function, so it can be solved as a separable case. With the use of a kernel function, the optimization problem becomes: Where C is the soft margin constant parameter with an upper bound in the Lagrange multipliers. The most familiar kernel functions are: Linear kernel function polynomial kernel function (p: degree), Gaussian kernel function (σ: Standard deviation) and sigmoid kernel function (β0: Slope and β1: Intercept constant). The Gaussian kernel is one of the most useful functions and the common SVM kernel can be used in different kinds of problems. Each kernel function has its own parameters and the related parameters must be properly set to increase classification accuracy.[38]

Proposed algorithm

The proposed algorithm is a combination of the GA and PSO algorithms. The goal is to combine the properties of both algorithms by integration of GA operators into the PSO algorithm. The main difference between GA and PSO is that there are no crossovers and mutation operators in PSO; thus, it is more likely to be caught in a local minimum. The best particle in PSO can be remembered and so that it has an effect on other particles. This property increases convergence.[39404142] The hybrid PSO/GA requires the following 11 steps.

Step 1

Step 1 is the preparation of data by filtering and normalization. Most genes in databases are not useful and do not have the desired patterns for analysis of microarray data. These genes must be removed because: (a) their expression value is very low; (b) they show little change in expression value in whole samples; (c) they have low standard deviations and do not substantially change around the mean expression value and; (d) they have low information entropy. The t-test can then be used to examine the data to select the top-ranked genes and apply them as an input to the hybrid PSO/GA system.

Step 2

The initial values of each parameter used in the algorithm are set as shown in Table 2.

Table 2

Parameters in particle swarm optimization genetic algorithm

Step 3

Step 3 is to create the initial population. At first, a population with N chromosomes is randomly generated. Primary binary initialization is applied so that (1) denotes the existence of a feature in the training system and (0) denotes the absence of that feature. The lengths of the particles or chromosomes are determined by adding the number of features selected based on a statistical method (segment 1) and the 17 additional genes used to determine the optimal parameters of a classifier in the hybrid algorithm. Table 3 shows the details of the subparts (segment 2 through segment 6). Subparts 1 and 2 contain 2 bits of chromosome that determines the type of kernel function as linear, polynomial, a radial basis function (RBF), or sigmoid. The third subpart (5 bits) represents values of C (penalty factor), which lie between 0.1 and 100000. The fourth subpart (6 bits) determines the RBF kernel parameter, which is between 0.001 and 0.128. The fifth segment (2 bits) represents the value of polynomial kernel parameter (d), which can be 1, 2, 3, or 4. The sixth segment (2 bits) represents the value of the sigmoid kernel parameter, which can be 1, 2, 3, or 4.[43]

Table 3

A sample chromosome of particle swarm optimization genetic algorithm/fuzzy support vector machine population

Step 4

In this step, the fitness values for all particles are calculated to determine the functionality of each particle; this is called validation of particles. The data are divided into training and evaluation parts using 10-fold cross-validation as input for the cost function. This step is carried out for every particle to determine it as either a training or testing particle based on the selected features that exist in that particle. The importance of each sample in the SVM training phase is examined. Standard SVM assumes that the training samples occur in pairs, such as (xi, yi) and yi∈ (−1, +1) next, the importance of each sample is considered in each pair as (xi, yi, si) where si denotes the level of importance of each sample. The membership degree of sample X is assigned rather than its class, which can be achieved by a slight alteration of the main formula as: The difference between standard SVM and FSVM is the upper limit of the Lagrange multipliers; ai in FSVM equals siC, while this value in SVM equals C. Next, the membership is computed for si of each sample xi rather than the class. Lin and Wang[44] obtained the value of si from the ratio of the distance of the sample from the center of the class to the distance of the farthest sample in same class from the center of the class. This method is sensitive to outliers and is not suitable for this kind of problem. The proposed method computes the weight and importance of each sample as: Where ε is a small value equal to 0.001 and μ, are the mean vector and covariance matrix of the sample class, respectively. For simplicity and to decrease computation, the covariance matrix is assumed to be a diagonal matrix. Using this method and entering the extent of each sample in the training phase decrease the effect of outliers by multiplication of each sample weight in the sample error.[45]

Step 5

Update the best particle as and the best personal memories of each particle x with the velocity and position of the particles as: In a binary algorithm, velocity is defined as a change in the means of probability and the velocity is explained by the probability of being in position 1.[46] Velocity is considered to be between 0 and 1, which explains the probability of being in position 1. The velocity is calculated using Eq. 8 and by mapping the values of 0 and 1 by limiting the sigmoid function. The final position of particle (i) is determined as: σ is a random number with uniform distribution in the range of 0 and 1. To increase the velocity of divergence of the system, the limitation of velocity in the system must be considered based on maximum and minimum velocity. The roulette wheel approach has been used for selection in the proposed method. After the steps for parent selection, the steps executes the genetic operators commence. Single point, double point, and uniform crossover by random probability are used to benefit these crossover methods simultaneously.

Step 6

Again evaluate the amount of the fitness function.

Step 7

Combine the offspring and sort them based on the fitness value. Then, select the best parents using the elitism method and a defined population size.

Step 8

Go back to step 5 and repeat the steps until the termination condition is reached. The termination condition is the number of generations.

Step 9

When there is no further progress, the best features with the best parameters for the classifier have been selected. These features and parameters can be applied to a blind test with no interference in the training and validation phases.

Step 10

Determine the occurrence frequency of each feature in the whole process. On average, biomarkers that have been repeated more than 6 times in the best locations are reported.

Step 11

The rules can be found using the best features extracted by the decision-tree algorithm. Figure 1 is a flowchart of the process. This flowchart summarizes how the system works and the relationships between the feature selection method and the classifier.

Figure 1

Hybrid algorithm flowchart (particle swarm optimization/genetic algorithm/fuzzy support vector machine)

Results

The accuracy, sensitivity, precision, and specificity values were evaluated by applying the proposed algorithm to four public data sets. These values are statistical indicators for evaluation of binary classification. The goal is to find the best possible combination and compare this modified algorithm with others methods. Tables 4 and 5 show the result of application of the algorithm to the databases. Proportional to the number of samples in each database, 5–60 genes were selected and the hybrid algorithm was applied to them. All algorithms were implemented in MATLAB and LIBSVM software.

Table 4

The results of applying hybrid (particle swarm optimization/genetic algorithm) to support vector machine classifier

Table 5

The results of applying hybrid (particle swarm optimization/genetic algorithm) to fuzzy support vector machine classifier

The results of applying hybrid (particle swarm optimization/genetic algorithm) to support vector machine classifier The results of applying hybrid (particle swarm optimization/genetic algorithm) to fuzzy support vector machine classifier This section introduces the biomarkers obtained using the hybrid algorithm. The results indicate the good performance of algorithm for finding small subsets of features with high accuracy by decreasing the effect of outliers and noisy data and finding good similarity between these biomarkers and the biomarkers introduced by others in the literature.

Discussion and Analysis of Results

To investigate the accuracy of the proposed PSO/GA/FSVM hybrid algorithm, the results were examined in greater detail. Figure 2 shows the most frequent genes identified while running the algorithm with 10-fold cross-validation to determine which genes occurred more frequently in each database. Figure 2a shows the results for leukemia cancer types (ALL, AML), where 25 biomarkers were selected by the proposed hybrid algorithm. The most frequent genes selected comprised 19 genes for cancer types ALL and MLL in Figure 2b, 14 genes for colon cancer in Figure 2c and 18 genes for breast cancer in Figure 2d. All these genes repeated more than 6 times in the 10 runs of the algorithm.

Figure 2

Occurrence frequency of genes by hybrid particle swarm optimization/genetic algorithm/fuzzy support vector machine algorithm with 10-fold cross validation. (a) Acute lymphoblastic leukemia, acute myeloid leukemia (b) acute lymphoblastic leukemia, mixed lineage leukemia (c) colon cancer (d) breast cancer A heat map was used to examine the biomarkers as a graphical representation of the changes in the behavior of the genes in the dataset. For example, it is desirable for the behavior of genes in cancer samples to be similar to one another and different from healthy samples. One group of genes may exhibit with low expression in normal samples and another group may exhibit high expression in normal samples. These genes can interact to aid in the accurate separation of cancer samples from normal samples. Figure 3 shows heatmaps of two types of leukemia [Figure 3a and b, colon [Figure 3c] and breast cancer [Figure 3d]. The red denotes values above the mean, black denotes the mean, and green denotes values below the mean of a gene across all columns. The decision-tree algorithm was applied to biomarkers obtained using the proposed hybrid approach to find rules in common to them. Several criteria are specified to determine features or traits, including information gain, gain ratio, and the Gini index. The C5.0 decision-tree algorithm by SPSS Clementine 12 software[47] was employed.(4-SPSS clementine is a software package used for logical batched and non-batched statistical analysis which was acquired by IBM in 2009). Table 6 shows the rules discovered using the hybrid PSOGA/FSVM. Three rules with 93% accuracy were found using 10-fold cross-validation on the blood cancer types (ALL, MLL). Classification was performed using the u29175 and X95735 genes. Gene X95735 has high expression in AML samples; gene u29175 has low expression in this cancer type (AML). The table also shows the rules for the other databases for blood cancer types ALL and MLL, breast cancer, and colon cancer. The classification accuracy for the cancer data and for highly ranked genes was 93%, 89%, and 80%, respectively.

Figure 3

Table 6

Extracted rules by decision tree on 4 cancer database

Heatmaps on 4 cancer data show the differences behavior of genes in 2 classes of data. (a-d) the result for leukemia cancer in types acute lymphoblastic leukemia and mixed lineage leukemia, acute lymphoblastic leukemia and acute myeloid leukemia, colon, and breast cancer data, respectively Extracted rules by decision tree on 4 cancer database Comparisons were made between the proposed algorithm and other algorithms. Table 7 shows the results of the comparison based on classification accuracy.

Table 7

Summarizes results and comparison with literatures

Summarizes results and comparison with literatures The paper[48] by the same authors employed a multilayer perceptron (MLP) for the classification. However, the running procedure of algorithm takes more time than the FSVM and SVM classifier. One of the advantages of the FSVM classifier is its high speed in running procedure. In MLP classifier, all the samples have the same weight in training phase; but in second paper, we use FSVM as the classifier and the importance of each sample in training phase. The extracted biomarkers from the proposed algorithm and those reported in other studies which were present in Tables 8 and Table 9 were also compared. For blood cancer types ALL and MLL, the proposed algorithm found 24 biomarkers; 11 were the same as biomarkers from Armstrong. The biomarkers that were common for blood cancer were 36678, 34699, 33305, 32579, 41710, 32533, 33412, 32749, 37027, 2036, and 40570. For the ALL and AML cancers, the algorithm found 19 biomarkers, 11 of which were the same as those presented by Golub et al. These common biomarkers were X17042, U50136, X95735, M55150, M92287, U29175, M31211, M16038, U05259, M31303, and M31523.

Table 8

Discovered biomarkers for leukemia and blood cancer (acute lymphoblastic leukemia, acute myeloid leukemia, mixed lineage leukemia)

Table 9

Discovered biomarkers for colon and breast cancer by particle swarm optimization/genetic algorithm/fuzzy support vector machine

Discovered biomarkers for leukemia and blood cancer (acute lymphoblastic leukemia, acute myeloid leukemia, mixed lineage leukemia) Discovered biomarkers for colon and breast cancer by particle swarm optimization/genetic algorithm/fuzzy support vector machine For breast and colon cancer, 7 out of 18 biomarkers were in common with the results presented by West for breast cancer and 3 out of 14 were in common for colon cancer with the results presented by Alon et al. The biomarkers in common for breast cancer were M35851, X52003, X58072, X14474, U95740, U68385, and U22376. The biomarkers in common for colon cancer were T57619, T58861, and X55715.

Conclusions

The results of the present study provide a comprehensive comparison of the proposed algorithm and those from previously published sources. The proposed algorithm is a hybrid of PSO and GA with FSVM classifier. This classifier has the ability to enter the importance of each sample into training of the system for further prediction without the need for trial-and-error to determine classifier parameters. Good results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67%, and for breast cancer is 98%. These results are better than the others works because the algorithm can determine the training parameters and small feature subsets in the databases perfectly with no user interface. The results show that the best kernel used in training the SVM classifier is the RBF.

Financial support and sponsorship

None.

Conflicts of interest

There are no conflicts of interest.

26 in total

1. A hybrid of genetic algorithm and particle swarm optimization for recurrent network design.

Authors: Chia-Feng Juang
Journal: IEEE Trans Syst Man Cybern B Cybern Date: 2004-04

2. Applications of support vector machines to cancer classification with microarray data.

Authors: Feng Chu; Lipo Wang
Journal: Int J Neural Syst Date: 2005-12 Impact factor: 5.866

3. Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification.

Authors: Qi Shen; Zhen Mei; Bao-Xian Ye
Journal: Comput Biol Med Date: 2009-05-28 Impact factor: 4.589

4. A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification.

Authors: Qi Shen; Wei-Min Shi; Wei Kong; Bao-Xian Ye
Journal: Talanta Date: 2006-09-01 Impact factor: 6.057

5. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes.

Authors: M Schena; D Shalon; R Heller; A Chai; P O Brown; R W Davis
Journal: Proc Natl Acad Sci U S A Date: 1996-10-01 Impact factor: 11.205

6. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors: U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal: Proc Natl Acad Sci U S A Date: 1999-06-08 Impact factor: 11.205

7. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

Authors: M Schena; D Shalon; R W Davis; P O Brown
Journal: Science Date: 1995-10-20 Impact factor: 47.728

8. Cancer gene search with data-mining and genetic algorithms.

Authors: Shital Shah; Andrew Kusiak
Journal: Comput Biol Med Date: 2006-04-17 Impact factor: 4.589

9. Biomarker Discovery Based on Hybrid Optimization Algorithm and Artificial Neural Networks on Microarray Data for Cancer Classification.

Authors: Niloofar Yousefi Moteghaed; Keivan Maghooli; Shiva Pirhadi; Masoud Garshasbi
Journal: J Med Signals Sens Date: 2015 Apr-Jun

10. Using protein interaction database and support vector machines to improve gene signatures for prediction of breast cancer recurrence.

Authors: Mohammad Reza Sehhati; Alireza Mehri Dehnavi; Hossein Rabbani; Shaghayegh Haghjoo Javanmard
Journal: J Med Signals Sens Date: 2013-04

1 in total

Review 1. Cardiac tissue engineering: state-of-the-art methods and outlook.

Authors: Anh H Nguyen; Paul Marsh; Lauren Schmiess-Heine; Peter J Burke; Abraham Lee; Juhyun Lee; Hung Cao
Journal: J Biol Eng Date: 2019-06-28 Impact factor: 4.355

1 in total