Literature DB >> 16046822

Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection.

Yong Mao1, Xiaobo Zhou, Daoying Pi, Youxian Sun, Stephen T C Wong.   

Abstract

We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene selection methods, binary classification tree based on SVM with F test, binary classification tree based on SVM with recursive feature elimination based on SVM, and FSVM with recursive feature elimination based on SVM are tested in our experiments. To accelerate computation, preselecting the strongest genes is also used. The proposed techniques are applied to analyze breast cancer data, small round blue-cell tumors, and acute leukemia data. Compared to existing multiclass cancer classifiers and binary classification tree based on SVM with F test or binary classification tree based on SVM with recursive feature elimination based on SVM mentioned in this paper, FSVM based on recursive feature elimination based on SVM can find most important genes that affect certain types of cancer with high recognition accuracy.

Entities:  

Year:  2005        PMID: 16046822      PMCID: PMC1184049          DOI: 10.1155/JBB.2005.160

Source DB:  PubMed          Journal:  J Biomed Biotechnol        ISSN: 1110-7243


INTRODUCTION

By comparing gene expressions in normal and diseased cells, microarrays are used to identify diseased genes and targets for therapeutic drugs. However, the huge amount of data provided by cDNA microarray measurements must be explored in order to answer fundamental questions about gene functions and their interdependence [1], and hopefully to provide answers to questions like what is the type of the disease affecting the cells or which genes have strong influence on this disease. Questions like this lead to the study of gene classification problems. Many factors may affect the results of the analysis. One of them is the huge number of genes included in the original dataset. Key issues that need to be addressed under such circumstances are the efficient selection of good predictive gene groups from datasets that are inherently noisy, and the development of new methodologies that can enhance the successful classification of these complex datasets. For multiclass cancer classification and discovery, the performance of different discrimination methods including nearest-neighbor classifiers, linear discriminant analysis, classification trees, and bagging and boosting learning methods are compared in [2]. Moreover, this problem has been studied by using partial least squares [3], Bayesian probit regression [4], and iterative classification trees [5]. But multiclass cancer classification, combined with gene selection, has not been investigated intensively. In the process of multiclass classification with gene selection, where there is an operation of classification, there is an operation of gene selection, which is the focus in this paper. In the past decade, a number of variable (or gene) selection methods used in two-class classification have been proposed, notably, the support vector machine (SVM) method [6], perceptron method [7], mutual-information-based selection method [8], Bayesian variable selection [2, 9, 10, 11, 12], minimum description length principle for model selection [13], voting technique [14], and so on. In [6], gene selection using recursive feature elimination based on SVM (SVM-RFE) is proposed. When used in two-class circumstances, it is demonstrated experimentally that the genes selected by these techniques yield better classification performance and are biologically relevant to cancer than the other methods mentioned in [6], such as feature ranking with correlation coefficients or sensitivity analysis. But its application in multiclass gene selection has not been seen for its expensive calculation burden. Thus, gene preselection is adopted to get over this shortcoming; SVM-RFE is a key gene selection method used in our study. As a two-class classification method, SVMs' remarkable robust performance with respect to sparse and noisy data makes them first choice in a number of applications. Its application in cancer diagnosis using gene profiles is referred to in [15, 16]. In the recent years, the binary SVM has been used as a component in many multiclass classification algorithms, such as binary classification tree and fuzzy SVM (FSVM). Certainly, these multiclass classification methods all have excellent performance, which benefit from their root in binary SVM and their own constructions. Accordingly, we propose two different constructed multiclass classifiers with gene selection: one is to use binary classification tree based on SVM (BCT-SVM) with gene selection while the other is FSVM with gene selection. In this paper, F test and SVM-RFE are used as our gene selection methods. Three groups of experiments are done, respectively, by using FSVM with SVM-RFE, BCT-SVM with SVM-RFE, and BCT-SVM with F test. Compared to the methods in [2, 3, 5], our proposed methods can find out which genes are the most important genes to affect certain types of cancer. In these experiments, with most of the strongest genes selected, the prediction error rate of our algorithms is extremely low, and FSVM with SVM-RFE shows the best performance of all. The paper is organized as follows. Problem statement is given in “problem statement.” BCT-SVM with gene selection is outlined in “binary classification tree based on SVM with gene” selection. FSVM with gene selection is described in “FSVM with gene selection.” Experimental results on breast cancer data, small round blue-cell tumors data, and acute leukemia data are reported in “experimental results.” Analysis and discussion are presented in “analysis and discussion.” “Conclusion” concludes the paper.

PROBLEM STATEMENT

Assume there are K classes of cancers. Let w = [w, . . ., w] denote the class labels of m samples, where w indicates the sample i being cancer k, where k = 1, . . ., K. Assume x1, . . ., xn are n genes. Let x be the measurement of the expression level of the jth gene for the ith sample, where j = 1,2, . . ., n, X = [x], denotes the expression levels of all genes, that is, In the two proposed methods, every sample is partitioned by a series of optimal hyperplanes. The optimal hyperplane means training data is maximally distant from the hyperplane itself, and the lowest classification error rate will be achieved when using this hyperplane to classify current training set. These hyperplanes can be modeled as and the classification functions are defined as , where X denotes the ith row of matrix X; s and t mean two partitions which are separated by an optimal hyperplane, and what these partitions mean lies on the construction of multiclass classification algorithms; for example, if we use binary classification tree, s and t mean two halves separated in an internal node, which may be the root node or a common internal node; if we use FSVM, s and t mean two arbitrary classes in K classes. ω is an n-dimensional weight vector; b is a bias term. SVM algorithm is used to determinate these optimal hyperplanes. SVM is a learning algorithm originally introduced by Vapnik [17, 18] and successively extended by many other researchers. SVMs can work in combination with the technique of “kernels” that automatically do a nonlinear mapping to a feature space so that SVM can settle the nonlinear separation problems. In SVM, a convex quadratic programming problem is solved and, finally, optimal solutions of ω and b are given. Detailed solution procedures are found in [17, 18]. Along with each binary classification using SVM, one operation of gene selection is done in advance. Specific gene selection methods used in our paper are described briefly in “experimental results.” Here, gene selection is done before SVM trained means that when an SVM is trained or used for prediction, dimensionality reduction will be done on input data, X, referred to as the strongest genes selected. We use function to represent this procedure, where β is an n × n matrix, in which only diagonal elements may be equal to 1 or 0; and all other elements are equal to 0; genes corresponding to the nonzero diagonal elements are important. β is gotten by specific gene selection methods; function means to select all nonzero elements in the input vector to construct a new vector , for example, I([1 0 2]) = [1 2]. So (2) is rewritten as and the classification functions are rewritten as accordingly. In order to accelerate calculation rate, preselecting genes before the training of multiclass classifiers is adopted. Based on all above, we propose two different constructed multiclass classifiers with gene selection: (1) binary classification tree based on SVM with gene selection, and (2) FSVM with gene selection.

BINARY CLASSIFICATION TREE BASED ON SVM WITH GENE SELECTION

Binary classification tree is an important class of machine-learning algorithms for multiclass classification. We construct binary classification tree with SVM; for short, we call it BCT-SVM. In BCT-SVM, there are K − 1 internal nodes and K terminal nodes. When building the tree, the solution of (3) is searched by SVM at each internal node to separate the data in the current node into the left children node and right children node with appointed gene selection method, which is mentioned in “experimental results”. Which class or classes should be partitioned into the left (or right) children node is decided at each internal node by impurity reduction [19], which is used to find the optimal construction of the classifier. The partition scheme with largest impurity reduction (IR) is optimal. Here, we use Gini index as our IR measurement criterion, which is also used in classification and regression trees (CARTs) [20] as a measurement of class diversity. Denote as M the training dataset at the current node, as M and M the training datasets at the left and right children nodes, as M sample set of class i in the training set, as M and M sample sets of class i of the training dataset at the left and right children nodes; and we use λΘ to denote the number of samples in dataset Θ; the current IR can be calculated as follows, in which c means the number of classes in the current node: When the maximum of IR(M) is found out based on all potential combinations of classes in the current internal node, which part of data should be partitioned into the left children node is decided. For the details to construct the standard binary decision tree, we refer to [19, 20]. After this problem is solved, samples partitioned into the left children node are labeled with −1, and the others are labeled with 1, based on these measures, a binary SVM classifier with gene selection is trained using the data of the two current children nodes. As to gene selection, it is necessary because the cancer classification is a typical problem with small sample and large variables, and it will cause overfitting if we directly train the classifier with all genes; here, all gene selection methods based on two-class classfication could be used to construct β in (3). The process of building a whole tree is recursive, as seen in Figure 1.
Figure 1

Binary classification tree based on SVM with gene selection.

When the training data at a node cannot be split any further, that node is identified as a terminal node and what we get from decision function corresponds to the label for a particular class. Once the tree is built, we could predict the results of the samples with genes selected by this tree; trained SVM will bring them to a terminal node, which has its own label. In the process of building BCT-SVM, there are K − 1 operations of gene selection done. This is due to the construction of BCT-SVM, in which there are K − 1 SVMs.

FSVM WITH GENE SELECTION

Other than BCT-SVM, FSVM has a pairwise construction, which means every hyperplane between two arbitrary classes should be searched using SVM with gene selection. These processes are modeled by (3). FSVM is a new method firstly proposed by Abe and Inoue in [21, 22]. It was proposed to deal with unclassifiable regions when using one versus the rest or pairwise classification method based on binary SVM for n(> 2)-class problems. FSVM is an improved pairwise classification method with SVM; a fuzzy membership function is introduced into the decision function based on pairwise classification. For the data in the classifiable regions, FSVM gives out the same classification results as pairwise classification with SVM method and for the data in the unclassifiable regions, FSVM generates better classification results than the pairwise classification with SVM method. In the process of being trained, FSVM is the same as the pairwise classification method with SVM that is referred to in [23]. In order to describe our proposed algorithm clearly, we denote four input variables: the sample matrix X0 = {x1 ,x2, . . ., x, . . ., x}, that is, X0 is a matrix composed of some columns of original training dataset X, which corresponds to preselected important genes; the class-label vector y = {y, y, . . ., y, . . ., y}; the number of classes in training set ν; and the number of important genes used in gene selection κ. With these four input variables, the training process of FSVM with gene selection is expressed in (Algorithm 1).
Algorithm 1

The FSVM with gene selection training algorithm.

In Algorithm 1, υ = GeneSelection (μ, ) is realization of a specific binary gene selection algorithm, υ denotes the genes important for two specific draw-out classes and is used to construct β in (3), SV MTrain() is realization of binary SVM algorithm, α is a Lagrange multiplier vector, and ϵ is a bias term. γ, alpha, and bias are the output matrixes. γ is made up of all important genes selected, in which each row corresponds to a list of important genes selected between two specific classes. alpha is a matrix with each row corresponding to Lagrange multiplier vector by an SVM classifier trained between two specific classes, and bias is the vector made up of bias terms of these SVM classifiers. In this process, we may see there are K(K − 1)/2 SVMs trained and K(K − 1)/2 gene selections executed. This means that many important genes relative to two specific classes of samples will be selected. Based on the K(K − 1)/2 optimal hyperplanes and the strongest genes selected, decision function is constructed based on (3). Define f(X) = −f(X), (s ≠ t); the fuzzy membership function m(X) is introduced on the directions orthogonal to f(X) = 0 as Using m(X)(s ≠ t, s = 1, . . ., n), the class i membership function of X is defined as m(X) = minm(X), which is equivalent to m(X) = min(1, min f(X)); now an unknown sample X is classified by argmaxm(X).

EXPERIMENTAL RESULTS

F test and SVM-RFE are gene selection methods used in our experiments. In F test, the ratio , is used to select genes, in which denotes the average expression level of gene j across all samples and denotes the average expression level of gene j across the samples belonging to class k where class k corresponds to {Ω = k}; and the indicator function 1 is equal to one if event Ω is true and zero otherwise. Genes with bigger R(j) are selected. From the expression of R(j) , it can be seen F test could select genes among l(>3) classes [14]. As to SVM-RFE, it is recursive feature elimination based on SVM. It is a circulation procedure for eliminating features combined with training an SVM classifier and, for each elimination operation, it consists of three steps: (1) train the SVM classifier, (2) compute the ranking criteria for all features, and (3) remove the feature with the smallest ranking scores, in which all ranking criteria are relative to the decision function of SVM. As a linear kernel SVM is used as a classifier between two specific classes s and t, the square of every element of weight vector ω in (2) is used as a score to evaluate the contribution of the corresponding genes. The genes with the smallest scores are eliminated. Details are referred to in [6]. To speed up the calculation, gene preselection is generally used. On every dataset we use the first important 200 genes are selected by F test before multiclass classifiers with gene selection are trained. Note that F test requires normality of the data to be efficient which is not always the case for gene expression data. That is the exact reason why we cannot only use F test to select genes. Since the P values of important genes are relatively low, that means the F test scores of important genes should be relatively high. Considering that the number of important genes is often among tens of genes, we preselect the number of genes as 200 according to our experience in order to avoid losing some important genes. In the next experiments, we will show this procedure works effectively. Combining these two specific gene selection methods with the multiclass classification methods, we propose three algorithms: (1) BCT-SVM with F test, (2) BCT-SVM with SVM-RFE, and (3) FSVM with SVM-RFE. As mentioned in [4, 9], every algorithm is tested with cross-validation (leave-one-out) method based on top 5, top 10, and top 20 genes selected by their own gene selection methods.

Breast cancer dataset

In our first experiment, we will focus on hereditary breast cancer data, which can be downloaded from the web page for the original paper [24]. In [24], cDNA microarrays are used in conjunction with classification algorithms to show the feasibility of using differences in global gene expression profiles to separate BRCA1 and BRCA2 mutation-positive breast cancers. Twenty-two breast tumor samples from 21 patients were examined: 7 BRCA1, 8 BRCA2, and 7 sporadic. There are 3226 genes for each tumor sample. We use our methods to classify BRCA1, BRCA2, and sporadic. The ratio data is truncated from below at 0.1 and above at 20. Table 1 lists the top 20 strongest genes selected by using our methods. (For reading purpose, sometimes instead of clone ID, we use the gene index number in the database [24].) The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 2; more information about all selected genes corresponding to the list in Table 1 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 1008 (keratin 8) is selected by all the three methods. This gene is also an important gene listed in [4, 7, 9]. Keratin 8 is a member of the cytokeratin family of genes. Cytokeratins are frequently used to identify breast cancer metastases by immunohistochemistry [24]. Gene 10 (phosphofructokinase, platelet) and gene 336 (transducer of ERBB2, 1) are also important genes listed in [7]. Gene 336 is selected by FSVM with SVM-RFE and BCT-SVM with SVM-RFE; gene 10 is selected by FSVM with SVM-RFE.
Table 1

The index no of the strongest genes selected in hereditary breast cancer dataset.

NoFSVM with SVM-RFEBCT-SVM with F testBCT-SVM with SVM-RFE

1231212

11008185942250111487501999
29551008288629848388603009
3147910343310418591008158
428703365014222724222761
553815892297710082804247
6336199930042578117918361859
7315424717093010106530041148
82259144675028042423420838
97397392299335199917091628
10289312003412456269930651068
1181628861836111612772977819
122804276121926810685851797
13150316581567509631475336
145855602867229415832172893
15162083831041566095012219
1618152300141222991417146585
1730655383217271511903431008
18315549829772753221914172886
19128880916122979560229936
20234210922804242824722941446
Table 2

A part of the strongest genes selected in hereditary breast cancer dataset (the first row of genes in Table 1).

RankIndex noClone IDGene description

11008897781Keratin 8
2955950682Phosphofructokinase, platelet
31479841641Cyclin D1 (PRAD1: parathyroid adenomatosis 1)
4287082991Phosphodiesterase I/nucleotide pyrophosphatase 1
(homologous to mouse Ly-41 antigen)
5538563598Human GABA-A receptor π subunit mRNA, complete cds
6336823940Transducer of ERBB2, 1
73154135118GATA-binding protein 3
82259814270Polymyositis/scleroderma autoantigen 1 (75kd)
9739214068GATA-binding protein 3
10289332790mutS (E coli) homolog 2 (colon cancer, nonpolyposis type 1)
11816123926Cathepsin K (pycnodysostosis)
12280451209Protein phosphatase 1, catalytic subunit, beta isoform
131503838568Cytochrome c oxidase subunit VIc
14585293104Phytanoyl-CoA hydroxylase (Refsum disease)
151620137638ESTs
161815141959Homo sapiens mRNA; cDNA DKFZp566J2446
(from clone DKFZp566J2446)
173065199381ESTs
183155136769TATA box binding protein (TBP)-associated factor,
RNA polymerase II, A, 250kd
191288564803Forkhead (drosophila)-like 16
202342284592Platelet-derived growth factor receptor, alpha polypeptide
Using the top 5, 10, and 20 genes each for these three methods, the recognition accuracy is shown in Table 3. When using top 5 genes for classification, there is one error for BCT-SVM with F test and no error for the other two methods. When using top 10 and 20 genes, there is no error for all the three methods. Note that the performance of our methods is similar to that in [4], where the authors diagnosed the tumor types by using multinomial probit regression model with Bayesian gene selection. Using top 10 genes, they also got zero misclassification.
Table 3

Classifiers' performance on hereditary breast cancer dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification methodTop 5Top 10Top 20

FSVM with SVM-RFE000
BCT-SVM with F test100
BCT-SVM with SVM-RFE000

Small round blue-cell tumors

In this experiment, we consider the small round blue-cell tumors (SRBCTs) of childhood, which include neuroblastoma (NB), rhabdomyosarcoma (RMS), nonHodgkin lymphoma (NHL), and the Ewing sarcoma (EWS) in [25]. The dataset of the four cancers is composed of 2308 genes and 63 samples, where the NB has 12 samples; the RMS has 23 samples; the NHL has 8 samples, and the EMS has 20 samples. We use our methods to classify the four cancers. The ratio data is truncated from below at 0.01. Table 4 lists the top 20 strongest genes selected by using our methods. The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 5; more information about all selected genes corresponding to the list in Table 4 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 244 (clone ID 377461), gene 2050 (clone ID 295985), and gene 1389 (clone ID 770394) are selected by all the three methods, and these genes are also important genes listed in [25]. Gene 255 (clone ID 325182), gene 107 (clone ID 365826), and gene 1 (clone ID 21652, (catenin alpha 1)) selected by BCT-SVM with SVM-RFE and FSVM with SVM-RFE are also listed in [25] as important genes.
Table 4

The index no of the strongest genes selected in small round blue-cell tumors dataset.

NoFSVM with SVM-RFEBCT-SVM with F testSVM-RFE SVM-RFE

123456123123

1246255195485118716011074169422545174851
2138986717088465098422461055109913891353846
38512461955191521621955170833875820508421915
417501389509160110725513894221387131918841601
5107842205074275820461954173876116131003742
621982050545191620461764607135312310037071916
7205036513892144219850916138008424619552144
821627422046219820226031645714188886720462198
960710734814271606707131975895119542551427
1019809761291169174566910160616451691
1156713195661066113533682047191411108191066
12202219912468671915169132721621634368509867
131626819120778878810032442227867129166788
1419162511003153188674254520497833481207153
15544236368198055422031888188421683656031980
16164519541105219913531072050195516011077962199
17142717081158783338719430120733517081764783
18170810841645143484616636532610841877191434
19230356613197991884188417727968361626107799
20256111017991886223519801298230849177222031886
Table 5

A part of the strongest genes selected in small round blue-cell tumors dataset (the first row of genes in Table 4).

RankIndex noClone IDGene description

1246377461Caveolin 1, caveolae protein, 22kd
21389770394Fc fragment of IgG, receptor, transporter, alpha
3851563673Antiquitin 1
41750233721Insulin-like growth factor binding protein 2 (36kd)
5107365826Growth arrest-specific 1
62198212542H sapiens mRNA; cDNA DKFZp586J2118
(from clone DKFZp586J2118)
72050295985ESTs
82162308163ESTs
9607811108Thyroid hormone receptor interactor 6
101980841641Cyclin D1 (PRAD1: parathyroid adenomatosis 1)
11567768370tissue inhibitor of metalloproteinase 3
(Sorsby fundus dystrophy, pseudoinflammatory)
122022204545ESTs
131626811000Lectin, galactoside-binding, soluble, 3 binding
protein (galectin 6 binding protein)
14191680109Major histocompatibility complex, class II, DQ alpha 1
155441416782Creatine kinase, brain
16164552076Olfactomedinrelated ER localized protein
171427504791Glutathione S-transferase A4
18170843733Glycogenin 2
192303782503H sapiens clone 23716 mRNA sequence
20256154472Fibroblast growth factor receptor 1
(fms-related tyrosine kinase 2, Pfeiffer syndrome)
Using the top 5, 10, and 20 genes for these three methods each, the recognition accuracy is shown in Table 6. When using top 5 genes for classification, there is one error for BCT-SVM with F test and no error for the other two methods. When using top 10 and 20 genes, there is no error for all the three methods.
Table 6

Classifiers' performance on small round blue-cell tumors dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification methodTop 5Top 10Top 20

FSVM with SVM-RFE000
BCT-SVM with F test100
BCT-SVM with SVM-RFE000
In [26], Yeo et al applied k nearest neighbor (kNN), weighted voting, and linear SVM in one-versus-rest fashion to this four-class problem and compared the performances of these methods when they are combined with several feature selection methods for each binary classification problem. Using top 5 genes, top 10 genes, or top 20 genes, kNN, weighted voting, or SVM combined with all the three feature selection methods, respectively, without rejection all have errors greater than or equal to 2. In [27], Lee et al used multicategory SVM with gene selection. Using top 20 genes, their recognition accuracy is also zero misclassification number.

Acute leukemia data

We have also applied the proposed methods to the leukemia data of [14], which is available at http://www.sensornet.cn/fxia/top_20_genes.zip. The microarray data contains 7129 human genes, sampled from 72 cases of cancer, of which 38 are of type B cell ALL, 9 are of type T cell ALL, and 25 of type AML. The data is preprocessed as recommended in [2]: gene values are truncated from below at 100 and from above at 16 000; genes having the ratio of the maximum over the minimum less than 5 or the difference between the maximum and the minimum less than 500 are excluded; and finally the base-10 logarithm is applied to the 3571 remaining genes. Here we study the 38 samples in training set, which is composed of 19 B-cell ALL, 8 T-cell ALL, and 11 AML. Table 7 lists the top 20 strongest genes selected by using our methods. The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 8; more information about all selected genes corresponding to the list in Table 7 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 1882 (CST3 cystatin C (amyloid angiopathy and cerebral hemorrhage)), gene 4847 (zyxin), and gene 4342 (TCF7 transcription factor 7 (T cell specific)) are selected by all the three methods. In the three genes, the first two are the most important genes listed in many literatures. Gene 2288 (DF D component of complement (adipsin)) is another important gene having biological significance, which is selected by FSVM with SVM-RFE.
Table 7

The index no of the strongest genes selected in acute leukemia dataset.

NoFSVM with SVM-RFEBCT-SVM with F testBCT-SVM with SVM-RFE

1231212

16696188266062335434218824342
26606468066964680405066964050
34342620146802642120755525808
41694228843421882651063781106
51046620067896225405238473969
6177976043184318405553001046
76200233518935300110626426606
8618075816945554126824026696
96510264243795688484733322833
10189324022215758554316851268
114050621833324913104641774847
124379637639694082283366066510
131268630865106573435739692215
144375177923356974437563081834
15484761856168649760417604535
166789408220101078623623351817
172288637811062995669620104375
181106484753005442163065735039
192833530040822215618045864379
206539168510464177410722155300
Table 8

A part of the strongest genes selected in small round blue-cell tumors dataset (the second row of genes in Table 4).

RankIndex noGene accession numberGene description

11882M27891_atCST3 cystatin C (amyloid angiopathy and cerebral hemorrhage)
24680X82240_rna1_atTCL1 gene (T-cell leukemia) extracted from H sapiens
mRNA for T-cell leukemia/lymphoma 1
36201Y00787_s_atInterleukin-8 precursor
42288M84526_atDF D component of complement (adipsin)
56200M28130_rna1_s_atInterleukin-8 (IL-8) gene
6760D88422_atCystatin A
72335M89957_atIGB immunoglobulin-associated beta (B29)
8758D88270_atGB DEF = (lambda) DNA for immunoglobin light chain
92642U05259_rna1_atMEF2C MADS box transcription enhancer factor 2,
polypeptide C (myocyte enhancer factor 2C)
102402M96326_rna1_atAzurocidin gene
116218M27783_s_atELA2 Elastatse 2, neutrophil
126376M83652_s_atPFC properdin P factor, complement
136308M57731_s_atGRO2 GRO2 oncogene
141779M19507_atMPO myeloperoxidase
156185X64072_s_atSELL leukocyte adhesion protein beta subunit
164082X05908_atANX1 annexin I (lipocortin I)
176378M83667_rna1_s_atNF-IL6-beta protein mRNA
184847X95735_atZyxin
195300L08895_atMEF2C MADS box transcription enhancer factor 2,
polypeptide C (myocyte enhancer factor 2C)
201685M11722_atTerminal transferase mRNA
Using the top 5, 10, and 20 genes for these three methods each, the recognition accuracy is shown in Table 9. When using top 5 genes for classification, there is one error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and BCT-SVM with F test, respectively. When using top 10 genes for classification, there is no error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and four errors for BCT-SVM with F test. When using top 20 genes for classification, there is one error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and two errors for BCT-SVM with F test. Again note that the performance of our methods is similar to that in [4], where the authors diagnosed the tumor types by using multinomial probit regression model with Bayesian gene selection. Using top 10 genes, they also got zero misclassification.
Table 9

Classifiers' performance on acute leukemia dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification methodTop 5Top 10Top 20

FSVM with SVM-RFE101
BCT-SVM with F test242
BCT-SVM with SVM-RFE212

ANALYSIS AND DISCUSSION

According to Tables 1–9, there are many important genes selected by these three multiclass classification algorithms with gene selection. Based on these selected genes, the prediction error rate of these three algorithms is low. By comparing the results of these three algorithms, we consider that FSVM with SVM-RFE algorithm generates the best results. BCT-SVM with SVM-RFE and BCT-SVM with F test have the same multiclass classification structure. The results of BCT-SVM with SVM-RFE are better than those of BCT-SVM with F test, because their gene selection methods are different; a better gene selection method combined with the same multiclass classification method will perform better. It means SVM-RFE is better than F test combined with multiclass classification methods; the results are similar to what is mentioned in [6], in which the two gene selection methods are combined with two-class classification methods. FSVM with SVM-RFE and BCT-SVM with SVM-RFE have the same gene selection methods. The results of FSVM with SVM-RFE are better than those of BCT-SVM with SVM-RFE whether in gene selection or in recognition accuracy, because the constructions of their multiclass classification methods are different, which is explained in two aspects. (1) The genes selected by FSVM with SVM-RFE are more than those of BCT-SVM with SVM-REF. In FSVM there are K(K − 1)/2 operations of gene selection, but in BCT-SVM there are only K −1 operations of gene selection. An operation of gene selection between every two classes is done in FSVM with SVM-RFE; (2) FSVM is an improved pairwise classification method, in which the unclassifiable regions being in BCT-SVM are classified by FSVM's fuzzy membership function [21, 22]. So, FSVM with SVM-RFE is considered as the best of the three.

CONCLUSION

In this paper, we have studied the problem of multiclass cancer classification with gene selection from gene expression data. We proposed two different new constructed classifiers with gene selection, which are FSVM with gene selection and BCT-SVM with gene selection. F test and SVM-RFE are used as our gene selection methods combined with multiclass classification methods. In our experiments, three algorithms (FSVM with SVM-RFE, BCT-SVM with SVM-RFE, and BCT-SVM with F test) are tested on three datasets (the real breast cancer data, the small round blue-cell tumors, and the acute leukemia data). The results of these three groups of experiments show that more important genes are selected by FSVM with SVM-RFE, and by these genes selected it shows higher prediction accuracy than the other two algorithms. Compared to some existing multiclass cancer classifiers with gene selection, FSVM based on SVM-RFE also performs very well. Finally, an explanation is provided on the experimental results of this study.
  12 in total

1.  Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors:  T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal:  Bioinformatics       Date:  2000-10       Impact factor: 6.937

2.  Missing-value estimation using linear and non-linear regression with Bayesian gene selection.

Authors:  Xiaobo Zhou; Xiaodong Wang; Edward R Dougherty
Journal:  Bioinformatics       Date:  2003-11-22       Impact factor: 6.937

3.  Simultaneous gene clustering and subset selection for sample classification via MDL.

Authors:  Rebecka Jörnsten; Bin Yu
Journal:  Bioinformatics       Date:  2003-06-12       Impact factor: 6.937

4.  Gene selection: a Bayesian variable selection approach.

Authors:  Kyeong Eun Lee; Naijun Sha; Edward R Dougherty; Marina Vannucci; Bani K Mallick
Journal:  Bioinformatics       Date:  2003-01       Impact factor: 6.937

5.  Strong feature sets from small samples.

Authors:  Seungchan Kim; Edward R Dougherty; Junior Barrera; Yidong Chen; Michael L Bittner; Jeffrey M Trent
Journal:  J Comput Biol       Date:  2002       Impact factor: 1.479

6.  Gene-expression profiles in hereditary breast cancer.

Authors:  I Hedenfalk; D Duggan; Y Chen; M Radmacher; M Bittner; R Simon; P Meltzer; B Gusterson; M Esteller; O P Kallioniemi; B Wilfond; A Borg; J Trent; M Raffeld; Z Yakhini; A Ben-Dor; E Dougherty; J Kononen; L Bubendorf; W Fehrle; S Pittaluga; S Gruvberger; N Loman; O Johannsson; H Olsson; G Sauter
Journal:  N Engl J Med       Date:  2001-02-22       Impact factor: 91.245

7.  Multi-class cancer classification using multinomial probit regression with Bayesian gene selection.

Authors:  X Zhou; X Wang; E R Dougherty
Journal:  Syst Biol (Stevenage)       Date:  2006-03

8.  Recursive partitioning for tumor classification with gene expression microarray data.

Authors:  H Zhang; C Y Yu; B Singer; M Xiong
Journal:  Proc Natl Acad Sci U S A       Date:  2001-05-29       Impact factor: 11.205

9.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.

Authors:  J Khan; J S Wei; M Ringnér; L H Saal; M Ladanyi; F Westermann; F Berthold; M Schwab; C R Antonescu; C Peterson; P S Meltzer
Journal:  Nat Med       Date:  2001-06       Impact factor: 53.440

10.  Multi-class cancer classification via partial least squares with gene expression profiles.

Authors:  Danh V Nguyen; David M Rocke
Journal:  Bioinformatics       Date:  2002-09       Impact factor: 6.937

View more
  9 in total

1.  Parameters selection in gene selection using Gaussian kernel support vector machines by genetic algorithm.

Authors:  Yong Mao; Xiao-Bo Zhou; Dao-Ying Pi; You-Xian Sun; Stephen T C Wong
Journal:  J Zhejiang Univ Sci B       Date:  2005-10       Impact factor: 3.066

2.  A study of health effects of long-distance ocean voyages on seamen using a data classification approach.

Authors:  Yunmei Lu; Yanhong Gao; Zhongbo Cao; Juan Cui; Zhennan Dong; Yaping Tian; Ying Xu
Journal:  BMC Med Inform Decis Mak       Date:  2010-03-10       Impact factor: 2.796

3.  Metabonomic analysis of hepatitis B virus-induced liver failure: identification of potential diagnostic biomarkers by fuzzy support vector machine.

Authors:  Yong Mao; Xin Huang; Ke Yu; Hai-bin Qu; Chang-xiao Liu; Yi-yu Cheng
Journal:  J Zhejiang Univ Sci B       Date:  2008-06       Impact factor: 3.066

4.  Constructing support vector machine ensembles for cancer classification based on proteomic profiling.

Authors:  Yong Mao; Xiao Bo Zhou; Dao Ying Pi; You Xian Sun
Journal:  Genomics Proteomics Bioinformatics       Date:  2005-11       Impact factor: 7.691

5.  Improving Classification of Cancer and Mining Biomarkers from Gene Expression Profiles Using Hybrid Optimization Algorithms and Fuzzy Support Vector Machine.

Authors:  Niloofar Yousefi Moteghaed; Keivan Maghooli; Masoud Garshasbi
Journal:  J Med Signals Sens       Date:  2018 Jan-Mar

6.  Identification of potential tissue-specific cancer biomarkers and development of cancer versus normal genomic classifiers.

Authors:  Akram Mohammed; Greyson Biegert; Jiri Adamec; Tomáš Helikar
Journal:  Oncotarget       Date:  2017-09-21

7.  Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study.

Authors:  Solbi Kweon; Jeong Hoon Lee; Younghee Lee; Yu Rang Park
Journal:  J Med Internet Res       Date:  2020-08-10       Impact factor: 5.428

8.  Differential gene expression analysis reveals novel genes and pathways in pediatric septic shock patients.

Authors:  Akram Mohammed; Yan Cui; Valeria R Mas; Rishikesan Kamaleswaran
Journal:  Sci Rep       Date:  2019-08-02       Impact factor: 4.379

9.  Evaluation of classification approaches for distinguishing brain states predictive of episodic memory performance from electroencephalography: Abbreviated Title: Evaluating methods of classifying memory states from EEG.

Authors:  Soroush Mirjalili; Patrick Powell; Jonathan Strunk; Taylor James; Audrey Duarte
Journal:  Neuroimage       Date:  2021-12-22       Impact factor: 6.556

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.