Literature DB >> 16046822

Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection.

Yong Mao¹, Xiaobo Zhou, Daoying Pi, Youxian Sun, Stephen T C Wong.

Abstract

We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene selection methods, binary classification tree based on SVM with F test, binary classification tree based on SVM with recursive feature elimination based on SVM, and FSVM with recursive feature elimination based on SVM are tested in our experiments. To accelerate computation, preselecting the strongest genes is also used. The proposed techniques are applied to analyze breast cancer data, small round blue-cell tumors, and acute leukemia data. Compared to existing multiclass cancer classifiers and binary classification tree based on SVM with F test or binary classification tree based on SVM with recursive feature elimination based on SVM mentioned in this paper, FSVM based on recursive feature elimination based on SVM can find most important genes that affect certain types of cancer with high recognition accuracy.

Entities: CellLine Disease Gene Species

Year: 2005 PMID： 16046822 PMCID： PMC1184049 DOI： 10.1155/JBB.2005.160

Source DB: PubMed Journal: J Biomed Biotechnol ISSN： 1110-7243

INTRODUCTION

By comparing gene expressions in normal and diseased cells, microarrays are used to identify diseased genes and targets for therapeutic drugs. However, the huge amount of data provided by cDNA microarray measurements must be explored in order to answer fundamental questions about gene functions and their interdependence [1], and hopefully to provide answers to questions like what is the type of the disease affecting the cells or which genes have strong influence on this disease. Questions like this lead to the study of gene classification problems. Many factors may affect the results of the analysis. One of them is the huge number of genes included in the original dataset. Key issues that need to be addressed under such circumstances are the efficient selection of good predictive gene groups from datasets that are inherently noisy, and the development of new methodologies that can enhance the successful classification of these complex datasets. For multiclass cancer classification and discovery, the performance of different discrimination methods including nearest-neighbor classifiers, linear discriminant analysis, classification trees, and bagging and boosting learning methods are compared in [2]. Moreover, this problem has been studied by using partial least squares [3], Bayesian probit regression [4], and iterative classification trees [5]. But multiclass cancer classification, combined with gene selection, has not been investigated intensively. In the process of multiclass classification with gene selection, where there is an operation of classification, there is an operation of gene selection, which is the focus in this paper. In the past decade, a number of variable (or gene) selection methods used in two-class classification have been proposed, notably, the support vector machine (SVM) method [6], perceptron method [7], mutual-information-based selection method [8], Bayesian variable selection [2, 9, 10, 11, 12], minimum description length principle for model selection [13], voting technique [14], and so on. In [6], gene selection using recursive feature elimination based on SVM (SVM-RFE) is proposed. When used in two-class circumstances, it is demonstrated experimentally that the genes selected by these techniques yield better classification performance and are biologically relevant to cancer than the other methods mentioned in [6], such as feature ranking with correlation coefficients or sensitivity analysis. But its application in multiclass gene selection has not been seen for its expensive calculation burden. Thus, gene preselection is adopted to get over this shortcoming; SVM-RFE is a key gene selection method used in our study. As a two-class classification method, SVMs' remarkable robust performance with respect to sparse and noisy data makes them first choice in a number of applications. Its application in cancer diagnosis using gene profiles is referred to in [15, 16]. In the recent years, the binary SVM has been used as a component in many multiclass classification algorithms, such as binary classification tree and fuzzy SVM (FSVM). Certainly, these multiclass classification methods all have excellent performance, which benefit from their root in binary SVM and their own constructions. Accordingly, we propose two different constructed multiclass classifiers with gene selection: one is to use binary classification tree based on SVM (BCT-SVM) with gene selection while the other is FSVM with gene selection. In this paper, F test and SVM-RFE are used as our gene selection methods. Three groups of experiments are done, respectively, by using FSVM with SVM-RFE, BCT-SVM with SVM-RFE, and BCT-SVM with F test. Compared to the methods in [2, 3, 5], our proposed methods can find out which genes are the most important genes to affect certain types of cancer. In these experiments, with most of the strongest genes selected, the prediction error rate of our algorithms is extremely low, and FSVM with SVM-RFE shows the best performance of all. The paper is organized as follows. Problem statement is given in “problem statement.” BCT-SVM with gene selection is outlined in “binary classification tree based on SVM with gene” selection. FSVM with gene selection is described in “FSVM with gene selection.” Experimental results on breast cancer data, small round blue-cell tumors data, and acute leukemia data are reported in “experimental results.” Analysis and discussion are presented in “analysis and discussion.” “Conclusion” concludes the paper.

PROBLEM STATEMENT

Assume there are K classes of cancers. Let w = [w, . . ., w] denote the class labels of m samples, where w indicates the sample i being cancer k, where k = 1, . . ., K. Assume x1, . . ., xn are n genes. Let x be the measurement of the expression level of the jth gene for the ith sample, where j = 1,2, . . ., n, X = [x], denotes the expression levels of all genes, that is, In the two proposed methods, every sample is partitioned by a series of optimal hyperplanes. The optimal hyperplane means training data is maximally distant from the hyperplane itself, and the lowest classification error rate will be achieved when using this hyperplane to classify current training set. These hyperplanes can be modeled as and the classification functions are defined as , where X denotes the ith row of matrix X; s and t mean two partitions which are separated by an optimal hyperplane, and what these partitions mean lies on the construction of multiclass classification algorithms; for example, if we use binary classification tree, s and t mean two halves separated in an internal node, which may be the root node or a common internal node; if we use FSVM, s and t mean two arbitrary classes in K classes. ω is an n-dimensional weight vector; b is a bias term. SVM algorithm is used to determinate these optimal hyperplanes. SVM is a learning algorithm originally introduced by Vapnik [17, 18] and successively extended by many other researchers. SVMs can work in combination with the technique of “kernels” that automatically do a nonlinear mapping to a feature space so that SVM can settle the nonlinear separation problems. In SVM, a convex quadratic programming problem is solved and, finally, optimal solutions of ω and b are given. Detailed solution procedures are found in [17, 18]. Along with each binary classification using SVM, one operation of gene selection is done in advance. Specific gene selection methods used in our paper are described briefly in “experimental results.” Here, gene selection is done before SVM trained means that when an SVM is trained or used for prediction, dimensionality reduction will be done on input data, X, referred to as the strongest genes selected. We use function to represent this procedure, where β is an n × n matrix, in which only diagonal elements may be equal to 1 or 0; and all other elements are equal to 0; genes corresponding to the nonzero diagonal elements are important. β is gotten by specific gene selection methods; function means to select all nonzero elements in the input vector to construct a new vector , for example, I([1 0 2]) = [1 2]. So (2) is rewritten as and the classification functions are rewritten as accordingly. In order to accelerate calculation rate, preselecting genes before the training of multiclass classifiers is adopted. Based on all above, we propose two different constructed multiclass classifiers with gene selection: (1) binary classification tree based on SVM with gene selection, and (2) FSVM with gene selection.

BINARY CLASSIFICATION TREE BASED ON SVM WITH GENE SELECTION

Binary classification tree is an important class of machine-learning algorithms for multiclass classification. We construct binary classification tree with SVM; for short, we call it BCT-SVM. In BCT-SVM, there are K − 1 internal nodes and K terminal nodes. When building the tree, the solution of (3) is searched by SVM at each internal node to separate the data in the current node into the left children node and right children node with appointed gene selection method, which is mentioned in “experimental results”. Which class or classes should be partitioned into the left (or right) children node is decided at each internal node by impurity reduction [19], which is used to find the optimal construction of the classifier. The partition scheme with largest impurity reduction (IR) is optimal. Here, we use Gini index as our IR measurement criterion, which is also used in classification and regression trees (CARTs) [20] as a measurement of class diversity. Denote as M the training dataset at the current node, as M and M the training datasets at the left and right children nodes, as M sample set of class i in the training set, as M and M sample sets of class i of the training dataset at the left and right children nodes; and we use λΘ to denote the number of samples in dataset Θ; the current IR can be calculated as follows, in which c means the number of classes in the current node: When the maximum of IR(M) is found out based on all potential combinations of classes in the current internal node, which part of data should be partitioned into the left children node is decided. For the details to construct the standard binary decision tree, we refer to [19, 20]. After this problem is solved, samples partitioned into the left children node are labeled with −1, and the others are labeled with 1, based on these measures, a binary SVM classifier with gene selection is trained using the data of the two current children nodes. As to gene selection, it is necessary because the cancer classification is a typical problem with small sample and large variables, and it will cause overfitting if we directly train the classifier with all genes; here, all gene selection methods based on two-class classfication could be used to construct β in (3). The process of building a whole tree is recursive, as seen in Figure 1.

Figure 1

Binary classification tree based on SVM with gene selection.

When the training data at a node cannot be split any further, that node is identified as a terminal node and what we get from decision function corresponds to the label for a particular class. Once the tree is built, we could predict the results of the samples with genes selected by this tree; trained SVM will bring them to a terminal node, which has its own label. In the process of building BCT-SVM, there are K − 1 operations of gene selection done. This is due to the construction of BCT-SVM, in which there are K − 1 SVMs.

FSVM WITH GENE SELECTION

Other than BCT-SVM, FSVM has a pairwise construction, which means every hyperplane between two arbitrary classes should be searched using SVM with gene selection. These processes are modeled by (3). FSVM is a new method firstly proposed by Abe and Inoue in [21, 22]. It was proposed to deal with unclassifiable regions when using one versus the rest or pairwise classification method based on binary SVM for n(> 2)-class problems. FSVM is an improved pairwise classification method with SVM; a fuzzy membership function is introduced into the decision function based on pairwise classification. For the data in the classifiable regions, FSVM gives out the same classification results as pairwise classification with SVM method and for the data in the unclassifiable regions, FSVM generates better classification results than the pairwise classification with SVM method. In the process of being trained, FSVM is the same as the pairwise classification method with SVM that is referred to in [23]. In order to describe our proposed algorithm clearly, we denote four input variables: the sample matrix X0 = {x1 ,x2, . . ., x, . . ., x}, that is, X0 is a matrix composed of some columns of original training dataset X, which corresponds to preselected important genes; the class-label vector y = {y, y, . . ., y, . . ., y}; the number of classes in training set ν; and the number of important genes used in gene selection κ. With these four input variables, the training process of FSVM with gene selection is expressed in (Algorithm 1).

Algorithm 1

The FSVM with gene selection training algorithm.

In Algorithm 1, υ = GeneSelection (μ, ) is realization of a specific binary gene selection algorithm, υ denotes the genes important for two specific draw-out classes and is used to construct β in (3), SV MTrain() is realization of binary SVM algorithm, α is a Lagrange multiplier vector, and ϵ is a bias term. γ, alpha, and bias are the output matrixes. γ is made up of all important genes selected, in which each row corresponds to a list of important genes selected between two specific classes. alpha is a matrix with each row corresponding to Lagrange multiplier vector by an SVM classifier trained between two specific classes, and bias is the vector made up of bias terms of these SVM classifiers. In this process, we may see there are K(K − 1)/2 SVMs trained and K(K − 1)/2 gene selections executed. This means that many important genes relative to two specific classes of samples will be selected. Based on the K(K − 1)/2 optimal hyperplanes and the strongest genes selected, decision function is constructed based on (3). Define f(X) = −f(X), (s ≠ t); the fuzzy membership function m(X) is introduced on the directions orthogonal to f(X) = 0 as Using m(X)(s ≠ t, s = 1, . . ., n), the class i membership function of X is defined as m(X) = minm(X), which is equivalent to m(X) = min(1, min f(X)); now an unknown sample X is classified by argmaxm(X).

EXPERIMENTAL RESULTS

F test and SVM-RFE are gene selection methods used in our experiments. In F test, the ratio , is used to select genes, in which denotes the average expression level of gene j across all samples and denotes the average expression level of gene j across the samples belonging to class k where class k corresponds to {Ω = k}; and the indicator function 1 is equal to one if event Ω is true and zero otherwise. Genes with bigger R(j) are selected. From the expression of R(j) , it can be seen F test could select genes among l(>3) classes [14]. As to SVM-RFE, it is recursive feature elimination based on SVM. It is a circulation procedure for eliminating features combined with training an SVM classifier and, for each elimination operation, it consists of three steps: (1) train the SVM classifier, (2) compute the ranking criteria for all features, and (3) remove the feature with the smallest ranking scores, in which all ranking criteria are relative to the decision function of SVM. As a linear kernel SVM is used as a classifier between two specific classes s and t, the square of every element of weight vector ω in (2) is used as a score to evaluate the contribution of the corresponding genes. The genes with the smallest scores are eliminated. Details are referred to in [6]. To speed up the calculation, gene preselection is generally used. On every dataset we use the first important 200 genes are selected by F test before multiclass classifiers with gene selection are trained. Note that F test requires normality of the data to be efficient which is not always the case for gene expression data. That is the exact reason why we cannot only use F test to select genes. Since the P values of important genes are relatively low, that means the F test scores of important genes should be relatively high. Considering that the number of important genes is often among tens of genes, we preselect the number of genes as 200 according to our experience in order to avoid losing some important genes. In the next experiments, we will show this procedure works effectively. Combining these two specific gene selection methods with the multiclass classification methods, we propose three algorithms: (1) BCT-SVM with F test, (2) BCT-SVM with SVM-RFE, and (3) FSVM with SVM-RFE. As mentioned in [4, 9], every algorithm is tested with cross-validation (leave-one-out) method based on top 5, top 10, and top 20 genes selected by their own gene selection methods.

Breast cancer dataset

In our first experiment, we will focus on hereditary breast cancer data, which can be downloaded from the web page for the original paper [24]. In [24], cDNA microarrays are used in conjunction with classification algorithms to show the feasibility of using differences in global gene expression profiles to separate BRCA1 and BRCA2 mutation-positive breast cancers. Twenty-two breast tumor samples from 21 patients were examined: 7 BRCA1, 8 BRCA2, and 7 sporadic. There are 3226 genes for each tumor sample. We use our methods to classify BRCA1, BRCA2, and sporadic. The ratio data is truncated from below at 0.1 and above at 20. Table 1 lists the top 20 strongest genes selected by using our methods. (For reading purpose, sometimes instead of clone ID, we use the gene index number in the database [24].) The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 2; more information about all selected genes corresponding to the list in Table 1 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 1008 (keratin 8) is selected by all the three methods. This gene is also an important gene listed in [4, 7, 9]. Keratin 8 is a member of the cytokeratin family of genes. Cytokeratins are frequently used to identify breast cancer metastases by immunohistochemistry [24]. Gene 10 (phosphofructokinase, platelet) and gene 336 (transducer of ERBB2, 1) are also important genes listed in [7]. Gene 336 is selected by FSVM with SVM-RFE and BCT-SVM with SVM-RFE; gene 10 is selected by FSVM with SVM-RFE.

Table 1

The index no of the strongest genes selected in hereditary breast cancer dataset.

No	FSVM with SVM-RFE			BCT-SVM with F test		BCT-SVM with SVM-RFE

	1	2	3	1	2	1	2

1	1008	1859	422	501	1148	750	1999
2	955	1008	2886	2984	838	860	3009
3	1479	10	343	3104	1859	1008	158
4	2870	336	501	422	272	422	2761
5	538	158	92	2977	1008	2804	247
6	336	1999	3004	2578	1179	1836	1859
7	3154	247	1709	3010	1065	3004	1148
8	2259	1446	750	2804	2423	420	838
9	739	739	2299	335	1999	1709	1628
10	2893	1200	341	2456	2699	3065	1068
11	816	2886	1836	1116	1277	2977	819
12	2804	2761	219	268	1068	585	1797
13	1503	1658	156	750	963	1475	336
14	585	560	2867	2294	158	3217	2893
15	1620	838	3104	156	609	501	2219
16	1815	2300	1412	2299	1417	146	585
17	3065	538	3217	2715	1190	343	1008
18	3155	498	2977	2753	2219	1417	2886
19	1288	809	1612	2979	560	2299	36
20	2342	1092	2804	2428	247	2294	1446

Table 2

A part of the strongest genes selected in hereditary breast cancer dataset (the first row of genes in Table 1).

Rank	Index no	Clone ID	Gene description

1	1008	897781	Keratin 8
2	955	950682	Phosphofructokinase, platelet
3	1479	841641	Cyclin D1 (PRAD1: parathyroid adenomatosis 1)
4	2870	82991	Phosphodiesterase I/nucleotide pyrophosphatase 1
4	2870	82991	(homologous to mouse Ly-41 antigen)
5	538	563598	Human GABA-A receptor π subunit mRNA, complete cds
6	336	823940	Transducer of ERBB2, 1
7	3154	135118	GATA-binding protein 3
8	2259	814270	Polymyositis/scleroderma autoantigen 1 (75kd)
9	739	214068	GATA-binding protein 3
10	2893	32790	mutS (E coli) homolog 2 (colon cancer, nonpolyposis type 1)
11	816	123926	Cathepsin K (pycnodysostosis)
12	2804	51209	Protein phosphatase 1, catalytic subunit, beta isoform
13	1503	838568	Cytochrome c oxidase subunit VIc
14	585	293104	Phytanoyl-CoA hydroxylase (Refsum disease)
15	1620	137638	ESTs
16	1815	141959	Homo sapiens mRNA; cDNA DKFZp566J2446
16	1815	141959	(from clone DKFZp566J2446)
17	3065	199381	ESTs
18	3155	136769	TATA box binding protein (TBP)-associated factor,
18	3155	136769	RNA polymerase II, A, 250kd
19	1288	564803	Forkhead (drosophila)-like 16
20	2342	284592	Platelet-derived growth factor receptor, alpha polypeptide

Using the top 5, 10, and 20 genes each for these three methods, the recognition accuracy is shown in Table 3. When using top 5 genes for classification, there is one error for BCT-SVM with F test and no error for the other two methods. When using top 10 and 20 genes, there is no error for all the three methods. Note that the performance of our methods is similar to that in [4], where the authors diagnosed the tumor types by using multinomial probit regression model with Bayesian gene selection. Using top 10 genes, they also got zero misclassification.

Table 3

Classifiers' performance on hereditary breast cancer dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification method	Top 5	Top 10	Top 20

FSVM with SVM-RFE	0	0	0
BCT-SVM with F test	1	0	0
BCT-SVM with SVM-RFE	0	0	0

Small round blue-cell tumors

In this experiment, we consider the small round blue-cell tumors (SRBCTs) of childhood, which include neuroblastoma (NB), rhabdomyosarcoma (RMS), nonHodgkin lymphoma (NHL), and the Ewing sarcoma (EWS) in [25]. The dataset of the four cancers is composed of 2308 genes and 63 samples, where the NB has 12 samples; the RMS has 23 samples; the NHL has 8 samples, and the EMS has 20 samples. We use our methods to classify the four cancers. The ratio data is truncated from below at 0.01. Table 4 lists the top 20 strongest genes selected by using our methods. The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 5; more information about all selected genes corresponding to the list in Table 4 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 244 (clone ID 377461), gene 2050 (clone ID 295985), and gene 1389 (clone ID 770394) are selected by all the three methods, and these genes are also important genes listed in [25]. Gene 255 (clone ID 325182), gene 107 (clone ID 365826), and gene 1 (clone ID 21652, (catenin alpha 1)) selected by BCT-SVM with SVM-RFE and FSVM with SVM-RFE are also listed in [25] as important genes.

Table 4

The index no of the strongest genes selected in small round blue-cell tumors dataset.

No	FSVM with SVM-RFE						BCT-SVM with F test			SVM-RFE SVM-RFE

	1	2	3	4	5	6	1	2	3	1	2	3

1	246	255	1954	851	187	1601	1074	169	422	545	174	851
2	1389	867	1708	846	509	842	246	1055	1099	1389	1353	846
3	851	246	1955	1915	2162	1955	1708	338	758	2050	842	1915
4	1750	1389	509	1601	107	255	1389	422	1387	1319	1884	1601
5	107	842	2050	742	758	2046	1954	1738	761	1613	1003	742
6	2198	2050	545	1916	2046	1764	607	1353	123	1003	707	1916
7	2050	365	1389	2144	2198	509	1613	800	84	246	1955	2144
8	2162	742	2046	2198	2022	603	1645	714	1888	867	2046	2198
9	607	107	348	1427	1606	707	1319	758	951	1954	255	1427
10	1980	976	129	1	169	174	566	910	1606	1645	169	1
11	567	1319	566	1066	1	1353	368	2047	1914	1110	819	1066
12	2022	1991	246	867	1915	169	1327	2162	1634	368	509	867
13	1626	819	1207	788	788	1003	244	2227	867	129	166	788
14	1916	251	1003	153	1886	742	545	2049	783	348	1207	153
15	544	236	368	1980	554	2203	1888	1884	2168	365	603	1980
16	1645	1954	1105	2199	1353	107	2050	1955	1601	107	796	2199
17	1427	1708	1158	783	338	719	430	1207	335	1708	1764	783
18	1708	1084	1645	1434	846	166	365	326	1084	187	719	1434
19	2303	566	1319	799	1884	1884	1772	796	836	1626	107	799
20	256	1110	1799	1886	2235	1980	1298	230	849	1772	2203	1886

Table 5

A part of the strongest genes selected in small round blue-cell tumors dataset (the first row of genes in Table 4).

Rank	Index no	Clone ID	Gene description

1	246	377461	Caveolin 1, caveolae protein, 22kd
2	1389	770394	Fc fragment of IgG, receptor, transporter, alpha
3	851	563673	Antiquitin 1
4	1750	233721	Insulin-like growth factor binding protein 2 (36kd)
5	107	365826	Growth arrest-specific 1
6	2198	212542	H sapiens mRNA; cDNA DKFZp586J2118
6	2198	212542	(from clone DKFZp586J2118)
7	2050	295985	ESTs
8	2162	308163	ESTs
9	607	811108	Thyroid hormone receptor interactor 6
10	1980	841641	Cyclin D1 (PRAD1: parathyroid adenomatosis 1)
11	567	768370	tissue inhibitor of metalloproteinase 3
11	567	768370	(Sorsby fundus dystrophy, pseudoinflammatory)
12	2022	204545	ESTs
13	1626	811000	Lectin, galactoside-binding, soluble, 3 binding
13	1626	811000	protein (galectin 6 binding protein)
14	1916	80109	Major histocompatibility complex, class II, DQ alpha 1
15	544	1416782	Creatine kinase, brain
16	1645	52076	Olfactomedinrelated ER localized protein
17	1427	504791	Glutathione S-transferase A4
18	1708	43733	Glycogenin 2
19	2303	782503	H sapiens clone 23716 mRNA sequence
20	256	154472	Fibroblast growth factor receptor 1
20	256	154472	(fms-related tyrosine kinase 2, Pfeiffer syndrome)

Using the top 5, 10, and 20 genes for these three methods each, the recognition accuracy is shown in Table 6. When using top 5 genes for classification, there is one error for BCT-SVM with F test and no error for the other two methods. When using top 10 and 20 genes, there is no error for all the three methods.

Table 6

Classifiers' performance on small round blue-cell tumors dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification method	Top 5	Top 10	Top 20

FSVM with SVM-RFE	0	0	0
BCT-SVM with F test	1	0	0
BCT-SVM with SVM-RFE	0	0	0

In [26], Yeo et al applied k nearest neighbor (kNN), weighted voting, and linear SVM in one-versus-rest fashion to this four-class problem and compared the performances of these methods when they are combined with several feature selection methods for each binary classification problem. Using top 5 genes, top 10 genes, or top 20 genes, kNN, weighted voting, or SVM combined with all the three feature selection methods, respectively, without rejection all have errors greater than or equal to 2. In [27], Lee et al used multicategory SVM with gene selection. Using top 20 genes, their recognition accuracy is also zero misclassification number.

Acute leukemia data

We have also applied the proposed methods to the leukemia data of [14], which is available at http://www.sensornet.cn/fxia/top_20_genes.zip. The microarray data contains 7129 human genes, sampled from 72 cases of cancer, of which 38 are of type B cell ALL, 9 are of type T cell ALL, and 25 of type AML. The data is preprocessed as recommended in [2]: gene values are truncated from below at 100 and from above at 16 000; genes having the ratio of the maximum over the minimum less than 5 or the difference between the maximum and the minimum less than 500 are excluded; and finally the base-10 logarithm is applied to the 3571 remaining genes. Here we study the 38 samples in training set, which is composed of 19 B-cell ALL, 8 T-cell ALL, and 11 AML. Table 7 lists the top 20 strongest genes selected by using our methods. The clone ID and the gene description of a typical column of the top 20 genes selected by SVM-RFE are listed in Table 8; more information about all selected genes corresponding to the list in Table 7 could be found at http://www.sensornet.cn/fxia/top_20_genes.zip. It is seen that gene 1882 (CST3 cystatin C (amyloid angiopathy and cerebral hemorrhage)), gene 4847 (zyxin), and gene 4342 (TCF7 transcription factor 7 (T cell specific)) are selected by all the three methods. In the three genes, the first two are the most important genes listed in many literatures. Gene 2288 (DF D component of complement (adipsin)) is another important gene having biological significance, which is selected by FSVM with SVM-RFE.

Table 7

The index no of the strongest genes selected in acute leukemia dataset.

No	FSVM with SVM-RFE			BCT-SVM with F test		BCT-SVM with SVM-RFE

	1	2	3	1	2	1	2

1	6696	1882	6606	2335	4342	1882	4342
2	6606	4680	6696	4680	4050	6696	4050
3	4342	6201	4680	2642	1207	5552	5808
4	1694	2288	4342	1882	6510	6378	1106
5	1046	6200	6789	6225	4052	3847	3969
6	1779	760	4318	4318	4055	5300	1046
7	6200	2335	1893	5300	1106	2642	6606
8	6180	758	1694	5554	1268	2402	6696
9	6510	2642	4379	5688	4847	3332	2833
10	1893	2402	2215	758	5543	1685	1268
11	4050	6218	3332	4913	1046	4177	4847
12	4379	6376	3969	4082	2833	6606	6510
13	1268	6308	6510	6573	4357	3969	2215
14	4375	1779	2335	6974	4375	6308	1834
15	4847	6185	6168	6497	6041	760	4535
16	6789	4082	2010	1078	6236	2335	1817
17	2288	6378	1106	2995	6696	2010	4375
18	1106	4847	5300	5442	1630	6573	5039
19	2833	5300	4082	2215	6180	4586	4379
20	6539	1685	1046	4177	4107	2215	5300

Table 8

A part of the strongest genes selected in small round blue-cell tumors dataset (the second row of genes in Table 4).

Rank	Index no	Gene accession number	Gene description

1	1882	M27891_at	CST3 cystatin C (amyloid angiopathy and cerebral hemorrhage)
2	4680	X82240_rna1_at	TCL1 gene (T-cell leukemia) extracted from H sapiens
2	4680	X82240_rna1_at	mRNA for T-cell leukemia/lymphoma 1
3	6201	Y00787_s_at	Interleukin-8 precursor
4	2288	M84526_at	DF D component of complement (adipsin)
5	6200	M28130_rna1_s_at	Interleukin-8 (IL-8) gene
6	760	D88422_at	Cystatin A
7	2335	M89957_at	IGB immunoglobulin-associated beta (B29)
8	758	D88270_at	GB DEF = (lambda) DNA for immunoglobin light chain
9	2642	U05259_rna1_at	MEF2C MADS box transcription enhancer factor 2,
9	2642	U05259_rna1_at	polypeptide C (myocyte enhancer factor 2C)
10	2402	M96326_rna1_at	Azurocidin gene
11	6218	M27783_s_at	ELA2 Elastatse 2, neutrophil
12	6376	M83652_s_at	PFC properdin P factor, complement
13	6308	M57731_s_at	GRO2 GRO2 oncogene
14	1779	M19507_at	MPO myeloperoxidase
15	6185	X64072_s_at	SELL leukocyte adhesion protein beta subunit
16	4082	X05908_at	ANX1 annexin I (lipocortin I)
17	6378	M83667_rna1_s_at	NF-IL6-beta protein mRNA
18	4847	X95735_at	Zyxin
19	5300	L08895_at	MEF2C MADS box transcription enhancer factor 2,
19	5300	L08895_at	polypeptide C (myocyte enhancer factor 2C)
20	1685	M11722_at	Terminal transferase mRNA

Using the top 5, 10, and 20 genes for these three methods each, the recognition accuracy is shown in Table 9. When using top 5 genes for classification, there is one error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and BCT-SVM with F test, respectively. When using top 10 genes for classification, there is no error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and four errors for BCT-SVM with F test. When using top 20 genes for classification, there is one error for FSVM with SVM-RFE, two errors for BCT-SVM with SVM-RFE and two errors for BCT-SVM with F test. Again note that the performance of our methods is similar to that in [4], where the authors diagnosed the tumor types by using multinomial probit regression model with Bayesian gene selection. Using top 10 genes, they also got zero misclassification.

Table 9

Classifiers' performance on acute leukemia dataset by cross-validation (number of wrong classified samples in leave-one-out test).

Classification method	Top 5	Top 10	Top 20

FSVM with SVM-RFE	1	0	1
BCT-SVM with F test	2	4	2
BCT-SVM with SVM-RFE	2	1	2

ANALYSIS AND DISCUSSION

According to Tables 1–9, there are many important genes selected by these three multiclass classification algorithms with gene selection. Based on these selected genes, the prediction error rate of these three algorithms is low. By comparing the results of these three algorithms, we consider that FSVM with SVM-RFE algorithm generates the best results. BCT-SVM with SVM-RFE and BCT-SVM with F test have the same multiclass classification structure. The results of BCT-SVM with SVM-RFE are better than those of BCT-SVM with F test, because their gene selection methods are different; a better gene selection method combined with the same multiclass classification method will perform better. It means SVM-RFE is better than F test combined with multiclass classification methods; the results are similar to what is mentioned in [6], in which the two gene selection methods are combined with two-class classification methods. FSVM with SVM-RFE and BCT-SVM with SVM-RFE have the same gene selection methods. The results of FSVM with SVM-RFE are better than those of BCT-SVM with SVM-RFE whether in gene selection or in recognition accuracy, because the constructions of their multiclass classification methods are different, which is explained in two aspects. (1) The genes selected by FSVM with SVM-RFE are more than those of BCT-SVM with SVM-REF. In FSVM there are K(K − 1)/2 operations of gene selection, but in BCT-SVM there are only K −1 operations of gene selection. An operation of gene selection between every two classes is done in FSVM with SVM-RFE; (2) FSVM is an improved pairwise classification method, in which the unclassifiable regions being in BCT-SVM are classified by FSVM's fuzzy membership function [21, 22]. So, FSVM with SVM-RFE is considered as the best of the three.

CONCLUSION

In this paper, we have studied the problem of multiclass cancer classification with gene selection from gene expression data. We proposed two different new constructed classifiers with gene selection, which are FSVM with gene selection and BCT-SVM with gene selection. F test and SVM-RFE are used as our gene selection methods combined with multiclass classification methods. In our experiments, three algorithms (FSVM with SVM-RFE, BCT-SVM with SVM-RFE, and BCT-SVM with F test) are tested on three datasets (the real breast cancer data, the small round blue-cell tumors, and the acute leukemia data). The results of these three groups of experiments show that more important genes are selected by FSVM with SVM-RFE, and by these genes selected it shows higher prediction accuracy than the other two algorithms. Compared to some existing multiclass cancer classifiers with gene selection, FSVM based on SVM-RFE also performs very well. Finally, an explanation is provided on the experimental results of this study.

12 in total

1. Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors: T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal: Bioinformatics Date: 2000-10 Impact factor: 6.937

2. Missing-value estimation using linear and non-linear regression with Bayesian gene selection.

Authors: Xiaobo Zhou; Xiaodong Wang; Edward R Dougherty
Journal: Bioinformatics Date: 2003-11-22 Impact factor: 6.937

3. Simultaneous gene clustering and subset selection for sample classification via MDL.

Authors: Rebecka Jörnsten; Bin Yu
Journal: Bioinformatics Date: 2003-06-12 Impact factor: 6.937

4. Gene selection: a Bayesian variable selection approach.

Authors: Kyeong Eun Lee; Naijun Sha; Edward R Dougherty; Marina Vannucci; Bani K Mallick
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

5. Strong feature sets from small samples.

Authors: Seungchan Kim; Edward R Dougherty; Junior Barrera; Yidong Chen; Michael L Bittner; Jeffrey M Trent
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

6. Gene-expression profiles in hereditary breast cancer.

Authors: I Hedenfalk; D Duggan; Y Chen; M Radmacher; M Bittner; R Simon; P Meltzer; B Gusterson; M Esteller; O P Kallioniemi; B Wilfond; A Borg; J Trent; M Raffeld; Z Yakhini; A Ben-Dor; E Dougherty; J Kononen; L Bubendorf; W Fehrle; S Pittaluga; S Gruvberger; N Loman; O Johannsson; H Olsson; G Sauter
Journal: N Engl J Med Date: 2001-02-22 Impact factor: 91.245

7. Multi-class cancer classification using multinomial probit regression with Bayesian gene selection.

Authors: X Zhou; X Wang; E R Dougherty
Journal: Syst Biol (Stevenage) Date: 2006-03

8. Recursive partitioning for tumor classification with gene expression microarray data.

Authors: H Zhang; C Y Yu; B Singer; M Xiong
Journal: Proc Natl Acad Sci U S A Date: 2001-05-29 Impact factor: 11.205

9. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.

Authors: J Khan; J S Wei; M Ringnér; L H Saal; M Ladanyi; F Westermann; F Berthold; M Schwab; C R Antonescu; C Peterson; P S Meltzer
Journal: Nat Med Date: 2001-06 Impact factor: 53.440

10. Multi-class cancer classification via partial least squares with gene expression profiles.

Authors: Danh V Nguyen; David M Rocke
Journal: Bioinformatics Date: 2002-09 Impact factor: 6.937

9 in total

1. Parameters selection in gene selection using Gaussian kernel support vector machines by genetic algorithm.

Authors: Yong Mao; Xiao-Bo Zhou; Dao-Ying Pi; You-Xian Sun; Stephen T C Wong
Journal: J Zhejiang Univ Sci B Date: 2005-10 Impact factor: 3.066

2. A study of health effects of long-distance ocean voyages on seamen using a data classification approach.

Authors: Yunmei Lu; Yanhong Gao; Zhongbo Cao; Juan Cui; Zhennan Dong; Yaping Tian; Ying Xu
Journal: BMC Med Inform Decis Mak Date: 2010-03-10 Impact factor: 2.796

3. Metabonomic analysis of hepatitis B virus-induced liver failure: identification of potential diagnostic biomarkers by fuzzy support vector machine.

Authors: Yong Mao; Xin Huang; Ke Yu; Hai-bin Qu; Chang-xiao Liu; Yi-yu Cheng
Journal: J Zhejiang Univ Sci B Date: 2008-06 Impact factor: 3.066

4. Constructing support vector machine ensembles for cancer classification based on proteomic profiling.

Authors: Yong Mao; Xiao Bo Zhou; Dao Ying Pi; You Xian Sun
Journal: Genomics Proteomics Bioinformatics Date: 2005-11 Impact factor: 7.691

5. Improving Classification of Cancer and Mining Biomarkers from Gene Expression Profiles Using Hybrid Optimization Algorithms and Fuzzy Support Vector Machine.

Authors: Niloofar Yousefi Moteghaed; Keivan Maghooli; Masoud Garshasbi
Journal: J Med Signals Sens Date: 2018 Jan-Mar

6. Identification of potential tissue-specific cancer biomarkers and development of cancer versus normal genomic classifiers.

Authors: Akram Mohammed; Greyson Biegert; Jiri Adamec; Tomáš Helikar
Journal: Oncotarget Date: 2017-09-21

7. Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study.

Authors: Solbi Kweon; Jeong Hoon Lee; Younghee Lee; Yu Rang Park
Journal: J Med Internet Res Date: 2020-08-10 Impact factor: 5.428

8. Differential gene expression analysis reveals novel genes and pathways in pediatric septic shock patients.

Authors: Akram Mohammed; Yan Cui; Valeria R Mas; Rishikesan Kamaleswaran
Journal: Sci Rep Date: 2019-08-02 Impact factor: 4.379

9. Evaluation of classification approaches for distinguishing brain states predictive of episodic memory performance from electroencephalography: Abbreviated Title: Evaluating methods of classifying memory states from EEG.

Authors: Soroush Mirjalili; Patrick Powell; Jonathan Strunk; Taylor James; Audrey Duarte
Journal: Neuroimage Date: 2021-12-22 Impact factor: 6.556

9 in total