Literature DB >> 18478083

SDED: a novel filter method for cancer-related gene selection.

Wenlong Xu¹, Minghui Wang, Xianghua Zhang, Lirong Wang, Huanqing Feng.

Abstract

Gene selection is to detect the most significantly expressed genes under different conditions expression data. The current challenge in gene selection is the comparison of a large number of genes with limited patient samples. Thus it is trivial task in simple statistical analysis. Various statistical measurements are adopted by filter methods applied in gene selection studies. Their ability to discriminate phenotypes is crucial in classification and selection. Here we describe the standard deviation error distribution (SDED) method for gene selection. It utilizes variations within-class and among-class in gene expression data. We tested the method using 4 leukemia datasets available in the public domain. The method was compared with the GS2 and CHO methods. The Prediction accuracies by SDED are better than both GS2 and CHO for different datasets. These are 0.8-4.2% and 1.6-8.4% more that in GS2 and CHO. The related OMIM annotations and KEGG pathways analyses verified that SDED can pick out more 4.0% and 6.1% genes with biological significance than GS2 and CHO, respectively.

Entities: Chemical Disease Gene Species

Keywords: SDED; filter method; gene selection; support vector machine

Year: 2008 PMID： 18478083 PMCID： PMC2374374 DOI： 10.6026/97320630002301

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

DNA micro-array technology has enabled biologists to associate phenotypes with molecular genetics [1,2]. It is commonly used to compare gene expression levels of different phenotypes (normal versus cancer). It enables the study of thousands of gene expression simultaneously. The difficulty is in interpreting expression data. Genes with significant expression across the sample set are selected using sound statistical techniques. These discriminatory genes will help to classify different cancer subtypes [3,4]. There are two categories of gene selection strategies namely, filter and wrapper [1]. Many filter methods have been proposed by eliminating redundant genes. Golub et al. [5] (1999) provided a signal-to-noise statistic method for binary classification. Baldi and Long [6] (2001) proposed multivariate test statistic to identify differentially expressed gene combinations. Cho et al. [7] (2003) used a new statistic method considering within-class variation (CHO). Yang et al. [8] (2006) used a stable gene selection in micro-array data analysis (GS2). In wrapper methods, genes are tested in groups according to their performance in the classification model. Xiong et al. [9] suggested a method to select genes through the space of feature subsets using classification errors. Guyon et al. [10] proposed a gene selection approach utilizing Support Vector Machines (SVM) based on recursive feature elimination. Both categories of gene selection strategies have their disadvantage. Although GS2 is a stable method, calculations are too complex and the biological meaning is difficult for annotation. The CHO method considers within-class information and it loses the among-class information. The wrapper methods use exponentially increasing dimensions of the feature space for large gene sets. Thus, the wrappers are computationally intractable for high-dimensional gene data [1]. The inherent linear nature is their disadvantage and it makes it difficult to identify important genes in wrapper methods [11]. Here, we propose a statistical measurement to better score genes with subtle expression patterns. It incorporates the within/among class variations in gene expression data.

Methodology

Datasets

MLL dataset

We used the MLL dataset from the KORSMEYER Laboratory [12], which containing 72 samples in three classes: (1) acute lymphoblastic leukaemia (ALL); (2) acute myeloid leukaemia (AML); and (3) mixed-lineage leukaemia (MLL), which has 24, 28, 20 samples, respectively. Each sample contains 12,582 gene expression values.

ALL-AML dataset

The ALL-AML dataset is obtained from the cancer program of BROAD Institute [13]. It consists of 7129 gene expression profiles of two acute cases of leukaemia: (1) acute lymphoblastic leukaemia (ALL, 47 samples) and (2) acute myeloblastic leukaemia (AML, 25 samples). The ALL dataset is obtained from B-cell (ALL-B, 38 samples) and T-cell (ALL-T, 9 samples) and the AML is obtained from bone marrow (AML-BM, 21 samples) and peripheral blood (AML-PB, 4 samples) samples. Due to the bipartition of each component, it can be treated both as a three-class dataset (ALL-B, ALL-T and AML) and as a four-class dataset (ALL-B, ALL-T, AML-BM and AML-PB). Here, the three-class version is referred to as ALL-AML-3 and the four-class version as ALL-AML-4.

Features of LIMACS

The ALL dataset by St. Jude Children's Research hospital [14] contains 248 samples in six classes of subtype ALL: (a) TEL, (b) Hyper, (c) T, (d) E2A, (e) MLL, and (f) BCR, which contains 79, 64, 43, 27, 20 and 15 samples, respectively. Every sample contains 12,625 gene expression values.

Data normalization

These 4 datasets were used in the analyses. Each sample was normalized to standard distribution - N(0,1) before scoring for gene selection. The expression of each gene was normalized based on the expression level in each sample.

SVM classifier

SVM is a powerful and popular machine-learning method and has been widely used in biological classification. The key idea of SVM is to maximize the margin separating the two classes while minimizing the total classification error. There were a number of kernels used in SVM models for decision plane computing and the radial basis function (RBF) kernel was chosen for our purpose. As for the design of multi-class SVM classifier, we used the one-versus-one method. The final prediction decision was given by the voting strategy: the predicted class is assigned to the one that has the maximum vote. If more than one class has the same maximum vote, the classifier will have to make a random prediction. It is known that proper selection of parameter is very important for SVM, so the grid search strategy by Chih-Jen Lin [15] was performed to find the best combination of parameters for each prediction process. The toolkit for SVM implementation we used in MATLAB was LIBSVM-Version 2.82 [15].

Discussion

Samples are first divided into testing and training data for each dataset. We used the training samples for scoring the genes. The quality of these top ranked x genes are selected based on two aspects, namely: (1) the classification accuracy; (2) relevance to relative inheritance or diseased association in related pathways.

Classification accuracies

We used the top ranked genes selected by a gene selection method, together with their expression values in the training dataset to build a classifier for each testing sample. We defined the classification accuracy as the percentage of correct decisions made by the classifier on the testing samples. We adopted the SVM classifier to compare the performance of SDED with GS2 and CHO. The classification accuracy was obtained through the leave one out cross validation (LOO_CV) process. One sample was taken as testing and the remaining were used as training data in LOO_CV. This is done for all samples and for every top ranked x (from 1 to 100 with p < 0.01) genes in the datasets. Figure 1 shows the plot for classification accuracy of the SVM classifier based on SDED, GS2 and CHO on MLL dataset. The SDED method could achieve better results than GS2 (94.444%/91, 97.222%/48, 93.056%/36), CHO (88.889%/82, 95.833%/74, 93.056%/69) for MLL, ALL-AML-3 and ALL-AML-4 datasets. The SDED showed 97.222%/48, 98.611%/16, 97.222%/57, accuracy for these datasets even with less number of genes, respectively. The performance of SDED method (98.387%/96) was only comparable with GS2 (97.581%/68) and CHO (96.774%/87) in ALL dataset. In summary, the SDED filter method can perform about 0.8-4.2% and 1.6-8.4% better classification accuracies than GS2 and CHO, respectively.

Figure 1

Classification accuracy by SDED, GS2 and CHO on MLL dataset.

Biological meaning

We examined genes and their association in pathways to demonstrate the biological significance and evidence of gene selection. The top 100 ranked genes were chosen (p < 0.01) for each method and dataset. The numbers of genes in the dataset that are found in OMIM (Online Mendelian Inheritance in Man) and KEGG Pathways were listed in Table 1 (see supplementary material). The SDED method helped to select more genes compared to other methods in ALL_AML_3, ALL_AML_4 and ALL datasets. It selected about 4.0% (570/800 versus 538/800) and 6.1% (570/800 versus 521/800) genes with biological significance than GS2 and CHO, respectively.

Conclusion

In this paper, we described an effective gene selection method named SDED. The method was tested using 4 leukaemia datasets and compared with the GS2 and CHO methods. The described SDED method achieved 0.8-4.2% and 1.6-8.4% better classification than GS2 and CHO, respectively. The related OMIM annotation and KEGG pathways analyses verified that SDED method can pick out more genes with biological significance.

10 in total

1. Biomarker identification by feature wrappers.

Authors: M Xiong; X Fang; J Zhao
Journal: Genome Res Date: 2001-11 Impact factor: 9.043

2. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes.

Authors: P Baldi; A D Long
Journal: Bioinformatics Date: 2001-06 Impact factor: 6.937

3. Singular value decomposition for genome-wide expression data processing and modeling.

Authors: O Alter; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 2000-08-29 Impact factor: 11.205

4. New gene selection method for classification of cancer subtypes considering within-class variation.

Authors: Ji-Hoon Cho; Dongkwon Lee; Jin Hyun Park; In-Beum Lee
Journal: FEBS Lett Date: 2003-09-11 Impact factor: 4.124

Review 5. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments.

Authors: Yulan Liang; Arpad Kelemen
Journal: Funct Integr Genomics Date: 2005-11-15 Impact factor: 3.410

6. Ensemble dependence model for classification and prediction of cancer and normal gene expression data.

Authors: Peng Qiu; Z Jane Wang; K J Ray Liu
Journal: Bioinformatics Date: 2005-05-06 Impact factor: 6.937

Review 7. Application of bioinformatics for DNA microarray data to bioscience, bioengineering and medical fields.

Authors: Taizo Hanai; Hiroyuki Hamada; Masahiro Okamoto
Journal: J Biosci Bioeng Date: 2006-05 Impact factor: 2.894

8. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors: T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal: Science Date: 1999-10-15 Impact factor: 47.728

9. A stable gene selection in microarray data analysis.

Authors: Kun Yang; Zhipeng Cai; Jianzhong Li; Guohui Lin
Journal: BMC Bioinformatics Date: 2006-04-27 Impact factor: 3.169

10. Translating microarray data for diagnostic testing in childhood leukaemia.

Authors: Katrin Hoffmann; Martin J Firth; Alex H Beesley; Nicholas H de Klerk; Ursula R Kees
Journal: BMC Cancer Date: 2006-09-26 Impact factor: 4.430

10 in total

2 in total

1. Identification of disease-causing genes using microarray data mining and Gene Ontology.

Authors: Azadeh Mohammadi; Mohammad H Saraee; Mansoor Salehi
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

2. Improving the sensitivity of sample clustering by leveraging gene co-expression networks in variable selection.

Authors: Zixing Wang; F Anthony San Lucas; Peng Qiu; Yin Liu
Journal: BMC Bioinformatics Date: 2014-05-20 Impact factor: 3.169

2 in total