Literature DB >> 18305833

On sparse Fisher discriminant method for microarray data analysis.

Abstract

One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.

Entities: Disease Gene Species

Keywords: Fisher discriminant method; algorithm; data; genes; microarray

Year: 2007 PMID： 18305833 PMCID： PMC2241932 DOI： 10.6026/97320630002230

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Microarray technologies for the analysis of biological samples provide information on a genomic scale. A major challenge in the context of microarray is the task of sample classification. One key problem in microarray data classification is that the number of features (gene expression levels) is extremely large compared to the number of observations (samples). Traditional pattern recognition methods may not handle this challenge properly. It is essential to identify which genes are relevant in the classification of disease so that better RNA-based diagnostic tests using laboratory techniques such as RT-PCR and better treatment can be developed. Researchers [3,6] have also developed methods to identify optimal sets of genes which together provide good discrimination of classes. These algorithms are generally very computationally intensive. Recently, various machine learning methods for gene selection have been developed, for instance, relevance vector machine [11], Gaussian process models [5] and simple decision rules [12]. Fisher discriminant analysis and least squares support vector machines are used for sample classification [9]. Another approach is to use optimization algorithms in feature selection like sparse logistic regression [14] and modified Fisher optimization model [7]. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples in two-class classification problems. This is achieved by including the weight sparsity term in the Fisher objective function that is minimized in the discriminant process as described in equation 1 (see supplementary material). Each entry of u represents a weight for each gene. An efficient l - l norm minimization method is implemented [8] to the above discriminant model to automatically compute the weights of all genes in the samples. The experiments on two microarray datasets have shown that the new algorithm can effectively determine a small set of genes for the purpose of classification, and can generate classification results that are as good as the other methods.

Results and discussion

Datasets

In this paper, we apply the proposed method to two public microarray data sets, namely, colon cancer data set from [1] and the Leukaemia MIT AML/ALL data set from [10].

Colon cancer data

In order to obtain more reliable results [15], we performed ten-fold cross validation in the experiments. The k-nearest neighbor's method is used to determine a classifier that can be applied to predict the class of expression profiles of test samples. In the experiments, we tried several values of α. For each value of α, ten cross validation cases are generated and therefore ten sets of weights of genes are obtained. Based on these ten sets of weights, the mean weights of genes can be calculated and thus genes are ranked according to the magnitude of their mean weights. We apply this ranking to the ten cross validation cases and evaluate how many numbers of important (relevant) genes to be selected such that the highest classification accuracy can be obtained. In the tests, we found out that the highest classification accuracy is achieved when α = 1609 among all tested values of α. In Figure 1a, we show the classification accuracy curve for 10-fold cross validation based on the ranking of average weights of genes when α = 1609. We note that the classification accuracy is still 82.4% even when the number of genes selected is more than 30, i.e., even if we include more genes in the classifier, the classification accuracy cannot be improved. We see from the figure 1 that when the number of genes selected is three, we can obtain the highest classification accuracy (86.7%). Among the ten cross validation cases, 5 out of 10 cases are 100% correct. The type I and type II errors are 25.0% and 7.5% respectively when α = 1609.

Figure 1

Classification accuracy and projection values. (a) classification accuracy (%) when α = 1609; (b) projection values when α = 1609; (c) classification accuracy (%) when α = 10; (d) projection values when α = 10

In Table 1 (supplementary material), we list the mean weights, the mean values of cancer samples and the mean value of normal samples for the three selected genes. We observe that their sample mean discrepancies of two classes are quite large. This may also suggest why they are selected and why they are relevant to a normal/disease sample classification. In Figure 1b, we plot the value of equation (2) (see supplementary material) for each training sample j, where [x] is a vector containing those selected genes expression of the j-th sample and ū represents a projection vector which is formed by using the average weights of the three selected genes.

Leukaemia MIT AML/ALL data

We also performed ten-fold cross validation for the Leukaemia data set. We found out that the highest average classification accuracy is achieved when α = 10 among all tested values of α. We show in Figure 1c that the classification accuracy curve for 10-fold cross validation based on the ranking of average weights of genes. We also note that the classification accuracy is still 91.5% even when the number of genes selected is more than 120. Obviously, we obtain the highest classification accuracy (95.8%) when the number of genes selected is 39. It is interesting to note that 7 out of 10 cases are 100% correct. The type I and type II errors are 0.0% and 11.7% respectively. In Table 2 (supplementary material), we observe that their sample mean discrepancies of two classes are quite large. In Figure 1d, we plot the value of equation (3) (see supplementary material for each training sample j, and it is clear from the figure 1 that the selected genes categorize patient and normal samples are well separated.

Comparison of methods

In this section, we compare the proposed method with other classification methods.

Modified Fisher discriminant method

In this subsection, we compare the performance of the proposed method with the modified Fisher discriminant method described in [7]. By using the colon cancer data set, we randomly selected half of the normal samples and patient samples as training samples and the rest of them as testing samples repeated 100 times. Here we fix α = 1069 as used in the previous subsection, and compare the results of the two methods. The classification accuracy for testing samples is 85.0 ± 13.8% and only one gene (“Hsa.8147”) is selected. On the other hand, the classification accuracy for testing samples in [7] is 86.0 ± 5.7% and the number of genes selected is 29.9 ± 4.8%. We see that the proposed method is quite competitive with the modified Fisher discriminant method. Secondly, we perform the same experiment by using the Leukaemia data set. We randomly selected half of the normal samples and patient samples as training samples and the rest of them as testing samples. Therefore, we have 36 training samples and 36 testing samples repeated 100 times. Here we fix α = 10 as used in the previous subsection. The classification accuracy for the test samples is 86.9 ± 14.7% and the number of genes selected is 58. No average result was given in [7] because large memory storage is required and the method is time- consuming. However, the proposed method can generate classification results efficiently.

Sparse logistic regression

In order to make a fair comparison with sparse logistic regression [4], we also perform a leave-one-out validation procedure to test the performance of the proposed method. We calculate the mean weights of genes in the procedure and evaluate how many numbers of genes to be selected such that the highest classification accuracy can be obtained. In the colon cancer data set, we find that when α is equal to 1, the classification accuracy, cross-entropy and number of selected genes of the proposed method are 83.9%, 0.31 and 9 respectively. It is better than those by the method (BLogReg) in [4], which gives lower classification accuracy (82.3%), higher cross-entropy (0.51) and more number of selected genes (11). In the Leukaemia MIT AML/ALL data set, we find that when α is equal to 0.1, the classification accuracy, cross-entropy and number of selected genes of the proposed method are 95.8%, 0.087 and 8. It is better than those by the method (BLogReg) in [4], which gives a lower classification accuracy (93.1%), a higher cross-entropy (0.259), and more selected genes (11). We remark that the lower cross-entropy is, the better the classification result is.

PAM

PAM is a tool for classifying normal/disease samples based on microarray data [2]. The idea behind nearest shrunken centroids [3][13] is to calculate each class centroid as a nearest centroid classifier. Each centroid is divided by the within-class standard deviation for each gene. This gives greater weight to genes whose expression is stable among samples in the same class. Soft thresholding is applied to the resulting normalized class centroids. If the normalized centroid is small, it is set to zero. This procedure is to reduce the number of genes that are used in the final classification model. The method is very efficient as it does not involve covariance matrix of genes, and the nearest shrunken centroids can be computed independently. In [2], it is mentioned that the discriminant weights in PAM are similar to those used in linear discriminant analysis. The main difference is that the calculation of distance between a given test observation and the class centroids where the pooled within-class variance/covariance matrix of the expression data is used. In PAM, it assumes that the covariance matrix is a diagonal matrix. In the proposed method, we use the covariance matrix in the formulation so that pairwise relations between any two genes are used in the calculation of discriminant weights. On the other hand, shrunken centroids are used in PAM. In the proposed method, we use a weight sparsity term ║u║1 in the objective function to control the discriminant weights. Similar to PAM, a cross-validation procedure is used to find out a good balance (α) between equation (4) (see supplementary material)and ║u║1. We remark that α is the regularization parameter to control the sparsity of u, i.e., very small values are set to zero. The corresponding gene does not contribute to the final classification.

Conclusion

In this paper, we study a new Fisher discriminant method for gene selection in microarray data and propose a l - l norm minimization method for finding the projection vector in discriminant process. The experiments on two microarray data sets have shown that the new algorithm can generate classification results in a competitive manner compared with other classification methods, and can effectively determine relevant genes.

11 in total

1. Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors: T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal: Bioinformatics Date: 2000-10 Impact factor: 6.937

2. Bayesian automatic relevance determination algorithms for classifying gene expression data.

Authors: Yi Li; Colin Campbell; Michael Tipping
Journal: Bioinformatics Date: 2002-10 Impact factor: 6.937

3. A simple and efficient algorithm for gene selection using sparse logistic regression.

Authors: S K Shevade; S S Keerthi
Journal: Bioinformatics Date: 2003-11-22 Impact factor: 6.937

4. Biomarker discovery in microarray gene expression data with Gaussian processes.

Authors: Wei Chu; Zoubin Ghahramani; Francesco Falciani; David L Wild
Journal: Bioinformatics Date: 2005-06-02 Impact factor: 6.937

5. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization.

Authors: Gavin C Cawley; Nicola L C Talbot
Journal: Bioinformatics Date: 2006-07-14 Impact factor: 6.937

Review 6. Classification based upon gene expression data: bias and precision of error rates.

Authors: Ian A Wood; Peter M Visscher; Kerrie L Mengersen
Journal: Bioinformatics Date: 2007-03-28 Impact factor: 6.937

7. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors: U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal: Proc Natl Acad Sci U S A Date: 1999-06-08 Impact factor: 11.205

8. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors: T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal: Science Date: 1999-10-15 Impact factor: 47.728

9. Evolutionary algorithms for finding optimal gene sets in microarray prediction.

Authors: J M Deutsch
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

10. New feature subset selection procedures for classification of expression profiles.

Authors: Trond Bø; Inge Jonassen
Journal: Genome Biol Date: 2002-03-14 Impact factor: 13.583

3 in total

1. Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection.

Authors: Michael C Wu; Lingsong Zhang; Zhaoxi Wang; David C Christiani; Xihong Lin
Journal: Bioinformatics Date: 2009-01-25 Impact factor: 6.937

2. MALDI imaging MS reveals candidate lipid markers of polycystic kidney disease.

Authors: Hermelindis Ruh; Theresia Salonikios; Jens Fuchser; Matthias Schwartz; Carsten Sticht; Christina Hochheim; Bernhard Wirnitzer; Norbert Gretz; Carsten Hopf
Journal: J Lipid Res Date: 2013-07-12 Impact factor: 5.922

3. A novel feature extraction approach for microarray data based on multi-algorithm fusion.

Authors: Zhu Jiang; Rong Xu
Journal: Bioinformation Date: 2015-01-30

3 in total