Literature DB >> 19238249

Entropy based sub-dimensional evaluation and selection method for DNA microarray data classification.

Abstract

DNA microarray allows the measurement of expression levels of tens of thousands of genes simultaneously and has many applications in biology and medicine. Microarray data are very noisy and this makes it difficult for data analysis and classification. Sub-dimension based methods can overcome the noise problem by partitioning the conditions into sub-groups, performing classification with each group and integrating the results. However, there can be many sub-dimensional groups, which lead to a high computational complexity. In this paper, we propose an entropy-based method to evaluate and select important sub-dimensions and eliminate unimportant ones. This improves the computational efficiency considerably. We have tested our method on four microarray datasets and two other real-world datasets and the experiment results prove the effectiveness of our method.

Entities: Disease Species

Keywords: DNA microarray; datasets; entropy; probabilistic neural network; sub-dimension

Year: 2008 PMID： 19238249 PMCID： PMC2639693 DOI： 10.6026/97320630003124

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

The development of microarray technology has made it possible to measure the expression levels of tens of thousands of genes in parallel and enhance our understanding of functional genomics. An important task in DNA microarray data analysis is to identify genes which have similar expression patterns in order to understand their biological functions and cellular processes. This process can be done manually, in which case the amount of effort would be tremendous and intensive. Thus, it is important to develop computerized data analysis techniques, such as classification algorithms, which are needed in many applications. In our previous study, we proposed a sub-dimension based probabilistic neural network to solve this problem [1]. Probabilistic neural network (PNN) was first developed by D. Specht [2], [3]. It provides a general solution to pattern classification problems by using the Bayes strategy for probability density functions. It is frequently employed in pattern classification and microarray data clustering due to its prominent time efficiency. It provides a considerable improvement in training speed compared to the conventional back-propagation network (BPN). Furthermore, as discussed in [4], PNN could attain the same accuracy as back-propagation neural network (BPN). We assume that the input data consist of an n by d matrix X, where n is the number of genes (objects) and d the number of conditions (features). The sub-dimension based method partitions the dataset into several smaller parts called sub-dimensions, which may or may not be disjoint [5]. It clusters the datasets based on their sub-dimensions. In our previous study, a voting system was used to combine all sub-dimension class results. We assigned two objects x1 and x2 to the same group if more than half of the sub-dimensions x1j and x2j belong to the same group. Experiment results show that the method is effective [1]. However, the enormous number of features in the real world microarray datasets makes it difficult to select the optimal sub-dimensions. One method is to reduce the dimensionality. In the classification, the contribution of each sub-dimension is not equal. Some may be corrupted or less relative to others, which can be discarded without degrading the performance of the system. In this paper, we employ the feature evaluation and selection technique to determine the sub-dimensions that are not as important as others in order to reduce the number of sub-dimensions without affecting the classification accuracy. The aim of feature selection is to discriminate features which contain the most or the least effective information from an original candidate set. Feature selection algorithms have been well researched in this area. In our study, we apply the entropy based measure combined with the subdimension method. Entropy based methods have been used in many areas, such as mathematics, communication theory, and economics. In 1948, Shannon [6] first introduced the basic entropy and the information gain concept to the information domain. “Entropy is a measure of the amount of uncertainty in the outcome of a random experiment, or equivalently, a measure of the information obtained when the outcome is observed.” [7] In our study, the entropy can be said to be the measure of contribution that a single sub-dimension makes to the general classification. Aiming to show the convincing performance of the proposed method, normal PNN and sub-dimension combined PNN are used in experimental comparison. In this paper, we first briefly review the structure of the PNN, discuss the sub-dimension formulation, and introduce the entropy concept. Then, we describe the proposed method and present experiment results from six datasets.

Methodology

Please see supplementary material.

Discussion

Experiments based on the proposed method are performed on four microarray datasets including yeast cell cycle data, sporulation data, rodrigues data, and annot data [11-14]. To verify the proposed method, we also present the experiment results on other datasets, including wine data, Wisconsin diagnostic breast cancer (wdbc) data. For each dataset, we run the steps in section II 30 times and compute their average to evaluate the performance.

Real world data

In order to evaluate the performance of the proposed method for noisy data, we added white Gaussian noise (wgn) randomly into the features of entire datasets as a form of corruption. The wine dataset contains 178 objects in three groups and 13 features. In our experiment, we adopt 78 objects as training samples and the remaining 100 objects for testing. As shown in Table 1 (supplementary material), the sub-dimension based PNN obtains 90 correct out of 100, compared with 71 correct out of 100 in normal PNN. However, with 89¢ accuracy, we can see that the proposed method provides a comparable performance with the sub-dimension based PNN. The wdbc dataset has 576 objects in two classes and 30 features in which 276 training samples and 300 testing samples are used to test the recognition results. As in the case for the wine data, the proposed method shows close results in the wdbc dataset, 279 correct classifications compared with 280 by the sub-dimension based PNN, and is superior to the normal PNN.

Microarray data

The yeast cell cycle dataset consisting of 6220 genes is published by Cho and colleagues [11]. In the study of the sub-dimension method [5], we adopt 384 genes and normalized each gene expression profile so that it has zero mean and unit variance. The dataset has five cycle phases which are the G1 phase, late G1 phase, S phase, S2 phase and M phase, and 17 time points. The results are given in Table 3 under supplementary material. The proposed method correctly classifies 149 out of 200 testing samples and the sub-dimension based PNN correctly classifies 150. The error is only 0.5¢. The sporulation dataset contains 6118 genes with seven features. In [5], after pre-processing, we use only 1136 genes of which the value of the root mean square of the log2 transformed the data greater than 1.13. The dataset has seven phases: metabolic, early I, early II early middle, middle, mid-late, and late. We use 736 genes for training and the remaining 400 genes for testing. As shown in Table 4(supplementary material), the proposed method works well with an accuracy rate of 48.5¢ (194 out of 400) compared with 49.5¢ for the sub-dimension based PNN. Rodriguez dataset is available elsewhere [13]. It contains 974 genes clustered to nine groups with 47 features and 500 of the genes are used for testing. Clearly Table 5 (supplementary material) shows that the proposed method achieves an improvement of the same recognition accuracy with the sub-dimension based PNN (82.4¢). As comparison, the normal PNN classification results are 79.6¢ accuracy. Similar results on the Annton dataset, containing 639 genes in five classes and 47 features, of which half are in the test set. As expected, the test set presents almost the same success as the sub-dimension based PNN, at 73¢ accuracy. The normal PNN could only obtain 283 correct out of 400 testing data. As shown in the tables (under supplementary material), the proposed method performs very closely to the sub-dimension based PNN which uses all sub-dimension features.

Conclusion

Instead of considering all features of datasets in a classifier, our previous paper [1] implemented the PNN classification on single sub-dimensions. However, the number of combinations of sub-dimensions is large and this overall system computationally to complicated. In this paper, a feature evaluation and selection technique based on an entropy definition is used to measure the contribution of each sub-dimension. The sub-dimension with the lowest contribution to the overall classification is discarded. Experiments on two real world datasets and four microarray datasets show clearly that the achievement of the proposed technique is remarkable better than the normal PNN and as good as the sub-dimension based PNN. However the system complexity is significantly reduced and the classification speed is increased. The feature evaluation and selection are especially effective and convenient when the input features are large and the datasets are noisy. At the rank of the corresponding information gain , the importance of the sub-dimension decreases while reduces. Good performance selection occurs particularly at the top of the rank. However, how many sub-dimensions should be considered as important is a critical issue which needs to be investigated further.

6 in total

1. Subdimension-based similarity measure for DNA microarray data clustering.

Authors: Benson S Y Lam; Hong Yan
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2006-10-09

2. Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification.

Authors: D F Specht
Journal: IEEE Trans Neural Netw Date: 1990

3. An efficient fuzzy classifier with feature selection based on fuzzy entropy.

Authors: H M Lee; C M Chen; J M Chen; Y L Jou
Journal: IEEE Trans Syst Man Cybern B Cybern Date: 2001

4. Global map of growth-regulated gene expression in Burkholderia pseudomallei, the causative agent of melioidosis.

Authors: Fiona Rodrigues; Mitali Sarkar-Tyson; Sarah V Harding; Siew Hoon Sim; Hui Hoon Chua; Chi Ho Lin; Xu Han; R Krishna M Karuturi; Ken Sung; Kun Yu; Wei Chen; Timothy P Atkins; Richard W Titball; Patrick Tan
Journal: J Bacteriol Date: 2006-09-22 Impact factor: 3.490

5. A genome-wide transcriptional analysis of the mitotic cell cycle.

Authors: R J Cho; M J Campbell; E A Winzeler; L Steinmetz; A Conway; L Wodicka; T G Wolfsberg; A E Gabrielian; D Landsman; D J Lockhart; R W Davis
Journal: Mol Cell Date: 1998-07 Impact factor: 17.970

6. Identification of genes periodically expressed in the human cell cycle and their expression in tumors.

Authors: Michael L Whitfield; Gavin Sherlock; Alok J Saldanha; John I Murray; Catherine A Ball; Karen E Alexander; John C Matese; Charles M Perou; Myra M Hurt; Patrick O Brown; David Botstein
Journal: Mol Biol Cell Date: 2002-06 Impact factor: 4.138

6 in total