Literature DB >> 21738314

A heuristic method for discovering biomarker candidates based on rough set theory.

Abstract

We apply a combined method of heuristic attribute reduction and evaluation of relative reducts in rough set theory to gene expression data analysis. Our method extracts as many relative reducts as possible from the gene-expression data and selects the best relative reduct from the viewpoint of constructing useful decision rules. Using a breast cancer dataset and a leukemia dataset, we evaluated the classification accuracy for the test samples and biological meanings of the rules. As a result, our method presented superior classification accuracy comparable to existing salient classifiers. Moreover, our method extracted interesting rules including a novel biomarker gene identified in recent studies. These results indicate the possibility that our method can serve as a useful tool for gene expression data analysis.

Entities: Disease Gene Species

Year: 2011 PMID： 21738314 PMCID： PMC3124797 DOI： 10.6026/97320630006200

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

DNA microarray technology has enabled us to monitor the expression levels of thousands of genes simultaneously under certain conditions, and has been yielded various applications in the field of disease diagnosis [1], drug discovery [2], and toxicological research [3]. Among them, cancer informatics based on gene-expression data is an important domain that has promising prospects for both clinical treatment and biomedical research. One of the key issues in this domain is to discover biomarker genes for cancer diagnosis from a massive amount of gene-expression data by using a bioinformatics approach called gene selection. A typical gene-selection approach is a statistical test such as t-test and ANOVA [4]. This approach detects differentially expressed genes between groups of samples obtained from different cells/tissues. Most of the statistical tests assume that the expression values of each gene across the samples follow a prior probability distribution; hence a sufficiently large number of samples are required to obtain statistically reliable results. Rough set theory [5] provides a theoretical basis for set-theoretical approximation and rule generation from categorical data. Computation of relative reducts is one of the hottest and most important research topics in rough set theory as a basis for rule generation. Relative reducts are minimal sets of attributes for correctly classifying all samples to those classes. We then expect that computation of relative reducts from gene-expression data is useful for discovering biologically-meaningful information such as biomarker candidates for cancer diagnosis. Because computing all relative reducts of the given data requires very high computational cost, there have been many proposals of heuristic algorithms to compute some of the candidates of relative reducts [6-10]. Kudo and Murai proposed attribute-reduction algorithms to compute as many relative reducts as possible from a decision table with numerous condition attributes [11]. They also proposed an evaluation criterion of relative reducts that evaluates the usefulness of relative reducts from the viewpoint of decision-rule generation [12]. In this paper, we introduce Kudo and Murai’s heuristic attribute reduction algorithms [11] and a criterion of relative reducts [12] for gene-expression data analysis. We use these algorithms and criterion in two gene-expression datasets, breast cancer [13] and leukemia [14], and discuss the extracted decision rules from these datasets and their biological meanings. The experimental results indicate that the method used in this paper can identify differentially expressed genes between different classes in gene-expression datasets and that it can be useful for gene-expression data analysis.

Methodology

The method we use in this paper to extract decision rules from gene-expression data based on rough set theory consists of the following three components: Extraction of as many relative reducts as possible from gene-expression data; Selection of relative reducts in accordance with an evaluation criterion of relative reducts; Construction of decision rules from the selected relative reducts. Figure 1 illustrates the processing flow of our method. In the following section, in terms of the method we use in this paper, we introduce heuristic attribute-reduction algorithms for generating as many relative reducts as possible [11] used in the first step of the above method and a criterion for evaluating the usefulness of relative reducts [12] as in the second step. Note that the details of these algorithms and the criterion of relative reducts are in supplementary material.

Figure 1

The method of discovering biomarker candidates based on rough set theory

Datasets

To evaluate the usefulness of our method, we use two gene-expression datasets: breast cancer [13] and leukemia [14]. Both of them are two-class datasets. The leukemia dataset is composed of the gene-expression values for 12,582 genes in 24 acute lymphocytic leukemia (ALL) samples and 28 acute myeloid leukemia (AML) samples. The breast cancer dataset includes the geneexpression values for 7,129 genes in 25 positive and 24 negative samples. For each dataset, the expression values from each gene are linearly normalized to have mean 0 and variance 1. Subsequently, they are discretized into six bins (-3, -2, -1, 1, 2, 3) by uniformly dividing the difference between the maximum and the minimum in the normalized data and into one bin that represents the lack of gene-expression values. Discretized positive values represent that the genes are up-regulated, while negative values represent that genes are downregulated.

Results and Discussion

Parameters

Our method was implemented in Java on a Linux workstation (CPU: Intel Xeon X5460 (3.16GB) x2, Memory: 8GB, HDD: 160GB, OS: SUSE Linux 10.1). All experiments were conducted with the following parameters: base size b = 10, size limitation L = 25, and number of iterations I = 100.

Classification Accuracy

First, we evaluate the classification accuracy of our method. The evaluation is conducted by Leave-One-Out Cross Validation (LOOCV). In LOOCV, first, we extract one sample as a test sample from the dataset and generate rules using the remaining samples. Second, we check whether the test sample is correctly classified by the rules. These processes are repeated for all samples. Finally we calculate the rate of correctly classified samples. The classification accuracy is compared to those of the two salient classifiers, decision tree (C4.5) and support vector machine (SVM). (see Table 1) shows the results of LOOCV on our method, C4.5, and SVM. For the breast cancer dataset, our method exhibits the similar classification ability with C4.5 and SVM. For the leukemia dataset, the classification ability of our method exceeds greatly that of C4.5.

Biological meanings of extracted rules

Next, we discuss the biological meanings of the best results by applying our method 10 times for each dataset. In these experiments, we used the same parameter settings with the comparison experiments. The best relative reducts of two datasets are as follows: (1) Breast cancer dataset: {CRIP1, M34715_at}, ACov = 0.08 (= 2/26). (2) Leukemia dataset: We extracted rules from each dataset by performing the following three steps: {POU2AF1}, ACov = 0.29 (= 2/7), where the score ACov is the average of coverage of decision rules generated from the relative reduct defined by Eq. in Supplementary material. For example, the relative reduct {POU2AF1} of leukemia dataset generates 7 decision rules from 2 classes, i.e., AML and ALL; hence ACov score of the relative reduct {POU2AF1} is 2/7 (= 0.29). generating all decision rules by the best relative reduct of each dataset, removing decision rules that contain null values in the antecedents, and combining the generated decision rules as long as possible by interpreting the meanings of decision rules. As a result, we obtained the rules for each dataset. (see supplementary material) The extracted rules are evaluated on the basis of known biological findings. To this end, we investigate the functions of genes in the rules by reference to a genetic disease database (OMIM) [15] and a protein sequence database (Swiss- Prot) [16]. For the breast-cancer dataset, the samples can be discriminated into a true class with an accuracy of 88 percent according to the expression level of the Cystein-rich intestinal protein 1 (CRIP1). CRIP1 is a transcription-factor gene that induces apoptosis in cancer cells. Interestingly, this gene has been identified as a novel biomarker of human breast cancer in recent studies [17, 18]. In the extracted rule, we can see that the CRIP1 expression is more upregulated in the positive samples. Indeed, this is consistent with the recent findings by Ma et al. [17] that CRIP1 in human breast cancer was overexpressed, compared to normal breast tissue in in situ experiments. For the leukemia dataset, all samples can be perfectly discriminated by the expression level of the POU class 2 associating factor 1 (POU2AF1). POU2AF1 is known as a gene responsible for leukocyte differentiation. In Swiss-Prot, we can see the description that “a chromosomal aberration involving POU2AF may be a cause of a form of B-cell leukemia.” Namely, it suggests that this gene can be inactivated/down-regulated in lymphocytic leukemia, such as ALL. In contrast, it should be noted that POU2AF1 in the extracted rule shows a weaker expression in AML than ALL. At present, the detailed role of POU2AF1 in AML has not been revealed [19], whereas we expect that its biological relevance will be unveiled by experimental biologists in the near future.

Conclusion

In this paper, we introduced a combined method of heuristic attribute reduction and evaluation of relative reducts in rough set theory for gene-expression data analysis. Our method is based on a heuristic attribute-reduction algorithm for generating as many relative reducts as possible and a criterion for evaluating the usefulness of relative reducts. We applied our method to two geneexpression datasets: breast cancer and leukemia. In the comparison of our method with C4.5 and SVM, our method showed good classification accuracy that is comparable to the results of SVM and considerably exceeds that of C4.5.The experimental results also showed that our method can identify differentially expressed genes among different classes in gene-expression datasets. For the breast-cancer dataset, our method extracted decision rules regarding a gene that has been identified as a novel biomarker of human breast cancer in recent studies. For the leukemia dataset, rules about a gene responsible for leukocyte differentiation were extracted. Thus, these results indicate a possibility that our method can be a useful tool for gene-expression data analysis. As a future work, we reduce the computation time via improvement of the program and will provide the freely available tool.

9 in total

1. Gene expression profiles of human breast cancer progression.

Authors: Xiao-Jun Ma; Ranelle Salunga; J Todd Tuggle; Justin Gaudet; Edward Enright; Philip McQuary; Terry Payette; Maria Pistone; Kimberly Stecker; Brian M Zhang; Yi-Xiong Zhou; Heike Varnholt; Barbara Smith; Michelle Gadd; Erica Chatfield; Jessica Kessler; Thomas M Baer; Mark G Erlander; Dennis C Sgroi
Journal: Proc Natl Acad Sci U S A Date: 2003-04-24 Impact factor: 11.205

Review 2. Use of microarray technologies in toxicology research.

Authors: Kent E Vrana; Willard M Freeman; Michael Aschner
Journal: Neurotoxicology Date: 2003-06 Impact factor: 4.294

Review 3. Applications of DNA microarray in disease diagnostics.

Authors: Seung Min Yoo; Jong Hyun Choi; Sang Yup Lee; Nae Choon Yoo
Journal: J Microbiol Biotechnol Date: 2009-07 Impact factor: 2.351

Review 4. DNA microarrays in drug discovery and development.

Authors: C Debouck; P N Goodfellow
Journal: Nat Genet Date: 1999-01 Impact factor: 38.330

5. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia.

Authors: Scott A Armstrong; Jane E Staunton; Lewis B Silverman; Rob Pieters; Monique L den Boer; Mark D Minden; Stephen E Sallan; Eric S Lander; Todd R Golub; Stanley J Korsmeyer
Journal: Nat Genet Date: 2001-12-03 Impact factor: 38.330

6. Predicting the clinical status of human breast cancer by using gene expression profiles.

Authors: M West; C Blanchette; H Dressman; E Huang; S Ishida; R Spang; H Zuzan; J A Olson; J R Marks; J R Nevins
Journal: Proc Natl Acad Sci U S A Date: 2001-09-18 Impact factor: 11.205

7. Expression of the B cell-associated transcription factors PAX5, OCT-2, and BOB.1 in acute myeloid leukemia: associations with B-cell antigen expression and myelomonocytic maturation.

Authors: Sarah E Gibson; Henry Y Dong; Anjali S Advani; Eric D Hsi
Journal: Am J Clin Pathol Date: 2006-12 Impact factor: 2.493

8. Thiamine transporter gene expression and exogenous thiamine modulate the expression of genes involved in drug and prostaglandin metabolism in breast cancer cells.

Authors: Shuqian Liu; Arnold Stromberg; Hsin-Hsiung Tai; Jeffrey A Moscow
Journal: Mol Cancer Res Date: 2004-08 Impact factor: 5.852

Review 9. Statistical tests for differential expression in cDNA microarray experiments.

Authors: Xiangqin Cui; Gary A Churchill
Journal: Genome Biol Date: 2003-03-17 Impact factor: 13.583

9 in total