Literature DB >> 19259397

Constructing tumor progression pathways and biomarker discovery with fuzzy kernel kmeans and DNA methylation data.

Abstract

Constructing pathways of tumor progression and discovering the biomarkers associated with cancer is critical for understanding the molecular basis of the disease and for the establishment of novel chemotherapeutic approaches and in turn improving the clinical efficiency of the drugs. It has recently received a lot of attention from bioinformatics researchers. However, relatively few methods are available for constructing pathways. This article develops a novel entropy kernel based kernel clustering and fuzzy kernel clustering algorithms to construct the tumor progression pathways using CpG island methylation data. The methylation data which come from tumor tissues diagnosed at different stages can be used to distinguish epigenotype and phenotypes the describe the molecular events of different phases. Using kernel and fuzzy kernel kmeans, we built tumor progression trees to describe the pathways of tumor progression and find the possible biomarkers associated with cancer. Our results indicate that the proposed algorithms together with methylation profiles can predict the tumor progression stages and discover the biomarkers efficiently. Software is available upon request.

Entities: Chemical Disease Gene Species

Keywords: biomarkers; fuzzy kernel kmeans; kernel kmeans; progression pathway

Year: 2008 PMID： 19259397 PMCID： PMC2623296

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

DNA methylation is a post-replication modification predominantly found in cytosines of the dinucleotide CpG that is infrarepresented throughout the genome except at small regions named CpG islands. It is noted that methylation events responsible for silencing critical tumor suppressor genes that lead to tumorigenesis can be captured in the DNA epigenetic code of a tumor. As DNA is heritable and stable, retrospective methylation analysis on samples collected earlier tends to provide more complete clinicopathological information. A few studies (Baylin, 2005; Robertson, 2005; Yan et al. 2001, Chen et al. 2003, Wei et al. 2002) have reported that normally unmethylated CpG island (a short stretch of DNA in which the frequency of the CG sequence is higher than other region) located in the promoter region of cancer cells undergoes dense hypermethylation during tumor progression. Hypomethylation can also be found at different parts of the genome in cancer. The extent of both DNA hypomethylation and hypermethylation in the tumor cell is likely to reflect distinctive biological and clinical features. Researchers found that DNA hypermethylation and hypomethylation are independent processes and appear to play different roles in tumor progression (Frigola et al. 2005). Methylation of the CpG sites influences the activity of nearby genes and is critical to the regulation of gene expression. With the advances in high throughput technology, we can get the methylation signature of multiple genes simultaneously and classify tumors based on the global patterns of DNA methylation. In this research, we intend to show that solid tumor progression can be characterized by the progressive accumulation of epigenetic events. We will concentrate on the hypermethylation analysis in this paper, even the proposed method can be applied to hypomethylation with little change. Several challenges we have to face in collecting and analyzing the multilocus hypermethylation data. First tumor tissues from different patients at different progression stages with distinct phenotypes have to be studied since it is difficult to collect tumor tissues of the same patient at different stages. Second we need to define an appropriate similarity (distance) measure before we can cluster the epigenetic data and construct the pathway. Wang et al. (2006) have used the weighted Euclidian distance to define the similarity of the binary hypermethylation data, which is not quite appropriate. Finally we have to build the tumor progression pathway with both methylation and phenotype data and each node (cluster) should make biological sense. Directed acyclic graphs (DAGs) or tree diagrams (Wang et al. 2007) have been used to represent possible tumor progression pathway. The main objectives of pathway construction is to construct patterns and relationships among hypermethylated genes that are progressively accumulated during tumorigenesis. Therefore the phenotypes of the progeny node in the pathway tree are supposed to be more aggressive than the parent nodes and the parent’s hypermethylated loci are inherited by their progeny nodes. In this paper, we proposed a two-step pathway construction algorithm. First we define a similarity measure using normalized mutual information and cluster the data with the measure to find the pathway nodes. Second we build the pathway tree based on the center of each cluster and the heritability of the genotype and phenotype data. By first clustering the genotype data, one reduces the number of multiple comparisons substantially. Moreover it is much more robust and more likely to be platform independent to look at aggregates of genes (clusters). Most importantly combining the genotypes along pathways is biologically meaningful. Pathways are closer to the clinical phenotype than the individual constituents of these pathways. The whole is usually more than the sum of its parts. This paper is organized as follows. In section 2, we develop the clustering procedures for pathway construction. We construct a pathway and discover the associated biomarkers with real data in section 3. Conclusions and discussions are given in section 4.

Methods

Kernel based clustering

Data clustering algorithms partition the data into pre-specified number of groups such that a well defined cost function is minimized. Many clustering algorithms have been developed in recent years. The most well-known and widely used algorithm is the iterative relocation scheme of Euclidean kmeans (Jain and Dubes, 1988). The popularity of this algorithm stems from its simplicity and scalability. However the squared-Euclidean cost function of kmeans is not a good match with the data and consequently kmeans performs poorly as compared with other approaches (Strehl, 2000). Another major drawback of kmeans is that it can not separate clusters that are nonlinearly separable in the input space. All of these have lead to the search for more appropriate distance (similarity) functions (Aggarwal, 2003; Strehl et al. 2000). Entropy based distance measure has been used for text classification (Dhillon et al. 2003). Data produced by methylation array are binary in nature (1: hypermethylated; 0: unmethylated) for which clustering based on Euclidean distance is not meaningful. In this paper we propose the mutual information based distance (similarity) function which capture the nonlinear correlation between the variables. We define an entropy kernel by introducing the normalized mutual information. Mutual information also called Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) was first proposed to measure the distance between two probability distributions. Unlike correlation coefficient, mutual information captures both the linear and nonlinear correlations. Give two sequences x and y, the normalized mutual information is defined as (Liu and Chen, 2005): where is the entropy for sequence x and h(y) is the entropy for sequence y and 0 ≤ K(x,y) ≤ 1. K(x,y) = 0 when two sequences are independent and K(x,y) = 1 when they are completely correlated. K(x,y) is also symmetric with respect to the two sequences x and y. It follows from the Mercer’s theorem that Equation (1) defines a kernel function. To form a kernel function, data points are mapped to a higher-dimensional feature space using a nonlinear function φ(x), then a kernel function is formed with the inner product, K(x,y) = φ(x)φ(y), where Φ(x) is the transpose of φ(x). However we can define a kernel function directly as in Equation (1) without knowing the transformation function φ(x) explicitly. Kernel kmeans partitions the data in the new feature space with the kernel matrix. Given a set of input data x1, x2,..., x, the kernel kmeans algorithm seeks to find clusters C1,C2,...,C that minimize the objective (error) function: where Note that the p-th cluster is denoted by a clustering of the data by {C} 1, and the mean of cluster C by m. We obtain the distance computation with each center as follows: Note the final result of Equation (3) is associated with kernel values only. It can be shown that kernel kmeans algorithm is equivalent to spectral clustering and graph partitioning (Dhillon et al. 2005). Spectral clustering is a eigenvector-based algorithm with kernel principal analysis, which can be computationally prohibitive. It also has the drawback that the number of eigenvectors used has to be predetermined. The equivalence of kernel kmeans and kernel principal component clustering implies that we can use the iteration algorithms for directly minimizing normalized-cut of graph. Therefore kernel kmeans is more computational efficient. With Equation (3), the standard kmeans iteration procedure can be applied without any difficulty. Details of the algorithm are given below: Calculating the kernel matrix K with the epigenetic data. Given the number of clusters k, the optional maximum number of iterations, and optional initial clusters: Initialize the k clusters C1(0),…,C(0) randomly if no initial clustering is given. Set t = 0. for each sample x and each cluster p, compute Update the clusters through sending x to cluster p* such that p* = arg min d(x, m). continue t = t + 1 until m is converged or the maximum iteration is exceeded. Output the final clusters.

Fuzzy kernel kmeans

The hard partitioning kernel kmeans is simple and popular, but its results are not always re liable and these algorithms have numerical problems as well. Fuzzy kernel kmeans algorithm is a generalization of kernel kmean. It minimizes the following error function: with constraints: where μ represents the membership degree and the weighting exponent q > 1 indicates the fuzziness of the clusters (q = 2 in our application). The function is minimized only if and We have the squared distance Calculating the kernel matrix K with the epigenetic data. Given the number of clusters k, the optional maximum number of iterations, and the termination tolerance ɛ > 0. Initialize the partition matrix U0 = [μ], and set t = 0. Compute the distances: Update the partition matrix: Stop until ||U − U−1|| < ɛ

Cluster validation

The number of clusters is predefined in the two kernel kmeans algorithm. There is no generally accepted procedure for determining the number of clusters. This decision should be guided by theory and practicality of the results, along with the use of the inter-cluster distances at successive steps. The distance between two samples can be calculated with d(x, x) = K(x, x) + K(x, x) − 2K(x, x). One approach to determining the number of clusters is to utilize the validation function. Different validity measures have been proposed in the literature, non of them is perfect by itself. We have utilized Dunn’s index (DI) to determine the number of clusters. The number of clusters is chosen with the maximal DI value.

Pathway Discovery

The clusters found with kernel kmenas can be used as the nodes of the pathway tree. We can build the pathway with the centers of genotype and phenotype data of each cluster. Because both genotype and phenotype data are ordinal in nature, their centers are defined to be the medians of each cluster. Given K genotype centers {GC} = 1 and phenotype centers {PC} =, the tumor progression tree are built with the heritability of each node, i.e. a child node j from its parent node i must satisfy GC ≥ GC and PC ≥ PC. Under this condition, all hypermethylated loci in a parent node is inherited by its progeny node and the progeny node has at least the same serious phenotype condition as its parent node.

Computational experiments

Our experiments are performed with the methylation data of 50 breast carcinomas of unrelated patients (http://www.stat.ohio-state.edu/statgen/). There are 9 genes for their methylation status (0: unmethylated; 1: hypermethylated). The 9 genes are: GPC3 RASSF1A, WT1, uPA, HOXA5, p16, 3OST3B, BRCA1, and DAPK1. The 4 phenotype measurements used in our analysis are ER/PR (1: +/+; 2: +/− 3; −/−), histology (1: well-differentiated, WD; 2: moderately-differentiated, MD; 3: poorly-differentiated, PD), clinical stage (1, 2, 3, or 4), and metastasis status (0, M0; 1, M1). We exclude age from the phenotype data since it is not a strong cancer indicator. We find the cluster nodes either solely with epigenetic data or with the phenotype and epigenetic data together. Because the aim of clustering is to group the samples into different clusters, only one dataset is used. The cluster is validated with Dunn’s index (DI).

Clustering with kernel Kmeans

In this approach, we first find the clusters (nodes) based on epigenetic data and kernel Kmeans, and then the node centers are estimated to be the median of the phenotype and genotype within each cluster. We then build the pathway based on nodes heritability. The output for the node centers are given in Table (1).

Table 1

Node centers of genotype and phenotype data.

Genotype center	phenotype center
111010000	2230
110000110	3241
111110000	2220
010000000	1220
110100100	1220
100000010	2220
110100000	1220
010011100	1220
101100000	1220

Based on the node centers, the pathway tree built with the proposed algorithm is given in Figure (1). In Figure (1), the green spots represent the unmethylated loci and the red spots are the hypermethylated loci. The gene name of each locus is shown in the upper right corner of the figure. The numbers beside each node plot are the phenotype centers (medians).

Figure 1

Progression pathway clustered with kernel kmeans.

Fuzzy kernel Kmeans

We also found the nodes (clusters) with fuzzy kernel Kmeans and genotype data. There is one more cluster (node) in the tumor progression tree and multiple pathways with Fuzzy kernel Kmeans. Figure (2) is the corresponding progression pathway.

Figure 2

Progression pathway clustered with fuzzy kernel kmeans.

Results

The progression Pathways presented both in Figure (1) and Figure (2) validate the notion that tumors with more aggressive phenotypes should have higher level of methylation than that with less aggressive phenotypes. The first phenotype is the Hormone receptor status (ER/PR). The progression becomes more aggressive when more hypermethylation happened among the 9 genes. The other two phenotypes, histology and clinical stage, have the similar trend to progress from lower stage to higher stage in both progression trees. Metastasis happened in the terminal nodes of the pathways. The hypermethylation in the promoters of GPC3 and RASSF1A happened in several intermediate and terminal nodes of both trees. This is consistent with the observation that a large number of tumors have concurrent hypermethylation of GPC3 and RASSF1A. The progression pathways also indicated that hypermethylation in the promoter of 4 genes: 3OST3B, BRCA1, GPC3, and RASSF1A, leads to the most aggressive tumors whereas methylation in other genes has much less effect. Therefore, genes 3OST3B and BRCA1 should be studied carefully in functional analysis. Another gene HOXA5 may also play an important role in tumor aggression. Fuzzy kernel Kmeans seems to perform better in the sense that it finds one more node and multiple pathways.

Conclusions and Remarks

We proposed an entropy kernel and kernel based algorithm for recapitulating tumor progression pathway and biomarker finding. The proposed fuzzy kernel kmean algorithm seems to perform better than the standard kernel kmean, since the DI for fuzzy kernel kmeans (0.52) is greater than that for kernel kmeans (0.41). It can also provides multiple pathways. In the constructed progression tree, the progeny nodes have more hypermethylated gene promoters than its parent nodes and the progeny node has more aggressive tumor than its parents. The proposed method can not only discover different tumor progression paths but also find genes (biomarkers) that lead to more aggressive tumors and have biological and clinical significance. Fuzzy kernel Kmean algorithm is a novel idea that is not found in the current literature. The proposed algorithms can be applied to analyze hypomethylation data and build corresponding pathways with little revision. With appropriate kernel functions, the method can be utilized to analyze other sequential and microarray data.

7 in total

Review 1. DNA methylation and human disease.

Authors: Keith D Robertson
Journal: Nat Rev Genet Date: 2005-08 Impact factor: 53.242

2. Heritable clustering and pathway discovery in breast cancer integrating epigenetic and phenotypic data.

Authors: Zailong Wang; Pearlly Yan; Dustin Potter; Charis Eng; Tim H-M Huang; Shili Lin
Journal: BMC Bioinformatics Date: 2007-02-01 Impact factor: 3.169

Review 3. DNA methylation and gene silencing in cancer.

Authors: Stephen B Baylin
Journal: Nat Clin Pract Oncol Date: 2005-12

4. Dissecting complex epigenetic alterations in breast cancer using CpG island microarrays.

Authors: P S Yan; C M Chen; H Shi; F Rahmatpanah; S H Wei; C W Caldwell; T H Huang
Journal: Cancer Res Date: 2001-12-01 Impact factor: 12.701

5. Differential DNA hypermethylation and hypomethylation signatures in colorectal cancer.

Authors: Jordi Frigola; Xavier Solé; Maria F Paz; Victor Moreno; Manel Esteller; Gabriel Capellà; Miguel A Peinado
Journal: Hum Mol Genet Date: 2004-12-01 Impact factor: 6.150

6. Methylation target array for rapid analysis of CpG island hypermethylation in multiple tissue genomes.

Authors: Chuan-Mu Chen; Hsiao-Ling Chen; Timothy H-C Hsiau; Andrew H-A Hsiau; Huidong Shi; Graham J R Brock; Susan H Wei; Charles W Caldwell; Pearlly S Yan; Tim Hui-Ming Huang
Journal: Am J Pathol Date: 2003-07 Impact factor: 4.307

7. Methylation microarray analysis of late-stage ovarian carcinomas distinguishes progression-free survival in patients and identifies candidate epigenetic markers.

Authors: Susan H Wei; Chuan-Mu Chen; Gordon Strathdee; Jaturon Harnsomburana; Chi-Ren Shyu; Farahnaz Rahmatpanah; Huidong Shi; Shu-Wing Ng; Pearlly S Yan; Kenneth P Nephew; Robert Brown; Tim Hui-Ming Huang
Journal: Clin Cancer Res Date: 2002-07 Impact factor: 12.531

7 in total