Ito Wasito1, Siti Zaiton M Hashim, Sri Sukmaningrum. 1. Department of Software Engineering, Faculty of Computer Science and Information Systems, University Technology Malaysia, Skudai, Johor Bahru, Malaysia. ito@gmx.co.uk
Abstract
Gene expression profiling plays an important role in the identification of biological and clinical properties of human solid tumors such as colorectal carcinoma. Profiling is required to reveal underlying molecular features for diagnostic and therapeutic purposes. A non-parametric density-estimation-based approach called iterative local Gaussian clustering (ILGC), was used to identify clusters of expressed genes. We used experimental data from a previous study by Muro and others consisting of 1,536 genes in 100 colorectal cancer and 11 normal tissues. In this dataset, the ILGC finds three clusters, two large and one small gene clusters, similar to their results which used Gaussian mixture clustering. The correlation of each cluster of genes and clinical properties of malignancy of human colorectal cancer was analysed for the existence of tumor or normal, the existence of distant metastasis and the existence of lymph node metastasis.
Gene expression profiling plays an important role in the identification of biological and clinical properties of humansolid tumors such as colorectal carcinoma. Profiling is required to reveal underlying molecular features for diagnostic and therapeutic purposes. A non-parametric density-estimation-based approach called iterative local Gaussian clustering (ILGC), was used to identify clusters of expressed genes. We used experimental data from a previous study by Muro and others consisting of 1,536 genes in 100 colorectal cancer and 11 normal tissues. In this dataset, the ILGC finds three clusters, two large and one small gene clusters, similar to their results which used Gaussian mixture clustering. The correlation of each cluster of genes and clinical properties of malignancy of humancolorectal cancer was analysed for the existence of tumor or normal, the existence of distant metastasis and the existence of lymph node metastasis.
Gene expression profiling is an effective approach to extract useful information from a large number of simultaneously expressed genes within specific cell types.
This approach is not only useful for investigating a known biological cell, it also can be applied to explore unknown biological cells in relation to specific gene
functions [1]. Comprehensive profiles of mRNA levels can be obtained and used to discriminate cancer cells from normal cell, and to provide sub-classes of tumor types.
The possibility of measuring thousands of simultaneously expressed genes represents a challenge in terms of analysis and interpretation. One useful application is the
identification of genes whose expression levels are associated with humancolorectal carcinoma where there is still limited knowledge of the biological and clinical
properties of malignancy [4]. This solid tumor is one of the most prevalent and well-characterized humancancers, and, in spite of recent advances in diagnosis and
therapeutics, is still a leading cause of death [3].Clustering is a powerful exploratory technique for the analysis of gene expression profiles. In the past decades, a number of clustering algorithms have been proposed
in this context including Hierarchical Clustering [1,2] and Gaussian Mixture Clustering [3]. The Hierarchical cluster analysis is probably the most popular and powerful
method for unveiling underlying features of gene expression profiles. However, because of a lack of valid statistical evaluation methods, the results are subject to
interpretation by the investigator. Gaussian Mixture Clustering is also another powerful approach. This parametric clustering method has been applied to gene expression
linked to malignancy of human colorectal carcinoma with promising results [3]. However, this cluster analysis requires prior information about the number of clusters in
the dataset, which is often not realistically possible.In this paper, a density-based clustering method for uncovering underlying structures of gene expression data will be explored. The advantages of of this method called
Iterative Local Gaussian Clustering (ILGC) includes the simplicity of the technique, no need for prior information on the number of clusters, and the requirement of only
one parameter, the nearest neighbour.
Methodology
Through density-based estimation, we try to approximate the ‘true’ density of genes. Basically, there are two main approaches to implement density estimation: parametric
and non-parametric. The first approach was implemented by Muro et al [3] using Gaussian Mixture clustering and Bayesian framework with promising results. However, to avoid
the requirement for information with regards number of clusters in advance, we used the non- parametric-based approach to determine the density of genes.The original form of density based approach can be formulated as in equation (1) (see supplementary material)Due to its simplicity, the K-nearest neighbour based is one of the most popular non-parametric-based approach [5,6,7]. In this report, we extend the K-nearest neighbour
(KNN) density estimation combined with Gaussian kernel function. In the proposed method, the KNN would contribute in determining the ‘best’ local genes iteratively for
Gaussian kernel density estimation. The local best is defined as the set of neighbours genes that maximizes the Gaussian kernel function. This leads to an alternative
non-parametric clustering approach that is called iterative local Gaussian clustering (ILGC).Chokepoint analysis has several advantages. First, it allows us to test the consistency between experimental data and assumptions about the organization and regulation of the biochemical pathway
and of its interdependencies with other processes. Second, it can be used to predict the consequences of various mutations or inhibitors.
Iterative local Gaussian clustering
Basically, Gaussian kernel function for genes clustering has basic form as in equation (2) (see supplementary material).There are two main rules to deal with this problem of selecting the best local genes: KNN-rule and Bayesian-rule. In ordinary KNN density estimation, the KNN-rule is applied
to assign a target gene to a certain cluster based on the majority of number of gene neighbours criterion. On other hand, ILGC implements a Bayesian decision rule such that the
target gene will be assigned to the c-th cluster, if the majority of k-neighbours of the target gene maximizes the density function, Kc(x). To do this, we perform the rule
iteratively using the inequality illustrated in equation (3) (see supplementary material).Note that we do not use the scale parameter term explicitly in the equation as it will be determined in k-nearest neighbour selection process. The iterative local Gaussian
clustering algorithm can be summarized follows:
ILGC Algorithm (Database, k neighbours)
Set the number clusters to the N “informative gene” selectedEach gene xi (i=1....N) with k neighbours is assigned to cluster c as in equation (4) (see supplementary material)If there is no change in the cluster structure, iterations have converged. Re-index the clusters and stop. Otherwise go to step 4.Re-calculate cluster membership in equation (3) (supplementary material) then go to step 2.
Data imputation
To implement ILGC algorithm, there are number of missing entries in the original datasets which we fill in. We apply the INI algorithm [6,7] to impute these missing data
entries. This method is based on a least squares principle. This approach minimizes the sum of squared differences between the data entries and those reconstructed via bilinear
modelling which is akin to the singular value decomposition (SVD) of a data matrix. Details of INI algorithm can be obtained elsewhere [6,7].
Gene selection
Another issues addressed in our implementation of cluster analysis is the “noisy” gene which is not so informative. We use a Correlation Ratio (CR) method as illustrated in
equation (5) in the supplementary material to select the informative genes [3].
Discussion
In this work, we used the informative genes selected by Muro et al [3] which consists of 341 genes out 1536 genes and 100 cancerous samples and 11 normal samples with
their clinical parameters. Using ILGC with 10 number of nearest neighbour and 95% of rate convergence, three clusters were found, similar to the Gaussian Mixture Clustering
results of Muro et al [3]. However, the ILGC uncovers a different structure of clusters compared to those found by the Gaussian Mixture method. The structure of clusters can be
visualized in 2-D graph based on plotting the first and second component of principal component analysis (PCA) as shown in Figure 1. The results show that there are two large
numbers of genes clusters and one small cluster.
Figure 1
The structure of clusters those to be found by ILGC algorithm. Green, blue and red represent cluster I, cluster II and
cluster III, respectively.
For the two large numbers of clusters, cluster I and cluster II, further analysis was carried out to detect any relationship to the cancer clinical parameters: cancerous or
normal, distant metastases and lymph node metastasis. Correlation Ratio (CR) analysis was used, based on the following procedure: (a) Calculate CR value for each gene in cluster I and II;
(b) Sort genes with key CR value from (a); (c) Permute sample position for each gene, then calculate CR of the permuted samples; and (d) Draw all CR values from (b) and (c).Figure 2 shows that cluster I and II correlate to the differences between cell tissues that contain tumour or normal. Figure 3a and Figure 3b show that the cluster I and II have
significant correlation with the existence of distant metastasis in cell tissues. However, cluster I and cluster II have no correlation to the existence of lymph-node metastasis
in cell tissues (Figure 4a and Figure 4b).
Figure 2
Cluster I and II have correlation to the differences between cell tissues that contain tumor or normal. The vertical axis
represents CR-value of the differences of cell tissues which contains cancer and normal in cluster I (a) and cluster II (b). The
horizontal axis represents sorted genes based on their CR-values. The top blue line represents clusters found by ILGC, others represent
permuted samples.
Figure 3
The vertical axis represents the CR-value of the differences of cell tissues which contain distance metastasis:
cluster I (a) and cluster II (b). The horizontal axis represents sorted genes based on their CR-values. The top blue line represents
clusters found by ILGC; others represent permuted samples.
Figure 4
The vertical axis represents CR-value of the differences of cell tissues which contain lymph node metastasis: cluster I (a)
and cluster II (b). Horizontal axis represents sorted genes based on their CR-values. No correlation to the existence of lymph-node
metastasis in cell tissues is observed.
Since cluster III contains only a small number of genes (17), we use the difference correlation analysis technique. Since this cluster contains TCL (tumor classifier) genes,
this cluster appears to correlate with the existence of tumor. Figure 5 shows that when distant metastasis exists, cluster III correlates to the third colorectal clinical parameter.
Figure 5
Linkage of the clusters of expressed genes to the existence of distant metastasis in cluster III using the difference
correlation analysis technique.
Conclusion
In this paper, we explored a non-parametric density based clustering technique which is called iterative local Gaussian clustering (ILGC). The advantages of ILGC includes:
the simplicity of the technique, no requirement for prior information on the number of clusters and the use of a single parameter, the nearest neighbour.ILGC algorithm has been tested on the colorectal carcinoma database of Muro et al, 2003 [3]. The results show that the proposed method produced the same number of clusters as
those found by Muro et al. In addition, the clusters found by ILGC were able to be linked to malignancy of human colorectal carcinoma which include the existence of tumor and distant metastasis.Further work is needed to compare ILGC experimentally with other existing clustering techniques such as Hierarchical clustering, Gaussian Mixture clustering and K-Means for
identification of other cancers.