Literature DB >> 23226009

A fast quad-tree based two dimensional hierarchical clustering.

Priscilla Rajadurai¹, Swamynathan Sankaranarayanan.

Abstract

Recently, microarray technologies have become a robust technique in the area of genomics. An important step in the analysis of gene expression data is the identification of groups of genes disclosing analogous expression patterns. Cluster analysis partitions a given dataset into groups based on specified features. Euclidean distance is a widely used similarity measure for gene expression data that considers the amount of changes in gene expression. However, the huge number of genes and the intricacy of biological networks have highly increased the challenges of comprehending and interpreting the resulting group of data, increasing processing time. The proposed technique focuses on a QT based fast 2-dimensional hierarchical clustering algorithm to perform clustering. The construction of the closest pair data structure is an each level is an important time factor, which determines the processing time of clustering. The proposed model reduces the processing time and improves analysis of gene expression data.

Entities: Chemical Disease Gene Species

Keywords: clustering; euclidean distance; hierarchical clustering; quad tree

Year: 2012 PMID： 23226009 PMCID： PMC3511054 DOI： 10.4137/BBI.S10383

Source DB: PubMed Journal: Bioinform Biol Insights ISSN： 1177-9322

Introduction

The DNA microarray has emerged as the latest breakthrough in molecular biology; it provides researchers with the opportunity to monitor genome-wide expression systematically.1 The microarrays are used to study gene expression profiles in biological samples.7 A microarray or microchip is a chip made with glass or other solid material, with an array of tiny DNA spots placed on it. Each of the spots contains fragments of DNA or RNA molecules whose sequence is predefined and corresponds to portions of a particular gene.8 Microarrays do not provide full information about genes, but rather they denote genes indirectly through their expressions. Also, the expressions obtained can be inaccurate depending on the applied microarray technology.30 Microarrays have emerged as a standard for simultaneous evaluation of the expression level of thousands of genes.2 Clustering techniques play a significant role by discovering sets of objects with identical functions from huge quantities of data.3 A good clustering algorithm should be able to identify genes that have similar expression profiles, including time-shifted or inverted profiles, and provides phase information.4 There are many popular techniques for clustering gene expression data by elucidating different functional roles in genes that play an important role in the biological process.5 A major area of concern in the clustering analysis of gene expression data is the sensitivity and vulnerability of results to noise and overfitting, due to their excessive dependence on limited biological and medical information.3 Most clustering algorithms are distance-based whereas some involve hierarchical clustering, K-means clustering, or a self-organizing map.6,17 Hierarchical clustering techniques are extensively used in microarray data analysis, which combines all data points into a single set which are placed adjacent to each other in the feature space.11 At present, hierarchical clustering is the most often used method for grouping data.13,14 The main objective of hierarchical clustering is to obtain a best cluster that will signify a set of patterns in the background of a given distance metric. This method permits biologists to visualize global expression patterns in DNA microarray data through graphic representation.12 The hierarchical clustering method is classified into 2 distinct types: agglomerative (bottom-up) and divisive (top-down).12,15 This method is based on a similarity or distance measure of the data, such as a correlation, Euclidean, squared Euclidean, or city-block (Manhattan) distance.13 Normally, clusters are constructed using a hierarchical tree. This tree is created after calculating the distance between pairs of objects in the correlation matrix.16 A quad tree (QT) is a tree data structure in which each internal node has up to 4 children. The quad tree and its different derivatives are considered to be the backbone for the storage, retrieval and analysis of spatial data. An efficient searching technique mainly depends on the height of the tree; an arbitrary insertion of the point features will make the tree unbalanced and will increase the time of searching. From the review it is obvious that clustering algorithms used in previous research have partitioned the data, where each gene belongs to only one cluster. Some clustering algorithms that can allow each gene to only be in one cluster, including the k-means algorithm, hierarchical clustering algorithm, biclustering algorithm, fuzzy k-means algorithm and Self-Organizing Map (SOM). These methods have some disadvantages while working with microarray gene expression data because of the high complexity of biological processes. The nature of proteins and their interaction is a major reason for this. The genes that generate proteins are expected to express in more than one group of genes because proteins generally perform unique biological roles by interacting with other groups of proteins. This explains the inclusion of a gene in more than one cluster of microarray gene expression data. The QT has focused on the clustering of gene expression data with the expectation maximization (EM) algorithm, which estimates incomplete data samples. It also validates the cluster and measures the representation of objects in the clusters. The algorithm is compared with the k-means algorithm for performance validation.28 Hierarchical biclustering suffers by selecting a depth of the tree, so the use of cross-validation measures and automatic depth selection capability will be needed. Each sample chooses a single path, which leads to a feasibility problem. The pathway enrichment among genes is activated through the same edges by introducing the activation model. Whenever the sample increases, the depth of the tree also increases, leading to computation complexity.31 The large samples represented in the QT may reduce its complexity by storing a large amount of samples at each level. A 2-dimensional hierarchical clustering approach was introduced by effectively representing the existence of genes in more than one cluster. The proposed approach was used to speed up the clustering process. It uses the QT-based data structure to find the closest pair, which also reduces the processing time.27

Quad Tree (QT)

A QT is a tree-like data structure, where each inner node has exactly 4 children. The QT and its different derivatives are mainly used for the purpose of storage, recovery and analysis of spatial data.26 A QT is used to index data records in a spatial database in accordance with spatial location. Most businesses and government agencies may wish to query the database using spatial constraints; in such cases, the QT facilitates the storage and querying of spatial data. The QT can also be used to index data records in multiple spatial databases. The QT index values are used to search across the different databases by extracting components such as associations, patterns, clusters, outliers, or nearest neighbors. The point QT is constructed successively by adding data points one by one. Initially, a point search is performed to add a point. In the QT, if there is no point related to target point (the point that has to be added), then the target point is inserted into the leaf node, where the search has to end. The QTs is designed to be faster in various applications and the performance depends on the distribution of data in the domain. The height and shape of the point QT greatly relies on the insertion sequence. Until now, no significant effort has been taken to overcome this height balance problem, or to make the QT search more efficient.17

Review of Research

A handful of research studies have been presented for clustering microarray gene expression. Their data are illustrated below. Kim and colleagues10 address a wide range of problems such as categorization of disease subtypes and tumors in biological and medical research. They describe the microarray, which has emerged as the most effective and broadly used tool for this categorization. The main objective of analyzing gene expression data has been to isolate data samples or genes. Identical expression patterns and statistical techniques exist to analyze and organize these complex data in a meaningful way. They have discovered that normalization, extent of noise and clarity in the datasets will change the clustering methods that are most commonly used in the analysis of microarray data.10 The microarray technique in concurrent measurement of the expression level in thousands of messenger RNA (mRNA)s has been enabled. This has been made possible by mining the data; it is feasible to recognize the dynamics of a gene expression time series in this manner. They have decreased the dimensionality of the data set by employing Principal Component Analysis (PCA). Examination of the components has provided an approach into the underlying factors calculated in the experiments. PCA has demonstrated that it is proved from their consequences that all rhythmic content of data can be decreased to three main components.19 Hereditary inclusion body myopathy (HIBM) of adult start steadily rising distal and proximal myopathy has also been discussed.20 After examining the expression outline data sets by the overlap of three statistic methods (Student’s t-test, TNoM and Info score), it has been found that the HIBM-specific transcriptome contains 374 differentially expressed genes. With the delicate contribution of mitochondrial processes exposed in HIBM, an unexpected feature of HIBM pathophysiology has been discovered. This could be expanded to provide reasons for the slow development of this disorder, and afford some understanding of its disease mechanism. The main objective of gene expression analysis is to comprehend the processes of regulatory networks, as well as what pathways are restricted during inter-cellular and intra-cellular activities. Currently, microarray datasets are broadly used for this purpose. By employing their algorithm on a yeast speculation dataset, D’Souza and colleagues21 have demonstrated that their algorithm can detect gene networks with reasonable ease. A firm gene selection and efficient cancer prediction structure called SGS has been introduced. This structure first recognizes gene groups in which the genes have high correlation coefficient by means of a clustering algorithm. Finally, a prediction model is constructed based on shrinkage gene space, using a capable classification algorithm (such as Support Vector regression (SVM), 1-nearest neighbor (1NN), or regression). By means of the trial results obtained on real-world data, the structure has been shown to regularize highly available feature selection and prediction methods, such as Significant Analysis of Microarray (SAM), Information gain (IG) and the Lasso-type prediction model.22

A Fast Two-Dimensional Hierarchical Clustering

The recent arrival of microarray technologies has permitted biologists to simultaneously monitor the behavior of numerous genes, which produces large quantities of complex data. A large amount of gene expression data has been generated continuously using microarray experiments. It has been demonstrated that gene expression data containing vital information is very useful in medical diagnosis, therapy, and drug design. The aim of the proposed technique is to examine such data in order to obtain this essential information. Cluster analysis has played a key role in analyzing gene expression data. The main aim of clustering is to partition a set of objects into clusters, so that objects in a group are more analogous to one another than to objects in other clusters. Several clustering algorithms have been used to find co-expressed genes, but the processing time of these techniques are very high. Hence, our aim is to reduce the processing time of the clustering analysis. In order to solve the aforementioned problem, in this research, a fast, novel, semi-supervised, 2-dimensional QT-based hierarchical clustering technique is proposed. Here, the clustering elements are selected from the microarray gene expression database by means of the index matrix, and these elements are clustered using a QT-based 2-dimensional clustering technique. After the clustering process, the best ‘k’ clusters are identified using a fitness evaluation. Then, the closest index of all best ‘k’ clusters is calculated, and is used to obtain the next set of clustering elements from the database. This process is repeated ‘r’ times until the optimum cluster is found.

QT-based two dimensional clustering

In cluster analysis, groups of analogous (similar) objects are identified by maximizing inter-group similarity and minimizing intra-group similarity. Cluster analysis partitions data into significant or useful groups (clusters). It also plays a key role in evaluating the gene expression data. Clustering techniques are very useful for understanding basic biological processes. There are many popular and robust techniques for clustering gene expression data, with the aim of describing different functional roles of genes in biological processes. It also describes genes in the cluster with similar expression (co-expressed genes), which serve similar functional roles in a process. In this investigation, we used a QT-based technique to expedite the 2-dimensional hierarchical clustering. Figure 2 illustrates the overall process of the proposed system in a single iteration. The clustering elements are selected from the gene expression database using the random values of an index matrix. Each ‘q’ type gene representation of ‘Q’ elements is clustered into 4 clusters, by means of QT-based vertical dimensional clustering, and then clustered into 2 clusters using QT-based horizontal clustering.

Figure 2

Process of QT based 2-dimensional clustering.

Notes: Let X be a database that contains ‘a’ gene representation of ‘b’ clustering elements and x = {x|x} 1 < i ≤ q ; 1 < j ≤ Q be the ‘q’ type gene representation of ‘Q’ elements, selected randomly from the database X using the index Y = {Y|Y < A ∀ i, j} 1 < i ≤ m ; 1< j ≤ n. Each value in the ‘Y’, which represents the row index value of Database X, must be unique and less than the maximum number of gene representations ‘A’ in the database X.

Genes clustering: vertical dimension

The pseudo-code for QT-based vertical dimensional clustering techniques explains how the genes are clustered using QT-based vertical dimensional clustering. Initially, the value of the root node of the QT is set to ‘Null’, and 4 values are selected randomly from the first type gene expression elements and inserted as a child of the root node. In the function Rchild (n, Root, P), the ‘n’ is the randomly selected data to be inserted, ‘Root’ is the parent node and ‘P’ is the position of child in the root node. Then, the element to be inserted is selected from the first type gene expression data, and to find its insertion position in the QT, its minimum Euclidean distance to the first level nodes that is a child of the root node is calculated. Subsequently, the node is inserted as a child node of the closest node that has a minimum distance value. The validity of the parent node is evaluated while inserting a child node to check whether the weight of parent node is less than 4 or not. If the above condition is satisfied, then the element is inserted as a child node, or it will find the closest node at the subsequent levels of the QT by calculating minimum distance. While inserting a node, if the parent node value is less than the child node to be inserted, then both the values are swapped. Thus, every element in the first type gene expression values are inserted into the QT and then clustered into 4 groups. Similarly, all gene type elements are clustered into 4 groups using the QT structure. Input: Clustering elements Output: The resultant solution clusters Parameters: Root ➔ Root node P ➔ Position L ➔ Depth of QT l ➔ level Nchild ➔ Number of child node Pseudo code Set Root=NULL c = 1; For i = 1 to Q For j = 1 to q For P = 1 to 4 Set n = Random (x) Insert Rchild (n, Root, P) End For While c < L l = 1; For k = 1 to 4 Set Set s = findmin (Ek) End For If Nchild ((s)) < 4 Insert child (xji, Rlk (s), P, l); Exit While Else Movedown (l, Rlk (s)); End If End While End for End for

Genes clustering: horizontal dimension

Every gene value is clustered inner-wise into 4 clusters according to their distance, and then these clusters are again clustered using a horizontal dimension for analyzing the gene values. The distance is calculated as follows. where, xji are the clustering elements in ‘x’ and Rlk is the clustered element in the kth level of the QT. The distance between 2 elements xji and Rlk is the quantity of the root of, sum of, square of, deviation among xji and Rlk. The first level of the horizontal-wise hierarchical clustering starts with the selection of 2 elements having greatest distance and inserted into the root node. Using Euclidian distance, the closest pair node of the next-gene representation element in the kth level is found and inserted into the corresponding node if the weight of the node is less than 4. Otherwise, it will find the closest pair of elements in the subsequent levels and insert the node here. Finally, every gene type is clustered based on QT.

Fitness evaluation

Let C be the resultant cluster and the fitness of C be calculated using the following equation. where is the weight of the each clustering element and R is the defined cluster used for the semi-supervised hierarchical clustering. If an element in the consequential cluster is in the defined cluster R, at that moment the weight of the element will be 1 and will be 0 otherwise. The 2-dimensional clustering progression and fitness evaluation are processed for every row of the index ‘I’ and K = {C | 1 < i ≤ l} is the resultant cluster. From the resultant cluster set ‘K’, the best clusters having the highest fitness value are selected and the contiguous index of all clusters are computed, which are then used to get the next set of clustering elements from the database. The clustering is performed repeatedly until an optimum cluster is produced using this cluster and the gene is analyzed.

Experimental Results

The proposed technique is implemented in the working platform of MATLAB 7.11 with the system configuration Intel(R) Core(TM) i5 CPU, 650@3.20GHz, 3.19 GHz, 3.17 GB of RAM and it is evaluated using the microarray gene expression data of human acute leukemia. The 2 datasets of standard leukemia for training and testing is obtained from ALL/ALM datasets23 and the performance of the proposed technique on clustering the ground truth data of the cancer classes, namely, acute myeloid leukemia (AML) and acute lymphoblast leukemia (ALL) are demonstrated. The 2 training leukemia datasets are again partitioned and become 4 sets, such as dataset_1, dataset_2, dataset_3 and dataset_4, each having N values (20, 18, 18, 17) respectively. This high-dimensional training dataset is subjected to clustering to analyze the occurrence of microarray genes in more than one cluster. In this clustering technique, an adaptive approach is used to represent the number of clusters that are dynamically generated from the microarray gene expression dataset. The proposed technique is a multi-stage clustering technique, which performs clustering at diverse levels. Initially, the gene values in every gene type are clustered vertically into 4 groups and then these gene types are clustered horizontally for further analysis. In the inner gene clustering, initially the root node is created and then the 4 clustering elements that are selected randomly from the first type gene expression data. These are inserted into the root node. Then the insertion position of next element to be inserted is found by calculating its minimum Euclidian distance from the first level nodes. This element is then inserted into the corresponding node as a child whose weight is less than 4. If the weight of the node is greater than or equal to 4, then the closest node at the subsequent level of the QT is found by calculating the minimum distance. This process is repeated until all elements are clustered. Thus, every gene representation value is clustered into 4 clusters. The clustering output of dataset_1, dataset_2, dataset_3 and dataset_4 are shown in Figures 3–6.

Figure 3

Dataset 1: hierarchical clustering.

Figure 6

Dataset 4: hierarchical clustering.

Notes: This experiment is done using 2 dataset such as AML and ALL. The 2 training leukemia dataset are partitioned again and turned to 4 set (dataset_1, dataset_2, dataset_3 and dataset_4) each having N values (20, 18, 18, 17) respectively. The result of these 4 dataset indicates that both AML and ALL are crossing each other that means AML have ALL and vice versa.

Comparison of Existing Algorithms

The comparison of the algorithms is related to their performance, as shown in Figure 7. The performance of the SOM is less whenever the number of clusters is high. The hierarchical and SOM clustering algorithms provide better results for small data sets. The expectation of maximization and K-Means algorithms are very suitable for the large datasets. These algorithms may suffer to a degree due to noise in the datasets.29 The above algorithms provide the clustered output but are lacking in several ways. In this paper, the 2-dimensional QT-based clustering technique is proposed for grouping gene expression data, with better performance than other clustering algorithms. It also finds genes in more than one cluster using a novel technique for achieving high accuracy. In Figure 7, the X-axis represents various clustering algorithms. The Y-axis represents the performance of these algorithms with many parameters. Each algorithm was analyzed based on input data sets as well as number of clusters (ie, 8, 16, 32 and 64) taken for the observation. The hierarchical clustering algorithm has higher performance than other algorithms, but it suffers in large datasets, so the novel 2D QT based hierarchical clustering has been proposed.

Figure 7

Performance comparison of existing algorithms.29

Performance Evaluation

The performance of the proposed 2-dimensional hierarchical data clustering technique is evaluated by clustering ground truth data of cancer classes, namely, acute myeloid leukemia (AML) and acute lymphoblast leukemia (ALL) using precision, recall and F-measures. Subsequently, these values are compared with the precision, recall and F-measure values of conventional hierarchical clustering. Precision and recall values of the clusters obtained by the proposed technique are given in the Table 1 and Figure 8 illustrate the corresponding graph. Precision and recall values of the clusters obtained by the Conventional Hierarchical Clustering are given in Table 2 and Figure 9 illustrates the same. We have used the precision, recall and F-measures described by Larsen and colleagues24 and Steinbach and colleagues25 to evaluate the performance of the proposed incremental text-clustering approach. where

Table 1

Precision, recall and F-measure values of the clusters using the proposed technique.

Dataset	Cluster	Precision	Recall	F-measure
Dataset_1	C1	0.8667	0.9286	0.8966
Dataset_1	C2	0.8000	0.6667	0.7273
Dataset_2	C3	0.9091	0.7692	0.8333
Dataset_2	C4	0.5714	0.8000	0.6667
Dataset_3	C5	0.9091	0.9091	0.9091
Dataset_3	C6	0.8571	0.8571	0.8571
Dataset_4	C7	0.8000	0.8000	0.8000
Dataset_4	C8	0.7143	0.7143	0.7143

Figure 8

Precision, recall evaluation of proposed technique.

Notes:Figure 8 describes the precision and recall evaluation of the proposed technique. The X-axis represents the number of clusters whereas the Y-axis represents the evaluation metrics, ie, value of precision and recall. Figure 9 describes that the conventional method of hierarchical clustering. The precision and recall values are less when compared to the proposed technique. For example, in the conventional method, cluster 1 has the precision and recall values of 0.5 and 0.2, respectively. In the proposed technique, cluster 1 has the precision and recall values of 0.8667 and 0.9286, respectively. All clusters have a higher value than the conventional method, indicating that the proposed method is faster.

Table 2

Precision, recall and F-measure values of the clusters using conventional hierarchical clustering.

Dataset	Cluster	Precision	Recall	F-measure
Dataset_1	C1	0.5000	0.2143	0.3000
Dataset_1	C2	0.2143	0.5000	0.3000
Dataset_2	C3	0.7000	0.5385	0.6087
Dataset_2	C4	0.2500	0.4000	0.3077
Dataset_3	C5	0.8182	0.8182	0.8182
Dataset_3	C6	0.7143	0.7143	0.7143
Dataset_4	C7	0.5000	0.3000	0.3750
Dataset_4	C8	0.3636	0.5714	0.4444

Figure 9

Precision, recall evaluation of the conventional technique.

M is the number of members of gene i in cluster j, M is the number of members of cluster j, M is the number of members of gene i. The performance of 2-dimensional hierarchical clustering without QT and with QT is also evaluated by comparing its processing time. In dataset_1 the processing time of 2-dimensional hierarchical clustering without QT and with QT is (2, 0.2). Likewise, dataset_2, dataset_3 and dataset_4 have processing times of (1.2,0.1), (1.3,0.1), (1.2,0.1). Their processing time differences are 1.8, 1.1, 1.2 and 1.1 respectively. Figure 10 shows the comparison between conventional and proposed techniques based on QT. Figure 11 illustrates the comparison between 2-dimensional hierarchical clustering and fast 2-dimensional hierarchical clustering based on QT.

Figure 10

Time comparison graph.

Figure 11

Time comparison graph for 2-dimensional hierarchical clustering without QT and with QT.

Conclusion

A novel and fast 2-dimensional hierarchical clustering technique has been proposed to deal with microarray genes that are present in more than one cluster. Initially, a set of clustering elements are selected randomly from the microarray gene expression data by using the index matrix, then they are clustered. The objective is to evaluate the fitness for selecting the best ‘k’ clusters from the various clusters. The next set of clustering elements is selected by finding the closest 2 cluster indices among the best ‘k’ clusters. These clusters are combined until the best clusters are found. The resultant genes are expressed in an efficient manner by eliminating biological complexities during the clustering process. In the existing techniques, each and every gene expression database was analyzed to find the closest pair, whereas the QT-based data structure uses some specific set of gene expression databases. The time comparison graph shows the average processing time of 2-dimensional hierarchical clustering without QT, which is 1.3 times larger than the proposed method. The proposed technique is faster when compared to existing clustering techniques in terms of performance. The experimental results based on real datasets have demonstrated that the proposed technique is truly more robust and efficient than traditional hierarchical clustering. In the future, the 2D QT algorithm will be implemented in various application areas in a more efficient manner.

3 in total

1 in total

Review 1. How to identify "Material basis-Quality markers" more accurately in Chinese herbal medicines from modern chromatography-mass spectrometry data-sets: Opportunities and challenges of chemometric tools.

Authors: Min He; Yu Zhou
Journal: Chin Herb Med Date: 2020-08-06