| Literature DB >> 19261720 |
Ashok Sharma1, Robert Podolsky, Jieping Zhao, Richard A McIndoe.
Abstract
MOTIVATION: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). AVAILABILITY: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Year: 2009 PMID: 19261720 PMCID: PMC2672630 DOI: 10.1093/bioinformatics/btp123
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example distance plot using a random 10% of the experimental microarray dataset to calculate the Dmax.
Comparison of time to complete analysis of various clustering algorithms using the centered data
| Simulated dataset | Experimental dataset | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | 10 000 | 10 000 | 22 283 | 44 760 | ||||||||
| Clusters | 4 | 10 | 20 | 4 | 10 | 20 | 4 | 10 | 20 | 4 | 10 | 20 |
| HPC-DEN | ||||||||||||
| HPC-RIA | 0.62±0.06 | 0.92±0.09 | 0.71±0.09 | 0.58±0.10 | 0.92±0.13 | 1.27±0.14 | 2.41±0.30 | 4.22±0.44 | ||||
| R-KM | 0.56±0.07 | 1.61±0.09 | 0.86±0.08 | 2.94±0.10 | 0.59±0.04 | 2.00±0.09 | 9.02±0.49 | 4.72±0.19 | 22.68±1.39 | |||
| R-SOM | 0.80±0.01 | 0.83±0.01 | 0.93±0.03 | |||||||||
| Cluster v2 | 0.62±0.12 | 1.95±0.45 | 1.48±0.08 | 5.09±0.49 | 11.80±1.20 | 4.72±0.61 | 25.94±1.32 | 45.25±8.81 | 9.02±2.09 | 47.75±9.67 | 118.33±6.51 | |
| Cluster v3 | 1.12±0.08 | 3.06±0.18 | 8.94±0.50 | 3.70±0.42 | 9.68±0.54 | 19.28±1.61 | 8.73±0.28 | 26.90±1.88 | 57.50±4.95 | 11.70±0.56 | 51.50±2.89 | 162.33±10.60 |
HPC-DEN, density from data; HPC-RIA, random initial assignment; R-KM, R-k-mean; R-SOM, R self organizing maps; Cluster v2, Cluster v3, k-means algorithm.
All values for the programs are the average time to completion in minutes and the standard deviation.
N = 12 for each. Bold items indicate significantly lower times within each column.
Fig. 2.Accuracy of clustering algorithms using the simulated dataset. Cluster assignments for each algorithm were recorded 12 times each using both centered and uncentered data. The average ARI was calculated for each algorithm.
Fig. 3.Stability of the clustering algorithms using experimental data (22 283 genes) and searching for 10 clusters. Cluster assignments for each algorithm were recorded 12 times each. The average ARI was calculated for each pairwise comparison for the 12 results from each algorithm. The median for each algorithm is shown as a horizontal bar. C = centered data, U = uncentered data.
Fig. 4.Accuracy of the clustering algorithms over a range of cluster sizes. The four cluster 1000 gene 100 array centered simulated dataset was analyzed by each program using varying numbers of clusters with the resulting ARI for each analysis plotted.