| Literature DB >> 18416829 |
Piotr Kraj1, Ashok Sharma, Nikhil Garge, Robert Podolsky, Richard A McIndoe.
Abstract
BACKGROUND: During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers.Entities:
Mesh:
Year: 2008 PMID: 18416829 PMCID: PMC2375128 DOI: 10.1186/1471-2105-9-200
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1ParaKMeans software components and deployment strategy. ParaKMeans has three software components; 1) the graphical user interface (GUI); 2) the application programming interface (API) and 3) the ParallelCluster web service. We provide the GUI in two forms, a windows GUI and a web GUI with the API compiled into each. The GUI is installed on the local machine while the ParallelCluster web services is installed on one or more laboratory computers. Installation of both the GUI and web service is done by double clicking on the .msi installation file and following the installation wizard's instructions.
Figure 2ParaKMeans user interfaces. ParaKMeans user interfaces. A) Screen capture of the Windows stand alone ParaKMeans application. The interface provides information on the data and program options being used. B) Home page of the web based ParaKMeans application, providing an overview of the current data and program options being used.
Figure 3Detected interactions that affect the time of execution using ParaKMeans. The column charts plot the speedup (fold increase) relative to a single node configuration versus the number of compute nodes used in the analysis. The bar graphs at the bottom of each plot illustrate the number of compute nodes where one finds statistically significant increases in speed. The p values presented are for tests of differences between the number of compute nodes for a given number of genes or clusters. A) The effect of the interaction between the number of genes and number of compute nodes on the speed of execution. B) The effect of the interaction between the number of clusters and number of compute nodes on the speed of execution.
Figure 4Time comparison of . ParaKMeans was run using single node/RIA configuration. The calculated p values are presented above the comparisons. A) 4 cluster data B) 20 cluster data. White bars = Cluster; black bars = PKM.
Accuracy and stability results using different clustering programs, initialization schemes and number of genes/clusters.
| 0.405 (0.404–0.405) | 0.594 (0.569–0.604) | 0.487 (0.404–0.604) | ||
| 0.519 (0.453–0.597) | 0.896 (0.896–0.896) | 0.747 (0.453–0.896) | ||
| 0.519 (0.322–0.519) | 0.770 (0.586–0.896) | 0.553 (0.322–0.896) | ||
| 0.489 (0.489–0.489) | 0.770 (0.770–0.770) | 0.629 (0.629–0.629) | ||
| 0.163 (0.124–0.183) | 0.256 (0.196–0.297) | 0.190 (0.124–0.297) | ||
| 0.231 (0.211–0.461) | 0.216 (0.208–0.233) | 0.227 (0.208–0.461) | ||
| 0.189 (0.178–0.226) | 0.210 (0.202–0.252) | 0.202 (0.178–0.252) | ||
| 0.400 (0.400–0.400 | 0.691 (0.691–0.691) | 0.545 (0.400–0.691) | ||
| 0.439 (0.436–0.442) | 0.797 (0.783–0.812) | 0.618 (0.436–0.812) | ||
| 0.514 (0.435–1.00) | 1.00 (1.00–1.00) | 1.00 (0.435–1.00) | ||
| 0.569 (0.483–1.00) | 0.769 (0.586–0.770) | 0.711 (0.483–1.00) | ||
| 1.00 (1.00–1.00) | 1.00 (1.00–1.00) | 1.00 (1.00–1.00) | ||
| 0.347 (0.321–0.393) | 0.492 (0.418–0.907) | 0.405 (0.321–0.907) | ||
| 0.738 (0.444–0.888) | 0.904 (0.682–0.994) | 0.788 (0.444–0.994) | ||
| 0.594 (0.548–0.643) | 0.652 (0.573–0.668) | 0.634 (0.548–0.668) | ||
| 1.00 (1.00–1.00) | 1.00 (1.00–1.00) | 1.00 (1.00–1.00) |
* – Cluster k-means: Single Node (N = 12), All PKM: 1 and 7 node data (N = 12)
All values are the median adjusted Rand Index with the range of values in parentheses.
Accuracy = cluster results vs. known assignments, Stability = over all agreement between cluster results
PKM = ParaKMeans, RFD = Random From Data, RIA = Random Initial Assignment, BKM = Bissecting K Means, Cluster = Eisen Cluster program.
Figure 5Stability results for ParaKMeans and Cluster using real microarray data. ParaKMeans was run using 7 nodes with the initialization scheme indicated on the x-axis. Both ParaKMeans and Cluster were run using k = 4, 10 and 20 partitions. The median ARI for each analysis is shown using the horizontal line in each plot. Initialization schemes: BKM = Bissecting K Means, RIA = Random Initial Assignment, RFD = Random From Data.