| Literature DB >> 23587447 |
Abstract
: Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.Entities:
Year: 2013 PMID: 23587447 PMCID: PMC3656798 DOI: 10.1186/1687-4153-2013-5
Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN: 1687-4145
Figure 1Illustration of gene hierarchical structures in microarray data.
Figure 2Illustration of the Chinese restaurant metaphor. Tables of the same pattern are grouped into the same cluster.
Figure 3The synthetic network structure.
Clustering performance of LDA, SVM, MCLUST, K-means, HC, and HDP on the AD400 data
| LDA | 0.931 | 0.553 | 10.0 |
| SVM | 0.929 | 0.493 | 11 |
| MCLUST | 0.942 | 0.583 | 10 |
| K-means | 0.895 | 0.457 | 10 |
| HC | 0.916 | 0.348 | 9 |
| BIMC | 0.935 | 0.571 | 10.0 |
| HDP | 0.947 | 0.577 | 10.0 |
Figure 4Plots of the HDP results in 20 experiments. (a) Plot of number of clusters in 20 experiments. (b) Plot of rand index in 20 experiments.
Clustering performance of LDA, MCLUST, SVM, and HDP on the yeast galactose data
| LDA | 0.942 | 6.3 |
| SVM | 0.954 | 5 |
| MCLUST | 0.903 | 9 |
| HDP | 0.973 | 3.8 |
Clustering performance of LDA, MCLUST, K-means, HC, BIMC, and HDP on the yeast sporulation data
| LDA | 0.586 | 6.2 |
| MCLUST | 0.577 | 6 |
| K-Means | 0.324 | 8 |
| HC | 0.392 | 7 |
| BIMC | 0.592 | 6.1 |
| HDP | 0.673 | 6.0 |
Clustering performance of LDA, MCLUST, K-means, HC, BIMC, and HDP on the human fibroblasts serum data
| LDA | 0.298 | 9.4 |
| MCLUST | 0.382 | 6 |
| K-Means | 0.324 | 7 |
| HC | 0.313 | 5 |
| BIMC | 0.418 | 7.3 |
| HDP | 0.452 | 6.4 |
Figure 5Plots of all HDP clusters for yeast cell cycle data. (a) Plot of Cluster 1, containing 261 genes. (b) Plot of Cluster 2, containing 86 genes. (c) Plot of Cluster 3, containing 135 genes. (d) Plot of Cluster 4, containing 144 genes. (e) Plot of Cluster 5, containing 76 genes. (f) Plot of Cluster 6, containing 25 genes. (g) Plot of Cluster 7, containing 88 genes. (h) Plot of Cluster 8, containing 60 genes. (i) Plot of Cluster 9, containing 381 genes. (j) Plot of Cluster 10, containing 259 genes.
Numbers of newly discovered genes in various functional categories by the proposed HDP clustering algorithm
| Cell cycle and DNA processing | 20 |
| Protein synthesis | 25 |
| Protein fate | 4 |
| Cell fate | 12 |
| Transcription | 8 |
| Unclassified protein | 57 |
List of newly discovered genes in various functional categories
| | YBL051c YBR136w YBL016w YDR200c YBR274w |
| Cell cycle and DNA | YDR217c YLR314c YJL074c YJL095w YDR052c |
| processing | YDL126c YCL016c YDL188c YAL040c YEL019c |
| | YER122c YLR035c YLR055c YML032c YMR078c |
| Protein synthesis | YDR091c YGL103w YBR118w YBL057c YBR101c |
| | YBR181c YDL083c YDL184c YDR012w YDR172w |
| | YGL105w YGL129c YJL041w YJL125c YJR113c |
| | YLR185w YPL037c YPL048w YLR009w YHL001w |
| | YHL015w YHR011w YHR088w YDR450w YEL034w |
| Protein fate | YAL016w YBL009w YBR044c YDL040c |
| Cell fate | YAL040c YDL006w YDL134c YIL007c YJL187c |
| | YDL029w YDL035c YCR002c YBL105c YCR089w |
| | YER114c YEL023c |
| Transcription | YAL021c YBL022c YCL051w YDR146c YIL084c |
| YJL127c YJL164c YJL006c |