| Literature DB >> 17493256 |
Alberto Bertoni1, Giorgio Valentini.
Abstract
BACKGROUND: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems.Entities:
Mesh:
Year: 2007 PMID: 17493256 PMCID: PMC1892076 DOI: 10.1186/1471-2105-8-S2-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A two-level hierarchical structure with 2 and 6 clusters is revealed by principal components analysis (data projected into the two components with highest variance).
Samplel: similarity indices. Similarity indices for the synthetic sample1 data set for different k-clusterings, sorted with respect to their mean values.
| 2 | 1.0000 | 0.0000 | 1.0000 |
| 6 | 1.0000 | 0.0000 | 1.0000 |
| 7 | 0.9217 | 0.0016 | 0.0000 |
| 8 | 0.8711 | 0.0033 | 0.0000 |
| 9 | 0.8132 | 0.0042 | 0.0000 |
| 5 | 0.8090 | 0.0104 | 0.0000 |
| 3 | 0.8072 | 0.0157 | 0.0000 |
| 10 | 0.7715 | 0.0056 | 0.0000 |
| 4 | 0.7642 | 0.0158 | 0.0000 |
Figure 2Histograms of the similarity measure distributions for different numbers of clusters.
Leukemia data set. Stability indices for different k-clusterings sorted with respect to their mean values.
| 2 | 0.8285 | 0.0077 | 1.0000 |
| 3 | 0.8060 | 0.0124 | 0.7328 |
| 4 | 0.6589 | 0.0060 | 2.3279e-06 |
| 5 | 0.6012 | 0.0073 | 9.5199e-11 |
| 6 | 0.5424 | 0.0057 | 6.3282e-15 |
| 7 | 0.5160 | 0.0062 | 0.0000 |
| 8 | 0.4865 | 0.0050 | 0.0000 |
| 9 | 0.4819 | 0.0060 | 0.0000 |
| 10 | 0.4744 | 0.0049 | 0.0000 |
Lymphoma data set. Stability indices for different k-clusterings sorted with respect to their mean values.
| 2 | 0.9566 | 0.0028 | 1.0000 |
| 3 | 0.7900 | 0.0149 | 0.0000 |
| 4 | 0.6963 | 0.0128 | 0.0000 |
| 5 | 0.6387 | 0.0075 | 0.0000 |
| 6 | 0.6135 | 0.0082 | 0.0000 |
| 7 | 0.6129 | 0.0079 | 0.0000 |
| 9 | 0.5864 | 0.0063 | 0.0000 |
| 8 | 0.5792 | 0.0079 | 0.0000 |
| 10 | 0.5744 | 0.0058 | 0.0000 |
Figure 3Leukemia data set: empirical cumulative distribution functions of the similarity measures for different number of clusters k.
Figure 4Lymphoma data set: empirical cumulative distribution functions of the similarity measures for different number of clusters k.
Results comparison. Comparison between different methods for model order selection in gene expression data analysis
| Methods | Class. risk (Lange et al. 2004) | Gap statistic (Tibshirani et al. 2001) | Clest (Dudoit and Fridlyand 2002) | Figure of Merit (Levine and Domany 2001) | Model Explorer (BenHur et al 2002) | MOSRAM | "True" number k |
| Data set | |||||||
| k = 3 | k = 10 | k = 3 | k = 2, 8, 19 | k = 2 | k = 2, 3 | k = 2, 3 | |
| k = 2 | k = 4 | k = 2 | k = 2, 9 | k = 2 | k = 2 | k = 2, (3) |