| Literature DB >> 21235792 |
Raffaele Giancarlo1, Filippo Utro.
Abstract
BACKGROUND: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up.Entities:
Year: 2011 PMID: 21235792 PMCID: PMC3035181 DOI: 10.1186/1748-7188-6-1
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Results for Consensus with H = 250 and p = 80% on the Benchmark 1 datasets
| Precision | Timing | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Hier-A | ⑦ | ➌ | ➑ | ➌ | ➎ | - | 8.9 × 105 | 1.4 × 106 | 5.0 × 107 | - |
| Hier-C | ➏ | ④ | ➑ | ⑥ | - | 8.1 × 105 | 1.3 × 106 | 4.8 × 107 | - | |
| Hier-S | 2 | ➌ | ② | 10 | - | 4.3 × 105 | 1.0 × 105 | 4.8 × 107 | - | |
| K-means-R | ➏ | ④ | ⑦ | ④ | ⑥ | - | 5.6 × 105 | 1.2 × 106 | 2.7 × 107 | - |
| K-means-A | ⑦ | ➌ | ➑ | ➌ | ⑥ | - | 1.0 × 106 | 1.8 × 106 | 5.6 × 107 | - |
| K-means-C | ➏ | ➌ | ➑ | ④ | ⑥ | - | 9.8 × 105 | 1.7 × 106 | 5.3 × 107 | - |
| K-means-S | ⑦ | ⑨ | ② | ⑥ | - | 1.2 × 106 | 1.2 × 106 | 5.7 × 107 | - | |
| NMF-R | ➏ | ④ | ⑦ | ④ | - | - | 1.1 × 108 | 6.4 × 107 | - | - |
| NMF-A | ⑦ | ➌ | 2 | ➌ | - | - | 3.0 × 107 | 1.3 × 107 | - | - |
| NMF-C | 5 | ④ | ⑦ | ④ | - | - | 3.0 × 107 | 1.3 × 107 | - | - |
| NMF-S | 2 | 8 | ⑨ | ② | - | - | 3.6 × 107 | 1.3 × 107 | - | - |
| - | - | - | - | |||||||
A summary of the results for Consensus with H = 250 and p = 80%, on all algorithms, on the Benchmark 1 datasets. Each cell in the table displays either a precision or a timing result. That is, either the prediction of the number of clusters in a dataset given by a measure or the execution time it took to get such a prediction. For cells displaying precision, a number in a circle with a black background indicates a prediction in agreement with the number of classes in the dataset; while a number in a circle with a white background indicates a prediction that differs, in absolute value, by 1 from the number of classes in the dataset; a number in a square indicates a prediction that differs, in absolute value, by 2 from the number of classes in the dataset; a number not in a circle/square indicates the remaining predictions. When one obtains two very close predictions for k*, they are both reported and separated by a dash. An entry containing a dash only indicates that either the experiment was stopped because of its high computational demand or that no useful indication was given by the method. For cells displaying timing, we use the following notation. Numeric values report timing in milliseconds, while a dash indicates that the timing is not available for at least one of the following reasons: the experiment (a) was performed on a computer other than the AMD Athlon; (b) it was stopped because of its high computational demand; (c) a smaller range of clustering solutions have been produced for that dataset, due to its size, i.e., Leukemia with p = 66%. For this particular set of experiments, we do not report the timing results for Leukemia and Lymphoma because they are redundant.
Results for Consensus with H = 250 and p = 80% on the Benchmark 2 datasets
| Precision | Timing | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Hier-A | ⑤ - | ➏ | 10 | ➌ | ➎ | ⑤ | 1.0 × 107 | 3.7 × 107 | 9.5 × 106 |
| Hier-C | ➍- ⑤ | ⑤ - ➏ | 10 | ➌ | ➎ | ⑤ | 1.0 × 107 | 3.7 × 107 | 9.2 × 106 |
| Hier-S | ⑤ | 2 | 10 | ② | 2 | ⑦ | 9.8 × 106 | 3.7 × 107 | 9.4 × 106 |
| K-means-R | ⑤ | ➏ | 10 | ➌ | ➎ | ➏ | 1.8 × 107 | 1.5 × 107 | 6.3 × 106 |
| K-means-A | ⑤ - | ➏ | 8 | ➌ | ➎ | ⑤ | 1.4 × 107 | 6.8 × 107 | 1.1 × 107 |
| K-means-C | ➍- ⑤ | ⑤ - ➏ | 10 | ➌ | ➎ | ⑤ | 1.5 × 107 | 6.8 × 107 | 1.0 × 107 |
| K-means-S | ⑤ | ➏ | 10 | ② | ➎ | ➏ | 1.6 × 107 | 6.8 × 107 | 1.1 × 107 |
| - | - | - | |||||||
A summary of the results for Consensus with H = 250 and p = 80%, on all algorithms, except NMF, and for the datasets in Benchmark 2. The table legend is as in Table 1. NMF has been excluded since each experiment was terminated due to its high computational demand. The timing results for the artificial datasets are not reported since the experiments have been performed on a computer other than the AMD Athlon.
Results for FC with H = 250 and p = 80% on the Benchmark 1 datasets
| Precision | Timing | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Hier-A | ⑦ | ➌ | ➑ | ➌ | ➎ | 2 | 4.7 × 104 | 5.2 × 104 | 1.4 × 106 | 3.7 × 107 |
| Hier-C | ➏ | ④ | ➑ | ⑥ | 14 - ⑰ | 4.4 × 104 | 6.4 × 104 | 1.4 × 106 | 3.7 × 107 | |
| Hier-S | 2 | 8 | ➑ | ② | 10 | 2 | 5.3 × 104 | 5.2 × 104 | 1.4 × 106 | 3.0 × 107 |
| K-means-R | ➏ | ④ | ⑦ | ④ | ⑥ | 3.7 × 105 | 1.2 × 106 | 1.6 × 107 | 1.6 × 108 | |
| K-means-A | ⑦ | ➌ | ➑ | ➌ | ⑥ | 12 | 3.1 × 105 | 9.3 × 105 | 1.8 × 107 | 2.1 × 108 |
| K-means-C | ➏ | ④ | ➑ | ④ | ⑥ | 12 | 2.5 × 105 | 6.5 × 105 | 1.4 × 107 | 2.0 × 108 |
| K-means-S | ➏ | 7 | ⑨ | ② | ⑥ | 2 | 3.7 × 105 | 6.9 × 105 | 1.9 × 107 | 2.4 × 108 |
| NMF-R | ➏ | ④ | ⑦ | ④ | - | - | 1.1 × 108 | 6.3 × 107 | - | - |
| NMF-A | ⑦ | ➌ | ⑦ | ➌ | - | - | 3.0 × 107 | 1.2 × 107 | - | - |
| NMF-C | ➏ | ➌ | ➑ | ④ | - | - | 2.9 × 107 | 1.2 × 107 | - | - |
| NMF-S | 2 | 8 | ⑨ | ② | - | - | 3.5 × 107 | 1.2 × 107 | - | - |
| - | - | - | - | |||||||
A summary of the results for FC with H = 250 and p = 80%, on all algorithms and on the Benchmark 1
datasets. The table legend is as in Table 1.
Results for FC with H = 250 and p = 80% on the Benchmark 2 datasets
| Precision | Timing | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Hier-A | ⑤ - | ➏ | 10 | ➌ | ➎ | ⑤ | 4.0 × 105 | 1.6 × 106 | 3.4 × 105 |
| Hier-C | ➍ - ⑤ | ⑤ - ➏ | 10 | ➌ | ➎ | ⑤ | 3.9 × 105 | 1.4 × 106 | 3.3 × 105 |
| Hier-S | ⑤ | 2 | 10 | ② | 2 | ⑦ | 4.4 × 105 | 1.5 × 106 | 3.4 × 105 |
| K-means-R | ⑤ | ➏ | 10 | ➌ | ➎ | ➏ | 1.4 × 107 | 5.9 × 106 | 2.0 × 106 |
| K-means-A | ⑤ - | ➏ | 8 | ➌ | ➎ | ⑤ | 5.5 × 106 | 3.2 × 107 | 5.4 × 106 |
| K-means-C | ➍ - ⑤ | ⑤ - ➏ | 10 | ➌ | ➎ | ⑤ | 6.5 × 106 | 3.2 × 107 | 2.1 × 106 |
| K-means-S | ⑤ | ➏ | 10 | ② | ➎ | ➏ | 7.8 × 106 | 4.9 × 107 | 2.1 × 106 |
| - | - | - | |||||||
A summary of the results for FC with H = 250 and p = 80%, on all algorithms, except NMF, and for the datasets in Benchmark 2. The table legend is as in Table 1. NMF has been excluded since each experiment was terminated due to its high computational demand. The timing results for the artificial datasets are not reported since the experiments have been performed on a computer other than the AMD Athlon.
Summary of results for the fastest measures on the Benchmark 1 datasets
| Precision | Timing | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ⑤ | ➌ | ➑ | 8 | ④ | 1.7 × 103 | 1.3 × 103 | 5.0 × 103 | 4.0 × 103 | |
| ⑤ | ④ | ➑ | ➌ | ④ | 1.2 × 103 | 8.0 × 102 | 4.1 × 103 | 3.0 × 103 | |
| ⑦ | ➌ | 4 | ④ | ⑥ | 2.4 × 103 | 2.0 × 103 | 8.3 × 104 | 8.4 × 103 | |
| ⑤ | ④ | 2 | ② | ④ | 1.2 × 103 | 8.0 × 102 | 4.5 × 104 | 3.2 × 103 | |
| ⑦ | 8 | ➑ | ④ | ④ | 1.9 × 104 | 9.4 × 104 | 5.5 × 105 | 2.6 × 105 | |
| ➏ | ➌ | ➑ | 8 | ④ | 2.9 × 104 | 1.0 × 105 | 7.1 × 105 | 3.6 × 105 | |
| ➏ | ➌ | ⑦ | ➎ | 3.9 × 103 | 3.7 × 104 | 2.1 × 105 | 7.6 × 104 | ||
| ⑦ | ➌ | ⑦ | 6 | ⑥ | 1.6 × 103 | 7.5 × 103 | 5.1 × 104 | 1.8 × 104 | |
| ⑦ | ➌ | ⑦ | ④ | 1.9 × 104 | 9.4 × 104 | 5.5 × 105 | 2.6 × 105 | ||
| ⑦ | ➌ | ➑ | ➌ | ➎ | 5.9 × 104 | 2.7 × 104 | 7.0 × 104 | 6.8 × 104 | |
| ➏ | ④ | ➑ | ⑥ | 5.9 × 104 | 2.7 × 104 | 6.5 × 104 | 6.7 × 104 | ||
| - | - | - | - | ||||||
A summary of the best performing measures taken from the benchmarking of Giancarlo et al., with the addition of FC, with H = 250 and p = 80%. The table legend is as in Table 1. Consistent with that study, we report only the timing results for CNS Rat, Leukemia, NCI60 and Lymphoma, since for the Yeast and PBM datasets the experiments have been performed on a computer other than the AMD Athlon.
Summary of results for the fastest measures on the Benchmark 2 datasets
| Precision | ||||||
|---|---|---|---|---|---|---|
| ⑤ | ➏ | 9 | ➌ | ⑦ | ||
| ⑤ | ➏ | 9 | - | ➌ | ⑦ | |
| ⑦ | 7 | ➌ | 8 | 3 | ||
| ⑤ | ⑦ | 7 | ➌ | 3 | ||
| ➍ | ⑤ | 6 | ➌ | - | ⑤ | |
| ➍ | ⑦ | 4 | - | - | ||
| 7 | - | 10 | ➌ | - | ⑦ | |
| ➍ | 8 | 6 | ➌ | - | ⑤ | |
| ⑦ | 3 | 7 | ➌ | 29 | 3 | |
| 5 - | ➏ | 10 | ➌ | ➎ | ⑤ | |
| ➍ - ⑤ | ⑤ - ➏ | 10 | ➌ | ➎ | ⑤ | |
A summary of the best performing measures taken from the benchmarking of Giancarlo et al., with the addition of FC, with H = 250 and p = 80%. The table legend is as in Table 1. The timing results are not reported since the experiments have been performed on a computer other than the AMD Athlon.
Figure 1An example of number of cluster prediction with the use of Consensus and FC. The experiment is derived with the Leukemia dataset as input, with the use of the K-means-A clustering algorithm. (i) The plot of the CDF curves as a function of k, obtained by Consensus with H = 250 and p = 80%. For clarity, only the curves for k in [2, 13] are shown. It is evident that there are increasing values of the area under the CDF for increasing values of k. The flattening effect in the growth rate of the area is evident for k ≥ k* = 3. (ii) The plot of the corresponding Δ curve for k in [2, 30], where the flattening effect indicating k* is evident for k ≥ k* = 3. (iii) The plot of the CDF curves, obtained by FC with H = 250 and p = 80%, in analogy with (i). (iv) The plot of the Δ curve, obtained by FC with H = 250 and p = 80%, in analogy with (ii).