| Literature DB >> 30314432 |
Thomas J Glassen1, Timo von Oertzen2,3, Dmitry A Konovalov4.
Abstract
BACKGROUND: Bayesian clustering algorithms, in particular those utilizing Dirichlet Processes (DP), return a sample of the posterior distribution of partitions of a set. However, in many applied cases a single clustering solution is desired, requiring a 'best' partition to be created from the posterior sample. It is an open research question which solution should be recommended in which situation. However, one such candidate is the sample mean, defined as the clustering with minimal squared distance to all partitions in the posterior sample, weighted by their probability. In this article, we review an algorithm that approximates this sample mean by using the Hungarian Method to compute the distance between partitions. This algorithm leaves room for further processing acceleration.Entities:
Keywords: Bayesian clustering; Dirichlet Process; Mean partition; Partition distance
Mesh:
Year: 2018 PMID: 30314432 PMCID: PMC6186144 DOI: 10.1186/s12859-018-2359-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Performance comparison of time taken to calculate partition-distance on the R1 and R10 simulation sets. Natural logarithm of time is displayed in arbitrary units against the effective partition size n. Each point is an average over 100 different distance calculations, where the error bars show one standard deviation for R10 test
Fig. 2The same performance comparison as in Fig. 1 but for the simulation set RM. Natural logarithm of time is displayed in arbitrary units against the number of randomly moved individuals x
Fig. 3Average effort of 100 calculation repetitions using the old and the new mean partition algorithm as a function of the partition size N
Average calculation time (in ms) of 100 calculation repetitions for different partition sizes
|
| Time (old) | Time (new) | Factor |
|---|---|---|---|
| 150 | 44.9 (± 0.2) | 4.0 (± 0.1) | 11.2 |
| 750 | 1783.3 (± 2.1) | 42.7 (± 0.3) | 41.8 |
| 1350 | 2318.2 (± 2.0) | 33.2 (± 0.3) | 69.8 |
| 1950 | 4706.0 (± 3.2) | 47.1 (± 0.3) | 100.0 |
| 2550a | 24681.0 (± 47.5) | 181.3 (± 0.9) | 136.1 |
| 3150a | 36614.3 (± 17.0) | 222.0 (± 1.0) | 164.9 |
The clustering-samples of the DPGMM for the datsets marked with a suggested a three- instead of two-cluster solution. The values in parentheses are the distances to the upper and lower bounds of the corresponding 95% confidence interval
Fig. 4Average effort of 100 calculation repetitions using the old and the new mean partition algorithm as a function of the average number of clusters within 100 sample partitions
Mean calculation time (in ms) of 100 calculation repetitions for different average numbers of clusters
|
| Time (old) | Time (new) | Factor |
|---|---|---|---|
| 2.0 | 44.9 (± 0.2) | 4.0 (± 0.1) | 11.2 |
| 4.8 | 8676.6 (± 5.9) | 417.3 (± 1.9) | 20.8 |
| 6.21 | 9722.5 (± 4.2) | 391.3 (± 1.5) | 24.8 |
| 7.49 | 31186.4 (± 26.7) | 1126.3 (± 2.3) | 27.7 |
| 9.05 | 33277.8 (± 24.8) | 1238.0 (± 2.2) | 26.9 |
| 13.16 | 80026.3 (± 209.5) | 2689.4 (± 8.2) | 29.8 |
The values in parentheses are the distances to the upper and lower bounds of the corresponding 95% confidence interval