| Literature DB >> 28408972 |
Abstract
DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.Entities:
Keywords: Clustering; Dendrogram; Empirical; Gene Expression; Hierarchy
Year: 2016 PMID: 28408972 PMCID: PMC5373427 DOI: 10.12688/f1000research.10103.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Example dendrogram from 26 data points.
Figure 2. Sample clusters of 100 points each.
Average cluster numbers over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 3.720 |
| 2.965 | 4.505 |
| 2 |
|
|
| 2.645 |
| 3 | 3.050 | 3.050 |
| 4.725 |
| 4 | 3.915 | 4.060 |
| 6.280 |
Average error size (when wrong) over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 2.720 |
| 1.965 | 3.505 |
| 2 |
|
|
| 1.206 |
| 3 | 1.111 | 1.250 |
| 1.917 |
| 4 | 1.688 |
|
| 2.413 |
Average cluster numbers over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 3.710 |
| 3.035 | 4.450 |
| 2 |
|
|
| 2.625 |
| 3 | 3.170 |
| 3.060 | 6.055 |
| 4 | 3.940 | 4.140 |
| 5.780 |
Average error size (when wrong) over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 2.710 |
| 2.035 | 3.450 |
| 2 |
|
|
| 1.190 |
| 3 | 1.063 |
| 1.000 | 3.070 |
| 4 | 1.778 | 1.167 |
| 1.945 |
Figure 3. Tracking for varying distances.
Figure 4. Clustering of the ExpressionSet.
Figure 5. Clustering of the ExpressionSet.
Average cluster numbers over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 2.015 |
| 3 |
| 3.050 |
| 3.080 |
| 4 |
| 4.060 |
| 4.065 |
Average error size (when wrong) over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 1.000 |
| 3 |
| 1.250 |
| 1.000 |
| 4 |
| 1.000 |
| 1.000 |
Figure 6. Clustering of the ExpressionSet with mixing.
Average cluster numbers over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 2.660 |
| 3 | 3.040 | 3.050 |
| 4.630 |
| 4 | 3.900 | 4.060 |
| 6.180 |
Average error size (when wrong) over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 1.234 |
| 3 | 1.000 | 1.250 |
| 1.884 |
| 4 | 2.000 | 1.000 |
| 2.307 |
Figure 7. Clustering of the ExpressionSet with “LOOCV”.
Success rate over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 0.000 |
| 0.000 | 0.000 |
| 2 |
|
|
| 0.465 |
| 3 | 0.955 | 0.960 |
| 0.100 |
| 4 | 0.920 | 0.940 |
| 0.055 |
Success rate over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 0.000 |
| 0.000 | 0.000 |
| 2 |
|
|
| 0.475 |
| 3 | 0.840 |
| 0.920 | 0.005 |
| 4 | 0.955 | 0.880 |
| 0.085 |
Success rate over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 0.985 |
| 3 |
| 0.960 |
| 0.920 |
| 4 |
| 0.940 |
| 0.935 |
Success rate over n = 200 runs.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
|
|
| 0.465 |
| 3 | 0.960 | 0.960 |
| 0.135 |
| 4 | 0.940 | 0.940 |
| 0.055 |