| Literature DB >> 22229026 |
Julio A Di Rienzo1, Silvia G Valdano, Paula Fernández.
Abstract
The most commonly applied strategies for identifying genes with a common response profile are based on clustering algorithms. These methods have no explicit rules to define the appropriate number of groups of genes. Usually the number of clusters is decided on heuristic criteria or through the application of different methods proposed to assess the number of clusters in a data set. The purpose of this paper is to compare the performance of seven of these techniques, including traditional ones, and some recently proposed. All of them produce underestimations of the true number of clusters. However, within this limitation, the gDGC algorithm appears to be the best. It is the only one that explicitly states a rule for cutting a dendrogram on the basis of a testing hypothesis framework, allowing the user to calibrate the sensitivity, adjusting the significance level.Entities:
Year: 2011 PMID: 22229026 PMCID: PMC3250619 DOI: 10.1155/2011/261975
Source DB: PubMed Journal: Int J Plant Genomics ISSN: 1687-5389
Figure 1Dendrogram showing the relationships among mean vectors. Cut-off criterion obtained with the gDGC test—Q 1−—is indicated with a dotted line. At the bottom of the figure, different letters identify groups statistically differing in the population centroids at a significance level α.
Summarized ANOVA table for the terms of the linear model fitted to the bias (estimated minus true number of clusters—k—in the gene-expression matrix). Results are shown for k = 2 and k = 10. Clustering algorithm: average linkage.
| Model terms | numDF | denDF |
|
| ||
|---|---|---|---|---|---|---|
|
| BH-Adj |
| BH-Adj | |||
|
| 9 | 720 | 11.44 | <0.0001 | 159.17 | <0.0001 |
|
| 1 | 80 | 12.32 | 0.0037 | 21.22 | <0.0001 |
|
| 1 | 80 | 3.52 | 0.1653 | 64.34 | <0.0001 |
|
| 1 | 80 | 0.01 | 0.9858 | 3.38 | 0.1243 |
|
| 9 | 720 | 2.42 | 0.0510 | 2.60 | 0.0177 |
|
| 9 | 720 | 1.84 | 0.1653 | 10.96 | <0.0001 |
|
| 9 | 720 | 1.71 | 0.1757 | 2.21 | 0.0485 |
|
| 1 | 80 | 0.01 | 0.9858 | 2.40 | 0.1824 |
|
| 1 | 80 | 1.62 | 0.3812 | 0.01 | 0.9233 |
|
| 1 | 80 | 0.49 | 0.8077 | 2.57 | 0.1823 |
|
| 9 | 720 | 0.50 | 0.9858 | 1.48 | 0.2050 |
|
| 9 | 720 | 0.66 | 0.9858 | 1.06 | 0.4865 |
|
| 9 | 720 | 1.79 | 0.1653 | 2.09 | 0.0594 |
|
| 1 | 80 | 0.36 | 0.8268 | 0.54 | 0.5333 |
|
| 9 | 720 | 0.25 | 0.9858 | 0.79 | 0.6701 |
BH-Adj P value: refers to the adjusted P value according to Benjamini-Hochberg algorithm.
Estimated mean, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias for each method applied to the estimation of the number of clusters in the simulated datasets. True number of clusters: k = 2. Clustering algorithm: average linkage.
| Method | Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|
| HOPACHm | 7.35 | 1.07 | 5.25 | 9.45 |
| HOPACHc | 7.28 | 1.07 | 5.18 | 9.38 |
| CH | 0.44 | 0.10 | 0.24 | 0.64 |
| Gap | 0.35 | 0.10 | 0.15 | 0.55 |
| Silh | 0.34 | 0.10 | 0.14 | 0.54 |
| H | 0.22 | 0.10 | 0.02 | 0.42 |
| CCCm | 0.16 | 0.10 | −0.04 | 0.36 |
| CCC | 0.16 | 0.10 | −0.04 | 0.36 |
| gDGC | 0.03 | 0.10 | −0.17 | 0.23 |
| MClust | 0.01 | 0.10 | −0.19 | 0.21 |
Estimated mean, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias for each combination of method (M) and number of treatments (T). Means of bias are sorted descending within each level of T. True number of clusters: k = 10. Clustering algorithm: average linkage.
|
|
| Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|---|
| HOPACHm | 3 | 33.00 | 2.56 | 32.80 | 33.20 |
| HOPACHc | 3 | 15.43 | 2.56 | 15.23 | 15.63 |
| MClust | 3 | −4.68 | 0.21 | −6.78 | −2.58 |
| gDGC | 3 | −4.77 | 0.21 | −6.87 | −2.67 |
| Gap | 3 | −5.00 | 0.21 | −5.20 | −4.80 |
| H | 3 | −5.09 | 0.21 | −5.29 | −4.89 |
| CH | 3 | −6.14 | 0.21 | −6.34 | −5.94 |
| Silh | 3 | −6.84 | 0.21 | −7.04 | −6.64 |
| CCC | 3 | −7.23 | 0.21 | −7.43 | −7.03 |
| CCCm | 3 | −7.25 | 0.21 | −7.45 | −7.05 |
| HOPACHm | 5 | 36.82 | 2.56 | 34.72 | 38.92 |
| HOPACHc | 5 | 35.82 | 2.56 | 33.72 | 37.92 |
| gDGC | 5 | −1.41 | 0.21 | −1.61 | −1.21 |
| MClust | 5 | −2.02 | 0.21 | −2.22 | −1.82 |
| Gap | 5 | −2.84 | 0.21 | −3.04 | −2.64 |
| CH | 5 | −3.11 | 0.21 | −3.31 | −2.91 |
| H | 5 | −3.34 | 0.21 | −3.54 | −3.14 |
| Silh | 5 | −4.07 | 0.21 | −4.27 | −3.87 |
| CCCm | 5 | −6.39 | 0.21 | −6.59 | −6.19 |
| CCC | 5 | −6.39 | 0.21 | −6.59 | −6.19 |
Summarized ANOVA table for the terms of the linear model fitted to the bias (estimated minus true number of clusters—k—in the gene-expression matrix). Results are shown for k = 2 and k = 10. Clustering algorithm: complete linkage.
| Model terms | numDF | denDF |
|
| ||
|---|---|---|---|---|---|---|
|
| BH-Adj |
| BH-Adj | |||
|
| 9 | 720 | 130.17 | <0.0001 | 265.88 | <0.0001 |
|
| 1 | 80 | 4.26 | 0.0845 | 11.78 | 0.0017 |
|
| 1 | 80 | 26.39 | <0.0001 | 57.35 | <0.0001 |
|
| 1 | 80 | 0.54 | 0.5703 | 28.62 | <0.0001 |
|
| 9 | 720 | 28.29 | <0.0001 | 10.05 | <0.0001 |
|
| 9 | 720 | 19.67 | <0.0001 | 14.43 | <0.0001 |
|
| 9 | 720 | 0.71 | 0.7031 | 4.15 | 0.0001 |
|
| 1 | 80 | 0.3 | 0.6331 | 0.33 | 0.6046 |
|
| 1 | 80 | 0.78 | 0.5063 | 0.83 | 0.4169 |
|
| 1 | 80 | 2.19 | 0.2289 | 1.55 | 0.2897 |
|
| 9 | 720 | 2.66 | 0.0129 | 3.78 | 0.0002 |
|
| 9 | 720 | 0.82 | 0.6331 | 1.27 | 0.3096 |
|
| 9 | 720 | 1.43 | 0.2508 | 1.63 | 0.1631 |
|
| 1 | 80 | 2.19 | 0.2289 | 2.58 | 0.1631 |
|
| 9 | 720 | 2.23 | 0.0419 | 0.48 | 0.8888 |
BH-Adj P value: refers to the adjusted P value according to Benjamini-Hochberg algorithm.
Estimated means, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias for each combination of method (M) and number of genes (G). The table is sorted in descending order of bias within each level of G. True number of clusters: k = 2. Clustering algorithm: complete linkage.
|
|
| Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|---|
| HOPACHm | 100 | 14.93 | 2.08 | 10.85 | 19.01 |
| HOPACHc | 100 | 11.14 | 2.08 | 7.05 | 15.22 |
| Gap | 100 | 3.11 | 0.22 | 2.69 | 3.54 |
| H | 100 | 1.93 | 0.22 | 1.51 | 2.35 |
| CH | 100 | 0.68 | 0.22 | 0.26 | 1.10 |
| Silh | 100 | 0.45 | 0.22 | 0.03 | 0.88 |
| CCC | 100 | 0.16 | 0.22 | −0.26 | 0.58 |
| CCCm | 100 | 0.16 | 0.22 | −0.26 | 0.58 |
| gDGC | 100 | 0.05 | 0.22 | −0.38 | 0.47 |
| MClust | 100 | −0.05 | 0.22 | −0.47 | 0.38 |
| HOPACHm | 300 | 8.16 | 2.08 | 4.08 | 12.24 |
| H | 300 | 6.41 | 0.22 | 5.99 | 6.83 |
| Gap | 300 | 4.77 | 0.22 | 4.35 | 5.20 |
| HOPACHc | 300 | 3.95 | 2.08 | −0.13 | 8.04 |
| CH | 300 | 0.11 | 0.22 | −0.31 | 0.54 |
| gDGC | 300 | 0.09 | 0.22 | −0.33 | 0.51 |
| Silh | 300 | 0.07 | 0.22 | −0.35 | 0.49 |
| CCC | 300 | 0.02 | 0.22 | −0.40 | 0.45 |
| CCCm | 300 | 0.02 | 0.22 | −0.40 | 0.45 |
| MClust | 300 | −0.02 | 0.22 | −0.45 | 0.40 |
Estimated mean, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias of each combination of method (M) and number of treatments (T). The table is sorted in descending order of bias within each level of T. True number of clusters: k = 2. Clustering algorithm: complete linkage.
|
|
| Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|---|
| HOPACHm | 3 | 17.25 | 2.08 | 13.17 | 21.33 |
| HOPACHc | 3 | 9.27 | 2.08 | 5.19 | 13.35 |
| H | 3 | 5.84 | 0.22 | 5.42 | 6.26 |
| Gap | 3 | 5.43 | 0.22 | 5.01 | 5.85 |
| CH | 3 | 0.75 | 0.22 | 0.33 | 1.17 |
| Silh | 3 | 0.41 | 0.22 | −0.01 | 0.83 |
| CCC | 3 | 0.18 | 0.22 | −0.24 | 0.60 |
| CCCm | 3 | 0.18 | 0.22 | −0.24 | 0.60 |
| gDGC | 3 | 0.00 | 0.22 | −0.42 | 0.42 |
| MClust | 3 | −0.07 | 0.22 | −0.49 | 0.35 |
| HOPACHm | 5 | 5.84 | 2.08 | 1.76 | 9.92 |
| HOPACHc | 5 | 5.82 | 2.08 | 1.74 | 9.90 |
| H | 5 | 2.50 | 0.22 | 2.08 | 2.92 |
| Gap | 5 | 2.45 | 0.22 | 2.03 | 2.88 |
| gDGC | 5 | 0.14 | 0.22 | −0.29 | 0.56 |
| Silh | 5 | 0.11 | 0.22 | −0.31 | 0.54 |
| CH | 5 | 0.05 | 0.22 | −0.38 | 0.47 |
| CCCm | 5 | 0.00 | 0.22 | −0.42 | 0.42 |
| MClust | 5 | 0.00 | 0.22 | −0.42 | 0.42 |
| CCC | 5 | 0.00 | 0.22 | −0.42 | 0.42 |
Estimated mean, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias of each combination of method (M) and number of replicates (N). The table is sorted in descending order of bias within each level of N. True number of clusters: k = 10. Clustering algorithm: complete linkage.
|
|
| Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|---|
| HOPACHm | 3 | 33.00 | 2.58 | 27.94 | 38.06 |
| HOPACHc | 3 | 25.18 | 2.58 | 20.12 | 30.24 |
| Gap | 3 | −1.95 | 0.23 | −2.40 | −1.50 |
| H | 3 | −2.27 | 0.23 | −2.72 | −1.82 |
| gDGC | 3 | −3.75 | 0.23 | −4.20 | −3.30 |
| MClust | 3 | −3.95 | 0.23 | −4.40 | −3.50 |
| CH | 3 | −5.48 | 0.23 | −5.93 | −5.03 |
| Silh | 3 | −6.57 | 0.23 | −7.02 | −6.12 |
| CCCm | 3 | −7.11 | 0.23 | −7.56 | −6.66 |
| CCC | 3 | −7.23 | 0.23 | −7.68 | −6.78 |
| HOPACHm | 6 | 32.39 | 2.58 | 27.33 | 37.45 |
| HOPACHc | 6 | 27.43 | 2.58 | 22.37 | 32.49 |
| Gap | 6 | −0.89 | 0.23 | −1.34 | −0.44 |
| H | 6 | −1.36 | 0.23 | −1.81 | −0.91 |
| gDGC | 6 | −2.45 | 0.23 | −2.90 | −2.00 |
| MClust | 6 | −2.98 | 0.23 | −3.43 | −2.53 |
| CH | 6 | −3.55 | 0.23 | −4.00 | −3.10 |
| Silh | 6 | −5.25 | 0.23 | −5.70 | −4.80 |
| CCCm | 6 | −7.16 | 0.23 | −7.61 | −6.71 |
| CCC | 6 | −7.23 | 0.23 | −7.68 | −6.78 |
Estimated mean, standard error, and lower (LB) and upper boundaries (UB) of a 95% confidence interval for the bias of each combination of method (M), number of genes (G), and number of treatments (T). The table is sorted in descending order of bias within each level of G and T. True number of clusters: k = 10. Clustering algorithm: complete linkage.
|
|
|
| Mean bias | Standard error | LB (95%) | UB (95%) |
|---|---|---|---|---|---|---|
| HOPACHm | 100 | 3 | 28.18 | 3.64 | 21.05 | 35.31 |
| HOPACHc | 100 | 3 | 23.14 | 3.64 | 16.01 | 30.27 |
| Gap | 100 | 3 | −2.14 | 0.32 | −2.77 | −1.51 |
| H | 100 | 3 | −3.50 | 0.32 | −4.13 | −2.87 |
| gDGC | 100 | 3 | −4.55 | 0.32 | −5.18 | −3.92 |
| CH | 100 | 3 | −4.91 | 0.32 | −5.54 | −4.28 |
| MClust | 100 | 3 | −5.09 | 0.32 | −5.72 | −4.46 |
| Silh | 100 | 3 | −6.82 | 0.32 | −7.45 | −6.19 |
| CCCm | 100 | 3 | −7.41 | 0.32 | −8.04 | −6.78 |
| CCC | 100 | 3 | −7.41 | 0.32 | −8.04 | −6.78 |
| HOPACHc | 100 | 5 | 30.05 | 3.64 | 22.92 | 37.18 |
| HOPACHm | 100 | 5 | 26.95 | 3.64 | 19.82 | 34.08 |
| Gap | 100 | 5 | −1.45 | 0.32 | −2.08 | −0.82 |
| gDGC | 100 | 5 | −2.00 | 0.32 | −2.63 | −1.37 |
| MClust | 100 | 5 | −2.45 | 0.32 | −3.08 | −1.82 |
| H | 100 | 5 | −3.05 | 0.32 | −3.68 | −2.42 |
| CH | 100 | 5 | −3.73 | 0.32 | −4.36 | −3.10 |
| Silh | 100 | 5 | −4.95 | 0.32 | −5.58 | −4.32 |
| CCCm | 100 | 5 | −7.09 | 0.32 | −7.72 | −6.46 |
| CCC | 100 | 5 | −7.36 | 0.32 | −7.99 | −6.73 |
| HOPACHm | 300 | 3 | 40.23 | 3.64 | 33.1 | 47.36 |
| HOPACHc | 300 | 3 | 16.09 | 3.64 | 8.96 | 23.22 |
| H | 300 | 3 | 0.36 | 0.32 | −0.27 | 0.99 |
| Gap | 300 | 3 | −1.18 | 0.32 | −1.81 | −0.55 |
| gDGC | 300 | 3 | −4.14 | 0.32 | −4.77 | −3.51 |
| MClust | 300 | 3 | −4.23 | 0.32 | −4.86 | −3.60 |
| CH | 300 | 3 | −6.36 | 0.32 | −6.99 | −5.73 |
| Silh | 300 | 3 | −7.41 | 0.32 | −8.04 | −6.78 |
| CCCm | 300 | 3 | −7.45 | 0.32 | −8.08 | −6.82 |
| CCC | 300 | 3 | −7.45 | 0.32 | −8.08 | −6.82 |
| HOPACHc | 300 | 5 | 35.95 | 3.64 | 28.82 | 43.08 |
| HOPACHm | 300 | 5 | 35.41 | 3.64 | 28.28 | 42.54 |
| Gap | 300 | 5 | −0.91 | 0.32 | −1.54 | −0.28 |
| H | 300 | 5 | −1.09 | 0.32 | −1.72 | −0.46 |
| gDGC | 300 | 5 | −1.73 | 0.32 | −2.36 | −1.10 |
| MClust | 300 | 5 | −2.09 | 0.32 | −2.72 | −1.46 |
| CH | 300 | 5 | −3.05 | 0.32 | −3.68 | −2.42 |
| Silh | 300 | 5 | −4.45 | 0.32 | −5.08 | −3.82 |
| CCCm | 300 | 5 | −6.59 | 0.32 | −7.22 | −5.96 |
| CCC | 300 | 5 | −6.68 | 0.32 | −7.31 | −6.05 |