| Literature DB >> 33723467 |
Tonglin Zhang1, Ge Lin2.
Abstract
Generalized k -means can be combined with any similarity or dissimilarity measure for clustering. Using the well known likelihood ratio or F -statistic as the dissimilarity measure, a generalized k -means method is proposed to group generalized linear models (GLMs) for exponential family distributions. Given the number of clusters k , the proposed method is established by the uniform most powerful unbiased (UMPU) test statistic for the comparison between GLMs. If k is unknown, then the proposed method can be combined with generalized liformation criterion (GIC) to automatically select the best k for clustering. Both AIC and BIC are investigated as special cases of GIC. Theoretical and simulation results show that the number of clusters can be correctly identified by BIC but not AIC. The proposed method is applied to the state-level daily COVID-19 data in the United States, and it identifies 6 clusters. A further study shows that the models between clusters are significantly different from each other, which confirms the result with 6 clusters.Entities:
Keywords:
COVID-19; Clustering; Exponential family distributions; Generalized
Year: 2021 PMID: 33723467 PMCID: PMC7943386 DOI: 10.1016/j.csda.2021.107217
Source DB: PubMed Journal: Comput Stat Data Anal ISSN: 0167-9473 Impact factor: 1.681
Fig. 1Generalized -means clustering for six regression lines.
Percentage of numbers of clusters identified correctly () based on 1000 simulation replications when data are generated from (26) with respect to -means (K), convex clustering (Convex), the EM algorithm, and our AIC and BIC selectors in generalized -means.
| K | Convex | EM | AIC | BIC | K | Convex | EM | AIC | BIC | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 10 | 2 | 53.2 | 59.7 | 81.1 | 15.4 | 70.2 | 73.9 | 72.4 | 21.4 | ||
| 3 | 23.3 | 19.2 | 0.0 | 5.6 | 10.7 | 7.4 | 0.0 | 6.2 | ||||
| 20 | 2 | 85.1 | 87.2 | 75.1 | 0.5 | 92.3 | 91.7 | 75.2 | 0.3 | |||
| 3 | 6.7 | 5.6 | 0.0 | 0.0 | 1.7 | 1.5 | 0.0 | 0.0 | ||||
| 1.0 | 10 | 2 | 22.6 | 25.0 | 75.1 | 17.3 | 35.9 | 39.2 | 71.6 | 15.9 | ||
| 3 | 27.3 | 21.6 | 0.0 | 7.3 | 24.4 | 21.5 | 0.0 | 4.9 | ||||
| 20 | 2 | 52.4 | 52.7 | 74.2 | 0.5 | 69.5 | 71.5 | 72.2 | 0.3 | |||
| 3 | 15.6 | 13.0 | 0.0 | 0.3 | 15.1 | 14.0 | 0.0 | 0.0 | ||||
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from (26) with respect to -means (K), convex clustering (Convex), the EM algorithm, and our AIC and BIC selectors in generalized -means.
| K | Convex | EM | BIC | K | Convex | EM | BIC | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 10 | 2 | 48.7 | 46.9 | 10.0 | 47.4 | 46.9 | 33.6 | ||
| 3 | 47.7 | 47.1 | 14.5 | 49.5 | 49.2 | 33.1 | ||||
| 20 | 2 | 49.7 | 49.5 | 12.8 | 49.5 | 49.4 | 33.1 | |||
| 3 | 49.8 | 49.7 | 12.8 | 50.1 | 50.0 | 31.8 | ||||
| 1.0 | 10 | 2 | 48.8 | 48.5 | 13.1 | 49.0 | 48.6 | 33.6 | ||
| 3 | 45.6 | 44.5 | 14.9 | 46.7 | 45.8 | 36.4 | ||||
| 20 | 2 | 49.8 | 49.6 | 13.3 | 49.7 | 49.6 | 34.5 | |||
| 3 | 48.4 | 47.8 | 14.3 | 49.0 | 48.7 | 36.5 | ||||
Percentage of numbers of clusters identified correctly (IC) based on 1000 simulation replications when data are generated (27) with respect to -means (K), convex clustering (Convex), and -means++ (KPP) directly on regression coefficients based on the gap statistic and our BIC selector in generalized -means.
| K | Convex | KPP | BIC | K | Convex | KPP | BIC | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 10 | 2 | 92.6 | 98.5 | 88.2 | 96.5 | 93.9 | |||
| 3 | 72.3 | 99.2 | 90.2 | 72.4 | 99.9 | 93.1 | ||||
| 20 | 2 | 99.7 | 98.9 | 99.9 | ||||||
| 3 | 73.3 | 99.2 | 72.9 | 99.9 | ||||||
| 0.2 | 10 | 2 | 94.6 | 97.8 | 90.9 | 96.0 | 99.8 | 93.3 | ||
| 3 | 36.8 | 47.5 | 26.9 | 75.6 | 99.2 | 95.0 | ||||
| 20 | 2 | 99.7 | 99.0 | 99.9 | ||||||
| 3 | 24.6 | 24.4 | 10.6 | 77.2 | 99.8 | |||||
| 0.5 | 10 | 2 | 95.7 | 99.9 | 93.5 | 96.0 | 99.1 | 95.4 | ||
| 3 | 1.1 | 1.3 | 1.2 | 1.3 | 1.4 | 1.4 | ||||
| 20 | 2 | 99.9 | 99.7 | 99.9 | 99.8 | |||||
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
| 1.0 | 10 | 2 | 96.2 | 93.0 | 97.7 | 95.8 | ||||
| 3 | 0.9 | 0.8 | 1.6 | 0.3 | 0.2 | 0.3 | ||||
| 20 | 2 | 99.6 | 99.6 | 99.9 | ||||||
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from (27) with respect to -means (K), convex clustering (Convex), and -means++ (KPP), directly on regression coefficients and our BIC selector in the generalized -means.
| K | Convex | KPP | BIC | K | Convex | KPP | BIC | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 10 | 2 | 0.4 | 0.1 | 0.6 | 0.2 | 0.3 | |||
| 3 | 1.4 | 0.2 | 1.5 | 0.1 | ||||||
| 20 | 2 | |||||||||
| 3 | 1.5 | 1.5 | ||||||||
| 0.2 | 10 | 2 | 0.3 | 0.1 | 0.5 | 0.2 | 0.3 | |||
| 3 | 14.0 | 12.4 | 16.1 | 1.4 | 0.1 | |||||
| 20 | 2 | |||||||||
| 3 | 17.5 | 17.6 | 20.6 | 1.5 | ||||||
| 0.5 | 10 | 2 | 10.8 | 10.3 | 10.4 | 0.9 | 0.6 | 0.9 | ||
| 3 | 33.1 | 32.6 | 32.2 | 26.0 | 26.8 | 25.7 | ||||
| 20 | 2 | 8.4 | 8.2 | 8.3 | 0.5 | 0.5 | 0.5 | |||
| 3 | 31.4 | 31.3 | 31.2 | 25.9 | 26.3 | 25.6 | ||||
| 1.0 | 10 | 2 | 41.1 | 40.2 | 40.0 | 22.8 | 21.7 | 20.5 | ||
| 3 | 46.8 | 45.7 | 46.4 | 38.9 | 38.1 | 38.0 | ||||
| 20 | 2 | 39.4 | 38.9 | 38.1 | 18.6 | 18.0 | 17.7 | |||
| 3 | 45.9 | 45.2 | 45.1 | 36.4 | 36.1 | 35.8 | ||||
Percentage of number of clusters identified correctly () based on 1000 simulation replications when data are generated from a Euclidean space (i.e., ) with respect to the -means (K), the convex clustering (Convex), and the -means++ (KPP) based on the gap statistic.
| K | Convex | KPP | K | Convex | KPP | ||
|---|---|---|---|---|---|---|---|
| 0.001 | 2 | ||||||
| 3 | 62.1 | 89.4 | 60.2 | 91.7 | |||
| 4 | 44.9 | 80.4 | 44.5 | 85.7 | |||
| 0.002 | 2 | ||||||
| 3 | 62.0 | 88.9 | 59.6 | 93.4 | |||
| 4 | 46.2 | 85.0 | 48.9 | 86.7 | |||
| 0.005 | 2 | ||||||
| 3 | 65.3 | 89.7 | 64.5 | 93.5 | |||
| 4 | 45.3 | 83.9 | 47.8 | 85.7 | |||
| 0.01 | 2 | ||||||
| 3 | 67.0 | 89.6 | 67.1 | 90.7 | |||
| 4 | 46.9 | 83.3 | 48.4 | 84.1 | |||
Percentage of clustering object errors () based on 1000 simulation replications when data are generated from a Euclidean space (i.e, ) with respect to the -means (K), the convex clustering (Convex), and the -means++ (KPP) based on the gap statistic.
| K | Convex | KPP | K | Convex | KPP | ||
|---|---|---|---|---|---|---|---|
| 0.001 | 2 | ||||||
| 3 | 2.4 | 2.5 | |||||
| 4 | 4.6 | 0.3 | 4.2 | 0.4 | |||
| 0.002 | 2 | ||||||
| 3 | 2.4 | 2.6 | |||||
| 4 | 4.3 | 0.4 | 4.3 | 0.4 | |||
| 0.005 | 2 | ||||||
| 3 | 2.1 | 2.2 | 0.1 | ||||
| 4 | 4.1 | 0.3 | 4.5 | 0.4 | |||
| 0.01 | 2 | ||||||
| 3 | 2.0 | 0.1 | 2.0 | 0.5 | |||
| 4 | 4.1 | 0.4 | 4.3 | 0.4 | |||
Percentage of number clusters identified correctly () in loglinear models based on 1000 simulation replications when data are generated from (28).
| AIC | BIC | AIC | BIC | AIC | BIC | AIC | BIC | ||
|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 10 | 1.6 | 0.3 | 1.6 | 0.4 | ||||
| 20 | 0.1 | 0.0 | 0.0 | 0.0 | |||||
| 1.0 | 10 | 2.1 | 1.2 | 1.3 | 0.6 | ||||
| 20 | 0.0 | 0.0 | 0.1 | 0.0 | |||||
BIC for percentage of clustering object errors () in loglinear models based on 1000 simulation replications with data generated from (28).
| 2 | 3 | 2 | 3 | ||
|---|---|---|---|---|---|
| 0.5 | 10 | 1.3 | 0.6 | 0.8 | 0.4 |
| 20 | 4.4 | 2.1 | 3.0 | 1.5 | |
| 1.0 | 10 | 1.1 | 0.5 | 0.7 | 0.3 |
| 20 | 3.3 | 1.5 | 2.5 | 1.1 | |
Fig. 2Daily new cases of COVID-19 in states in the mainland United States.
Fitting results of the exponential and the Gamma models for the outbreak of COVID-19 in eleven selected countries between January 11 to May 31, 2020.
| Country | Exponential | Gamma | ||||||
|---|---|---|---|---|---|---|---|---|
| Peak | ||||||||
| China | 7.91 | −0.032 | 0.368 | −9.6 | 7.77 | −0.290 | 0.813 | 02/07 |
| USA | 7.23 | 0.025 | 0.582 | −64.7 | 20.56 | −0.195 | 0.939 | 04/26 |
| Canada | 4.19 | 0.026 | 0.561 | −75.1 | 22.61 | −0.215 | 0.920 | 04/26 |
| Russia | 3.67 | 0.044 | 0.840 | −135.2 | 37.90 | −0.308 | 0.993 | 05/13 |
| Spain | 6.59 | 0.013 | 0.164 | −77.0 | 24.72 | −0.283 | 0.899 | 04/08 |
| UK | 5.53 | 0.023 | 0.446 | −82.9 | 25.25 | −0.248 | 0.857 | 04/22 |
| Italy | 6.66 | 0.010 | 0.110 | −58.0 | 19.53 | −0.238 | 0.945 | 04/03 |
| France | 6.26 | 0.012 | 0.096 | −96.9 | 30.47 | −0.353 | 0.694 | 04/07 |
| Germany | 6.33 | 0.011 | 0.103 | −83.7 | 26.80 | −0.317 | 0.862 | 04/05 |
| Switzerland | 4.90 | 0.006 | 0.030 | −116.0 | 36.50 | −0.463 | 0.853 | 03/30 |
| Sweden | 3.28 | 0.026 | 0.626 | −43.75 | 13.6 | −0.123 | 0.876 | 04/30 |
Fig. 3AIC and BIC for number of clusters in generalized -means based on (30).
Fig. 4Six clusters identified by BIC in generalized -means for the period between February 24 to May 31 (left) and the period between February 24 to July 31 (right), respectively.
Fig. 5Gap statistics for number of cluster in -means for coefficients.
Parameter estimates in the six clusters with a selected state (State) for each cluster based on the Gamma model for the outbreak of COVID-19 in the United States, where the standard errors are given inside the parenthesis and ×means out of control.
| Cluster | State | 02/24–05/31 | 02/24–07/31 | ||||
|---|---|---|---|---|---|---|---|
| Peak | Peak | ||||||
| 1 | California | 5/10(2.27) | × | ||||
| 2 | New York | 4/3(0.28) | 4/12 | ||||
| 3 | Illinois | 4/28(0.81) | 5/11 | ||||
| 4 | Louisiana | 4/8(0.39) | × | ||||
| 5 | Minnesota | 5/17(7.3) | 6/5 | ||||
| 6 | Florida | 4/26(0.42) | × | ||||