| Literature DB >> 24662996 |
Shima Ghassempour1, Federico Girosi2, Anthony Maeder3.
Abstract
In this paper we describe an algorithm for clustering multivariate time series with variables taking both categorical and continuous values. Time series of this type are frequent in health care, where they represent the health trajectories of individuals. The problem is challenging because categorical variables make it difficult to define a meaningful distance between trajectories. We propose an approach based on Hidden Markov Models (HMMs), where we first map each trajectory into an HMM, then define a suitable distance between HMMs and finally proceed to cluster the HMMs with a method based on a distance matrix. We test our approach on a simulated, but realistic, data set of 1,255 trajectories of individuals of age 45 and over, on a synthetic validation set with known clustering structure, and on a smaller set of 268 trajectories extracted from the longitudinal Health and Retirement Survey. The proposed method can be implemented quite simply using standard packages in R and Matlab and may be a good candidate for solving the difficult problem of clustering multivariate time series with categorical variables using tools that do not require advanced statistic knowledge, and therefore are accessible to a wide range of researchers.Entities:
Mesh:
Year: 2014 PMID: 24662996 PMCID: PMC3968966 DOI: 10.3390/ijerph110302741
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1The main premise of this paper is that while the distance between two trajectories is ill-defined, the distance between two probabilistic models that are likely to generate them is well-defined. The class of probabilistic models we consider in this paper is the Hidden Markov Models (HMMs).
The DB, Silhouette and Dunn index for the 45 and Up data, in the case of 3 hidden states. The reason for choosing 3 hidden states is found in Subsection 4.1
| 2-Cluster | 1.6739 | 0.2074 | 0.3679 |
| 3-Cluster | 3.0126 | 0.1458 | 0.1815 |
| 5-Cluster | 2.0816 | 0.1964 | 0.2125 |
| 6-Cluster | 2.9047 | 0.1291 | 0.1690 |
| 7-Cluster | 2.1018 | 0.1724 | 0.2176 |
| 8-Cluster | 2.0894 | 0.1895 | 0.1929 |
| 9-Cluster | 1.9710 | 0.1651 | 0.2210 |
| 10-Cluster | 1.6086 | 0.2216 | 0.2601 |
Figure 2MDS 45 and up.
Figure 3The profile of the four clusters in the feature space for the 45 and Up data.
Interpretation of the four clusters for the 45 and Up synthetic data. Note that the expressions such as “mostly”, “significant” or “some” do not refer to the size of the effect on an individuals, but rather to the size of the population that experiences the effect. Therefore “Some weight gain” means that some of the the people in the cluster experiences weight gain. Interpretation for not-smoking behavior omitted because of lack of change.
| Heart disease, stroke and diabetes almost simultaneously | Mostly overweight/obese before 1st disease | Some weight loss after 3rd disease | No change in smoking behaviour after 3rd disease | |
| Diabetes, heart disease and then stroke | Significantly overweight/obese before 1st disease | Some weight gain after 3rd disease | Mild increase in quitting smoking after 3rd disease | |
| Diabetes, stroke and then heart disease | Half time normal BMI before 1st disease | Significant weight gain after 3rd disease | Mild increase in quitting after 3rd disease | |
| Diabetes and then heart disease and stroke | Mostly overweight/obese before 1st disease | Significant weight loss after 3rd disease | Mild increase in quitting after 3rd disease |
The optimal number of hidden states for the 45 and Up synthetic data is 3, since it corresponds to the lowest average correlation across the feature profiles.
| 2 Hidden states | 0.74 | |
| 4 Hidden states | 0.47 |
The optimal number of hidden states for the validation data set is 3, since it corresponds to the lowest average correlation across the feature profiles. This is indeed the number of hidden states used to generate the data.
| 2 Hidden states | 0.12 | |
| 4 Hidden states | 0.17 |
Confusion matrix between true cluster and predicted cluster results.
|
| |||||
|---|---|---|---|---|---|
| 199 | 49 | 1 | 25 | ||
| 5 | 441 | 2 | 0 | ||
| 0 | 0 | 221 | 0 | ||
| 0 | 0 | 0 | 312 | ||
The optimal number of hidden states for the 45 and Up data is 3, since it corresponds to the lowest average correlation across the feature profiles.
| 2 Hidden states | 0.59 | |
| 4 Hidden states | 0.62 |
The DB, Silhouette and Dunn index for the HRS data, using 3 hidden states.
| 2-Cluster | 2.0529 | 0.4636 | 0.3893 |
| 4-Cluster | 2.1257 | 0.4412 | 0.2820 |
| 5-Cluster | 1.8561 | 0.3974 | 0.2614 |
| 6-Cluster | 2.6350 | 0.4321 | 0.1841 |
| 7-Cluster | 1.8339 | 0.4304 | 0.3497 |
| 8-Cluster | 2.6883 | 0.4506 | 0.2170 |
| 9-Cluster | 2.5222 | 0.4536 | 0.1544 |
| 10-Cluster | 2.2139 | 0.4226 | 0.2243 |
Figure 4MDS HRS data.
Figure 5The profile of the three clusters in the feature space for the HRS data.
Interpretation of the three clusters in the HRS data. Note that the expressions such as “large” or “mostly” do not refer to the size of the effect on an individuals, but rather to the size of the population that experiences the effect. Therefore “Large weight loss” means that a large portion of the people in the cluster experiences weight loss. Interpretation of not-smoking omitted because of lack of change.
| Heart disease, then stroke and then diabetes | Mostly overweight or obese before 1st disease | Weight gain after 3rd disease | Large increase in smoke quitting after 3rd disease | |
| Diabetes and then heart disease and stroke almost at the same time | Mostly overweight or obese before 1st disease | Some weight loss after 3rd disease | Increa1se in smoke quitting after 3rd disease | |
| Heart disease, then diabetes and much later stroke | Mostly overweight or obese before 1st disease | Large weight loss after 3rd disease | Increase in smoke quitting after 3rd disease |