| Literature DB >> 25272975 |
Peter Kent1, Rikke K Jensen, Alice Kongsted.
Abstract
BACKGROUND: There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is a scarcity of head-to-head comparisons that can inform the choice of which clustering method might be suitable for particular clinical datasets and research questions. Therefore, the aim of this study was to perform a head-to-head comparison of three commonly available methods (SPSS TwoStep CA, Latent Gold LCA and SNOB LCA).Entities:
Mesh:
Year: 2014 PMID: 25272975 PMCID: PMC4192340 DOI: 10.1186/1471-2288-14-113
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Characteristics of real datasets
| Dataset | Data type | n | Variables |
|---|---|---|---|
| MRI1 dataset | Dichotomous, cross-sectional data | 2,060 disc levels | Disc signal intensity, loss of disc height, disc high intensity zone, location of high intensity zone, type of disc herniation, location of disc herniation, nucleus pulposus shape, annular tear anterior, annular tear posterior, annular tear right, annular tear left, location of nerve root compression, nerve root compression, anterolisthesis, retrolisthesis, top endplate defect, bottom endplate defect, Modic changes top endplate, Modic changes bottom endplate, facet joint degeneration, facet joint asymmetry, central stenosis, foraminal stenosis. |
| MRI2 dataset | Dichotomous, cross- sectional data | 3,155 disc levels | Disc signal intensity, disc height, disc high intensity zone, disc contour, type of disc herniation, disc herniation signal intensity, anterolisthesis, retrolisthesis, type of endplate changes top, type of endplate changes bottom, size of endplate changes top, size of endplate changes bottom, osteophytes top, osteophytes bottom, endplate defect top, endplate defect bottom, endplate irregularity top, endplate irregularity bottom. |
| MRI3 dataset | Dichotomous, cross-sectional data | 20,810 disc levels | Disc bulge, disc degeneration, disc herniation, disc high intensity zone, Modic changes Type 1, Modic changes Type 2, nerve root compression, Scheuermann's disease, spondylolisthesis, facet joint degeneration, osteoarthritis, central spinal stenosis, scoliosis, red flag condition (cancer, fracture, infection). |
| SMS dataset | Interval, longitudinal repeated measures data | 1,121 people | Pain intensity (0 to 10) measured once a week for 52 weeks. |
| Clinical dataset | Mixed (dichotomous, ordinal, interval), cross-sectional data | 543 people |
|
|
| |||
|
|
Characteristics of artificial datasets
| Dataset | No of subgroups | Data type | Subgroup scoring | Subgroup n |
|---|---|---|---|---|
| A1 | 3 | Interval and dichotomous | Discrete scoring bands | 333,333, 334 |
| A2 | 3 | Interval | Overlapping scoring bands | 333, 333, 334 |
| A3 | 6 | Interval and dichotomous | Overlapping scoring bands with two distinct subgroups on each variable | 166, 166, 166, 164, 168, 170 |
| A4 | 3 | Interval | Overlapping scoring bands plus 10 ‘noise’ variables that do not discriminate subgroups | 333, 333, 334 |
Figure 1Dataset A1 (n = 1000) - containing 3 subgroups, whose distinguishing features do not overlap, with characteristics scored on a mixture of continuous and dichotomous variables.
Figure 2Dataset A2 (n = 1000) - containing 3 subgroups, whose distinguishing features do overlap, with all characteristics scored on continuous variables.
Figure 3Dataset A3 (n = 1000)- containing 6 subgroups, whose distinguishing features do overlap, with characteristics scored on a mixture of continuous and dichotomous variables.
Figure 4Dataset A4 (n = 1000) - containing 3 subgroups, whose distinguishing features do overlap, with all characteristics scored on continuous variables. Contains 10 ’pure noise’ non-discriminatory variables.
Figure 5Illustration of classification overlap of subgroups.
Classification performance with real datasets
| TwoStep | Latent Gold | SNOB | |
|---|---|---|---|
|
| |||
| MRI1 dataset | 2 | 7 | 10 |
| MRI2 dataset | 3 | 11 | 15 |
| MRI3 dataset | 2 | 6 | 7 |
| SMS dataset | 2 | 10 | 37 |
| Clinical dataset | Not available | 8 | 9 |
|
| |||
| MRI1 dataset | Not available | 91.2% (SD11.9%) | 91.5% (11.6%) |
| MRI2 dataset | Not available | 98.9% (SD3.9%) | 97.1% (SD6.6%) |
| MRI3 dataset | Not available | 85.7% (SD19.5%) | 91.0% (SD12.7%) |
| SMS dataset | Not available | 96.5% (SD8.8%) | 98.2% (SD4.7%) |
| Clinical dataset | Not available | 91.4% (SD12.9%) | 89.9% (SD13.5%) |
|
| |||
| Number of subgroups | 100% agreement | With fixed seed point = 100% agreement | 100% agreement |
| Classification stability (reproducibility of individual disc-levels or people being classified into each subgroup) | 100% agreement | With fixed seed point = 100% agreement | 100% agreement |
| Classification certainty | Not available | With fixed seed point = 100% agreement | 100% agreement |
Figure 6Classification disagreement of individuals (disc levels or patients).
Classification performance with artificial datasets
| Number of subgroups detected | Accuracy of classifying 1000 individuals into subgroups | |||||
|---|---|---|---|---|---|---|
| Dataset | TwoStep | Latent Gold | SNOB | TwoStep | Latent Gold | SNOB |
| A1 (3 subgroups) | 3 | 3 | 3 | 100% | 100% | 100% |
| A2 (3 subgroups) | 3 | 3 | 3 | 99.9% | 99.8% | 99.9% |
| A3 (6 subgroups) | 6 | 7 | 6 | 98.7% | 100% | 98.4% |
| A4 (3 subgroups) | 3 | 3 | 3 | 99.4% | 99.2% | 99.4% |
Overall summary of three clustering techniques
| TwoStep | Latent Gold | SNOB | |
|---|---|---|---|
| Method | Distance-based, agglomerative hierarchical cluster analysis | Finite mixture modeling to probabilistically identify latent classes | Finite mixture modeling to probabilistically identify latent classes |
| Stopping rule to identify number of subgroups | Automated using either ‘Bayesian information criterion’ or ‘Akaike’s information criterion’ | Analyst choice using various criteria, including ‘Bayesian information criterion’, unexplained variance, Chi-square p-value | Automated using ‘Minimum message length’ principle |
| Suitable data types | Ordinal data require recoding as dichotomous or handled as if interval data | All types | All types |
| Report classification probability of individuals | No | Yes | Yes |
| Sensitivity to subgroups | Least | Middle | Most |
| Reproducibility | Very high | Very high | Very high |
| Accuracy | Very high | Very high | Very high |
| Cost | Most expensive | Less expensive | Free |
| Support | Extensive documentation, fee-based support available | Extensive documentation and some free support available | Some documentation but minimal support available |
| Interpretability of presentation of results | Results are presented numerically and graphically (charts of certainty of the subgroup structure, bar and pie charts of cluster frequencies, and charts displaying the importance of specific variables to subgroups) | Results are presented numerically and graphically (including a tri-plot displaying the relationships between subgroups) | Results are mostly numeric (although a tree diagram is produced showing the relationship between ‘mother’ and ‘daughter’ subgroups) |
|
|
|
|
|