| Literature DB >> 32463370 |
Elsie Horne1, Holly Tibble1, Aziz Sheikh1, Athanasios Tsanas1.
Abstract
BACKGROUND: In the current era of personalized medicine, there is increasing interest in understanding the heterogeneity in disease populations. Cluster analysis is a method commonly used to identify subtypes in heterogeneous disease populations. The clinical data used in such applications are typically multimodal, which can make the application of traditional cluster analysis methods challenging.Entities:
Keywords: asthma; cluster analysis; data mining; machine learning; unsupervised machine learning
Year: 2020 PMID: 32463370 PMCID: PMC7290450 DOI: 10.2196/16452
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Schematic of the typical cluster analysis steps.
21]. Related methods include factor analysis for continuous data, multiple correspondence analysis (MCA) for categorical data [22], and multiple factor analysis for mixed-type data [23].
Figure 2Flow of studies into review.
Initial considerations across the asthma studies we have included in this review (N=63).
| Method | Values, n (%)a | |
|
| ||
|
| Clinical intuition and understanding | 33 (52) |
|
| Avoid clinical redundancy | 15 (24) |
|
| Previous studies | 15 (24) |
|
| Easily measured in clinical practice | 8 (13) |
|
| ||
|
| Complete case analysis | 22 (35) |
|
| Features with >x%b missing values removed | 14 (22) |
|
| Imputed | 11 (17) |
|
| Patients with >x%b missing values removed | 5 (8) |
|
| No missing data present | 2 (3) |
|
| Clustering methods handle missing data | 1 (2) |
aOne study may use multiple methods; some studies may use no methods.
bx>0.
Figure 3Number of patients versus final number of cluster features. The line corresponds to the number of patients that is equal to 70 times the number of features.
Breakdown of methods used by studies applying hierarchical clustering with Ward's linkage (N=23).
| Data type, dissimilarity, and scaling of continuous features | Categorical features encoded as binary? | Value, n (%) | ||
|
| ||||
|
| ||||
| Not detailed | N/Aa | 1 (4) | ||
|
| ||||
|
| ||||
| Scaled but method unspecified | Yes | 1 (4) | ||
| Scaled to lie in the interval of 0 to 1 | Yes | 1 (4) | ||
| z-scores | Yes | 1 (4) | ||
| Not detailed | Yes | 3 (13) | ||
|
| ||||
| z-scores | Yes | 2 (9) | ||
|
| ||||
| Gower standardisation | No | 3 (13) | ||
| Scaled but method unspecified | No | 1 (4) | ||
|
| ||||
| Not detailed | No | 1 (4) | ||
aN/A: not applicable (irrelevant for continuous features).
bComputing the Gower coefficient normalizes the distance between feature samples by dividing by the feature range. Therefore, it is not necessary to normalize continuous features prior to computing the Gower coefficient.
Breakdown of methods used by studies applying SPSS TwoStep (N=7).
| Data type, dissimilarity, and scaling of continuous features | Categorical features encoded as binary? | Value, n (%) | ||
|
| ||||
|
|
| |||
|
|
| No details | N/Aa | 1 (14) |
|
| ||||
|
|
| |||
|
|
| Scaled to lie in the interval 0 to 1 | Yes | 1 (14) |
|
|
| z-scores | No | 1 (14) |
|
|
| No details | Yes | 2 (29) |
|
|
| |||
|
|
| Scaled but method unspecified | No | 1 (14) |
|
|
| No details | No | 1 (14) |
aN/A: not applicable (irrelevant for continuous features).
Feature engineering methods used in the asthma studies included in this review.
| Method | Values, n (%)a | |
|
| ||
| Logarithmic transformation | 21 (33) | |
| Box-Cox transformation | 1 (2) | |
| Method not explained | 1 (2) | |
|
| ||
| Factor analysisb | 8 (13) | |
| Principal component analysisb | 5 (8) | |
| Avoid collinearity | 3 (5) | |
| Avoid multicollinearity | 3 (5) | |
| Supervised learning methods | 2 (3) | |
| Multiple correspondence analysis | 1 (2) | |
|
| ||
| Principal component analysis | 4 (6) | |
| Factor analysis | 1 (2) | |
| Multiple correspondence analysis | 1 (2) | |
aAs a percentage of all 63 studies.
bThese are not typically methods of feature selection but have been used in these studies.
Breakdown of methods used by studies applying k-means (N=22).
| Data type, dissimilarity, and scaling of continuous features | Categorical features encoded as binary? | Value, n (%) | ||
|
| ||||
|
|
| |||
|
|
| z-scores for one feature | N/Aa | 1 (5) |
|
|
| No details | N/A | 3 (14) |
|
|
| |||
|
|
| No details | N/A | 1 (5) |
|
| ||||
|
|
| |||
|
|
| Scaled but method unspecified | No | 1 (5) |
|
|
| z-scores | Yes | 6 (27) |
|
|
| z-scores for one feature | No | 1 (5) |
|
|
| No details | Yes | 1 (5) |
|
|
| |||
|
|
| z-scores | Yes | 1 (5) |
|
|
| No details | No | 1 (5) |
|
| ||||
|
|
| |||
|
|
| No details | No | 3 (14) |
|
|
| |||
|
|
| z-scores | No | 1 (5) |
aN/A: not applicable (irrelevant for continuous features).
Postprocessing methods used in the asthma studies included in this review.
| Method | Values, n (%)a | |
|
| ||
|
| Dendrogram | 27 (43) |
|
| Hierarchical clustering with Ward linkage | 19 (30) |
|
| Specify a maximum number of clustersb | 8 (13) |
|
| Statistic(s) | 7 (11) |
|
| Silhouette plot or average silhouette width | 5 (8) |
|
| Bayesian information criterion | 4 (6) |
|
| Specify a minimum size of smallest clusterb | 4 (6) |
|
| Previous studies | 3 (5) |
|
| Unclear | 3 (5) |
|
| Clinical interpretation | 2 (3) |
|
| Scree plot | 1 (2) |
|
| ||
|
| Repeated in random subset | 3 (5) |
|
| Leave-one-out cross-validation | 3 (5) |
|
| Bootstrap methods | 3 (5) |
|
| Unclear methods | 2 (3) |
|
| Train and test set | 1 (2) |
|
| ||
|
| Repeated in selected subset | 8 (13) |
|
| Repeated with difference methods | 6 (10) |
|
| Repeated with different initial configurations | 5 (8) |
|
| Repeated in separate cohort | 4 (6) |
|
| Repeated with altered features | 3 (5) |
|
| Repeated at different time point | 3 (5) |
|
| Repeated with different software | 1 (2) |
aStudies may have used more than 1 method.
bThese methods were not included when calculating the number of methods used to choose the number of clusters.