| Literature DB >> 36034597 |
Tyler J Loftus1,2,3, Benjamin Shickel3,4, Jeremy A Balch1, Patrick J Tighe5, Kenneth L Abbott1, Brian Fazzone1, Erik M Anderson1, Jared Rozowsky1, Tezcan Ozrazgat-Baslanti2,3,4, Yuanfang Ren2,3,4, Scott A Berceli1, William R Hogan6, Philip A Efron1, J Randall Moorman7, Parisa Rashidi2,3,8, Gilbert R Upchurch1, Azra Bihorac2,3,4.
Abstract
Human pathophysiology is occasionally too complex for unaided hypothetical-deductive reasoning and the isolated application of additive or linear statistical methods. Clustering algorithms use input data patterns and distributions to form groups of similar patients or diseases that share distinct properties. Although clinicians frequently perform tasks that may be enhanced by clustering, few receive formal training and clinician-centered literature in clustering is sparse. To add value to clinical care and research, optimal clustering practices require a thorough understanding of how to process and optimize data, select features, weigh strengths and weaknesses of different clustering methods, select the optimal clustering method, and apply clustering methods to solve problems. These concepts and our suggestions for implementing them are described in this narrative review of published literature. All clustering methods share the weakness of finding potential clusters even when natural clusters do not exist, underscoring the importance of applying data-driven techniques as well as clinical and statistical expertise to clustering analyses. When applied properly, patient and disease phenotype clustering can reveal obscured associations that can help clinicians understand disease pathophysiology, predict treatment response, and identify patients for clinical trial enrollment.Entities:
Keywords: artificial intelligence; cluster; endotype; endotyping; machine learning
Year: 2022 PMID: 36034597 PMCID: PMC9411746 DOI: 10.3389/frai.2022.842306
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1Phenotype clustering in health care applies clustering algorithms to clinical data, biomarkers, or genomic data to form unique groupings that can elucidate pathophysiology, predict treatment response, or augment clinical trial enrollment.
Figure 2Similarity of elements in clustering algorithms is inversely proportional to distance. This is often derived by applying the Pythagorean theorem to calculate Euclidean distance. We illustrate this approach in two-dimensional space, though similar calculations apply for data points of arbitrary dimensionality.
Summary of clustering methods.
|
|
|
|
|
|
|---|---|---|---|---|
| Centroid-based or partitional clustering | Minimize the distance between points within a cluster while maximizing the distance between cluster centroids | Simple implementation and interpretation | Number of clusters must be assigned | |
| Centroid-based variation: fuzzy clustering | Fuzzy c-means, rough or soft k-means | Points are assigned to one or more clusters based on membership coefficients representing similarity to other points in each cluster | Useful for datasets and applications with substantial overlap like image segmentation or genomic clustering | Number of clusters must be assigned |
| Hierarchical clustering | DIANA, AGNES | Generate a dendrogram using distance metrics and then cut the dendrogram to group its components | Obviates defining the number of clusters | Cumbersome for large datasets; sensitivity to outliers |
| Distribution-based clustering | Gaussian mixed models, DBCLASD | Points are assigned to clusters with similar probability distributions for metrics like mean and variance | Flexible, adapts to inherent distributions of the data, if present | Tends to overfit noisy data, complex algorithm runs slowly on large datasets |
| Density-based clustering | DBSCAN, Mean shift, OPTICS | Clusters are identified as the densest region in a data space, separated from other clusters by low-density areas | Adapts to non-linear data; obviates spatial and shape constraints of the clusters; insensitivity to outliers | Performs poorly with sparse data; sensitive to hyperparameters; complex algorithm runs slowly on large datasets |
| Supervised or constraint-based clustering | Random forest, gradient boosting, deep learning | Certain properties of the clustering result are defined | Incorporates prior knowledge of biology; generates a perfect decision boundary | Greater risk of overfitting compared with unsupervised methods |
| Spectral or graph-based clustering | STING, CLIQUE | Use a standard (e.g., | Effective for high-dimensional spectral data that contains substantial noise and outliers | Cumbersome for large graphs, interpretation requires understanding of vector spaces and linear transformation |
DIANA, Divisive ANAlysis; AGNES, AGlomerative NESting; DBCLASD, Distribution-Based Clustering of LArge Spatial Databases; DBSCAN, density-based spatial clustering of applications with noise; OPTICS, ordering points to identify the clustering structure; STING, statistical information grid; CLIQUE, Clustering In QUEst.
All clustering methods share the weakness of finding clusters even when natural clusters do not exist.