Caitlin E Coombes1, Xin Liu2, Zachary B Abrams3, Kevin R Coombes4, Guy Brock5. 1. The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH 43210, USA. Electronic address: Caitlin.Coombes@osumc.edu. 2. Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA. Electronic address: liu.7302@buckeyemail.osu.edu. 3. Institute for Informatics, Washington University in St. Louis, 444 Forest Park Ave., St. Louis, MO 63108, USA. Electronic address: Zachary.Abrams@osumc.edu. 4. Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA. Electronic address: Kevin.Coombes@osumc.edu. 5. Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA. Electronic address: Guy.Brock@osumc.edu.
Abstract
INTRODUCTION: Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data. METHODS: We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit. RESULTS: HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets. DISCUSSION: Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.
INTRODUCTION: Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data. METHODS: We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit. RESULTS: HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets. DISCUSSION: Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.
Authors: Peter J Castaldi; Marta Benet; Hans Petersen; Nicholas Rafaels; James Finigan; Matteo Paoletti; H Marike Boezen; Judith M Vonk; Russell Bowler; Massimo Pistolesi; Milo A Puhan; Josep Anto; Els Wauters; Diether Lambrechts; Wim Janssens; Francesca Bigazzi; Gianna Camiciottoli; Michael H Cho; Craig P Hersh; Kathleen Barnes; Stephen Rennard; Meher Preethi Boorgula; Jennifer Dy; Nadia N Hansel; James D Crapo; Yohannes Tesfaigzi; Alvar Agusti; Edwin K Silverman; Judith Garcia-Aymerich Journal: Thorax Date: 2017-06-21 Impact factor: 9.139
Authors: Hatice Duzkale; Carmen D Schweighofer; Kevin R Coombes; Lynn L Barron; Alessandra Ferrajoli; Susan O'Brien; William G Wierda; John Pfeifer; Tadeusz Majewski; Bogdan A Czerniak; Jeffrey L Jorgensen; L Jeffrey Medeiros; Emil J Freireich; Michael J Keating; Lynne V Abruzzo Journal: Blood Date: 2011-02-10 Impact factor: 22.113
Authors: Maria Pikoula; Jennifer Kathleen Quint; Francis Nissen; Harry Hemingway; Liam Smeeth; Spiros Denaxas Journal: BMC Med Inform Decis Mak Date: 2019-04-18 Impact factor: 2.796
Authors: Brent M Egan; Susan E Sutherland; Peter L Tilkemeier; Robert A Davis; Valinda Rutledge; Angelo Sinopoli Journal: PLoS One Date: 2019-06-19 Impact factor: 3.240
Authors: Caitlin E Coombes; Zachary B Abrams; Suli Li; Lynne V Abruzzo; Kevin R Coombes Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497