Literature DB >> 33862229

Simulation-derived best practices for clustering clinical data.

Caitlin E Coombes1, Xin Liu2, Zachary B Abrams3, Kevin R Coombes4, Guy Brock5.   

Abstract

INTRODUCTION: Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data.
METHODS: We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit.
RESULTS: HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets. DISCUSSION: Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.
Copyright © 2021 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Clinical informatics; Clinical trial; Clustering; Electronic health record; Unsupervised machine learning

Mesh:

Year:  2021        PMID: 33862229      PMCID: PMC9017600          DOI: 10.1016/j.jbi.2021.103788

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   8.000


  34 in total

Review 1.  Clustering algorithms in biomedical research: a review.

Authors:  Rui Xu; Donald C Wunsch
Journal:  IEEE Rev Biomed Eng       Date:  2010

2.  Do COPD subtypes really exist? COPD heterogeneity and clustering in 10 independent cohorts.

Authors:  Peter J Castaldi; Marta Benet; Hans Petersen; Nicholas Rafaels; James Finigan; Matteo Paoletti; H Marike Boezen; Judith M Vonk; Russell Bowler; Massimo Pistolesi; Milo A Puhan; Josep Anto; Els Wauters; Diether Lambrechts; Wim Janssens; Francesca Bigazzi; Gianna Camiciottoli; Michael H Cho; Craig P Hersh; Kathleen Barnes; Stephen Rennard; Meher Preethi Boorgula; Jennifer Dy; Nadia N Hansel; James D Crapo; Yohannes Tesfaigzi; Alvar Agusti; Edwin K Silverman; Judith Garcia-Aymerich
Journal:  Thorax       Date:  2017-06-21       Impact factor: 9.139

3.  Using Unsupervised Machine Learning to Identify Subgroups Among Home Health Patients With Heart Failure Using Telehealth.

Authors:  Eliezer Bose; Kavita Radhakrishnan
Journal:  Comput Inform Nurs       Date:  2018-05       Impact factor: 1.985

4.  LDOC1 mRNA is differentially expressed in chronic lymphocytic leukemia and predicts overall survival in untreated patients.

Authors:  Hatice Duzkale; Carmen D Schweighofer; Kevin R Coombes; Lynn L Barron; Alessandra Ferrajoli; Susan O'Brien; William G Wierda; John Pfeifer; Tadeusz Majewski; Bogdan A Czerniak; Jeffrey L Jorgensen; L Jeffrey Medeiros; Emil J Freireich; Michael J Keating; Lynne V Abruzzo
Journal:  Blood       Date:  2011-02-10       Impact factor: 22.113

5.  Identification of subtypes in subjects with mild-to-moderate airflow limitation and its clinical and socioeconomic implications.

Authors:  Jin Hwa Lee; Chin Kook Rhee; Kyungjoo Kim; Jee-Ae Kim; Sang Hyun Kim; Kwang Ha Yoo; Woo Jin Kim; Yong Bum Park; Hye Yun Park; Ki-Suck Jung
Journal:  Int J Chron Obstruct Pulmon Dis       Date:  2017-04-12

6.  Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records.

Authors:  Maria Pikoula; Jennifer Kathleen Quint; Francis Nissen; Harry Hemingway; Liam Smeeth; Spiros Denaxas
Journal:  BMC Med Inform Decis Mak       Date:  2019-04-18       Impact factor: 2.796

7.  A cluster-based approach for integrating clinical management of Medicare beneficiaries with multiple chronic conditions.

Authors:  Brent M Egan; Susan E Sutherland; Peter L Tilkemeier; Robert A Davis; Valinda Rutledge; Angelo Sinopoli
Journal:  PLoS One       Date:  2019-06-19       Impact factor: 3.240

8.  Detecting Systemic Data Quality Issues in Electronic Health Records.

Authors:  Casey N Ta; Chunhua Weng
Journal:  Stud Health Technol Inform       Date:  2019-08-21

9.  Mercator: A Pipeline For Multi-Method, Unsupervised Visualization And Distance Generation.

Authors:  Zachary B Abrams; Caitlin E Coombes; Suli Li; Kevin R Coombes
Journal:  Bioinformatics       Date:  2021-01-30       Impact factor: 6.937

10.  Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia.

Authors:  Caitlin E Coombes; Zachary B Abrams; Suli Li; Lynne V Abruzzo; Kevin R Coombes
Journal:  J Am Med Inform Assoc       Date:  2020-07-01       Impact factor: 4.497

View more
  1 in total

1.  A cohesin-associated gene score may predict immune checkpoint blockade in hepatocellular carcinoma.

Authors:  Cui-Zhen Liu; Jian-Di Li; Gang Chen; Rong-Quan He; Rui Lin; Zhi-Guang Huang; Jian-Jun Li; Xiu-Fang Du; Xiao-Ping Lv
Journal:  FEBS Open Bio       Date:  2022-09-02       Impact factor: 2.792

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.