Literature DB >> 29774571

Clustering and variable selection in the presence of mixed variable types and missing data.

C B Storlie1, S M Myers2, S K Katusic1, A L Weaver1, R G Voigt3, P E Croarkin1, R E Stoeckel1, J D Port1.   

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
Copyright © 2018 John Wiley & Sons, Ltd.

Entities:  

Keywords:  Dirichlet process; hierarchical Bayesian modeling; missing data; mixed variable types; model-based clustering; variable selection

Year:  2018        PMID: 29774571      PMCID: PMC6240391          DOI: 10.1002/sim.7697

Source DB:  PubMed          Journal:  Stat Med        ISSN: 0277-6715            Impact factor:   2.373


  10 in total

1.  MissForest--non-parametric missing value imputation for mixed-type data.

Authors:  Daniel J Stekhoven; Peter Bühlmann
Journal:  Bioinformatics       Date:  2011-10-28       Impact factor: 6.937

2.  Variable selection for model-based high-dimensional clustering and its application to microarray data.

Authors:  Sijian Wang; Ji Zhu
Journal:  Biometrics       Date:  2007-10-26       Impact factor: 2.571

3.  Variable selection in penalized model-based clustering via regularization on grouped parameters.

Authors:  Benhuai Xie; Wei Pan; Xiaotong Shen
Journal:  Biometrics       Date:  2007-12-20       Impact factor: 2.571

4.  Variable selection for clustering with Gaussian mixture models.

Authors:  Cathy Maugis; Gilles Celeux; Marie-Laure Martin-Magniette
Journal:  Biometrics       Date:  2009-02-04       Impact factor: 2.571

5.  Multivariate probit analysis: a neglected procedure in medical statistics.

Authors:  E Lesaffre; G Molenberghs
Journal:  Stat Med       Date:  1991-09       Impact factor: 2.373

6.  Bayesian Analysis of Multivariate Nominal Measures Using Multivariate Multinomial Probit Models.

Authors:  Xiao Zhang; W John Boscardin; Thomas R Belin
Journal:  Comput Stat Data Anal       Date:  2008-03-15       Impact factor: 1.681

7.  A framework for feature selection in clustering.

Authors:  Daniela M Witten; Robert Tibshirani
Journal:  J Am Stat Assoc       Date:  2010-06-01       Impact factor: 5.033

8.  Simplex Factor Models for Multivariate Unordered Categorical Data.

Authors:  Anirban Bhattacharya; David B Dunson
Journal:  J Am Stat Assoc       Date:  2012-03-01       Impact factor: 5.033

9.  Nonparametric Bayes Conditional Distribution Modeling With Variable Selection.

Authors:  Yeonseung Chung; David B Dunson
Journal:  J Am Stat Assoc       Date:  2009-12-01       Impact factor: 5.033

10.  Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes.

Authors:  Brian J Reich; Curtis B Storlie; Howard D Bondell
Journal:  Technometrics       Date:  2009-05-01
  10 in total
  2 in total

1.  Determining County-Level Counterfactuals for Evaluation of Population Health Interventions: A Novel Application of K-Means Cluster Analysis.

Authors:  Kelly L Strutz; Zhehui Luo; Jennifer E Raffo; Cristian I Meghea; Peggy Vander Meulen; Lee Anne Roman
Journal:  Public Health Rep       Date:  2021-07-29       Impact factor: 3.117

2.  Phenotypes Determined by Cluster Analysis and Their Survival in the Prospective European Scleroderma Trials and Research Cohort of Patients With Systemic Sclerosis.

Authors:  Vincent Sobanski; Jonathan Giovannelli; Yannick Allanore; Gabriela Riemekasten; Paolo Airò; Serena Vettori; Franco Cozzi; Oliver Distler; Marco Matucci-Cerinic; Christopher Denton; David Launay; Eric Hachulla
Journal:  Arthritis Rheumatol       Date:  2019-08-12       Impact factor: 10.995

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.