Literature DB >> 25371041

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Serena G Liao, Yan Lin, Dongwan D Kang, Divay Chandra, Jessica Bon, Naftali Kaminski, Frank C Sciurba, George C Tseng.   

Abstract

BACKGROUND: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.
RESULTS: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available.
CONCLUSIONS: Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.

Entities:  

Mesh:

Year:  2014        PMID: 25371041      PMCID: PMC4228077          DOI: 10.1186/s12859-014-0346-6

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  17 in total

1.  Biological impact of missing-value imputation on downstream analyses of gene expression profiles.

Authors:  Sunghee Oh; Dongwan D Kang; Guy N Brock; George C Tseng
Journal:  Bioinformatics       Date:  2010-11-02       Impact factor: 6.937

2.  "Phenome" project set to pin down subgroups of autism.

Authors:  Emily Singer
Journal:  Nat Med       Date:  2005-06       Impact factor: 53.440

3.  DETERMINATION OF THE COEFFICIENT OF CORRELATION.

Authors:  F Boas
Journal:  Science       Date:  1909-05-21       Impact factor: 47.728

4.  External phenome analysis enables a rational federated query strategy to detect changing rates of treatment-related complications associated with multiple myeloma.

Authors:  Jeremy L Warner; Gil Alterovitz; Kelly Bodio; Robin M Joyce
Journal:  J Am Med Inform Assoc       Date:  2013-03-20       Impact factor: 4.497

5.  Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records.

Authors:  Svetlana Lyalina; Bethany Percha; Paea LePendu; Srinivasan V Iyer; Russ B Altman; Nigam H Shah
Journal:  J Am Med Inform Assoc       Date:  2013-08-16       Impact factor: 4.497

6.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

Authors:  José M Jerez; Ignacio Molina; Pedro J García-Laencina; Emilio Alba; Nuria Ribelles; Miguel Martín; Leonardo Franco
Journal:  Artif Intell Med       Date:  2010-07-16       Impact factor: 5.326

7.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.

Authors:  Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford
Journal:  Bioinformatics       Date:  2010-03-24       Impact factor: 6.937

8.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.

Authors:  Jonathan A C Sterne; Ian R White; John B Carlin; Michael Spratt; Patrick Royston; Michael G Kenward; Angela M Wood; James R Carpenter
Journal:  BMJ       Date:  2009-06-29

9.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

Authors:  Guy N Brock; John R Shaffer; Richard E Blakesley; Meredith J Lotz; George C Tseng
Journal:  BMC Bioinformatics       Date:  2008-01-10       Impact factor: 3.169

10.  Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk.

Authors:  Marylyn D Ritchie; Joshua C Denny; Rebecca L Zuvich; Dana C Crawford; Jonathan S Schildcrout; Lisa Bastarache; Andrea H Ramirez; Jonathan D Mosley; Jill M Pulley; Melissa A Basford; Yuki Bradford; Luke V Rasmussen; Jyotishman Pathak; Christopher G Chute; Iftikhar J Kullo; Catherine A McCarty; Rex L Chisholm; Abel N Kho; Christopher S Carlson; Eric B Larson; Gail P Jarvik; Nona Sotoodehnia; Teri A Manolio; Rongling Li; Daniel R Masys; Jonathan L Haines; Dan M Roden
Journal:  Circulation       Date:  2013-03-05       Impact factor: 29.690

View more
  26 in total

1.  Random Forest Missing Data Algorithms.

Authors:  Fei Tang; Hemant Ishwaran
Journal:  Stat Anal Data Min       Date:  2017-06-13       Impact factor: 1.051

2.  Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.

Authors:  Shah Atiqur Rahman; Yuxiao Huang; Jan Claassen; Nathaniel Heintzman; Samantha Kleinberg
Journal:  J Biomed Inform       Date:  2015-10-21       Impact factor: 6.317

3.  Imputing Gene Expression in Uncollected Tissues Within and Beyond GTEx.

Authors:  Jiebiao Wang; Eric R Gamazon; Brandon L Pierce; Barbara E Stranger; Hae Kyung Im; Robert D Gibbons; Nancy J Cox; Dan L Nicolae; Lin S Chen
Journal:  Am J Hum Genet       Date:  2016-03-31       Impact factor: 11.025

4.  Efficacy of Guided iCBT for Depression and Mediation of Change by Cognitive Skill Acquisition.

Authors:  Nicholas R Forand; Jeffrey G Barnett; Daniel R Strunk; Mohammed U Hindiyeh; Jason E Feinberg; John R Keefe
Journal:  Behav Ther       Date:  2017-05-01

5.  Relational Network for Knowledge Discovery through Heterogeneous Biomedical and Clinical Features.

Authors:  Huaidong Chen; Wei Chen; Chenglin Liu; Le Zhang; Jing Su; Xiaobo Zhou
Journal:  Sci Rep       Date:  2016-07-18       Impact factor: 4.379

6.  Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data.

Authors:  Yi Deng; Changgee Chang; Moges Seyoum Ido; Qi Long
Journal:  Sci Rep       Date:  2016-02-12       Impact factor: 4.379

7.  Exploring differential health effects of work stress: a latent class cluster approach.

Authors:  Hannes Mayerl; Erwin Stolz; Anja Waxenegger; Wolfgang Freidl
Journal:  PeerJ       Date:  2017-03-21       Impact factor: 2.984

8.  Enabling network inference methods to handle missing data and outliers.

Authors:  Abel Folch-Fortuny; Alejandro F Villaverde; Alberto Ferrer; Julio R Banga
Journal:  BMC Bioinformatics       Date:  2015-09-03       Impact factor: 3.169

9.  Nearest neighbor imputation algorithms: a critical evaluation.

Authors:  Lorenzo Beretta; Alessandro Santaniello
Journal:  BMC Med Inform Decis Mak       Date:  2016-07-25       Impact factor: 2.796

10.  The role of postmastectomy radiotherapy in clinically node-positive, stage II-III breast cancer patients with pathological negative nodes after neoadjuvant chemotherapy: an analysis from the NCDB.

Authors:  Jieqiong Liu; Kai Mao; Shuai Jiang; Wen Jiang; Kai Chen; Betty Y S Kim; Qiang Liu; Lisa K Jacobs
Journal:  Oncotarget       Date:  2016-04-26
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.