Literature DB >> 22039212

MissForest--non-parametric missing value imputation for mixed-type data.

Daniel J Stekhoven1, Peter Bühlmann.   

Abstract

MOTIVATION: Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously.
RESULTS: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. AVAILABILITY: The package missForest is freely available from http://stat.ethz.ch/CRAN/. CONTACT: stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch

Mesh:

Year:  2011        PMID: 22039212     DOI: 10.1093/bioinformatics/btr597

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  635 in total

1.  Exposure to disinfection byproducts and risk of type 2 diabetes: a nested case-control study in the HUNT and Lifelines cohorts.

Authors:  Stephanie Gängler; Melanie Waldenberger; Anna Artati; Jerzy Adamski; Jurjen N van Bolhuis; Elin Pettersen Sørgjerd; Jana van Vliet-Ostaptchouk; Konstantinos C Makris
Journal:  Metabolomics       Date:  2019-04-08       Impact factor: 4.290

2.  Measuring Teen Dating Violence Perpetration: A Comparison of Cumulative and Single Assessment Procedures.

Authors:  Alison Krauss; Ernest N Jouriles; Renee McDonald; David Rosenfield
Journal:  Psychol Violence       Date:  2019-11-07

3.  Epigenome-wide DNA methylation in placentas from preterm infants: association with maternal socioeconomic status.

Authors:  Hudson P Santos; Arjun Bhattacharya; Elizabeth M Martin; Kezia Addo; Matt Psioda; Lisa Smeester; Robert M Joseph; Stephen R Hooper; Jean A Frazier; Karl C Kuban; T Michael O'Shea; Rebecca C Fry
Journal:  Epigenetics       Date:  2019-05-21       Impact factor: 4.528

4.  Phylogenetic correlates of extinction risk in mammals: species in older lineages are not at greater risk.

Authors:  Luis Darcy Verde Arregoitia; Simon P Blomberg; Diana O Fisher
Journal:  Proc Biol Sci       Date:  2013-07-03       Impact factor: 5.349

5.  Identification of Clinically Meaningful Plasma Transfusion Subgroups Using Unsupervised Random Forest Clustering.

Authors:  Che Ngufor; Matthew A Warner; Dennis H Murphree; Hongfang Liu; Rickey Carter; Curtis B Storlie; Daryl J Kor
Journal:  AMIA Annu Symp Proc       Date:  2018-04-16

6.  Sex Differences in Mortality Based on United Network for Organ Sharing Status While Awaiting Heart Transplantation.

Authors:  Eileen M Hsich; Eugene H Blackstone; Lucy Thuita; Dennis M McNamara; Joseph G Rogers; Hemant Ishwaran; Jesse D Schold
Journal:  Circ Heart Fail       Date:  2017-06       Impact factor: 8.790

7.  Postoperative bleeding risk prediction for patients undergoing colorectal surgery.

Authors:  David Chen; Naveed Afzal; Sunghwan Sohn; Elizabeth B Habermann; James M Naessens; David W Larson; Hongfang Liu
Journal:  Surgery       Date:  2018-07-20       Impact factor: 3.982

8.  Imputation Strategy for Reliable Regional MRI Morphological Measurements.

Authors:  Shaina Sta Cruz; Ivo D Dinov; Megan M Herting; Clio González-Zacarías; Hosung Kim; Arthur W Toga; Farshid Sepehrband
Journal:  Neuroinformatics       Date:  2020-01

9.  An Algorithm for Creating Virtual Controls Using Integrated and Harmonized Longitudinal Data.

Authors:  William B Hansen; Shyh-Huei Chen; Santiago Saldana; Edward H Ip
Journal:  Eval Health Prof       Date:  2018-05-03       Impact factor: 2.651

10.  HIV messaging on Twitter: an analysis of current practice and data-driven recommendations.

Authors:  Sophie Lohmann; Benjamin X White; Zhen Zuo; Man-Pui Sally Chan; Alex Morales; Bo Li; Chengxiang Zhai; Dolores Albarracín
Journal:  AIDS       Date:  2018-11-28       Impact factor: 4.177

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.