Literature DB >> 30848787

Outlier detection for questionnaire data in biobanks.

Rieko Sakurai1,2, Masao Ueki1,2, Satoshi Makino1,3, Atsushi Hozawa2,3, Shinichi Kuriyama2,3,4, Takako Takai-Igarashi2,3, Kengo Kinoshita2,5, Masayuki Yamamoto2,3, Gen Tamiya1,2.   

Abstract

BACKGROUND: Biobanks increasingly collect, process and store omics with more conventional epidemiologic information necessitating considerable effort in data cleaning. An efficient outlier detection method that reduces manual labour is highly desirable.
METHOD: We develop an unsupervised machine-learning method for outlier detection, namely kurPCA, that uses principal component analysis combined with kurtosis to ascertain the existence of outliers. In addition, we propose a novel regression adjustment approach to improve detection, namely the regression adjustment for data by systematic missing patterns (RAMP). RESULT: Application to epidemiological record data in a large-scale biobank (Tohoku Medical Megabank Organization, Japan) shows that a combination of kurPCA and RAMP effectively detects known errors or inconsistent patterns.
CONCLUSIONS: We confirm through the results of the simulation and the application that our methods showed good performance. The proposed methods are useful for many practical analysis scenarios.
© The Author(s) 2019; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.

Keywords:  Outlier detection; anomaly detection; kurtosis; principal component analysis; regression adjustment

Mesh:

Year:  2019        PMID: 30848787     DOI: 10.1093/ije/dyz012

Source DB:  PubMed          Journal:  Int J Epidemiol        ISSN: 0300-5771            Impact factor:   7.196


  2 in total

1.  Cross-Sectional Analysis of Impulse Indicator Saturation Method for Outlier Detection Estimated via Regularization Techniques with Application of COVID-19 Data.

Authors:  Sara Muhammadullah; Amena Urooj; Muhammad Hashim Mengal; Shahzad Ali Khan; Fereshteh Khalaj
Journal:  Comput Math Methods Med       Date:  2022-05-06       Impact factor: 2.809

2.  Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort.

Authors:  Justin H Davies; Sarah Ennis; Hang T T Phan; Florina Borca; David Cable; James Batchelor
Journal:  Sci Rep       Date:  2020-06-23       Impact factor: 4.379

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.