Literature DB >> 30353541

Automated feature selection of predictors in electronic medical records data.

Jessica Gronsbell1, Jessica Minnier2, Sheng Yu3, Katherine Liao4, Tianxi Cai5.   

Abstract

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.
© 2018 International Biometric Society.

Entities:  

Keywords:  electronic medical records; feature selection; prediction accuracy; regularized regression; risk prediction

Mesh:

Year:  2019        PMID: 30353541     DOI: 10.1111/biom.12987

Source DB:  PubMed          Journal:  Biometrics        ISSN: 0006-341X            Impact factor:   2.571


  8 in total

1.  Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies.

Authors:  Majid Afshar; Dmitriy Dligach; Brihat Sharma; Xiaoyuan Cai; Jason Boyda; Steven Birch; Daniel Valdez; Suzan Zelisko; Cara Joyce; François Modave; Ron Price
Journal:  J Am Med Inform Assoc       Date:  2019-11-01       Impact factor: 4.497

2.  Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions.

Authors:  Feng Xie; Marcus Eng Hock Ong; Johannes Nathaniel Min Hui Liew; Kenneth Boon Kiat Tan; Andrew Fu Wah Ho; Gayathri Devi Nadarajan; Lian Leng Low; Yu Heng Kwan; Benjamin Alan Goldstein; David Bruce Matchar; Bibhas Chakraborty; Nan Liu
Journal:  JAMA Netw Open       Date:  2021-08-02

Review 3.  Can antiepileptic efficacy and epilepsy variables be studied from electronic health records? A review of current approaches.

Authors:  Barbara M Decker; Chloé E Hill; Steven N Baldassano; Pouya Khankhanian
Journal:  Seizure       Date:  2021-01-13       Impact factor: 3.184

4.  Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records.

Authors:  Jeffrey Thompson; Jinxiang Hu; Dinesh Pal Mudaranthakam; David Streeter; Lisa Neums; Michele Park; Devin C Koestler; Byron Gajewski; Roy Jensen; Matthew S Mayo
Journal:  Sci Rep       Date:  2019-06-25       Impact factor: 4.379

5.  A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications.

Authors:  Yosef Masoudi-Sobhanzadeh; Habib Motieghader; Yadollah Omidi; Ali Masoudi-Nejad
Journal:  Sci Rep       Date:  2021-02-08       Impact factor: 4.379

Review 6.  Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review.

Authors:  Sara Momtazmanesh; Ali Nowroozi; Nima Rezaei
Journal:  Rheumatol Ther       Date:  2022-07-18

7.  Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

Authors:  Junghwan Lee; Cong Liu; Jae Hyun Kim; Alex Butler; Ning Shang; Chao Pang; Karthik Natarajan; Patrick Ryan; Casey Ta; Chunhua Weng
Journal:  JAMIA Open       Date:  2021-06-16

Review 8.  A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.

Authors:  I S Stafford; M Kellermann; E Mossotto; R M Beattie; B D MacArthur; S Ennis
Journal:  NPJ Digit Med       Date:  2020-03-09
  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.