Kenney Ng1, Amol Ghoting2, Steven R Steinhubl3, Walter F Stewart4, Bradley Malin5, Jimeng Sun2. 1. IBM TJ Watson Research Center, Yorktown Heights, NY, United States. Electronic address: kenney.ng@us.ibm.com. 2. IBM TJ Watson Research Center, Yorktown Heights, NY, United States. 3. Scripps Translational Science Institute, LaJolla, CA, United States; Geisinger Medical Center, Danville, PA, United States. 4. Sutter Health, Concord, CA, United States. 5. Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, United States; Department of Electrical Engineering and Computer Science, School of Engineering, Vanderbilt University, Nashville, TN, United States.
Abstract
OBJECTIVE: Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS: To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS: We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION: This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.
OBJECTIVE: Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS: To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS: We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION: This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.
Authors: Jennifer Pittman; Erich Huang; Holly Dressman; Cheng-Fang Horng; Skye H Cheng; Mei-Hua Tsou; Chii-Ming Chen; Andrea Bild; Edwin S Iversen; Andrew T Huang; Joseph R Nevins; Mike West Journal: Proc Natl Acad Sci U S A Date: 2004-05-19 Impact factor: 11.205
Authors: Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2013-03-26 Impact factor: 4.497
Authors: Ying P Tabak; Xiaowu Sun; Richard S Johannes; Linda Hyde; Andrew F Shorr; Peter K Lindenauer Journal: Med Care Date: 2013-07 Impact factor: 2.983
Authors: Roy J Byrd; Steven R Steinhubl; Jimeng Sun; Shahram Ebadollahi; Walter F Stewart Journal: Int J Med Inform Date: 2013-01-11 Impact factor: 4.046
Authors: Jimeng Sun; Jianying Hu; Dijun Luo; Marianthi Markatou; Fei Wang; Shahram Edabollahi; Steven E Steinhubl; Zahra Daar; Walter F Stewart Journal: AMIA Annu Symp Proc Date: 2012-11-03
Authors: José Carlos Ferrão; Mónica Duarte Oliveira; Filipe Janela; Henrique M G Martins Journal: Appl Clin Inform Date: 2016-12-07 Impact factor: 2.342
Authors: Robert Chen; Hang Su; Mohammed Khalilia; Sizhe Lin; Yue Peng; Tod Davis; Daniel A Hirsh; Elizabeth Searles; Javier Tejedor-Sojo; Michael Thompson; Jimeng Sun Journal: AMIA Annu Symp Proc Date: 2015-11-05
Authors: Nir Nissim; Mary Regina Boland; Nicholas P Tatonetti; Yuval Elovici; George Hripcsak; Yuval Shahar; Robert Moskovitch Journal: J Biomed Inform Date: 2016-03-22 Impact factor: 6.317
Authors: Robert Moskovitch; Hyunmi Choi; George Hripcsak; Nicholas Tatonetti Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2016-07-14 Impact factor: 3.710
Authors: R Andrew Taylor; Joseph R Pare; Arjun K Venkatesh; Hani Mowafi; Edward R Melnick; William Fleischman; M Kennedy Hall Journal: Acad Emerg Med Date: 2016-02-13 Impact factor: 3.451