BACKGROUND: EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls. OBJECTIVES: This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them. METHODS: Along standard data processing stages - extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework -, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies. RESULTS: Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: cross-checking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately. CONCLUSIONS: This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.
BACKGROUND: EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls. OBJECTIVES: This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them. METHODS: Along standard data processing stages - extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework -, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies. RESULTS: Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: cross-checking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately. CONCLUSIONS: This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.
Keywords:
Data mining; clinical decision support; data access; electronic health records and systems; integration and analysis; structured data
Authors: Muhammad F Walji; Elsbeth Kalenderian; Mark Piotrowski; Duong Tran; Krishna K Kookal; Oluwabunmi Tokede; Joel M White; Ram Vaderhobli; Rachel Ramoni; Paul C Stark; Nicole S Kimmes; Maxim Lagerweij; Vimla L Patel Journal: Int J Med Inform Date: 2014-02-03 Impact factor: 4.046
Authors: Tiina Mäenpää; Tarja Suominen; Paula Asikainen; Marianne Maass; Ilmari Rostila Journal: Int J Med Inform Date: 2009-08-04 Impact factor: 4.046
Authors: Kenney Ng; Amol Ghoting; Steven R Steinhubl; Walter F Stewart; Bradley Malin; Jimeng Sun Journal: J Biomed Inform Date: 2013-12-25 Impact factor: 6.317
Authors: Christopher M Horvat; Heba M Ismail; Alicia K Au; Luigi Garibaldi; Nalyn Siripong; Sajel Kantawala; Rajesh K Aneja; Diane S Hupp; Patrick M Kochanek; Robert Sb Clark Journal: Pediatr Diabetes Date: 2018-04-26 Impact factor: 4.866
Authors: Tiffany Pellathy; Melissa Saul; Gilles Clermont; Artur W Dubrawski; Michael R Pinsky; Marilyn Hravnak Journal: J Clin Monit Comput Date: 2021-02-08 Impact factor: 1.977
Authors: Davy van de Sande; Michel E Van Genderen; Jim M Smit; Joost Huiskens; Jacob J Visser; Robert E R Veen; Edwin van Unen; Oliver Hilgers Ba; Diederik Gommers; Jasper van Bommel Journal: BMJ Health Care Inform Date: 2022-02