Literature DB >> 36053173

I2b2-etl: Python application for importing electronic health data into the informatics for integrating biology and the bedside platform.

Kavishwar B Wagholikar^1,2, Layne Ainsworth³, David Zelle⁴, Kira Chaney⁴, Michael Mendis³, Jeffery Klann^1,2, Alexander J Blood^1,4, Angela Miller³, Rupendra Chulyadyo³, Michael Oates³, William J Gordon^1,3,4, Samuel J Aronson³, Benjamin M Scirica^1,4, Shawn N Murphy^1,2.

Abstract

MOTIVATION: The i2b2 platform is used at major academic health institutions and research consortia for querying for electronic health data. However, a major obstacle for wider utilization of the platform is the complexity of data loading that entails a steep curve of learning the platform's complex data schemas. To address this problem, we have developed the i2b2-etl package that simplifies the data loading process, which will facilitate wider deployment and utilization of the platform.
RESULTS: We have implemented i2b2-etl as a Python application that imports ontology and patient data using simplified input file schemas and provides inbuilt record number de-identification and data validation. We describe a real-world deployment of i2b2-etl for a population-management initiative at MassGeneral Brigham.
AVAILABILITY AND IMPLEMENTATION: i2b2-etl is a free, open-source application implemented in Python available under the Mozilla 2 license. The application can be downloaded as compiled docker images. A live demo is available at https://i2b2clinical.org/demo-i2b2etl/ (username: demo, password: Etl@2021). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36053173 PMCID： PMC9563689 DOI： 10.1093/bioinformatics/btac595

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

The i2b2 platform is deployed at major academic health institutions for querying for electronic health records (EHR) (Abend ; Murphy ). The platform provides a user-friendly interface that allows researchers with no expertise in information technology to find patient cohorts using EHR data. I2b2 has been deployed as a critical component of research networks including National Patient-Centered Clinical Research Network (PCORNet) (Klann ), Accrual for Clinical Trials (Visweswaran ) and Consortium for clinical Characterization of COVID-19 (4CE) (Brat ; Weber ). The platform has been used for a wide spectrum of use cases including clinical-trial enrollment (Bucalo ), population management (Wagholikar ), biobanking (Castro ; Mate ; Segagni ), clinical decision support and epidemiological analysis (Klann and Murphy, 2013; Murchison ; Pfiffner ; Segagni ; Wagholikar , b). However, despite its impact and open-source availability, the deployment of the platform is largely limited to large academic medical centers. A major obstacle for wider utilization of the i2b2 platform is the difficulty in data loading, as it requires the IT staff to learn the complex data-schema internal to the platform that has a star topology (Murphy ). To address this issue, we have developed the i2b2-etl package that specifies a simple input format and abstracts away the complex processes to initialize the internal i2b2 schema. This will facilitate wider deployment and utilization of the platform.

2 Materials and methods

We have implemented a Python application, referred to as ‘i2b2-etl’ that imports data in the Comma Separated Value (CSV) format into the i2b2 platform. The source code is available in open source (https://github.com/i2b2/i2b2-etl and https://github.com/i2b2/i2b2-etl-docker), and as compiled containers in Dockerhub. The application can be downloaded as a docker images, that are compatible with all common Linux distributions, Windows and Mac-OSX systems. For an online demo see, https://i2b2clinical.org/demo-i2b2etl/ (username: demo, password: Etl@2021). After login navigates to ETL tab, press delete button and then choose upload files, selecting the CSV files from the Supplementary Material. Supplementary Appendix A provides the steps to install i2b2-etl. Supplementary Appendix B describes command-line interaction. Supplementary Appendix C demonstrates the use of Gitlab to use SQL queries as inputs to i2b2-etl. As shown in Figure 1A, i2b2-etl accepts two types of CSV files—concept files and fact files, which contain the meta-data and data, respectively.

Fig. 1.

(A) I2b2-etl accepts two types of CSV files– concept files and fact files. Concept files provide the meta-data or dictionary for the creation of an ontology hierarchy. The fact files contain patient data. For example, the first row represents that patient with medical-record number 1, had a ‘GLU’ of 160 on 1st January 2010. The concepts file clarifies that GLU is blood-glucose, which is a laboratory test performed on the blood, as indicated in the path, and it has a value of an integer. I2b2-etl uses the concept files to validate the facts while performing the import. (B) I2b2-etl parses in the input schemas shown on the left (blue background) and executes processes to populate the internal i2b2 tables shown on the right (yellow background). The metadata, table access and concept dimension tables are essential for the functionality to display ontologies and to query patient data using ontologies. These are automatically populated by the i2b2-tool. The observation-fact and dimension tables are internal i2b2 tables in a star topology that contain EHR data, and the mapping tables serve to de-identify the patient record numbers. These tables (except the provider and modifier dimensions) are automatically generated by i2b2-etl. Without i2b2-etl, all these internal tables need to be populated individually, which requires an in-depth understanding of the schema, a challenge that is now resolved by i2b2-etl (A color version of this figure appears in the online version of this article.) Concept files provide the meta-data or dictionary for the creation of the ontology hierarchy. Concept files consist of three columns: path, code and type, and their names end in ‘_concepts.csv’. Each row in the concept file corresponds to a node in the ontology hierarchy displayed in the i2b2-User Interface. The path specifies the unique location of the node in the hierarchy. The type can be integer, float, assertion, string or large-string, and the code is the abbreviated reference to the concept. For example, in the first row in the concept file shown in Figure 1A, the path ‘/Lab/Blood/Blood-glucose’ leads to creation of the integer node called Blood-Glucose, as a child of the ‘Blood’ node. The ancestor nodes for ‘Lab’ and ‘Blood’ are automatically created. The fact files contain patient data in four columns: medical record number (MRN), start-date, code and value. The name of fact files end in ‘_facts.csv’. Each row of the fact file provides the value of a specific observation (of a concept) for a patient referenced by the MRN starting at a particular point in time (start date). For example, the first row in the fact file shown in Figure 1A indicates that a value of 160 for blood glucose was observed on 1 January 2010, for the patient having a MRN 1. I2b2-etl populates the tables in the internal i2b2 schema as elucidated in Figure 1B. The ontology hierarchy and SQL code snippets for ontology-based querying are auto-generated from the concept file and populated in the i2b2 metadata and concept dimension tables by the tool. The medical record numbers are converted into randomly generated integers and stored in the i2b2 patient-mapping table that is inaccessible to end-users. The latter can only access the randomized integers as patient numbers in the user interface. I2b2-etl performs validation of the input facts to ensure that each fact references a valid concept, has a value that conforms to its concept type, and that it has a valid time stamp. We deployed i2b2-etl for a cardiovascular population health program at MassGeneral Brigham in Boston (Benson ; Blood ; Gordon ; Wagholikar ) The program involved daily interaction of navigators with patients. The resulting data were recorded in a relational database. To import this data into i2b2 for easy querying, we developed SQL queries to extract the project data into CSV files specified above. Next, we deployed i2b2-etl as a nightly job that executed the SQL queries and loaded the resulting CSV files into an i2b2 repository. The latter was setup specifically for the project using the i2b2 docker containers (Wagholikar ).

3 Results

Our i2b2-etl package allows importing of the ontology and patient-data as two simple CSV files. The cardiovascular program included data for 28 483 patients. Deployment of i2b2-etl resulted in 1395 concepts and over 4.7 million facts in the i2b2 repository, requiring 18 min for execution. The resultant i2b2-repository is used by the study staff to identify sub-cohorts in the population and to evaluate the program’s progress (Scirica ). The novelty of i2b2-etl application is the simplified design of the input file schema, inbuilt de-identification and data validation. The input file schema abstracts away the complexity of the data schemas internal to the i2b2 platform (see Fig. 1B). This simplified mechanism will allow the IT staff at healthcare institutions to easily transform and load their institution’s EHR into the i2b2 platform. Without i2b2-etl, the IT staff needs to thoroughly understand the complex schemas in i2b2 platform, in order to extract the data in conformance with the i2b2 schemas. As i2b2-etl can transform the simplified concept and file schemas into the platform’s schemas, IT staff is required to only focus on preparing SQL statements to yield the simplified input schemas. Moreover, with the integration of ETL module with Gitlab (Supplementary Appendix C), the entire ETL process can be automated, wherein Gitlab triggers the execution of SQL on the source database to extract the data as CSVs, which are then transformed into the i2b2 internal schemas and then loaded into the i2b2 platform’s database. Consequently, as the transform and loading steps are done by i2b2-etl, the IT staff only needs to focus on the extraction step, which can be performed by IT staff with minimal SQL expertise. However, the simplified abstraction is at the expense of functionality in the i2b2 platform. I2b2-etl does not support querying using fields in the modifier, patient and visit dimensions of the i2b2 star topology. There are several alternative approaches that have been previously developed for importing EHR data into i2b2. Post , b, 2016) have developed an application called Eureka that can load Excel files with custom schemas into the i2b2 database. Importing of EHR data into OMOP model has been used in several projects (Klann , 2019; Majeed ; Rinner ) However, the major differentiator of i2b2-etl with the alternative approaches is the simplified abstraction for the input, which allows users without advanced training in data modeling to rapidly import EHR into the i2b2 platform, thereby improving accessibility of the EHR for secondary use cases.

Funding

This work was supported by MassGeneral Brigham and National Institutes of Health [R00-LM011575, R01-HG009174 and R01HL151643]. Conflict of Interest: none declared. Click here for additional data file.

28 in total

1. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside.

Authors: Shawn N Murphy; Michael Mendis; Kristel Hackett; Rajesh Kuttan; Wensong Pan; Lori C Phillips; Vivian Gainer; David Berkowicz; John P Glaser; Isaac Kohane; Henry C Chueh
Journal: AMIA Annu Symp Proc Date: 2007-10-11

2. Web services for data warehouses: OMOP and PCORnet on i2b2.

Authors: Jeffrey G Klann; Lori C Phillips; Christopher Herrick; Matthew A H Joss; Kavishwar B Wagholikar; Shawn N Murphy
Journal: J Am Med Inform Assoc Date: 2018-10-01 Impact factor: 4.497

3. On-The-Fly Query Translation Between i2b2 and Samply in the German Biobank Node (GBN) Prototypes.

Authors: Sebastian Mate; Patric Vormstein; Dennis Kadioglu; Raphael W Majeed; Martin Lablans; Hans-Ulrich Prokosch; Holger Storf
Journal: Stud Health Technol Inform Date: 2017

4. Digital Care Transformation: Interim Report From the First 5000 Patients Enrolled in a Remote Algorithm-Based Cardiovascular Risk Management Program to Improve Lipid and Hypertension Control.

Authors: Benjamin M Scirica; Christopher P Cannon; Naomi D L Fisher; Thomas A Gaziano; David Zelle; Kira Chaney; Angela Miller; Hunter Nichols; Lina Matta; William J Gordon; Shawn Murphy; Kavi B Wagholikar; Jorge Plutzky; Calum A MacRae
Journal: Circulation Date: 2020-11-17 Impact factor: 29.690

5. Integrating Clinical Data into the i2b2 Repository.

Authors: Aaron Abend; Dan Housman; Bruce Johnson
Journal: Summit Transl Bioinform Date: 2009-03-01

6. Evolving Research Data Sharing Networks to Clinical App Sharing Networks.

Authors: Kavishwar B Wagholikar; Rahul Jain; Eliel Oliveira; Joshua Mandel; Jeffery Klann; Ricardo Colas; Prasad Patil; Kuladip Yadav; Kenneth D Mandl; Thomas Carton; Shawn N Murphy
Journal: AMIA Jt Summits Transl Sci Proc Date: 2017-07-26

7. SMART-on-FHIR implemented over i2b2.

Authors: Kavishwar B Wagholikar; Joshua C Mandel; Jeffery G Klann; Nich Wattanasin; Michael Mendis; Christopher G Chute; Kenneth D Mandl; Shawn N Murphy
Journal: J Am Med Inform Assoc Date: 2017-03-01 Impact factor: 4.497

8. C3-PRO: Connecting ResearchKit to the Health System Using i2b2 and FHIR.

Authors: Pascal B Pfiffner; Isaac Pinyol; Marc D Natter; Kenneth D Mandl
Journal: PLoS One Date: 2016-03-31 Impact factor: 3.240

9. International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium.

Authors: Gabriel A Brat; Griffin M Weber; Nils Gehlenborg; Paul Avillach; Nathan P Palmer; Luca Chiovato; James Cimino; Lemuel R Waitman; Gilbert S Omenn; Alberto Malovini; Jason H Moore; Brett K Beaulieu-Jones; Valentina Tibollo; Shawn N Murphy; Sehi L' Yi; Mark S Keller; Riccardo Bellazzi; David A Hanauer; Arnaud Serret-Larmande; Alba Gutierrez-Sacristan; John J Holmes; Douglas S Bell; Kenneth D Mandl; Robert W Follett; Jeffrey G Klann; Douglas A Murad; Luigia Scudeller; Mauro Bucalo; Katie Kirchoff; Jean Craig; Jihad Obeid; Vianney Jouhet; Romain Griffier; Sebastien Cossin; Bertrand Moal; Lav P Patel; Antonio Bellasi; Hans U Prokosch; Detlef Kraska; Piotr Sliz; Amelia L M Tan; Kee Yuan Ngiam; Alberto Zambelli; Danielle L Mowery; Emily Schiver; Batsal Devkota; Robert L Bradford; Mohamad Daniar; Christel Daniel; Vincent Benoit; Romain Bey; Nicolas Paris; Patricia Serre; Nina Orlova; Julien Dubiel; Martin Hilka; Anne Sophie Jannot; Stephane Breant; Judith Leblanc; Nicolas Griffon; Anita Burgun; Melodie Bernaux; Arnaud Sandrin; Elisa Salamanca; Sylvie Cormont; Thomas Ganslandt; Tobias Gradinger; Julien Champ; Martin Boeker; Patricia Martel; Loic Esteve; Alexandre Gramfort; Olivier Grisel; Damien Leprovost; Thomas Moreau; Gael Varoquaux; Jill-Jênn Vie; Demian Wassermann; Arthur Mensch; Charlotte Caucheteux; Christian Haverkamp; Guillaume Lemaitre; Silvano Bosari; Ian D Krantz; Andrew South; Tianxi Cai; Isaac S Kohane
Journal: NPJ Digit Med Date: 2020-08-19

10. Implementation of informatics for integrating biology and the bedside (i2b2) platform as Docker containers.

Authors: Kavishwar B Wagholikar; Pralav Dessai; Javier Sanz; Michael E Mendis; Douglas S Bell; Shawn N Murphy
Journal: BMC Med Inform Decis Mak Date: 2018-07-16 Impact factor: 2.796