| Literature DB >> 34350388 |
Suparno Datta1,2, Jan Philipp Sachs1,2, Harry FreitasDa Cruz1,2, Tom Martensen1, Philipp Bode1, Ariane Morassi Sasso1,2, Benjamin S Glicksberg2,3, Erwin Böttinger1,2.
Abstract
OBJECTIVES: The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames.Entities:
Keywords: databases; electronic health records; factual; information storage and retrieval; software/instrumentation; workflow
Year: 2021 PMID: 34350388 PMCID: PMC8327378 DOI: 10.1093/jamiaopen/ooab048
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Major features of cohort creation and analysis tools
| Feature | ATLAS | FIDDLE | inspectOMOP | Leaf | PLP | ROMOP | rEHR | FIBER |
|---|---|---|---|---|---|---|---|---|
| Internal standards | ||||||||
| Standard of underlying database | OMOP | Specialized data set only (MIMIC) | OMOP | i2b2+ OMOP | OMOP | OMOP | CPRD | i2b2 |
| (Programming)interface | GUI | Py | Py | GUI | R | R | R | Py |
| Data handling | ||||||||
| Complex cohort building | ● | ◐ | ◐ | ● | ● | ◐ | ● | ● |
| Modeling-ready dataframes (aggregated at patient level) | ◐ | ● | ◐ | ○ | ◐ | ◐ | ◐ | ● |
| Customization of graphical display and results | ◐ | ● | ● | ◐ | ◐ | ○ | ● | ● |
Abbreviations: Py: Python; GUI: graphical user interface; CPRD: Clinical Practice Research Datalink.
Legend: ●: fully supported; ◐: partially supported; ○: not mentioned.
Figure 1.FIBER architecture depicted using a Fundamental Modeling Concepts (FMC) block diagram. The architecture can be extended to Omics and Text Data.
Available condition classes in FIBER
| Condition class | Initialization with | |
|---|---|---|
| Description (text) | Clinical codes | |
| Diagnosis | ✓ | ✓ |
| Procedure | ✓ | ✓ |
| Measurement (procedure) | ✓ | ✓ |
| Vital sign (measurement) | ✓ | ✓ |
| Material | ✓ | ✓ |
| Drug (material) | ✓ | ✓ |
| Encounter | ✓ | – |
| Metadata | ✓ | – |
| Laboratory value | ✓ | ✓ |
| Patient | ✓ | – |
Note: Some conditions can only be created from short descriptions, for example, LabValue(“GLUCOSE”), others also from standardized clinical coding scheme like ICD-9, for example, Procedure(code=“35.0”, context= “ICD-9”). The condition class names in brackets indicate their parent condition class.
Standardized codes like LOINC can be integrated into the FIBER framework but have not been applied in the current use case.
Figure 2.(A) Gender and (B) age distribution plots generated using FIBER for the heart surgery cohort.
Figure 3.Different utility plots for cohort exploration from FIBER. In (A)—an encounter-timeline plot, the x-axis shows the number of encounters, the y-axis shows different time windows around the heart surgery. The number in the boxes indicates what fractions of patients had that many encounters. In (B)—feature counts, we see the number of features for some of the different feature classes obtained using the get_pivoted_features function for the heart surgery cohort with varying thresholds.
Metrics for prediction of acute kidney injury onset in a time window of 7 and 28 days after heart surgery, comparing four different models for each prediction period
| Prediction window (days) | Model | AUC | AUPRC |
|---|---|---|---|
| 7 | Logistic regression | 0.57 | 0.10 |
| 7 | Random_Forest | 0.52 | 0.09 |
| 7 | Light-GBM | 0.73 | 0.16 |
| 7 | XGBoost | 0.55 | 0.11 |
| 28 | Logistic_Regression | 0.61 | 0.18 |
| 28 | Random_Forest | 0.54 | 0.15 |
| 28 | Light-GBM |
|
|
| 28 | XGBoost | 0.60 | 0.20 |
Note: The complete data were extracted with FIBER. The values in bold indicate the best performance achieved.
Figure 4.Performance of the FIBER library on two database architectures across three typical use cases. (A) Creation of a cohort with a diagnosis code; (B) fetching of values for a patient cohort; (C) counting lab results by type of test.