| Literature DB >> 33040151 |
Shengpu Tang1, Parmida Davarmanesh2, Yanmeng Song3, Danai Koutra1, Michael W Sjoding4,5,6,7, Jenna Wiens1,5,6.
Abstract
OBJECTIVE: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.Entities:
Keywords: electronic health records; machine learning; preprocessing pipeline
Mesh:
Year: 2020 PMID: 33040151 PMCID: PMC7727385 DOI: 10.1093/jamia/ocaa139
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.Overview of FIDDLE. Given formatted input data and user-defined arguments, FIDDLE processes data in 3 stages: (1) pre-filter, (2) transform, and (3) post-filter. So long as the units are consistent, timestamps in the t column may be recorded at any level of granularity (eg, seconds, minutes, hours, days, visits, etc.). In this sample input file, we consider time in hours. A row with [1, 0.2, Heart Rate, 72] corresponds to a patient with ID = 1 with a heart rate = 72 bpm recorded at t = 0.2 h. In (1) pre-filter, FIDDLE eliminates rare variables. In (2) transform, FIDDLE transforms data into tensors containing time-invariant and time-dependent features. In (3) post-filter, FIDDLE removes redundant features and features that are likely uninformative. The output consists of binary vectors and , describing the features for each ID. bpm: beats per minute; FIDDLE: Flexible Data-Driven Pipeline: ID: unique identifier; KCl: potassium chloride; WBC: white blood cell.
Challenges in preprocessing EHR data and FIDDLE’s solution
| Challenges | Example | Solutions in FIDDLE |
|---|---|---|
| Some data have associated timestamps, while others do not |
Sex is recorded once at the time of admission and typically does not have a timestamp; Administration of medications is timestamped. | Handle time-invariant and time-dependent data separately. |
| Data have heterogeneous types
Categorical Numerical Hierarchical |
Drug route is categorical: oral, IV Heart rate is numerical: 70 bpm ICD-9 code is hierarchical | Different representations for each value type
Categorical: one-hot encoding Numerical: 3 options Kept as continuous; Binned into quintiles and one-hot encoding; or Binned into quintiles and ordinal encoding. Hierarchical: user specifies which level(s) of the hierarchy to encode; values are converted internally to categorical values. |
| Data are sparse and irregularly sampled, and different variables can have different frequencies of recording |
Vital signs, such as temperature or heart rate, may be measured multiple times per day at different intervals; and Laboratory tests are run infrequently (eg, once/twice every day). |
Irregular sampling: resample data into time bins, defined by the user input ( Different recording frequencies: handle “frequent” and “non-frequent” variables differently (determined by a user-defined threshold |
| After resampling the data according to some temporal granularity ( Multiple recordings within a time bin; and Not every time bin will have a recording (missing values) |
Multiple (potentially different) heart rate values within an hour; and Temperature measurements were interrupted when a patient is transferred between ICU wards. |
Multiple recordings per time bin: use the most recent recording. Calculate summary statistics for “frequent”variables. Missing values: Imputation with carry-forward; Keep track of “presence mask” and “delta time” (how long the value has been imputed). |
| High-dimensional feature space
Some features are rarely recorded or nearly constant; and Some features are correlated or duplicated. | Data extracted from the EHR typically contain hundreds, if not thousands, of variables, including medications, labs, CPT codes, etc. |
Feature selection, filter out potentially uninformative features; Combine duplicate features into a single feature, renaming the features where appropriate. |
Note: bpm: beats per minute; CPT: current procedure terminology; EHR: electronic health record; FIDDLE: Flexible Data-Driven Pipeline; ICD-9: International Classification of Diseases, Ninth Edition; ICU: intensive care unit; IV: intravenous.
Summary of notation in user-defined arguments of FIDDLE
| Argument | Description |
|---|---|
|
| A positive number specifying the time of prediction; |
|
| A positive number specifying the temporal granularity (eg, hourly vs daily) at which to resample the time-dependent data. The unit of |
|
| A value between 0 and 1 specifying the threshold for the pre-filter step. |
|
| A value between 0 and 1 specifying the threshold for the post-filter step. |
|
| A positive number specifying the threshold, in terms of the average number of measurements per time window, at which we deem a variable “frequent” (for which summary statistics will be calculated). |
|
| A set of |
| discretize | A Boolean flag (default value: True) specifying whether features with numerical values are kept as raw values or discretized into binary features. |
| discretization_encoding | A string specifying how numerical values are encoded into binary features after discretization. Possible values are: “one-hot” (default) and “ordinal.” This argument is ignored and should not be used when discretize=False. |
Note: FIDDLE: Flexible Data-Driven Pipeline.
Symbols used to describe FIDDLE’s implementation
| Symbol | Shape | Description |
|---|---|---|
|
| – | The number of examples. |
|
| – | The number of time bins, calculated as |
|
| – | The number of input variables that are time-invariant/time-dependent after the pre-filter step. |
|
| – | The dimensionalities of time-invariant / time-dependent features after the transform step and before the post-filter step. |
|
| – | The final dimensionalities of time-invariant / time-dependent features. |
|
|
| Data tables containing values of raw time-invariant/time-dependent values after the pre-filter step. |
|
|
| |
|
|
| Tensors containing the time-invariant/time-dependent features for all |
|
|
| |
|
|
| Tensors containing the final time-invariant/time-dependent features for all |
|
|
|
Note: FIDDLE: Flexible Data-Driven Pipeline.
Figure 2.Examples of FIDDLE input and output for time-invariant and time-dependent data. In this example, each ID represents a patient (an example). Timestamps are recorded in hours. Only the subset of input/output relevant for illustration is shown. The bins for numerical variables and the categories for categorical variables are automatically determined from the entire input data table (not shown). (A) Time-invariant input data and output features for Patient 1. Patient 1 is female with an age of 55. The feature “sex = female” is dropped in the post-filter step because it is perfectly correlated with “sex = male.” (B) Time-dependent input data and output features for Patient 2. At t = 1.5 h, Patient 2 had an insulin administration of 3 units via drug push. No imputation in 2–4 h is done, since the 3 variables related to insulin are not considered “frequent,” resulting in 0 s in the output features for the corresponding time bins. FIDDLE: Flexible Data-Driven Pipeline; ID: unique identifier; IV: intravenous.
Summary of MIMIC-III tables used in our analysis
| MIMIC-III | ||
|---|---|---|
| Table name | Description | Example variables |
|
| Information on unique patients | Age, Sex |
|
| Information on unique hospitalizations |
Admission type Admission location |
|
| Information on unique ICU stays |
Care unit Ward ID Admission-to-ICU time |
|
| Charted data, including vital signs, and other information relevant to patients’ care |
Heart rate Pain location Daily weight |
|
| Laboratory test results from the hospital database |
Lactate WBC |
|
| Fluid intake administered, including dosage and route (eg, oral or intravenous) |
NaCl 0.45% Whole blood |
|
| Fluid output during the ICU stay |
OR urine Stool |
|
| Patients’ procedures during the ICU stay |
CT scan X-ray |
|
| Microbiology specimen from hospital database | Sputum |
|
| Documentation of dates and times of certain events |
Last dialysis Pregnancy due |
Note: We used all structured tables that pertain to patient health.
CT: computed tomography; ICU: intensive care unit; ID: unique identifier; OR: operating room; WBC: white blood cell.
Summary of eICU tables used in our analysis
| eICU | ||
|---|---|---|
| Table name | Description | Example variables |
|
| Information on unique patients, hospitalizations, and ICU stays |
Age, Sex Hospital/ward ID |
|
| Vital signs measured through bedside monitors or invasively |
Temperature End Tidal CO2 |
|
| Laboratory tests |
CPK troponin - I |
|
| Active medication orders, the intake of drug through infusions, and intake/output of fluids |
Morphine dosage Dialysis total |
|
| Microbiology cultures taken from patients |
Culture site (wound) Organism |
|
| Documentation of physician/nurse assessment |
Abdominal pain Psychological status Respiratory rate |
|
| Relevant past medical history |
Transplant AIDS |
|
| Results of physical exam (structured) |
Blood pressure Verbal score |
|
| Respiratory care data |
Airway position Vent details |
|
| Structured data documenting specific, active treatments | Thrombolytics |
Note: We used all structured tables that pertain to patient health.
AIDS: acquired immunodeficiency syndrome; CPK: creatine phosphokinase; ICU: intensive care unit; ID: unique identifier.
Figure 3.Harutyunyan et al definitions of the study cohorts. For each data set (MIMIC-III and eICU), we defined 5 prediction tasks, each with a distinct study cohort: in-hospital mortality at 48 h, ARF at 4 h, ARF at 12 h, shock at 4 h, and shock at 12 h. ARF: acute respiratory failure; ICU: intensive care unit; PEEP: positive end-expiratory pressure.
Figure 4.Dimensionality of feature vectors for each prediction task on MIMIC-III. After applying FIDDLE to the MIMIC-III study cohorts, an ICU visit is represented by time-invariant features and time-dependent features, both of which are high-dimensional. Though the number of time-invariant features is similar across tasks, the number of time-dependent features varies because more data (likely corresponding to more variables) are collected for a later prediction time. FIDDLE: Flexible Data-Driven Pipeline; ICU: intensive care unit.
Examples of time-invariant features extracted by FIDDLE on the 12-hour ARF cohort for MIMIC-III
| Time-invariant features |
|---|
| Age in Q1 (18–51) |
| Age in Q2 (52–62) |
| Age in Q3 (63–71) |
| Age in Q4 (72–80) |
| Age in Q5 (>80) |
| Sex = Female |
| ICU Location ID = 12 |
| ICU Location ID = 15 |
| ICU Location ID = 23 |
| ICU Location ID = 33 |
| ICU Location ID = 52 |
| ICU Location ID = 57 |
| Hospital admission source: clinic referral |
| Hospital admission source: transfer from hospital |
| Hospital admission source: from emergency room |
Note: ARF: acute respiratory failure; FIDDLE: Flexible Data-Driven Pipeline; ICU: intensive care unit; ID: unique identifier; Q, quintile.
Examples of time-dependent features extracted by FIDDLE on the 12-hour ARF cohort for MIMIC-III
| Time-dependent features |
|---|
| At 0–1 h, insulin dosage in Q1 (≤2 units) |
| At 0–1 h, insulin dosage in Q2 (>2 units, ≤3 units) |
| At 0–1 h, insulin dosage in Q3 (>3 units, ≤4 units) |
| At 0–1 h, insulin dosage in Q4 (>4 units, ≤8 units) |
| At 0–1 h, insulin dosage in Q5 (>8 units) |
| At 0–1 h, insulin route = intravenous |
| At 0–1 h, insulin route = drug push |
| At 1–2 h, insulin dosage in Q1 (≤2 units) |
| At 1–2 h, insulin dosage in Q2 (>2 units, ≤3 units) |
| At 1–2 h, insulin dosage in Q3 (>3 units, ≤4 units) |
| At 1–2 h, insulin dosage in Q4 (>4 units, ≤8 units) |
| At 1–2 h, insulin dosage in Q5 (>8 units) |
| At 1–2 h, insulin route = intravenous |
| At 1–2 h, insulin route = drug push |
Note: ARF: acute respiratory failure; FIDDLE: Flexible Data-Driven Pipeline; Q, quintile.
Summary of performance on MIMIC-III for all FIDDLE-based models, compared to MIMIC-Extract
| Task | In-hospital mortality, 48 h | ARF, 4 h | ARF, 12 h | Shock, 4 h | Shock, 12 h | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR | |
|
|
|
| 0.445 (0.358–0.540) | 0.777 (0.752–0.803) | 0.604 (0.561–0.648) | 0.723 (0.683–0.759) | 0.250 (0.200–0.313) | 0.796 (0.771–0.821) | 0.505 (0.454–0.557) | 0.748 (0.712–0.784) | 0.242 (0.193–0.310) |
|
| 0.852 (0.821–0.882) |
|
|
|
|
|
|
|
|
| |
|
| 0.851 (0.820–0.879) | 0.439 (0.353–0.529) | 0.788 (0.763–0.814) | 0.633 (0.591–0.672) | 0.722 (0.684–0.758) | 0.258 (0.207–0.320) | 0.798 (0.773–0.824) | 0.520 (0.471–0.572) | 0.741 (0.704–0.778) | 0.247 (0.198–0.317) | |
|
| 0.837 (0.803–0.867) | 0.441 (0.358–0.523) | 0.796 (0.770–0.822) | 0.634 (0.590–0.675) | 0.700 (0.661–0.736) | 0.229 (0.184–0.286) | 0.801 (0.778–0.825) | 0.513 (0.463–0.562) | 0.753(0.717–0.791) | 0.248 (0.199–0.313) | |
|
|
| 0.856(0.821–0.888) | 0.444(0.357–0.545) | 0.817(0.792–0.839) | 0.657(0.614–0.696) | 0.757(0.720–0.789) | 0.291(0.236–0.354) | 0.825(0.803–0.846) |
|
| 0.274(0.227–0.338) |
|
| 0.814(0.780–0.847) | 0.357(0.279–0.448) | 0.817(0.795–0.839) | 0.652(0.608–0.690) | 0.760(0.726–0.793) | 0.317(0.255–0.382) | 0.809(0.786–0.833) | 0.516(0.467–0.566) | 0.773(0.740–0.806) | 0.288(0.231–0.355) | |
|
|
|
|
|
| 0.768(0.733–0.800) | 0.294(0.238–0.361) |
| 0.541(0.493–0.589) | 0.791(0.758–0.823) | 0.295(0.239–0.361) | |
|
| 0.868(0.835–0.897) | 0.510(0.411–0.597) |
| 0.664(0.623–0.703) |
|
| 0.824(0.803–0.845) | 0.541(0.497–0.587) |
|
| |
Note: Reported as AUROC and AUPR with 95% CIs in parentheses on the respective held-out test set for the 5 prediction tasks. For each task (column), the bolded results are the best-performing model for either MIMIC-Extract or FIDDLE.
ARF: acute respiratory failure; AUROC: area under the receiver operating characteristics curve; AUPR: area under the precision-recall curve; CI: confidence interval; CNN: convolutional neural networks; FIDDLE: Flexible Data-Driven Pipeline; LR: logistic regression; LSTM: long short-term memory networks; RF: random forest.
Summary of performance on eICU for all FIDDLE-based models
| Task | In-hospital mortality, 48 h | ARF, 4 h | ARF, 12 h | Shock, 4 h | Shock, 12 h | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Method | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR | AUROC | AUPR |
|
| 0.824(0.812–0.836) | 0.401(0.374–0.428) | 0.810(0.799–0.821) | 0.269(0.246–0.293) | 0.778(0.763–0.794) | 0.201(0.178–0.225) | 0.846(0.836–0.855) | 0.338(0.314–0.360) | 0.797(0.782–0.811) | 0.187(0.168–0.210) |
|
| 0.787(0.774–0.800) | 0.340(0.314–0.366) | 0.792(0.779–0.803) | 0.236(0.217–0.258) | 0.749(0.734–0.764) | 0.166(0.149–0.187) | 0.810(0.800–0.820) | 0.279(0.258–0.298) | 0.768(0.753–0.783) | 0.152(0.136–0.171) |
|
|
| 0.433(0.404–0.461) | 0.828(0.817–0.839) | 0.276(0.252–0.300) | 0.799(0.784–0.813) | 0.212(0.190–0.236) |
| 0.351(0.327–0.374) | 0.813(0.800–0.826) |
|
|
| 0.841(0.830–0.852) |
|
|
|
|
| 0.853(0.844–0.861) |
|
| 0.199(0.178–0.223) |
Note: Reported as AUROC and AUPR with 95% CI on the respective held-out test set for the 5 prediction tasks.
ARF: acute respiratory failure; AUROC: area under the receiver operating characteristics curve; AUPR: area under the precision-recall curve; CI: confidence interval; CNN: convolutional neural networks; FIDDLE: Flexible Data-Driven Pipeline; LR: logistic regression; LSTM: long short-term memory networks; RF: random forest.
Figure 5.Model performance (with 95% CI) for prediction of ARF at t = 12 h on MIMIC-III, evaluated on the held-out test set (n = 2093). On this task, all 4 FIDDLE-based models exhibited similarly good discriminative and calibration performance. (A) ROC curves and AUROC scores. (B) PR curves and AUPR scores. (C) Calibration plots and Brier scores. ARF: acute respiratory failure; AUROC: area under the receiver operating characteristics curve; AUPR: area under the precision-recall curve; CI: confidence interval; CNN: convolutional neural networks; FIDDLE: Flexible Data-Driven Pipeline; LR: logistic regression; LSTM: long short-term memory networks; PR: precision-recall curve; RF: random forest; ROC: receiver operating characteristics curve.