| Literature DB >> 35177096 |
Nina Cesare1, Lawrence P O Were2.
Abstract
OBJECTIVE: Electronic health records (EHR) hold promise for conducting large-scale analyses linking individual characteristics to health outcomes. However, these data often contain a large number of missing values at both the patient and visit level due to variation in data collection across facilities, providers, and clinical need. This study proposes a stepwise framework for imputing missing values within a visit-level EHR dataset that combines informative missingness and conditional imputation in a scalable manner that may be parallelized for efficiency.Entities:
Keywords: Big data; Electronic medical records; HIV; Imputation
Mesh:
Year: 2022 PMID: 35177096 PMCID: PMC8851714 DOI: 10.1186/s13104-022-05911-w
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Fig. 1Visual representation of workflow, noting unit of analysis shifts
Variables imputed using value dependencies
| Dependency 1 | Dependency 2 | Variable of interest | Original variation | Imputed variation | ||
|---|---|---|---|---|---|---|
| No | Yes | No | Yes | |||
| On ARV | NA | Change in ARV regimen | 0 | 1375 | 386,984 | 1375 |
| On ARV | Change in ARV regimen | ARV stop: Completed T-pMTCT | 0 | 2588 | 384,488 | 2588 |
| On ARV | Change in ARV regimen | ARV stop/change due to regiment failure | 0 | 352 | 386,937 | 352 |
| On ARV | Change in ARV regimen | ARV stop/change due to toxicity | 0 | 1696 | 385,509 | 1696 |
| On ARV | Change in ARV regimen | ARV stop/change due to weight change | 0 | 10 | 386,974 | 10 |
| On ARV | Change in ARV regimen | ARV stop/change due to other reason | 0 | 2002 | 385,266 | 2002 |
| On ARV | Change in ARV regimen | ARV stop/change due to new TB | 0 | 43 | 386,942 | 43 |
| On ARV | Change in ARV regimen | ARV stop/change due to non-adherence | 0 | 358 | 386,637 | 358 |
| On ARV | Change in ARV regimen | ARV stop/change due to out of stock | 0 | 63 | 386,927 | 63 |
Fig. 2Patient weight and Diastolic BP over time for several patients among a random subset of 100 patients. Red indicates observed weight, and blue indicates imputed weight
Fig. 3Percent reduction in missing values pre- and post-imputation among focal variables
NIHF prediction classifier performance
| Yes | No | |
|---|---|---|
| Yes | 791 | 780 |
| No | 333 | 19,608 |
| Accuracy | 0.948 | |
| Precision | 0.504 | |
| Recall | 0.704 | |
| F1 | 0.587 | |
Predictive importance of key variables
| Variable | Generalized cross-validation (GCV) estimate of error |
|---|---|
| Age at fist pregnancy | 100 |
| Body Mass Index (BMI) | 69.9913 |
| Viral load | 53.1357 |
| Age at first pregnancy | 29.11793 |
| Years of school | 28.06922 |
| Children under 18 months: 1 | 6.261627 |
| Travel time to clinic: 30–60 min | 5.991079 |
| WHO weight loss stage: 3 | 5.981254 |
| Place of delivery 2 | 5.827017 |
| Pregnancy outcome: unknown/not documented | 5.822529 |
| WHO weight loss stage2 | 5.179188 |
| Visits urban clinic | 5.064835 |
| On ARV | 5.025331 |
| Travel time to clinic: 1–2 h | 4.986507 |
| Delivery assistance | 4.309087 |
| Place of delivery 4 | 4.232822 |
| Delivery assistance 4 | 3.88988 |
| Travel time to clinic: > 2 h | 3.462557 |
| WHO weight loss stage: 4 | 2.662623 |
| Delivery assistance 5 | 1.985101 |