| Literature DB >> 29475824 |
Brett K Beaulieu-Jones1,2, Daniel R Lavage3, John W Snyder3, Jason H Moore2, Sarah A Pendergrass3, Christopher R Bauer3.
Abstract
BACKGROUND: Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results.Entities:
Keywords: clinical laboratory test results; electronic health records; imputation; missing data
Year: 2018 PMID: 29475824 PMCID: PMC5845101 DOI: 10.2196/medinform.8960
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Two general paradigms are commonly used to describe missing data. Missing data are considered ignorable if the probability of observing a variable has no relation to the value of the observed variable and are considered nonignorable otherwise. The second paradigm divides missingness into 3 categories: missing completely at random (MCAR: the probability of observing a variable is not dependent on its value or other observed values), missing at random (MAR: the probability of observing a variable is not dependent on its own value after conditioning on other observed variables), and missing not at random (MNAR: the probability of observing a variable is dependent on its value, even after conditioning on other observed variables). The x-axis indicates the extent to which a given value being observed depends on other values of other observed variables. The y-axis indicates the extent to which a given value being observed depends on its own value.
Figure 2Summary of missing data across 143 clinical laboratory measures. (A) After ranking the clinical laboratory measures by the number of total results, the percentage of patients missing a result for each test was plotted (red points). At each rank, the percentage of complete cases for all tests of equal or lower rank were also plotted (blue points). Only variables with a rank ≤75 are shown. The vertical bar indicates the 28 tests that were selected for further analysis. (B) The full distribution of patient median ages is shown in blue, and the fraction of individuals in each age group that had a complete set of observations for tests 1-28 are shown in red. (C) Within the 28 laboratory tests that were selected for imputation analyses, the mean number of missing tests is depicted as a function of age. (D) Within the 28 laboratory tests that were selected for imputation, the mean number of missing tests is depicted as a function of body mass index (BMI). (E) Accuracy of a random forest predicting the presence or absence of all 143 laboratory tests. AUROC: area under the receiver operating characteristic curve. (F) Accuracy of a random forest predicting the presence or absence of the top 28 laboratory tests, by Logical Observation Identifiers Names and Codes (LOINC).
Logical Observation Identifiers Names and Codes (LOINC) and descriptions of the most frequently ordered clinical laboratory measurements. The assays are ranked from the most common to the least.
| LOINC | Description |
| 718-7 | Hemoglobin [Mass/volume] in Blood |
| 4544-3 | Hematocrit [Volume Fraction] of Blood by Automated count |
| 787-2 | Erythrocyte mean corpuscular volume [Entitic volume] by Automated count |
| 786-4 | Erythrocyte mean corpuscular hemoglobin concentration [Mass/volume] by Automated count |
| 785-6 | Erythrocyte mean corpuscular hemoglobin [Entitic mass] by Automated count |
| 6690-2 | Leukocytes [#/volume] in Blood by Automated count |
| 789-8 | Erythrocytes [#/volume] in Blood by Automated count |
| 788-0 | Erythrocyte distribution width [Ratio] by Automated count |
| 32623-1 | Platelet mean volume [Entitic volume] in Blood by Automated count |
| 777-3 | Platelets [#/volume] in Blood by Automated count |
| 2345-7 | Glucose [Mass/volume] in Serum or Plasma |
| 2160-0 | Creatinine [Mass/volume] in Serum or Plasma |
| 2823-3 | Potassium [Moles/volume] in Serum or Plasma |
| 3094-0 | Urea nitrogen [Mass/volume] in Serum or Plasma |
| 2951-2 | Sodium [Moles/volume] in Serum or Plasma |
| 2075-0 | Chloride [Moles/volume] in Serum or Plasma |
| 2028-9 | Carbon dioxide, total [Moles/volume] in Serum or Plasma |
| 17861-6 | Calcium [Mass/volume] in Serum or Plasma |
| 1743-4 | Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma by With P-5'-P |
| 30239-8 | Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma by With P-5'-P |
| 1975-2 | Bilirubin.total [Mass/volume] in Serum or Plasma |
| 2885-2 | Protein [Mass/volume] in Serum or Plasma |
| 10466-1 | Anion gap 3 in Serum or Plasma |
| 751-8 | Neutrophils [#/volume] in Blood by Automated count |
| 2093-3 | Cholesterol [Mass/volume] in Serum or Plasma |
| 2571-8 | Triglyceride [Mass/volume] in Serum or Plasma |
| 2085-9 | Cholesterol in HDLa [Mass/volume] in Serum or Plasma |
| 13457-7 | Cholesterol in LDLb [Mass/volume] in Serum or Plasma by calculation |
aHDL: high-density lipoprotein.
bLDL: low-density lipoprotein.
Figure 3Area under the receiver operating characteristic curve (AUROC) of a random forest predicting whether data will be present or missing. (A) Missing completely at random simulation. (B) Missing at random simulation. (C) Missing not at random simulation.
Figure 4Imputation accuracy measured by root mean square error (RMSE) across simulations 1-3. (A) Missing completely at random (MCAR). (B) Missing at random (MAR). (C) Missing not at random (MNAR). FI: fancyimpute; KNN: k-nearest neighbors; MICE: Multivariate Imputation by Chained Equations; pmm: predictive mean matching; RF: random forest; SVD: singular value decomposition.
Figure 5Imputation root mean square error (RMSE) for a subset of 10,000 patients from simulation 4. A total of 12 imputation methods were tested (x-axis), and each color corresponds to a Logical Observation Identifiers Names and Codes (LOINC) code. The black line shows the theoretical error from random sampling. FI: fancyimpute; KNN: k-nearest neighbors; MICE: Multivariate Imputation by Chained Equations; pmm: predictive mean matching; RF: random forest; SVD: singular value decomposition.
Figure 6Assessment of multiple imputation for each method. Using simulation 4, missing values were imputed multiple times with each method. The x-axes show the root mean square error (RMSE) between the imputed data and the observed values. The y-axes show the RMSE between multiple imputations of the same data. The axis scales vary between panels to better show the range of variation. The laboratory tests are indicated by the color of the points. The black diagonal line represents unity (y=x). Panels are ordered by each method’s mean deviation (MD) from unity, indicated in the top left corner of each panel. In the last 7 panels, the unity line is not visible because the variation between multiple imputations was close to zero. FI: fancyimpute; KNN: k-nearest neighbors; MICE: Multivariate Imputation by Chained Equations; pmm: predictive mean matching; RF: random forest; SVD: singular value decomposition.