| Literature DB >> 29369797 |
D J Albers1, N Elhadad2, J Claassen3, R Perotte4, A Goldstein5, G Hripcsak6.
Abstract
We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers, 2012, 2013), as defined by measurement context (Hripcsak and Albers, 2013; Albers et al., 2012) and measurement patterns (Albers and Hripcsak, 2010, 2012), can influence how EHR data are distributed statistically (Kohane and Weber, 2013; Pivovarov et al., 2014). We construct an algorithm, PopKLD, which is based on information criterion model selection (Burnham and Anderson, 2002; Claeskens and Hjort, 2008), is intended to reduce and cope with health care process biases and to produce an intuitively understandable continuous summary. The PopKLD algorithm can be automated and is designed to be applicable in high-throughput settings; for example, the output of the PopKLD algorithm can be used as input for phenotyping algorithms. Moreover, we develop the PopKLD-CAT algorithm that transforms the continuous PopKLD summary into a categorical summary useful for applications that require categorical data such as topic modeling. We evaluate our methodology in two ways. First, we apply the method to laboratory data collected in two different health care contexts, primary versus intensive care. We show that the PopKLD preserves known physiologic features in the data that are lost when summarizing the data using more common laboratory data summaries such as mean and standard deviation. Second, for three disease-laboratory measurement pairs, we perform a phenotyping task: we use the PopKLD and PopKLD-CAT algorithms to define high and low values of the laboratory variable that are used for defining a disease state. We then compare the relationship between the PopKLD-CAT summary disease predictions and the same predictions using empirically estimated mean and standard deviation to a gold standard generated by clinical review of patient records. We find that the PopKLD laboratory data summary is substantially better at predicting disease state. The PopKLD or PopKLD-CAT algorithms are not meant to be used as phenotyping algorithms, but we use the phenotyping task to show what information can be gained when using a more informative laboratory data summary. In the process of evaluation our method we show that the different clinical contexts and laboratory measurements necessitate different statistical summaries. Similarly, leveraging the principle of maximum entropy we argue that while some laboratory data only have sufficient information to estimate a mean and standard deviation, other laboratory data captured in an EHR contain substantially more information than can be captured in higher-parameter models.Entities:
Keywords: Electronic health record; Kullback-Leibler divergence; Laboratory tests; Summary statistic; phenotyping
Mesh:
Year: 2018 PMID: 29369797 PMCID: PMC5856130 DOI: 10.1016/j.jbi.2018.01.004
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 6.317
Fig. 1A graphical picture of the PopKLD algorithm for creating a statistical summary of patient laboratory data.
Fig. 2A graphical picture of the PopKLD-CAT algorithm that translates the continuous PopKLD patient laboratory data summaries into categorical variables that can be used in situations where categorical variables are necessary, such as topic modeling.
Clinical evaluation of the PopKLD method for selecting cohorts. For three diseases, diabetes, chronic kidney disease and pancreatitis and three related laboratory measurements, glucose, creatinine and lipase, we compare the presence/absence of a disease identified by manual review with presence/absence of a disease identified using output from the PopKLD algorithm. We want to see positive correlation between a low KL-divergence and a high cluster purity because this implies that the model selected by the PopKLD method separated patients in a ways useful for identifying phenotypes and cohorts. Generally, the PopKLD method worked well identifying presence of a disease compared with other laboratory data based metrics. Most metrics worked well identifying absence of a disease compared with presence of a disease, a result that is expected because the low outlier error indicates absence whereas high outlier errors produce false positives.
| Clinical evaluation of cluster purity of PopKLD selected cohorts | |||
|---|---|---|---|
|
| |||
| Disease state | Model-defined cohort | KL-divergence | Purity (Proportion) |
|
| |||
| Diabetes | GEV() 10th decile, shape >0 |
| |
| Diabetes | GEV() 10th decile, shape <0 |
| |
| Diabetes | logn() 10th decile | 4.3 |
|
| Diabetes | mean and standard deviation 10th decile | – |
|
| No Diabetes | GEV() 1st decile, shape >0 |
| |
| No Diabetes | GEV() 1st decile, shape <0 |
| |
| No Diabetes | logn() 1st decile | 4.3 |
|
| No Diabetes | mean and standard deviation 1st decile | – |
|
|
| |||
| CKD | GEV() 10th decile, shape >0 |
| |
| CKD | GEV() 10th decile, shape <0 |
| |
| CKD | logn() 10th decile | 2.0 |
|
| CKD | mean and standard deviation 10th decile | – |
|
| No CKD | GEV() 1st decile, shape >0 |
| |
| No CKD | GEV() 1st decile, shape <0 |
| |
| No CKD | logn() 1st decile | 2.0 |
|
| No CKD | mean and standard deviation 1st decile | – |
|
|
| |||
| Pancreatitis | GEV() 10th decile, shape |
| |
| Pancreatitis | GEV() 10th decile, shape <0 |
| |
| Pancreatitis | logn() 10th decile | 80 |
|
| Pancreatitis | mean and standard deviation 10th decile | – |
|
| no Pancreatitis | GEV() 1st decile, shape >0 |
| |
| no Pancreatitis | GEV() 1st decile, shape <0 |
| |
| no Pancreatitis | logn() 1st decile | 80 |
|
| no Pancreatitis | mean and standard deviation 1st decile | – |
|
Metabolic laboratory PopKLD model selection estimates were we list the PopKLD selected models. Multiple models are listed if their KL-divergence is a minimum and agrees one two or more orders of magnitude. Note the diversity in which model is selected and how many models are good approximations cross laboratory measurements. The data used were from the AIM clinic data set with the exception of the glucose data collected in the ICU (GLU-ICU).
| Summary models selected by the PopKLD for 64 laboratory features | |||||
|---|---|---|---|---|---|
|
| |||||
| Basic metabolic | Whole blood | Hepatobiliary | |||
|
|
|
| |||
| Lab-Context | PopKLD model | Lab | PopKLD model | Lab | PopKLD model |
| GLU-ICU | Gamma, LogNorm | HGB | Norm, Weibull, Logistic | AST | GEV |
| GLU | GEV | MCH | Logistic | ALT | GEV |
| CA | Logistic | MCHC | Logistic | AMY | GEV, LogNorm |
| CL | Logistic | HCT | Gamma | LIP | GEV |
| CREAT | GEV | RBC | Norm, Logistic | BLOOD PROTEIN | Weibull, Norm, Logistic |
| K | LogNorm | RDW | GEV | BILI TOTAL | GEV |
| MG | Logistic | MCV | Logistic | BILI RECT | Uniform, t |
| PH | GEV, Logistic, LogNorm | PLT | Logistic | ALB | Weibull, Logistic, Norm |
| BICARB | Norm | MPV | Norm, Gamma, Logistic, LogNorm, Weibull | ALK PHOS | GEV |
| BUN | GEV | WBC | GEV, LogNorm, Gamma | ||
| URIC | GEV, Gamma, LogNorm | ||||
| CA ION | Logistic, Norm | ||||
| HA1C | GEV | ||||
| Lipids | Anemia | Cardiac | |||
|
|
|
| |||
| Lab-Context | PopKLD model | Lab | PopKLD model | Lab | PopKLD model |
|
| |||||
| HDL | GEV, GAMMA, LogNorm | FERRITIN | GEV | CK | GEV |
| LDL | Gamma, GEV, Norm, Logistic | IRON BINDING CAP | Norm, Weibull | TROPONIN | GEV |
| TG | LogNorm, GEV | VITAMIN B12 | Rayleigh | LACTATE | GEV |
| CHOL | Gamma | IRON | GEV, Logistic | ||
| Hormone, Inflam, Vitamin, Urine | Differential | Blood Gases | |||
|
|
|
| |||
| Lab-Context | PopKLD model | Lab | PopKLD model | Lab | PopKLD model |
|
| |||||
| TSH | GEV, LogNorm, | BASOS % | Uniform | BASE EXCESS ART | Rayleigh, LogNorm |
| T4 FREE | GEV, Gamma, Logistic, LogNorm | MO % | GEV, Logistic | PO2 VEN | GEV, LogNorm |
| T4 | GEV, Gamma, Logistic, LogNorm | LYMPH | GEV, Gamma | PO2 ART | GEV |
| CRP HIGH SEN | GEV | NRBC abs | t | PCO2 VEN | Logistic, Gamma |
| ESR | Logistic, GEV, Gamma | NRCB % | Norm | PCO2 ART | GEV, LogNorm |
| 25 OH VIT D | GEV, Logistic | PH ART | Norm, GEV, Gamma, Logistic, LogNorm, Weibull | ||
| PH UA | GEV, Normal, Gamma, Logistic, LogNorm, Weibull | PH VEN | GEV, Norm, Gamma, Logistic, LorNorm, Weibull | ||
| ACR | GEV | ||||
Fig. 3Joint distributions for: mean vs raw standard deviation (top left), mean vs truncated standard deviation (top right), location vs scale—the mean-like and variance-like parameters of the GEV—(middle left), log-normal mean vs standard deviation (middle right), and “a” vs “b” of the gamma distribution for the ICU population (bottom). PopKLD selected the log-normal and gamma distributions as the best models and both reproduce known physiology well.
Fig. 4Joint distributions for: mean vs raw standard-deviation (top left), mean vs by hand truncated variance (top right), location vs scale—the mean-like and variance-like parameters of the GEV—(middle left), log-normal mean vs standard deviation (middle right), and “a” vs “b” of the gamma distribution for the AIM population (bottom).