| Literature DB >> 30252148 |
Rebecca A Hubbard1, Jing Huang1, Joanna Harton1, Arman Oganisian1, Grace Choi1, Levon Utidjian2,3, Ihuoma Eneli4, L Charles Bailey2,3, Yong Chen1.
Abstract
Phenotyping, ie, identification of patients possessing a characteristic of interest, is a fundamental task for research conducted using electronic health records. However, challenges to this task include imperfect sensitivity and specificity of clinical codes and inconsistent availability of more detailed data such as laboratory test results. Despite these challenges, most existing electronic health records-derived phenotypes are rule-based, consisting of a series of Boolean arguments informed by expert knowledge of the disease of interest and its coding. The objective of this paper is to introduce a Bayesian latent phenotyping approach that accounts for imperfect data elements and missing not at random missingness patterns that can be used when no gold-standard data are available. We conducted simulation studies to compare alternative phenotyping methods under different patterns of missingness and applied these approaches to a cohort of 68 265 children at elevated risk for type 2 diabetes mellitus (T2DM). In simulation studies, the latent class approach had similar sensitivity to a rule-based approach (95.9% vs 91.9%) while substantially improving specificity (99.7% vs 90.8%). In the PEDSnet cohort, we found that biomarkers and clinical codes were strongly associated with latent T2DM status. The latent T2DM class was also strongly predictive of missingness in biomarkers. Glucose was missing in 83.4% of patients (odds ratio for latent T2DM status = 0.52) while hemoglobin A1c was missing in 91.2% (odds ratio for latent T2DM status = 0.03 ), suggesting missing not at random missingness. The latent phenotype approach may substantially improve on rule-based phenotyping.Entities:
Keywords: Bayesian; electronic health records; latent class; missing data; phenotype
Mesh:
Year: 2018 PMID: 30252148 PMCID: PMC6519239 DOI: 10.1002/sim.7953
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Model specification for Bayesian latent variable model for EHR‐derived phenotypes for the ith patient.
| Latent Phenotype | Availability of | Biomarkers | Clinical Codes | Prescription | |
|---|---|---|---|---|---|
| Biomarkers | Medications | ||||
| Example | Type 2 Diabetes | Availability of glucose | Glucose or | Diabetes ICD‐9 code; | Diabetes |
| or HbA1c data | HbA1c values | Endocrinologist visits | medication | ||
| Variable |
|
|
|
|
|
| Model |
|
|
|
|
|
| Priors |
|
|
|
|
|
|
| |||||
|
|
Abbreviations: N, normal; Bern, Bernoulli; MVN, multivariate normal; Unif, uniform; InvGamma, inverse gamma, HbA1c, Hemoglobin A1c.
Distributional assumptions and parameter values used in simulation studies. Normal distributions are parameterized using mean and standard deviation.
| Variable | Distribution |
|---|---|
| Age ( | Uniform (9,18) |
| White race ( | Bernoulli (0.524) |
| T2DM ( | Bernoulli (expit( |
| BMI percentile ( | Truncated Normal (2.2 |
| T2DM code ( | Bernoulli (0.8 |
| Endocrinologist visit code ( | Bernoulli (0.5 |
| Metformin code ( | Bernoulli (0.2 |
| Glucose ( | Normal (90.6 + 42 |
| HbA1c ( | Normal (5.4 + 1.00 |
Abbreviations: BMI, body mass index; T2DM, type 2 diabetes mellitus.
Figure 1Sensitivity and specificity of methods for identifying patients with type 2 diabetes based on 100 simulations per scenario. Biomarkers were simulated under MAR missingness with average missingness in biomarkers of 48% for HbA1c and 25% for glucose in Scenario 1 (Panel A) and 95% for HbA1c and 81% for glucose in Scenario 2 (Panel B). Circle = sensitivity, Triangle = specificity, Star = percent misclassified. MI, multiple imputation
Figure 2Sensitivity and specificity of methods for identifying patients with type 2 diabetes based on 1000 simulations per scenario. Biomarkers were simulated under missing not at random missingness with average missingness in biomarkers of 48% for HbA1c and 25% for glucose in Scenario 3 (Panel A) and 95% for HbA1c and 81% for glucose in Scenario 4 (Panel B). Circle = sensitivity, Triangle = specificity, Star = percent misclassified. MI, multiple imputation
Figure 3Receiver operating characteristic curves for identifying patients with type 2 diabetes from four example simulated data sets. Biomarkers were simulated under missing at random (MAR) (first row) and missing not at random (MNAR) (second row) missingness with average missingness in biomarkers of 48% for HbA1c and 25% for glucose in Scenario 3 (left column) and 95% for HbA1c and 81% for glucose in Scenario 4 (right column). AUC, area under the curve.
Mean and standard deviation (SD) for area under the curve (AUC) based on posterior probability of type 2 diabetes mellitus from latent phenotype analysis, glucose, hemoglobin A1c (HbA1c), glucose with missing imputed via multiple imputation (MI), and HbA1c with missing imputed via MI based on simulated data. Means and standard deviations were computed across 100 simulated data sets for each scenario
| AUC (SD | ||||
|---|---|---|---|---|
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |
| Low MAR Missingness | High MAR Missingness | Low MNAR Missingness | High MNAR Missingness | |
| Latent phenotype | 0.999 (5.98) | 1.000 (2.60) | 0.998 (4.66) | 0.997 (7.40) |
| Glucose | 0.761 (40.11) | 0.513 (28.64) | 0.851 (36.45) | 0.592 (31.62) |
| HbA1c | 0.571 (102.88) | 0.499 (12.14) | 0.735 (42.67) | 0.530 (19.36) |
| Glucose MI | 0.954 (14.01) | 0.920 (28.66) | 0.959 (13.20) | 0.931 (21.97) |
| HbA1c MI | 0.933 (18.12) | 0.811 (107.93) | 0.944 (15.78) | 0.883 (69.61) |
Abbreviations: MAR, missing at random; MNAR, missing not at random.
Characteristics of study population of pediatric patients at risk for type 2 diabetes mellitus (T2DM) stratified according to absence of codes for type 1 diabetes mellitus and presence of codes for T2DM, metformin prescription, or elevated hemoglobin A1c or glucose
| Total | Codes or Biomarkers Suggesting T2DM | ||
|---|---|---|---|
| Yes | No | ||
| N | N | N | |
|
|
|
| |
| Male | 36 836 (53.96) | 2026 (40.17) | 34 810 (55.06) |
| White | 35 740 (52.35) | 2886 (57.23) | 32 854 (51.97) |
| Endocrinologist | 5338 (7.82) | 510 (63.43) | 4828 (7.16) |
| Metformin | 764 (1.12) | 675 (83.96) | 89 (0.13) |
| Insulin | 727 (1.06) | 154 (19.15) | 573 (0.85) |
| T1D codes | 632 (0.93) | 0 (0) | 632 (0.94) |
| T2D codes | 275 (0.4) | 221 (27.49) | 54 (0.08) |
| Any glucose measurement | 11 325 (16.59) | 355 (44.15) | 10 970 (16.26) |
| Any HbA1c measurement | 6031 (8.83) | 397 (49.38) | 5634 (8.35) |
| | |
| |
| Age | 11.90(2.50) | 13.79 (2.58) | 11.87 (2.49) |
| BMI | 2.02 (0.30) | 2.27 (0.36) | 2.01 (0.30) |
| Glucose | 94.31 (32.51) | 141.39 (104.47) | 92.79 (27.44) |
| Hemoglobin A1c | 5.79 (1.25) | 6.93 (1.94) | 5.71 (1.15) |
Abbreviations: BMI, body mass index; SD, standard deviation; T1D, type 1 diabetes; T2D, type 2 diabetes.
Posterior means and 95% credible intervals (CI) for model parameters for analysis of pediatric T2DM in the PEDSnet sample
| Posterior | 95% CI | |
|---|---|---|
| Mean | ||
| Mean shift in glucose (
| 90.62 | (90.25, 91.00) |
| Mean shift in HbA1c (
| 3.15 | (3.06, 3.24) |
| T2DM code sensitivity (
| 0.17 | (0.15, 0.20) |
| T2DM code specificity (1‐
| 1.00 | (1.00, 1.00) |
| Endocrinologist visit code sensitivity (
| 0.94 | (0.92, 0.95) |
| Endocrinologist visit code specificity (1‐
| 0.93 | (0.93, 0.94) |
| Metformin code sensitivity (
| 0.31 | (0.28, 0.35) |
| Metformin code specificity (1‐
| 0.99 | (0.99, 0.99) |
| Insulin code sensitivity (
| 0.66 | (0.61, 0.70) |
| Insulin code specificity (1‐
| 1.00 | (1.00, 1.00) |
| OR missing glucose (
| 0.52 | (0.44, 0.61) |
| OR missing HbA1c (
| 0.03 | (0.02, 0.04) |
Abbreviations: HbA1c, hemoglobin A1c; T2DM, type 2 diabetes mellitus; OR, odds ratio.
Prevalence of pediatric type 2 diabetes mellitus in the PEDSnet sample according to six phenotyping approaches
| N | Prevalence (%) | |
|---|---|---|
| Latent phenotype | 2362 | 3.5 |
| Codes | 722 | 1.1 |
| Biomarkers | 209 | 0.3 |
| Codes or biomarkers | 804 | 1.2 |
| Biomarkers MI | 424 | 0.6 |
| Codes or biomarkers MI | 995 | 1.5 |
Abbreviation: MI, multiple imputation.