| Literature DB >> 35396454 |
Shaan Khurshid1,2,3, Christopher Reeder4, Lia X Harrington2,3, Pulkit Singh4, Gopal Sarma4, Samuel F Friedman4, Paolo Di Achille4, Nathaniel Diamant4, Jonathan W Cunningham3,5, Ashby C Turner6,7, Emily S Lau1,2,3, Julian S Haimovich2,8, Mostafa A Al-Alusi1,2, Xin Wang2,3, Marcus D R Klarqvist4, Jeffrey M Ashburner9,10, Christian Diedrich11, Mercedeh Ghadessi11, Johanna Mielke11, Hanna M Eilken11, Alice McElhinney3, Andrea Derix11, Steven J Atlas9,10, Patrick T Ellinor2,3,12, Anthony A Philippakis4,13, Christopher D Anderson2,6,7,14,15, Jennifer E Ho1,2,3, Puneet Batra4, Steven A Lubitz16,17,18.
Abstract
Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95-0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012-0.030 in C3PO vs. 0.028-0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.Entities:
Year: 2022 PMID: 35396454 PMCID: PMC8993873 DOI: 10.1038/s41746-022-00590-0
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Overview of C3PO construction and data pipeline.
Depicted is a graphical overview of the construction of the Community Care Cohort Project (C3PO). C3PO comprises the electronic health record (EHR) data of 520,868 individuals aged 18–90 at the start of sample follow-up, selected from an ambulatory EHR database on the basis of receiving periodic primary care (i.e., ≥2 visits within 1–3 consecutive years, see text). C3PO is structured as an indexed file system containing protected health information-minimized data of various types (bottom panel). The C3PO database can readily accommodate updating of existing data, integration of new data features, and construction of composite disease phenotypes based on multiple data features.
Fig. 2Distribution of office visits in C3PO versus Convenience Samples.
Depicted are boxplots demonstrating the distribution of office visits (a) and primary care physician (PCP) office visits (b) in the C3PO analysis samples (AF [blue] and MI/stroke [green]) versus the respective Convenience Samples (AF [red] and MI/stroke [purple]). In each boxplot, the black bar denotes the median number of office visits per individual, the box represents the interquartile range, and the whiskers represent points beyond the interquartile range. Points greater than quartile 3 plus 1.5 times the interquartile range and points smaller than quartile 1 minus 1.5 times the interquartile range are not depicted.
Baseline characteristics.
| C3PO1 ( | C3PO – MI/stroke ( | MI/stroke Convenience Sample ( | C3PO – AF ( | AF Convenience Sample ( | |
|---|---|---|---|---|---|
| Mean ± SD, Median (quartile 1, quartile 3), or | |||||
| Age (years) | 48.4 ± 17.1 | 57.0 ± 10.3 | 56.2 ± 10.4 | 60.9 ± 10.0 | 61.4 ± 10.5 |
| Women | 315,577 (60.6%) | 116,448 (58.8%) | 195,039 (57.3%) | 106,279 (60.9%) | 288,334 (57.5%) |
| White | 389,755 (74.8%) | 154,712 (78.1%) | 270,002 (79.4%) | 140,746 (79.6%) | 422,266 (84.2%) |
| Black | 38,104 (7.3%) | 13,805 (7.0%) | 21,248 (6.2%) | 11,103 (6.4%) | 22,787 (4.5%) |
| Hispanic or Latino | 33,762 (6.5%) | 9401 (4.7%) | 15,142 (4.5%) | 6804 (3.9%) | 14,115 (2.8%) |
| Asian or Pacific Islander | 21,701 (4.2%) | 7807 (3.9%) | 13,219 (3.9%) | 6003 (3.4%) | 14,329 (2.9%) |
| Mixed | 27 (0.05%) | 11 (0.06%) | 24 (0.07%) | 7 (0.04%) | 23 (0.04%) |
| Other | 18,774 (3.6%) | 5716 (2.9%) | 8937 (2.6%) | 4467 (2.6%) | 9023 (1.8%) |
| Unknown | 18,745 (3.6%) | 6732 (3.4%) | 11,654 (3.4%) | 5514 (3.2%) | 18,729 (3.7%) |
| Height (cm) | 167.4 ± 10.4 | – | – | 166.6 ± 10.4 | 167.4 ± 10.3 |
| Weight (kg) | 78.3 ± 20.3 | – | – | 79.4 ± 19.5 | 79.8 ± 19.8 |
| Systolic blood pressure (mmHg) | 123 ± 17 | 126 ± 17 | 127 ± 18 | 128 ± 17 | 130 ± 19 |
| Diastolic blood pressure (mmHg) | 75 ± 10 | – | – | 76 ± 10 | 77 ± 11 |
| Current smoker | 27,202 (5.2%) | 14,720 (7.4%) | 12,652 (3.7%) | 14,031 (8.0%) | 22,020 (4.4%) |
| Anti-hypertensive use | 147,898 (28.4%) | 77,827 (39.3%) | 119,954 (35.3%) | 78,219 (44.8%) | 173,235 (34.6%) |
| Diabetes | 58,159 (11.2%) | 29,307 (14.8%) | 43,966 (12.9%) | 27,953 (16.0%) | 52,180 (10.4%) |
| Heart failure | 12,555 (2.4%) | – | – | 3334 (1.9%) | 16,786 (3.3%) |
| Myocardial infarction | 17,937 (3.4%) | – | – | 6641 (3.8%) | 18,260 (3.6%) |
| Total cholesterol (g/dL) | 189 ± 39 | 195 ± 39 | 194 ± 40 | – | – |
| HDL cholesterol (g/dL) | 55 ± 18 | 57 ± 18 | 57 ± 18 | – | – |
| Follow-up, years | 7.2 (2.6, 12.9) | 7.3 (2.8, 11.9) | 7.4 (3.5, 11.8) | 6.5 (2.5, 11.1) | 5.4 (2.2, 9.8) |
1Values shown exclude missing data.
2Only variables relevant for each risk score (CHARGE-AF for AF, PCE for MI/stroke) are depicted.
Fig. 3Yield of NLP-based missing data recovery.
Depicted is a summary of the yield of our deep natural language processing (NLP) based model for missing data recovery in C3PO. a–c Compare effective sample sizes with versus without NLP recovery, where error bars depict 95% confidence intervals. a The y-axis depicts the total number of individuals with a baseline height, weight, and blood pressure, and the hashed line indicates the total sample size of C3PO. b The y-axis depicts the total number of individuals with a complete Pooled Cohort Equations (PCE) score at baseline and the hashed line indicates the total number of individuals eligible for PCE analysis (i.e., within age 40–79 years, with available follow-up data, and without prevalent MI/stroke). c The y-axis depicts the total number of individuals with a complete CHARGE-AF score at baseline and the hashed line indicates the total number of individuals eligible for CHARGE-AF analysis (i.e., within age 45–94 years, with available follow-up data, and without prevalent AF). d Depicts the total number of vital sign extractions obtained using the rule-based method (light shades), BERT (medium shades), and Bio + DischargeSummaryBERT (Bio + DS BERT, dark shades).
Fig. 4Agreement between tabular and natural language processing-extracted vital signs.
Depicted is agreement between vital signs obtained from tabular data and those obtained from our NLP model among individuals with values obtained on the same day. a Depict height values, b depict weight values, c depict systolic blood pressures, and d depict diastolic blood pressures. For individuals with multiple eligible values, only the pair most closely preceding the start of follow-up was used. Left panels show the distribution of values obtained from tabular versus NLP sources. Middle panels show the correlation between tabular values (x-axis) and NLP values (y-axis). Right panels are Bland–Altman plots showing agreement between paired tabular and NLP values. The x-axis depicts the increasing mean of the paired values, and the y-axis depicts the difference between the paired values, where positive values denote tabular values greater than corresponding NLP values and negative values denote tabular values lower than corresponding NLP values. The colored horizontal lines depict the mean difference between sources, and the hashed horizontal lines depict 1.96 standard deviations above and below the mean. The values corresponding to the bounds and percentage of values contained within those bounds is printed on each plot.
Risk score performance in C3PO versus Convenience Samples.
| Model | Hazard ratio (per 1-SD increase) | C-index3 (95% CI) | GND | Recalibrated GND | ICI6 (95% CI) | Recalibrated ICI5,6 (95% CI) | Calibration slope7 (95% CI) |
|---|---|---|---|---|---|---|---|
| PCE (White women)1 | 2.51 (2.43–2.59) | 0.768 (0.760–0.775) | 487 | 1689 | 0.018 (0.017–0.020) | 0.034 (0.031–0.037) | 0.67 (0.65–0.70) |
| PCE (Black women)1 | 2.39 (2.17–2.64) | 0.724 (0.702–0.746) | 69 | 257 | 0.030 (0.023–0.036) | 0.057 (0.050–0.064) | 0.60 (0.53–0.67) |
| PCE (White men)1 | 2.17 (2.11–2.24) | 0.738 (0.730–0.746) | 361 | 618 | 0.024 (0.022–0.027) | 0.032 (0.029–0.035) | 0.70 (0.68–0.73) |
| PCE (Black men)1 | 2.04 (1.85–2.25) | 0.725 (0.698–0.751) | 21 | 183 | 0.012 (0–0.025) | 0.010 (0–0.024) | 0.88 (0.77–1.00) |
| CHARGE-AF1 | 2.56 (2.50–2.61) | 0.782 (0.777–0.787) | 1856 | 1367 | 0.028 (0.027–0.030) | 0.019 (0.018–0.021) | 0.77 (0.75–0.79) |
| PCE (White women)2 | 2.44 (2.39–2.49) | 0.770 (0.764–0.775) | 1797 | 4923 | 0.032 (0.031–0.034) | 0.047 (0.044–0.049) | 0.64 (0.62–0.65) |
| PCE (Black women)2 | 2.29 (2.13–2.46) | 0.732 (0.716–0.748) | 213 | 562 | 0.046 (0.040–0.053) | 0.074 (0.067–0.081) | 0.56 (0.51–0.61) |
| PCE (White men)2 | 2.18 (2.13–2.22) | 0.744 (0.739–0.749) | 1291 | 1493 | 0.041 (0.038–0.043) | 0.039 (0.037–0.042) | 0.70 (0.68–0.71) |
| PCE (Black men)2 | 2.01 (1.88–2.15) | 0.727 (0.705–0.749) | 36 | 13 | 0.028 (0.018–0.037) | 0.012 (0.0026–0.022) | 0.87 (0.79–0.95) |
| CHARGE-AF2 | 2.40 (2.38–2.43) | 0.781 (0.778–0.784) | 7188 | 8322 | 0.036 (0.035–0.036) | 0.028 (0.027–0.029) | 0.69 (0.68–0.70) |
1PCE (White women): 4231, 107,998, 7.1 (2.8, 10); PCE (Black women): 617, 8450, 7.7 (2.9, 10); PCE (White men): 4928, 76,304, 6.2 (2.3, 10); PCE (Black men): 425, 5432, 6.7 (2.5, 10); CHARGE-AF: n events = 7877, N total = 174,644, median follow-up, years (Q1,Q3): 5.0 (2.3,5.0).
2PCE (White women): 10,259, 182,349, 7.5 (3.6, 10); PCE (Black women): 1119, 12,690, 7.2 (3.2, 10); PCE (White men): 12,891, 136,629, 6.2 (2.6, 10); PCE (Black men): 843, 8558, 6.0 (2.6, 10); CHARGE-AF: n events = 26,907, N total = 501,272, median follow-up, years (Q1,Q3): 5.0 (2.0,5.0).
3C-index calculated using the inverse probability of censoring weighting method[28].
4Greenwood-Nam-D’Agostino (GND) test, a test of calibration[30]. Lower chi-squared values suggest better calibration (across equally sized samples). Significant p-values indicate evidence of miscalibration. Corresponding p-values are all p < 0.01 except for C3PO PCE Black men (p = 0.02), C3PO PCE Black men recalibrated (p = 0.03), Convenience Sample PCE Black men recalibrated (p = 0.17).
5Values after recalibration to the baseline hazard of the sample (see text).
6Integrated calibration index, a quantitative measure of the average difference between predicted event risk and observed event incidence, weighted by the empirical distribution of event risk[29]. Smaller values indicate better calibration. P-values indicated pairwise comparison of ICI with the corresponding Convenience Sample.
7A measure of calibration applicable to models that are calibrated in the large[31,44]. A calibration slope equal to one is optimally calibrated. P-values indicated pairwise comparison of calibration slope with corresponding Convenience Sample.
SD standard deviation, CI confidence interval.
Fig. 5Cumulative event risk in C3PO versus Convenience Samples.
Depicted is Kaplan–Meier cumulative risk of MI/stroke (a) and AF (b) observed in C3PO (blue [left] and green [right]) versus the Convenience Samples (red [left] and purple [right]). The number of individuals remaining at risk over time is labeled below each plot. Note an initial rapid inflection in MI/stroke and AF incidence observed in the Convenience Samples but not in C3PO.
Fig. 6Model discrimination in C3PO and Convenience Samples.
Depicted are time-dependent receiver operating characteristic curves for the Pooled Cohort Equations (PCE, left panels) and the CHARGE-AF score (right panels) in C3PO (top panels) versus the respective Convenience Samples (bottom panels). Each plot shows the discrimination performance of each risk score for its respective prediction target (i.e., 10-year MI/stroke for the PCE, 5-year incident AF for CHARGE-AF). Since the PCE score comprises four models stratified on the basis of sex and race, the curves for each score are represented separately (see legend). The c-index calculated using the inverse probability of censoring weighting method[28] is depicted for each model.
Fig. 7Model calibration in C3PO and Convenience Samples.
Depicted is model calibration performance in C3PO versus the Convenience Samples. a Depicts the calibration slope for the PCE models (x-axis, left) and CHARGE-AF (x-axis, right) in C3PO (blue, green) versus the Convenience Samples (red, purple). The y-axis depicts the calibration slope, a measure of the relationship between predicted event risk and observed event incidence, where a slope of one indicates an optimal relationship (horizontal hashed line), with corresponding 95% confidence intervals. b, c Compare calibration error in C3PO versus the Convenience Samples. Calibration error is depicted on the y-axis using the Integrated Calibration Index (ICI, see text), where lower values indicate better absolute agreement between predicted risk and observed event incidence. b Depicts ICI values using the original models, while c depicts ICI values after recalibration to the baseline hazard of each sample. In all plots, statistically significant differences between values in C3PO versus the Convenience Sample (p < 0.05) are depicted with an asterisk.
Fig. 8Conceptual overview of C3PO analysis methods.
Depicted is a graphical overview of the potential analyses enabled by the Community Care Cohort Project (C3PO). By integrating diverse data types (e.g., diagnoses, imaging, vital signs, diagnostic test data, genetics), C3PO may enable methods such as traditional statistical modeling and deep learning to facilitate more accurate disease risk prediction models and enable deep phenotyping including disease subgroup identification.
Regular expression rule-based approach for vital sign labeling.
| Vital Sign | Context Words | Units Considered | Text Patterns | Labeled Example In format: word {LABEL} |
|---|---|---|---|---|
| Height | “height”, “height:”, “ht”, “ht:” | ‘inches’, ‘in’, ‘feet’, ‘ft’, ‘m’, ‘meters’, ‘cm’, ‘centimeters’, ''' (for feet) '''' (for inches) | [number] | Ht: 63.5 {HEIGHT} |
| [number] [unit] | Patient height is 63.5 {HEIGHT} inches {HEIGHT_UNIT} | |||
| [number] [unit] [number] [unit] | Height: 5 {HEIGHT} feet {HEIGHT_UNIT} 11 {HEIGHT} inches {HEIGHT_UNIT} | |||
| Weight | “weight”, “weight:”, “wt”, “wt:” | ‘pounds’, ‘lbs’, ‘lb’, ‘ounces’, ‘oz’, ‘kilograms’, ‘kg’, ‘grams’, ‘g’ | [number] | Wt: 180 {WEIGHT} |
| [number] [unit] | Current weight is 65.9 {WEIGHT} kg {WEIGHT_UNIT} | |||
| [number] [unit] [number] [unit] | Patient’s weight is 170 {WEIGHT} lbs {WEIGHT_UNIT} 9 {WEIGHT} oz {WEIGHT_UNIT} | |||
| Blood Pressure | “pressure”, “bp”, “bp:” | – | [number]/[number] | Blood pressure is 128/70 {BP} |