| Literature DB >> 34897458 |
Philip Darke1, Sophie Cassidy2, Michael Catt3, Roy Taylor4, Paolo Missier1, Jaume Bacardit1.
Abstract
Primary care EHR data are often of clinical importance to cohort studies however they require careful handling. Challenges include determining the periods during which EHR data were collected. Participants are typically censored when they deregister from a medical practice, however, cohort studies wish to follow participants longitudinally including those that change practice. Using UK Biobank as an exemplar, we developed methodology to infer continuous periods of data collection and maximize follow-up in longitudinal studies. This resulted in longer follow-up for around 40% of participants with multiple registration records (mean increase of 3.8 years from the first study visit). The approach did not sacrifice phenotyping accuracy when comparing agreement between self-reported and EHR data. A diabetes mellitus case study illustrates how the algorithm supports longitudinal study design and provides further validation. We use UK Biobank data, however, the tools provided can be used for other conditions and studies with minimal alteration.Entities:
Keywords: diabetes mellitus; electronic health records; longitudinal studies; medical record linkage; phenotype
Mesh:
Year: 2022 PMID: 34897458 PMCID: PMC8800530 DOI: 10.1093/jamia/ocab260
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.Common issues in EHR data collection illustrated with synthetic participant data. These resemble realistic participant types, for example, around 70% of UK Biobank participants have data outside of periods of practice registration. Example 1—individual registered with a practice at birth that subsequently adopted an EHR system in the 1990s (prior records are paper-based). Example 2—individual registered with a practice in 1999 but records are also held from a previous period of registration with another practice. Example 3—multiple periods of registration are available from different practices and/or data providers. Example 4—a combination of the above issues. The boxed areas illustrate the inferred periods of data collection using our algorithm.
Figure 2.Application of our algorithm to determine periods of complete EHR data collection. The example participant has multiple periods of registration and data outside of registration periods. The boxed areas are the inferred periods of data collection. Further details are included in the Supplementary Materials (Algorithm A1 and Supplementary Figure S1).
Agreement between self-reported and EHR data at the first UK Biobank visit for the conditions and medications used in the QDiabetes-2018 model
| Active data collection at first UK Biobank visit determined using: | Our algorithm (Algorithm A1 in | GP registration records | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Precision | Sensitivity | Specificity | Precision | |
| Presence of a previous diagnostic record | ||||||
| Diabetes |
| 99.8 | 95.8 | 94.3 | 99.8 |
|
| Hypertension |
| 98.1 | 93.2 | 72.1 | 98.1 |
|
| MI/heart attack | 70.6 | 99.9 | 94.7 |
| 99.9 |
|
| Angina | 59.8 | 99.4 | 76.4 |
| 99.4 |
|
| Stroke |
| 99.6 |
| 55.4 | 99.6 | 64.9 |
| Transient ischemic attack |
| 99.4 | 23.7 | 55.8 | 99.4 | 23.7 |
| Bipolar disorder |
| 99.7 | 41.3 | 67.1 |
|
|
| Schizophrenia |
| 99.8 | 28.4 | 87.0 | 99.8 |
|
| Polycystic ovarian syndrome |
| 99.8 | 22.4 | 57.0 | 99.8 |
|
| Presence of a prescription record in previous 90 days | ||||||
| Antihypertensives | 86.0 | 98.2 | 93.6 |
| 98.2 |
|
| Statins | 88.1 | 97.9 | 89.0 |
| 97.9 | 89.0 |
| Corticosteroids | 49.6 | 99.3 | 45.1 |
| 99.3 |
|
| Atypical antipsychotics | 79.7 | 100.0 | 85.6 |
| 100.0 | 85.6 |
Note: Agreement was defined as the presence of a diagnostic record prior to the visit for medical conditions, or the presence of a prescription record in the 90 days prior to the visit for current medication. Sensitivity is the proportion of self-reporting participants that have a confirmatory EHR record. Specificity is the proportion of participants that do not self-report that also do not have an EHR record. Precision is the proportion of participants with an EHR record that also self-report. Overall agreement was similar under each approach, indicating that the algorithm did not sacrifice phenotyping accuracy. Bold indicates higher value.
Figure 3.Example output from the longitudinal phenotyping tool for a synthetic participant. Our algorithm was used to identify periods of complete data collection (top panel). Periods of nondiabetic hyperglycemia (prediabetes), type 2 diabetes, and remission were identified. Periods of medication and biomarkers are also shown. We phenotyped periods of complete data collection to reduce the risk of inaccurately identifying the date of incidence of diabetes. Similar phenotyping approaches using linked EHR data can be used to enforce study criteria or identify more complex endpoints.
QDiabetes-2018 model performance (concordance index) for the 10-year incidence of diabetes using UK Biobank EHR data
| Model A (demographic data, medical history, and BMI) | Model B (A plus current fasting plasma glucose result) | Model C (A plus current HbA1c result) | |
|---|---|---|---|
| Male | |||
| UK Biobank | 0.781 | 0.831 | 0.882 |
| QResearch | 0.814 | 0.866 | 0.855 |
| Female | |||
| UK Biobank | 0.832 | 0.877 | 0.904 |
| QResearch | 0.834 | 0.889 | 0.878 |
Note: Performance on the integrated linked EHR data is broadly in line with Hippisley-Cox et al. (shown as QResearch).