| Literature DB >> 33838341 |
Sarah DeLozier1, Sarah Bland2, Melissa McPheeters2, Quinn Wells3, Eric Farber-Eger3, Cosmin A Bejan2, Daniel Fabbri2, Trent Rosenbloom2, Dan Roden4, Kevin B Johnson2, Wei-Qi Wei2, Josh Peterson2, Lisa Bastarache2.
Abstract
From the start of the coronavirus disease 2019 (COVID-19) pandemic, researchers have looked to electronic health record (EHR) data as a way to study possible risk factors and outcomes. To ensure the validity and accuracy of research using these data, investigators need to be confident that the phenotypes they construct are reliable and accurate, reflecting the healthcare settings from which they are ascertained. We developed a COVID-19 registry at a single academic medical center and used data from March 1 to June 5, 2020 to assess differences in population-level characteristics in pandemic and non-pandemic years respectively. Median EHR length, previously shown to impact phenotype performance in type 2 diabetes, was significantly shorter in the SARS-CoV-2 positive group relative to a 2019 influenza tested group (median 3.1 years vs 8.7; Wilcoxon rank sum P = 1.3e-52). Using three phenotyping methods of increasing complexity (billing codes alone and domain-specific algorithms provided by an EHR vendor and clinical experts), common medical comorbidities were abstracted from COVID-19 EHRs, defined by the presence of a positive laboratory test (positive predictive value 100%, recall 93%). After combining performance data across phenotyping methods, we observed significantly lower false negative rates for those records billed for a comprehensive care visit (p = 4e-11) and those with complete demographics data recorded (p = 7e-5). In an early COVID-19 cohort, we found that phenotyping performance of nine common comorbidities was influenced by median EHR length, consistent with previous studies, as well as by data density, which can be measured using portable metrics including CPT codes. Here we present those challenges and potential solutions to creating deeply phenotyped, acute COVID-19 cohorts.Entities:
Keywords: Controlled terminologies and vocabularies; Data management; Phenomics; Phenotype
Mesh:
Year: 2021 PMID: 33838341 PMCID: PMC8026248 DOI: 10.1016/j.jbi.2021.103777
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 8.000
Fig. 1Chronology of coronavirus disease 2019 (COVID-19) registry data types. “Test: T0” indicates the timestamp of a positive PCR test and defines the acute phase of disease in our registry. As depicted, T0 is critical for distinguishing between risk factors (e.g., history of DVT/PE in the pre-infection phase prior to T0) and sequelae of disease (e.g., acute DVT/PE in the acute or recovery phase). RD: Research Derivative, a database of clinical information curated from the EHR at Vanderbilt University Medical Center and restructured for research; T0: chronology of incoming raw data streams ordered with respect to a SARS-CoV-2 PCR test; DOB: Date of birth; COPD: Chronic obstructive pulmonary disease; Meds: Medications; O2: Oxygen; ICU: Intensive Care Unit; PFTs: Pulmonary function tests; PTSD: Post traumatic stress disorder.
Fig. 2Data workflow. Registry created from individuals with at least one SARS-CoV-2 positive PCR test at any of our 18 care sites across the Mid-South between March 1 and June 5, 2020. COVID-19 case definition validated on a subset of EHRs dated after billing guidelines issued April 1. Random sampling of adult inpatient and outpatient EHRs selected for phenotyping. Double arrows indicate comparison between cohorts. Dotted lines indicate processes that contributed to decision making in the methods workflow (solid lines). COVID-19: coronavirus disease 2019; PCR: polymerase chain reaction; ICD-10: International classification of diseases, Tenth revision, Clinical modification; EHR: electronic health record; ICU: intensive care unit; COPD: chronic obstructive pulmonary disease; CHF: congestive heart failure; DVT/PE: deep venous thrombosis/pulmonary embolism.
Metadata studied within a COVID-19 cohort.
| Metadata (Data type) | Description | Data Reference | |
|---|---|---|---|
| Median EHR length (Years) | Difference in years between the first recorded test date (either influenza or SARS-CoV-2) and first recorded visit, any type. | Data quantity | |
| Missingness (Count) | “Unknown” demographic(s) (i.e., any incomplete age, self-reported race, gender) data element in the RD. | Data quantity | |
| Data density (Categorical, institution-specific) | No Visits | Individuals with no visit(s) billed prior to the week before the first test date. | Data quality |
| No Primary Care Visit | Non-primary care visit(s) billed before the first test date. | ||
| Medical Home | At least one primary care visit, identified by local site IDs, billed before the first test date. | ||
| Data density (Binary, not institution-specific) | Presence of a CPT code that indicates a ‘Comprehensive history’ was taken prior to or on the day of the first SARS-CoV-2 test ( | Data quality | |
COVID-19: Coronavirus disease 2019; EHR: Electronic health record; RD: RD: Research Derivative, a database of clinical information curated from the EHR at Vanderbilt University Medical Center and restructured for research; CPT: Current procedural terminology.
Comparison of COVID-19 case definitions between April 1 and May 15, 2020.
| ICD-10 Only | 90.6% | 46.4% |
| Laboratory testing Only | 100% | 93.0% |
| ICD-10 or Laboratory testing | 95.4% | 100% |
Reference standard is manual review of 140 charts for patients meeting any of the criteria of the more expansive COVID case definition ICD-10: cases assigned billing code U07.1 after April 1st, 2020; Laboratory: ever SARS-CoV-2 PCR positive. Data pulled May 15, 2020 | ||
Metadata results for SARS-CoV-2 tested cohort between March 1 through June 5, 2020 and influenza tested March 1 through June 5, 2019.
| SARS-CoV-2 Positive | SARS-CoV-2 Negative | Influenza tested 2019 | ||
|---|---|---|---|---|
| (n = 2155) | (n = 26,540) | (n = 1687) | ||
| Missing demographic data elements (any age, race, gender reported “Unknown”) | 717 (33%) | 4585 (17%) | 7 (0.4%) | |
| Median EHR length (median years + IQR) | 3.10 [0.0–11.1] | 6.16 [0.5–14.5] | 8.65 [1.6–16.2] | |
| Data density (institution-specific) | No Visits | 723 (34%) | 5108 (19%) | 351 (21%) |
| No PC Visit | 726 (34%) | 9559 (36%) | 469 (28%) | |
| Medical Home | 706 (33%) | 11,873 (45%) | 867 (51%) | |
| Data density (comprehensive history CPT code) | 878 (41%) | 16,665 (63%) | 1,447 (86%) | |
EHR: Electronic health record; Medical Home: At least one primary care visit before the first test date as defined using clinic location identifiers; No PC visit: Only non-primary care visit(s) before the first test date; No visits: No billing dates prior to the week before the first test date; IQR: Interquartile range; CPT: Common procedural terminology.
Fig. 3EHR: Electronic health record; COVID+: Patients testing positive for SARS-CoV-2 during the study period; COVID-: Patients testing negative for SARS-CoV-2 during the study period; Flu 2019: Patients testing positive for influenza during the study period equivalent dates March 1 through June 5, 2019. CPT: Common procedural terminology.
Fig. 4Probability of false negative results for 9 comorbid phenotypes across 3 phenotyping algorithms (ICD-10 based and ICD-10 plus domain-specific algorithms provided by an EHR vendor and clinical experts) among an early COVID-19 population, March 1 through June 5, 2020. Medical Home: At least one primary care visit before the first test date as defined using clinic location identifiers; No PC visit: Only non-primary care visit(s) before the first test date; No visits: No billing dates prior to the week before the first test date; Comprehensive CPT: patients with comprehensive history CPT code; No CPT: patients without comprehensive history CPT code; All demos available: EHRs with complete age, self-reported race, and gender data elements; Missingness (demos): at least one missing demographic variable in the EHR.
Fig. 5Percent of total individuals tested between March 1 and June 5, 2020 grouped by data density categorized by center-specific visit identifiers. Since March, we have seen fewer individuals with any primary care visits in our medical system. Medical Home: At least one primary care visit before the first test date as defined using clinic location identifiers; No PC visit: Only non-primary care visit(s) before the first test date; No visits: No billing dates prior to the week before the first test date.
Lessons learned1 from early phenotyping efforts during the coronavirus disease 2019 (COVID-19) pandemic.
| Data | Data Availability (Completeness) | Longitudinal records may not be available for all patients. | Anticipate data quality issues with available data types including electronic health record (EHR) metadata. |
| Data Management (Timeliness) | Discordant temporality of data streams (e.g., from operational to structured data). | Evaluate time from event to data pull; create automated systems to accommodate differences. | |
| Data Validation (Correctness) | Patient histories may rely on data from limited visit(s) and visit types. | Evaluate ways data is gathered and recorded in your healthcare system. | |
| Authoring | Defined Cohorts | No reliable billing code available to identify cohorts. | Validate local testing practices (i.e., presence of laboratory testing). |
| Defined Logic | Data use requires knowledge of data cleaning processes. | Build a data dictionary documenting representation of data elements (e.g., Boolean, temporal) as well as cleaning methods. |
elements of the table were adapted with permission from Rasmussen, et al. 2019.[19]