Literature DB >> 30480666

Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records.

Casey N Ta1, Michel Dumontier2, George Hripcsak1, Nicholas P Tatonetti1,3,4, Chunhua Weng1.   

Abstract

Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.

Entities:  

Mesh:

Year:  2018        PMID: 30480666      PMCID: PMC6257042          DOI: 10.1038/sdata.2018.273

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

Sharing clinical data is important for reproducible biomedical research and can drive discovery. Openly available data improves the accuracy of research, enables optimal generation of knowledge, drives discoveries undetectable in individual data sets, and enhances trust in clinical research[1-3]. However, even when patient-level data are fully de-identified following the HIPAA Safe Harbor Method, patient privacy is still at risk of re-identification by using external sources of identifying information[4]. Therefore, it is important to develop methods to share clinical statistics derived from patient data without risking patient re-identification. One such method is the sharing of prevalence and co-occurrence statistics of medical events. Prevalence measures the population disease burden and can be useful to clinicians for guiding differential diagnoses, insurance providers estimating healthcare costs, pharmaceutical companies forecasting new treatment market sizes, and researchers estimating power of a clinical trial protocol[5]. Co-occurrence statistics are frequently used to determine associations between entities, such as disease-disease pairs[6], diseases and clinical findings[7], and adverse drug events[8]. The prevalence, incidence rates, and other statistics of various diseases are commonly estimated from population surveys and reported in literature; however, these reports generally focus on specific classes of diseases. Grant et al. analyzed national interviews to determine the prevalence of 7 DSM-IV personality disorders and the co-occurrence among them via odds ratios[9]. Lee et al. analyzed co-occurrences of coronary artery disease, congestive heart failure, diabetes mellitus, urinary incontinence, and injurious falls within the geriatric population from health surveys[10]. The American Cancer Society releases annual reports of cancer statistics in the United States collected from cancer registries, including incidence, mortality, and survival from 46 anatomical cancer sites[11]. Although these reports can accurately estimate the disease prevalence in the general population, this knowledge is difficult to consume at scale, as it requires manual literature review. Clarivate Analytics (Philadelphia, PA) offers an Incidence and Prevalence Database including over 4000 diseases and procedures, but this database is not freely available. The National Cancer Institute provides SEER*Explorer to easily explore cancer statistics, but this database is limited to cancer. Several research groups estimated prevalence from electronic health records (EHR) or pharmacy claims databases. Wiréhn et al. reported prevalence of diabetes, hypertension, asthma, and chronic obstructive pulmonary disease estimated from hospital and primary healthcare data in administrative databases[12]. Naughton et al. reported prevalence on 22 chronic diseases in elderly patients estimated from a pharmacy claims database[13]. Violán et al. compared prevalence estimates from health surveys vs. EHR data for 27 chronic conditions[14]. Ornstein et al. estimated prevalence and multi-morbidity of 24 chronic conditions from an EHR database covering primary care practices[15]. Bhattacharya et al. reported patterns of co-occurring conditions in patients with kidney disease by applying topic modeling on SNOMED codes[16]. Researchers who have access to an Observational Medical Outcomes Partnership (OMOP) database can use the open web applications ACHILLES and ATLAS to access useful statistics and scientific analyses, including counts per concept, prevalence rates, and frequencies of records per person[17]. Finlayson et al. published a data set of occurrence and co-occurrence frequencies covering ~23,000 clinical concepts (drugs, diseases, procedures, and devices) and ~18,500,000 concept pairs detected from unstructured clinical notes from 261,397 patients[18]. Finlayson et al. demonstrated how these co-occurrence frequencies are useful in many research scenarios, including computing contingency tables used in standard statistical analysis, estimating probabilities of concepts to construct Bayesian networks, and quantifying dependencies between features to improve feature selection and model design. To accelerate translational biomedical research, we present Columbia Open Health Data (COHD), a database of EHR prevalence and co-occurrence frequencies on conditions, drugs, procedures, and demographics (sex, race, and ethnicity) observed per patient at Columbia University Irving Medical Center (CUIMC), covering 36,578 single concepts and 32,788,901 concept pairs from 5,364,781 patients. We present a novel method of collecting EHR prevalence and co-occurrence frequencies from structured EHR data in the OMOP format and sharing these statistics via a web application-programming interface (API) (http://cohd.io). Analyzing an OMOP database and sharing the code on GitHub immediately enables any institution with clinical data in OMOP format to perform this analysis and share their results. Institutions interested in joining the Observational Health Data Sciences and Informatics (OHDSI) Research Network will find an active and open community ready to help integrate new partners. These data are also available for download from the Figshare repository (Data Citation 1).

Methods

In this article, we use the term “concept” to refer generally to clinical entities and events, such as conditions, drugs, and procedures. Concepts can vary in their level of specificity, e.g., from Type 2 diabetes mellitus without complication to Metabolic disease. We refer to clinical concepts by their concept name as defined in the OMOP Common Data Model (CDM). When concepts appear in the main body of this article, the concept name is styled in italics (e.g., Essential hypertension and Chest pain) to distinguish the formalized concepts from regular text. Similarly, we style identifier (ID) strings from the OMOP CDM in italics (e.g., Condition, Drug, and Procedure for domain identifiers). Figure 1 depicts the overall workflow for generating the COHD datasets and the COHD API. Briefly, we extracted conditions, drugs, procedures, and demographics from CUIMC’s OMOP database to calculate prevalence and co-occurrence frequencies. The lifetime dataset measured prevalence and co-occurrence from data from all available years while the 5-year dataset only used data from 2013–2017. For patient protection, we excluded concepts with counts ≤ 10 and perturbed the included counts using Poisson randomization. The resulting data are stored in a MySQL database and served to the public via the COHD RESTful web API. Details of these steps follow.
Figure 1

Columbia Open Health Data (COHD) workflow.

Overall workflow of COHD analysis and application-programming interface (API). We analyzed an Observational Medical Outcomes Partnership (OMOP) database created from Columbia University Irving Medical Center (CUIMC) and New York Presbyterian’s (NYP) clinical data warehouse. We extracted conditions, drugs, procedures, and demographics to calculate prevalence and co-occurrence frequencies. The lifetime dataset used all data while the 5-year dataset only used data from 2013–2017. For patient protection, we excluded concepts with counts ≤ 10 and perturbed the remaining counts using Poisson randomization. The resulting data are stored in a MySQL database and served publicly via the COHD Representational State Transfer (REST) API.

Data source

This study received institutional review board approval with waiver for informed consent. We analyzed data from CUIMC’s OHDSI database. The OHDSI database was derived from longitudinal electronic health records including inpatient and outpatient data spanning from 1985 to 2018. CUIMC’s clinical data warehouse (CDW) was converted to OMOP CDM v5.1 in March 2018. CUIMC and New York Presbyterian (NYP) Hospital serve New York, NY and the surrounding area. The diverse population of 8.2 M people in New York City, including 44.0% White, 25.5% Black, 12.7% Asian, 13.0% Some Other Race, and 4.0% Two or More Races[19], provides an ideal environment for generating aggregated statistics. We extracted all rows from the OMOP condition_occurrence, drug_exposure, and procedure_occurrence tables to provide patients’ observed conditions, drugs, and procedures using the condition_concept_id, drug_concept_id, and procedure_concept_id columns, respectively. We extracted patients’ sex, race, and ethnicity from the person table’s gender_concept_id, race_concept_id, and ethnicity_concept_id columns, respectively. A patient is only included in a dataset if at least one condition, drug, or procedure is observed for that patient within the dataset. Performing this analysis on an OMOP database using OMOP standard concept identifiers provides several advantages over operating on the original CDW. OMOP is a deep information model that precisely specifies the encoding and relationship between concepts and categorizes them into domains (e.g., Condition, Drug, Procedure, Gender, Race, and Ethnicity), reducing ambiguity of the meaning and use of each concept. The OMOP standard concepts are rooted in and mapped to established vocabularies, including ICD-9-CM, SNOMED-CT, RxNorm, and UMLS, providing semantic interoperability with other knowledge sources. Performing the analysis on an OMOP database facilitates future generalizability testing and data aggregation through collaborations with other institutions in the OHDSI Research Network[17].

EHR prevalence and co-occurrence analyses

COHD reports the EHR prevalence and co-occurrence frequencies of concepts as detected from electronic health records. We define EHR prevalence as: where is EHR prevalence of concept C, is the number of unique patients observed with concept C in a given period, and is the number of unique patients observed in the database in the same period. We define co-occurrence frequency as: where is the co-occurrence frequency of concepts and , and is the number of unique patients observed with both concepts and . In these analyses, is the number of patients where the specific concept ID C is used in the OMOP tables. We distinguish EHR prevalence from prevalence in the general population as EHR prevalence is observational and influenced by medical care processes. For example, a hypothetical patient with condition C always counts towards the general population prevalence of C, but only contributes to if and only if the patient has a recorded diagnosis for C in the medical records. We discuss the differences between EHR prevalence and general population prevalence further in the Usage Notes section. For the patient counts, we assume person IDs in the person table uniquely identify patients. For each concept C, the number of unique person IDs was counted to indicate the number of patients observed with the given concept (). For every pair of concepts, the number of unique patient identifiers observed with both concepts was counted to indicate the paired concept count (). We performed the above analyses on two subsets of data. First, we analyzed the entire database without restriction by date, referred to as the lifetime dataset. Following data quality analyses, we identified a 5-year range from 2013–2017, where annual clinical data were more stable. We performed the same analyses restricted to this date range and provide the results, referred to as the 5-year dataset. To protect patients against potential re-identification risks, any concepts with  ≤ 10 or pairs of concepts with  ≤ 10 were excluded from the dataset. Furthermore, the true counts were randomized by replacing the actual count with a random draw from a Poisson distribution with the expected number of events () set to the observed concept count (). The Poisson is the probabilistic distribution of events occurring in a given interval if the events occur at a known rate () and independently of the time since the last event[20]. Iatrogenic concept codes were removed from the data set based on a list of 2943 potentially iatrogenic ICD-9-CM, ICD-10-CM, and SNOMED-CT concept codes (e.g., ICD-9-CM code 996.82 Complication of transplanted liver). To provide a metric for assessing the temporal stability of these measurements, we calculated the mean and standard deviation of the annual prevalence and co-occurrence rates. The annual prevalence and co-occurrence rates for each dataset were calculated over the years with data for the entire year (lifetime: 1986–2017; 5-year: 2013–2017), excluding years with data for only part of the year. We randomized each year’s single concept count and co-occurrence counts as described above prior to calculating the annual mean prevalence and co-occurrence rates. We used the true counts to calculate the standard deviation. Data resulting from these analyses are available through the COHD API and downloadable from the Figshare data repository as flat-files (Data Citation 1).

Concept association analyses

The COHD API employs three methods to provide different perspectives on quantifying associations between concepts from co-occurrence frequencies.

Chi-square

The most common form of association analysis from co-occurrence data is the standard chi-square analysis. The chi-square analysis is informative of the dependence between two concepts. However, this analysis becomes very sensitive with large population sizes, such that statistically significant results may not be scientifically significant.

Relative frequency

The relative frequency indicates how frequently concept occurs among patients who have concept . This is similar to the conditional probability of given . Relative frequency is calculated as: where is the relative frequency of concept among patients observed with concept , is the number of unique patients observed with both concepts and , and is the count of patients with concept .

Observed-expected frequency ratio

The observed-expected frequency ratio quantifies the strength of the dependence between two concepts. The natural logarithm of observed-expected frequency ratio (log ratio for short) is calculated as: where is the log ratio of concepts and , is the number of unique patients observed with both concepts and , and are the counts of patients observed with concept and concept , respectively, and is the number of patients in the dataset. estimates the expected co-occurrence count of concepts and assuming independence between the concepts. The ratio indicates whether the pair of concepts co-occurred more or less frequently relative to the expected frequency. The natural logarithm transforms the scale such that the magnitude indicates the strength of the dependence between the concepts and the sign indicates the direction.

Code availability

The code to perform EHR prevalence and co-occurrence analyses was written in Python 2.7 and was performed on an OMOP CDM V5.1 database on Microsoft SQL Server 2014 SP2. Statistical tests were performed using the SciPy Python library version 0.19.1. The code and instructions to perform the EHR prevalence and co-occurrence analysis are publicly available on GitHub with no restrictions to access, allowing other institutions to replicate our analyses on their databases (https://github.com/CaseyTa/ehr_prevalence). The code only requires minimal modifications for any institution with an OMOP CDM v4 or v5 database. The COHD API was implemented using FLASK (Python web framework) running on uWSGI (application server container) and Nginx (web server). The data is stored on a MySQL server running on an Amazon Relational Database Service instance. To promote and facilitate open data sharing, the code and instructions to deploy the server are publicly available with no restrictions to access on GitHub (https://github.com/CaseyTa/cohd).

Data Records

COHD API

The COHD API, a RESTful (Representational State Transfer) web API, provides public access to the COHD data (http://cohd.io). Table 1 lists the API endpoints and their descriptions. The API endpoints are grouped into four resources based on their functionality. The metadata resources provide COHD metadata, including the available datasets, the number of single concepts per domain in each data set, the number of paired concepts per domain, and the number of patients in each data set. The OMOP resources provide definitions of the OMOP concept IDs, a search utility to find OMOP concepts by name, endpoints that map concepts between OMOP source vocabularies and OMOP standard concepts, and endpoints that use the EMBL-EBI Ontology Xref Service (https://www.ebi.ac.uk/spot/oxo/index) to map concepts between OMOP and external ontologies. The frequencies resources provide access to the EHR prevalence and co-occurrence data, including endpoints to retrieve single concept counts and paired concept counts for specified concepts, lists of most frequent concepts by domain, and lists of most frequent concepts associated with a specified concept. The association resources provide estimates of the degree of association between concepts, including chi-square analysis, relative frequency, and observed-expected frequency ratio.
Table 1

COHD application-programming interface (API) endpoints.

API endpointDescription
The COHD API endpoints are listed along with a brief description of each endpoint. The endpoints are grouped into four resources based on function: metadata, OMOP, frequencies, and association. 
/metadata/datasetsEnumerates the datasets available in COHD
/metadata/domainCountsThe number of concepts in each domain
/metadata/domainPairCountsThe number of pairs of concepts in each pair of domains
/metadata/patientCountThe number of patients in the dataset
/omop/findConceptIDsSearch for OMOP concepts by name and domain
/omop/conceptsConcept definitions from concept ID
/omop/vocabulariesList of vocabularies
/omop/mapFromStandardConceptIDMap from a standard concept ID to concept code(s) in an external vocabulary
/omop/mapToStandardConceptIDMap from a non-standard concept code to a standard OMOP concept ID
/omop/xrefFromOMOPCross-reference from an ontology to OMOP standard concepts using the Ontology Xref Service
/omop/xrefToOMOPCross-reference from an ontology to OMOP standard concepts using the Ontology Xref Service
/frequencies/singleConceptFreqClinical frequency of individual concepts
/frequencies/pairedConceptFreqClinical frequency of a pair of concepts
/frequencies/mostFrequencyConceptsMost frequent concepts [optional: by domain]
/frequencies/associatedConceptFreqClinical frequencies of all pairs of concepts given a concept id
/frequencies/associatedConceptDomainFreqClinical frequencies of all pairs of concepts given a concept id
/association/chiSquareChi-square analysis of paired concepts
/association/obsExpRatioObserved Count / Expected Count
/association/relativeFrequencyRelative frequency between pairs of concepts
The SmartAPI page provides detailed documentation of the COHD API as well as an interactive interface that allows users to perform simple queries. The API returns data in JSON (JavaScript Object Notation) format. The use cases described below demonstrate scenarios how researchers can use the API to answer various questions. Example Python code that demonstrates how to programmatically retrieve and analyze COHD data, including these use cases, is available in a Python notebook on GitHub (https://github.com/CaseyTa/cohd/).

Figshare

The single concept counts and paired-concept co-occurrences for the lifetime and 5-year data sets and the concept definitions are also available to download from Figshare as flat-files (Data Citation 1). Nine tab-delimited text files comprise this data record. The concept association analyses are not included in these records since they can be computed directly from single and paired-concept counts as described in the methods. In all files, concepts are referenced by their OMOP standard concept ID, and frequencies are relative to a maximum of 1.0 (1.0 = 100%). : The single concept counts from the lifetime data set. The columns are the concept ID, count of patients with this concept, and prevalence of patients with this concept. : The paired concept counts from the lifetime data set. The columns are the first concept ID, second concept ID, count of patients with this pair of concepts, and prevalence of patients with this pair of concepts. : The single concept means and standard deviations of annual prevalence from the lifetime data set. The columns are the concept ID, mean annual prevalence of this concept, and standard deviation of the annual prevalence of this concept. : The paired-concept means and standard deviations of annual co-occurrence rates from the lifetime data set. The columns are the first concept ID, second concept ID, mean annual co-occurrence rate of this concept pair, and standard deviation of the annual co-occurrence rate of this concept pair. : The single concept counts from the 5-year data set. The columns are the concept ID, count of patients with this concept, and prevalence of patients with this concept. : The paired concept counts from the 5-year data set. The columns are the first concept ID, second concept ID, count of patients with this pair of concepts, and prevalence of patients with this pair of concepts. : The single concept means and standard deviations of annual prevalence from the 5-year data set. The columns are the concept ID, mean annual prevalence of this concept, and standard deviation of the annual prevalence of this concept. : The paired-concept means and standard deviations of annual co-occurrence rates from the 5-year data set. The columns are the first concept ID, second concept ID, mean annual co-occurrence rate of this concept pair, and standard deviation of the annual co-occurrence rate of this concept pair. : The concept definitions. The columns are the concept ID, concept name, domain, source vocabulary (the vocabulary that originally defined this concept, e.g., SNOMED, RxNorm, etc.), concept class, and concept source code (the identifier for this concept from the source vocabulary).

Technical Validation

EHR prevalence and co-occurrence descriptive statistics

We analyzed CUIMC’s OMOP database to detect the EHR prevalence and co-occurrence of conditions, drugs, procedures, and demographics. Table 2 lists the OMOP data tables used in this analysis along with the number of records accessed to compute the EHR prevalence and co-occurrence frequencies for the lifetime and 5-year datasets. The lifetime dataset contains count and EHR prevalence data on 36,578 single concepts and 32,788,901 pairs of concepts from 5,364,781 patients, including 11,952 conditions, 12,334 drugs, and 10,816 procedures (Table 3). The 5-year dataset contains data on 29,964 single concepts and 15,927,195 pairs of concepts from 1,790,431 patients, including 10,159 conditions, 10,264 drugs, and 8,270 procedures. Table 4 lists the counts of pairs of concepts by domain.
Table 2

Number of EHR records evaluated.

OMOP CDM table nameNumber of records (Lifetime)Number of records (5-year)
Number of records in the Columbia Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) v5 tables as of March 2018 used to generate the lifetime and 5-year datasets.  
condition_occurrence140,300,45760,057,858
drug_exposure78,878,15945,298,565
procedure_occurrence64,383,77526,030,193
person5,364,7811,790,431
Table 3

Number of concepts by domain for single concept counts.

DomainLifetime Dataset
   5-Year Dataset
CountMin Prevalence1Mean Prevalence1Max Prevalence1CountMin Prevalence1Mean Prevalence1Max Prevalence1
1Data are the minimum, mean, and maximum prevalence across concepts within the domain, respectively. Prevalence = number of patients with the concept/the total number of patients in the dataset.
        
This table lists the number of unique concepts in each domain for the resulting lifetime and 5-year datasets. Descriptive statistics include the minimum, mean, and maximum prevalence among the concepts in each domain.        
Condition119525.59E-075.68E-048.57E-02101592.00E-067.58E-041.31E-01
Device2041.491E-061.10E-046.19E-031703.00E-062.15E-049.76E-03
Drug123345.59E-073.82E-047.10E-02102642.00E-067.71E-041.10E-01
Ethnicity25.55E-028.05E-021.05E-0121.09E-011.65E-012.22E-01
Gender42.88E-042.50E-015.57E-0146.00E-062.50E-015.79E-01
Measurement2351.68E-061.71E-038.14E-021883.00E-063.67E-031.28E-01
Observation9939.32E-071.50E-037.18E-018703.00E-062.80E-036.64E-01
Procedure108165.592E-075.14E-046.02E-0182702.00E-067.47E-042.41E-01
Race321.491E-067.68E-031.12E-01326.00E-061.04E-022.34E-01
Relationship62.05E-061.80E-058.60E-0551.20E-055.90E-052.33E-04
Table 4

Number of pairs of concepts by domain for paired concept counts.

Domain 1Domain 2Lifetime Dataset
   5-Year Dataset
CountMin Prevalence1Mean Prevalence1Max Prevalence1CountMin Prevalence1Mean Prevalence1Max Prevalence1
1Data are the minimum, mean, and maximum prevalence across concepts within the pair of domains, respectively. Prevalence = number of patients with the concept/the total number of patients in the dataset.
         
This table lists the number of unique pairs of concepts in each pair of domains for the resulting lifetime and 5-year datasets. Descriptive statistics include the minimum, mean, and maximum co-occurrence prevalence among the concept pairs in each domain.         
ConditionCondition45583910.00E + 003.00E-053.02E-0219339170.00E + 005.40E-055.89E-02
ConditionDevice516511.86E-071.90E-054.88E-03200455.59E-073.50E-057.11E-03
ConditionDrug83843450.00E + 002.80E-053.16E-0241399440.00E + 005.80E-054.28E-02
ConditionEthnicity171353.73E-071.40E-042.01E-02142971.12E-062.62E-044.21E-02
ConditionGender204383.73E-073.32E-044.55E-02166411.68E-064.62E-046.85E-02
ConditionMeasurement2828930.00E + 004.40E-053.90E-021685145.59E-078.50E-055.73E-02
ConditionObservation7253410.00E + 004.30E-055.39E-024426010.00E + 009.60E-056.86E-02
ConditionProcedure53594610.00E + 003.60E-056.67E-0222523580.00E + 007.00E-055.49E-02
ConditionRace298511.86E-077.90E-052.07E-02213005.59E-071.67E-044.36E-02
ConditionRelationship1341.49E-065.00E-062.90E-051152.79E-061.40E-058.50E-05
DeviceDevice5889.32E-071.40E-052.30E-033831.12E-063.00E-051.03E-03
DeviceDrug455890.00E + 002.00E-052.29E-03219211.68E-063.90E-054.51E-03
DeviceEthnicity2637.46E-074.10E-051.92E-032242.79E-068.70E-053.45E-03
DeviceGender3439.32E-076.60E-053.39E-032843.91E-061.27E-045.46E-03
DeviceMeasurement18815.59E-073.00E-053.80E-039291.68E-067.10E-058.18E-03
DeviceObservation56723.73E-072.60E-053.99E-0334651.12E-066.30E-057.32E-03
DeviceProcedure356313.73E-072.20E-055.74E-03147811.12E-064.20E-059.63E-03
DeviceRace3167.46E-073.40E-052.04E-032521.68E-068.20E-053.59E-03
DrugDrug44653960.00E + 003.00E-054.23E-0224416540.00E + 006.90E-055.74E-02
DrugEthnicity169841.86E-071.25E-041.66E-02140791.12E-062.88E-043.11E-02
DrugGender205583.73E-072.29E-044.27E-02169291.12E-064.67E-046.83E-02
DrugMeasurement2783131.86E-074.20E-054.14E-021782085.59E-079.40E-055.98E-02
DrugObservation7120970.00E + 004.10E-054.51E-024747200.00E + 001.04E-045.92E-02
DrugProcedure52233730.00E + 003.40E-055.88E-0225206390.00E + 007.40E-055.31E-02
DrugRace266803.73E-077.40E-051.81E-02214921.12E-061.83E-043.52E-02
DrugRelationship1329.32E-074.00E-062.10E-051043.91E-061.30E-056.10E-05
EthnicityGender82.24E-062.02E-025.92E-0251.90E-056.62E-021.27E-01
EthnicityMeasurement3821.12E-063.88E-041.74E-023205.59E-069.54E-043.24E-02
EthnicityObservation13051.12E-063.19E-042.84E-0211692.79E-068.82E-048.44E-02
EthnicityProcedure131213.73E-071.54E-047.32E-0297511.12E-063.18E-046.60E-02
EthnicityRace442.24E-062.87E-036.66E-02433.91E-065.88E-031.47E-01
EthnicityRelationship32.42E-068.00E-061.90E-0539.49E-062.40E-055.20E-05
GenderMeasurement4071.12E-069.88E-044.30E-023283.35E-062.10E-036.85E-02
GenderObservation16137.46E-079.23E-044.05E-0114002.23E-061.74E-033.85E-01
GenderProcedure170495.59E-073.25E-043.47E-01121595.59E-075.07E-041.43E-01
GenderRace581.68E-064.23E-036.17E-02513.91E-066.52E-031.31E-01
GenderRelationship87.46E-071.20E-055.50E-0584.47E-063.80E-051.93E-04
MeasurementMeasurement54213.73E-071.06E-041.18E-0239222.23E-062.51E-043.52E-02
MeasurementObservation246193.73E-078.00E-055.40E-02179011.68E-061.98E-047.50E-02
MeasurementProcedure1743901.86E-076.00E-056.45E-02972205.59E-071.28E-046.46E-02
MeasurementRace9019.32E-071.60E-041.90E-027702.23E-063.95E-043.61E-02
MeasurementRelationship151.49E-069.00E-065.40E-05123.35E-063.20E-051.51E-04
ObservationObservation348833.73E-077.90E-053.81E-02266581.12E-062.25E-041.00E-01
ObservationProcedure4668321.86E-075.40E-054.53E-012755865.59E-071.24E-041.84E-01
ObservationRace23837.46E-071.61E-042.74E-0221231.68E-064.55E-048.09E-02
ObservationRelationship571.30E-061.30E-056.90E-05553.35E-064.00E-051.98E-04
ProcedureProcedure17612480.00E + 004.30E-051.36E-017438430.00E + 008.70E-058.70E-02
ProcedureRace206065.59E-079.40E-057.81E-02140011.68E-062.13E-046.68E-02
ProcedureRelationship891.12E-068.00E-067.60E-05683.91E-062.70E-052.18E-04
RaceRelationship32.24E-061.10E-052.90E-0536.70E-063.60E-058.80E-05
The EHR prevalence and co-occurrence analyses included all concepts from the OMOP condition_occurrence, drug_exposure, and procedure_occurrence tables (Table 2). These tables should only contain concepts from the domains Condition, Drug, and Procedure, respectively. However, due to errors in the extract, transform, load (ETL) process that converts CUIMC’s clinical data warehouse into the OMOP data tables, these tables also contained concepts from other domains, including Device, Measurement, Observation, and Relationship (Tables 3 and 4). Since these extraneous data do not affect the counts of concepts of interest (conditions, drugs, procedures, and demographics), we did not exclude them from the analysis. However, counts for concepts from the Device, Measurement, Observation, and Relationship domains may have limited accuracy since the relevant device_exposure, measurement, and observation OMOP tables were not included in the analysis.

Data quality

To assess temporal stability of the clinical occurrence measurements, we calculated the nonrandomized counts of all concepts on a yearly basis, looking for consistency from year to year. We evaluated the yearly counts on an individual concept level for demographic data. For each of the conditions, drugs, and procedures domains, we reviewed the total counts across concepts in each domain and the count-per-capita to identify data quality issues. Figure 2 shows the total counts across all concepts (blue) and counts per capita (orange) of a) condition occurrences, b) drug exposures, c) procedure occurrences, and d) people per year. The total counts and count per capita of conditions, drugs, and people increase over time, with conditions and people beginning to increase in 1985, and drugs beginning in 2001 and increasing steadily. Speaking to an expert, we learned that the ancillary system for drugs came online in 2001 and consequently had no data for prior years. The total counts for conditions and people grow relatively steadily throughout the years, but both sets of counts spike in 2014 while the count-per-capita for conditions remains in line with its neighboring years. These data suggest that in 2014, the CUIMC database contained condition data on additional patients who were not included in other years. Counts of procedures begin in 1987 but exhibit unstable behavior prior to 2005. The total counts of procedures grew from 1987 to 1990, steadily dropped until 1995, suddenly increased in 1996-1997, dropped again until 2001, rapidly grew until 2005, and steadily grew thereafter. Counts across all domains drop in 2018 since the database only contains data for a quarter of the year.
Figure 2

Annual total counts and counts per capita per domain.

Total counts (blue) and counts per capita (orange) of a) condition occurrences, b) drug exposures, c) procedure occurrences, and d) people per year.

Figure 3 shows the yearly EHR prevalence of individual concepts related to a) sex, b) ethnicity, and c) race. Sex remained stable throughout the years, yielding lifetime EHR prevalence of 55.7% FEMALE, 44.0% MALE, 0.04% AMBIGUOUS, and 0.03% UNKNOWN. However, ethnicity and race data fluctuated throughout the years, with lifetime EHR prevalence of Unknown ethnicity and Unknown race accounting for 83.9% and 82.8% of the patient population, respectively. Although sex data appears reliable, missing data is a major issue for ethnicity and race. We will report an in-depth analysis of the data quality issues in a separate article.
Figure 3

Annual demographics prevalence rates.

EHR prevalence per year of a) sex, b) ethnicity, and c) race. c) For visual clarity, the plot excludes races with EHR prevalence < 0.001.

Due to these issues, we decided to compute and provide the 5-year statistics restricted to events occurring between 2013 and 2017 in addition to the original lifetime dataset. The domain total counts within this time range are not completely stable, but they do not exhibit major fluctuations as seen prior to 2005. We make both datasets available for public consumption since certain use cases, such as detecting concept associations, may benefit from the larger sample sizes while tolerating these instabilities. The mean and standard deviation of single concept prevalence and concept-pair co-occurrence rates calculated annually are available to assess the temporal variance of each concept and concept-pair. The mean and standard deviation of annual prevalence and co-occurrence rates should only be compared to each other to assess stability of the concept over the time period of the dataset. They should not be compared to their 5-year and lifetime counterparts since 5-year and lifetime prevalence are expected to be larger than 1-year prevalence.

Poisson randomization

The true (nonrandomized) patient counts were perturbed using Poisson randomization to insert an additional layer of security to patient privacy. To determine the impact of the Poisson randomization on interpretation of the counts, we performed a chi-squared analysis on the counts of all single concepts and all pairs of concepts from the lifetime dataset. We compared the nonrandomized and randomized counts and calculated the frequency of statistically significant results (α = 0.05), that is, the number of concepts (or pairs of concepts) where the randomized count is significantly different from the true count divided by the total number of concepts (or pairs of concepts). This indicates how often the randomization process causes the reported counts to be significantly different from the true counts. The randomized counts were significantly different (P < 0.05) in 4.76% (1,740/36,578) of single-concept counts and 4.93% (1,617,257/32,788,901) of paired-concept counts. This matches our expectation that when using α = 0.05, 5% of tests will reject the null-hypothesis when randomly drawn from the same sample population. To understand how the Poisson randomization affected counts at different magnitudes, we evaluated the absolute percentage difference between randomized and nonrandomized single concept counts in the lifetime dataset. Figure 4 shows a scatter plot of the absolute percentage difference vs. the true counts. The maximum absolute percentage difference decreases as the true concept count increases. This trend shows that the Poisson randomization has the largest relative effect on concepts with small counts (count < 100) and small relative effect on concepts with very large counts. This is a desirable characteristic for the randomization process since rare concepts with low prevalence have greater potential risk for patient re-identification than common concepts with high prevalence. This also indicates that investigators using COHD data should be more conservative in their analyses when the count is small.
Figure 4

Effect of Poisson randomization.

Absolute percentage difference between Poisson randomized and true counts vs true counts for single concept counts in the lifetime dataset.

Usage Notes

Use case analyses

To demonstrate the utility of COHD, we discuss two general use cases for using COHD as a knowledge base or knowledge discovery platform. All results below use the 5-year dataset. To facilitate using COHD, we provide a Python notebook in our GitHub repository (https://github.com/CaseyTa/cohd) that demonstrates how to make the API calls that perform these analyses and more.

Use case 1: knowledge base

The simplest use case for COHD is to look up the EHR prevalence of a single concept or the co-occurrence frequency of a pair of concepts. COHD references concepts by their OMOP standard concept ID. Users interested in finding out the EHR prevalence for a particular concept can search for a concept’s OMOP standard concept ID using the COHD API to perform a string similarity search. For example, an investigator interested in hypertension and hyperlipidemia will find the concept IDs 320128 (Essential hypertension) and 432867 (Hyperlipidemia), respectively. The investigator can then retrieve the EHR prevalence for Essential hypertension (13.1% of patients) and Hyperlipidemia (8.1%), and the co-occurrence frequency between the two (5.9%). To facilitate exploration, the COHD API also provides lists of results ordered by EHR prevalence and optionally filtered by domain. For example, an investigator interested in finding out the most commonly observed conditions will find the top 10 conditions listed in Table 5. Similarly, investigators can retrieve the concepts that most commonly co-occur with a concept of interest. Table 6 lists the 10 drugs that most frequently co-occur among patients with Atrial fibrillation, a condition where the heart beats irregularly and often causes poor blood flow. However, analysis of the raw co-occurrence frequency between pairs of concepts may not provide much valuable insight to investigators because the most prevalent individual concepts dominate this list. For example, the drug that most commonly co-occurs with Atrial fibrillation is Acetaminophen 325 MG Oral Tablet [Tylenol], a general pain reliever, co-occurring with Atrial fibrillation in 1.18% of patients. Despite not being strongly associated, Acetaminophen 325 MG Oral Tablet [Tylenol] likely co-occurs frequently with Atrial fibrillation because it has high individual prevalence (11.0%).
Table 5

Most prevalent conditions.

conceptcount1EHR prevalence2
1Data are the number of patients with the specified condition from the 5-year dataset.
  
2Data are the EHR prevalence of the specified condition from the 5-year dataset. EHR prevalence = count/patient population; patient population is 1,790,431.
  
The top 10 conditions with the highest EHR prevalence are listed.  
Essential hypertension2337900.130577
Chest pain1520050.084899
Hyperlipidemia1453670.081191
Abdominal pain1248200.069715
Dyspnea1037180.057929
Inflamed seborrheic keratosis969020.054122
Cough944530.052754
Neoplasm of uncertain behavior of skin850410.047498
Coronary arteriosclerosis in native artery802440.044818
Electrocardiogram abnormal755240.042182
Table 6

Top drugs co-occurring with atrial fibrillation.

conceptcount1EHR prevalence2
1Data are the number of patients with the specified condition from the 5-year dataset.
  
2Data are the EHR prevalence of the specified condition from the 5-year dataset. EHR prevalence = count/patient population; patient population is 1,790,431.
  
3The name of this concept was truncated due to the 255 character limit for concept names in the OMOP CDM.
  
The top ten drugs with the highest co-occurrence count with atrial fibrillation are listed.  
Acetaminophen 325 MG Oral Tablet [Tylenol]211650.011821
0.5 ML pneumococcal capsular polysaccharide type 1 vaccine 0.05 MG/ML/pneumococcal capsular polysaccharide type 10 A vaccine 0.05 MG/ML/pneumococcal capsular polysaccharide type 11 A vaccine 0.05 MG/ML/pneumococcal capsular polysaccharide type 12 F vac3183400.010243
Docusate Sodium 100 MG Oral Capsule171300.009568
1000 ML Sodium Chloride 9 MG/ML Injection168040.009385
sennosides, USP 8.6 MG Oral Tablet158770.008868
Aspirin 81 MG Delayed Release Oral Tablet155540.008687
heparin sodium, porcine 5000 UNT/ML Injectable Solution152150.008498
Aspirin 81 MG Oral Tablet137160.007661
2 ML Ondansetron 2 MG/ML Injection135260.007555
Acetaminophen 325 MG/Oxycodone Hydrochloride 5 MG Oral Tablet130410.007284

Use case 2: concept associations

Instead of analyzing raw co-occurrence frequencies, sorting concepts by relative frequency (equation (3)) can yield better insight for knowledge discovery. Applying relative frequency analysis to the same question of investigating the most common drugs associated with Atrial fibrillation produces the results in Table 7. The relative frequency informs investigators that among patients who take the drug named in each row, this proportion of patients experience atrial fibrillation. Note that the relative frequency can exceed the upper limit of 1.0 due to the Poisson randomization. In this data, nearly all patients who take these drugs have experienced atrial fibrillation. Several of these drugs treat arrhythmia, including dronedarone 400 MG Oral Tablet, Flecainide Acetate 50 MG Oral Tablet [Tambocor], and Sotalol Hydrochloride 160 MG Oral Tablet [Betapace]. Other drugs included in this list are often prescribed to treat associated conditions. For example, patients with atrial fibrillation are at risk of developing blood clots and subsequently having a stroke, thus physicians commonly prescribe anticoagulants like dabigatran etexilate 75 MG Oral Capsule and Warfarin to reduce the risk of stroke[21].
Table 7

Drugs with the highest relative frequency among patients with atrial fibrillation.

conceptpaired concept count1base concept count2relative frequency3
1Data are the number of patients exposed to the specified drug and atrial fibrillation from the 5-year dataset.
   
2Data are the number of patients exposed to the specified drug in the 5-year dataset.
   
3Data are the ratio [paired concept count]/[base concept count] (equation (3)). Relative frequency can exceed the upper limit of 1.0 due to Poisson randomization.
   
The top ten drugs associated with atrial fibrillation via relative frequency analysis are listed.   
50 ML idarucizumab 50 MG/ML Injection [Praxbind]15111.364
dronedarone 400 MG Oral Tablet45371.216
dabigatran etexilate 75 MG Oral Capsule39341.147
5 ML Dopamine Hydrochloride 80 MG/ML Injection17151.133
Flecainide Acetate 50 MG Oral Tablet [Tambocor]16151.067
darbepoetin alfa 0.025 MG/ML Injection24231.043
ovine digoxin immune fab 40 MG Injection [DigiFab]31311.000
Diltiazem Hydrochloride 90 MG Oral Tablet [Cardizem]58581.000
Warfarin51530.962
Sotalol Hydrochloride 160 MG Oral Tablet [Betapace]42440.955
Similarly, investigators could query COHD for conditions related to a drug of interest to find potential primary uses, off-label uses, adverse side effects, and co-occurring conditions. Table 8 shows results for this type of analysis for Albuterol 0.83 MG/ML Inhalant Solution, a bronchodilator commonly used in patients with asthma, bronchitis, and other lung diseases. Among these results, Acute bronchitis due to rhinovirus and Acute severe exacerbation of mild persistent asthma are the most common conditions (by paired concept count) associated with albuterol treatment and are straightforward applications for its use[22,23]. Acute pulmonary insufficiency following thoracic surgery and Lung disease with systemic lupus erythematosus are less common but still known uses for albuterol[24,25]. We could not find a documented connection between albuterol and conditions like Gastrostomy hemorrhage and Leakage of cardiac device in literature.
Table 8

Conditions with the highest relative frequency among patients taking albuterol.

conceptpaired concept count1base concept count2relative frequency3
1Data are the number of patients observed with the specified condition and albuterol from the 5-year dataset.
   
2Data are the number of patients observed with the specified condition in the 5-year dataset.
   
3Data are the ratio [paired concept count]/[base concept count] (equation (3)). Relative frequency can exceed the upper limit of 1.0 due to Poisson randomization.
   
The top 10 conditions associated with albuterol via relative frequency analysis are listed.   
Gastrostomy hemorrhage16131.23076923
Zygomycosis21181.166666666
Lung disease with systemic lupus erythematosus18161.125
Acute severe exacerbation of intrinsic asthma37331.121212121
Acute bronchitis due to rhinovirus100901.111111111
Acute pulmonary insufficiency following thoracic surgery41371.108108108
Injury of retroperitoneum without open wound into abdominal cavity13121.083333333
Leakage of cardiac device18171.058823529
Tracheostomy hemorrhage19181.055555555
Acute severe exacerbation of mild persistent asthma1911831.043715846
Additionally, the observed-expected frequency ratio (equation (4)) indicates whether a pair of concepts occur together more or less frequently than expected. Table 9 shows a sample of pairs of concepts throughout the range of association strengths. The sample pairs of concepts were automatically chosen by selecting the log ratios closest to whole values (e.g., 1, 2, 3, etc.) throughout the range. Pairs of concepts with highly positive log ratios include strongly associated concepts, such as the procedures Extirpation of Matter from Right External Auditory Canal, Via Natural or Artificial Opening and Extirpation of Matter from Left External Auditory Canal, Via Natural or Artificial Opening with a log ratio of 12.00. Pairs of concepts with log ratios near zero have weak or no association, such as Dexamethasone phosphate 10 MG/ML Injectable Solution, a corticosteroid used for many conditions, and Computed tomography, cervical spine; without contrast material, an imaging procedure. Pairs of concepts with highly negative log ratios include negatively associated concepts, such as the sex MALE and the procedure Gynecologic examination with a log ratio of −7.12. Such highly negative associations may alert to the presence of erroneous data in the EHR database when concepts that should never co-occur are recorded for the same patient.
Table 9

Sample associated concept pairs via observed-expected frequency ratio analysis.

concept 1concept 2count1log ratio2
1Data are the number of patients with both concepts.
   
2Data are calculated as log(observed paired concept count/expected paired concept count) (equation (4)).
   
The sample pairs of concepts were automatically chosen by selecting concept pairs with log ratios closest to whole values (e.g., 1, 2, 3, etc.) throughout the range of log ratios.   
Extirpation of Matter from Right External Auditory Canal, Via Natural or Artificial OpeningExtirpation of Matter from Left External Auditory Canal, Via Natural or Artificial Opening1912.00
Unilateral repair of femoral hernia with graft or prosthesisObstructed femoral hernia1811.00
Treprostinil 5 MG/ML Injectable Solution [Remodulin]Treprostinil 10 MG/ML Injectable Solution2010.00
Construction of tracheoesophageal fistula and subsequent insertion of an alaryngeal speech prosthesis (eg, voice button, Blom-Singer prosthesis)Cervical lymphadenectomy (modified radical neck dissection)189.00
Repair of pulmonary venous stenosisTotal anomalous pulmonary venous return158.00
Desmopressin Acetate 0.01 MG/ACTUAT Nasal SprayPanhypopituitarism577.00
Arthrodesis, posterior technique, craniocervical (occiput-C2)Short-latency somatosensory evoked potential study, stimulation of any/all peripheral nerves or skin sites, recording from the central nervous system; in upper and lower limbs156.00
Insertion of Infusion Device into Right Atrium, Percutaneous ApproachNeutropenia605.00
Other partial resection of small intestinePost-traumatic wound infection94.00
carvedilol 3.125 MG Oral Tablet [Coreg]BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; minor breakpoint, qualitative or quantitative183.00
Fine needle aspiration; with imaging guidanceB-cell lymphoma102.00
Electrocardiogram, routine ECG with at least 12 leads; interpretation and report onlyContusion12381.00
Dexamethasone phosphate 10 MG/ML Injectable SolutionComputed tomography, cervical spine; without contrast material220.00
Arthropathy of knee jointHemorrhage in early pregnancy, antepartum11-1.00
Postmature infancyHuman papilloma virus screening29-2.00
Infectious agent detection by nucleic acid (DNA or RNA); Neisseria gonorrhoeae, amplified probe techniqueTechnetium tc-99m sestamibi, diagnostic, per study dose17-3.00
Level IV - Surgical pathology, gross and microscopic examination Abortion - spontaneous/missed Artery, biopsy Bone marrow, biopsy Bone exostosis Brain/meninges, other than for tumor resection Breast, biopsy, not requiring microscopic evaluation of surgicaHearing Screening Assessment39-4.00
Benign prostatic hyperplasiaFEMALE75-4.98
Lower urinary tract symptomsFEMALE12-6.01
Gynecologic examinationMALE56-7.12
The typical caution applies when evaluating association findings: association does not imply causality or a direct association between concepts. For example, Benign prostatic hyperplasia and Gynecologic examination have a log ratio of −4.46. However, the concepts are not directly associated, as Benign prostatic hyperplasia is a condition occurring in men, and Gynecologic examination is a procedure performed on women.

EHR prevalence

Data consumers must understand the nuances of EHR prevalence and co-occurrence statistics when interpreting these data for their purposes. These datasets provide information about concept prevalence as recorded in the CDW. The process by which structured clinical data is entered into the CDW varies by setting. For inpatient data, after the patient is discharged, coders review patient records, including structured and narrative data, and enter diagnosis codes based on the whole record. For outpatient data, physicians, office managers, or coders enter diagnosis codes around the time of the visit. In recent years, especially after the Department of Health and Human Services’ “Meaningful Use” requirements took effect in 2010[26], these processes ensure that most billing diagnosis codes, drug orders, and procedures are recorded in the CDW. Accordingly, COHD provides the 5-year dataset which utilizes data from 2013–2017, when structured data were reliably recorded in the CDW. Another factor to consider when analyzing structured clinical data is the use of different clinical coding standards across time and practices. For example, the CUIMC system transitioned from using ICD-9-CM to ICD-10-CM in 2015, undergoing a formal process involving extensive training of physicians and coders. Similarly, different practices may use different coding systems, such as ICD-10-PCS, CPT4, or HCPCS to code the same concepts. The OMOP CDM alleviates most of the burden of harmonizing multiple coding systems by mapping codes from these various source vocabularies to a set of OMOP standard concepts, leveraging existing mappings in the National Library of Medicine’s Unified Medical Language System and filling gaps in the mappings as needed. For example, the concept for atrial fibrillation in ICD-9-CM (427.31), ICD-10-CM (I48.91), and SNOMED-CT (49436004) all map to OMOP standard concept ID 313217, rooted on the SNOMED-CT code. Issues occasionally persist due to incongruent mappings. For example, in the CUIMC CDW, the most common ICD-9-CM codes used for patients with type-2 diabetes mellitus were 250 and 250.02, which mapped to OMOP concept ID 201826 (Type 2 diabetes mellitus). After the transition to ICD-10-CM, E11.9 was used predominantly, which maps to the more specific OMOP concept ID 4193704 (Type 2 diabetes mellitus without complication). It is tempting to interpret EHR prevalence in these datasets as lifetime prevalence and 5-year period prevalence in the general population. Although these data represent concept prevalence as recorded in the CDW, they may not accurately reflect prevalence in the general population. For example, the 5-year EHR prevalence of essential hypertension and hyperlipidemia are 13.1% and 8.1%, respectively. In contrast, the Centers for Disease Control and Prevention (CDC) reported 29.1% prevalence of hypertension in adults (age 18+) in 2011–2012[27] and 33.5% hyperlipidemia in adults (age 20+) in 2005–2008[28]. Several sources of bias contribute to the differences between EHR prevalence and general population prevalence. First, clinical databases contain a base population biased towards people with higher levels of existing conditions, thus biasing the measurements to overestimate the prevalence relative to the general population. Second, the nature of medical services leads to biased detection and recording of concepts. Clinicians routinely perform certain procedures, including monitoring blood pressure, thus conditions such as essential hypertension are easily detected. However, conditions not routinely examined in medical settings are under-detected. For example, dental health is not evaluated or treated in clinical settings in the United States, resulting in Dental caries being underreported in COHD: 0.26% EHR prevalence compared to an estimate that 92% of adults aged 20–64 have had dental caries[29]. Third, clinical codes may be absent in the EHR for a variety of reasons. Health providers may not log diagnoses codes for observed conditions when those conditions do not require treatment, diagnostic testing, or specialty referral. Similarly, patients may not seek medical treatment for their conditions or they may not mention symptoms and conditions if those symptoms seem inconsequential or unrelated to the primary concern of their visit[12,14]. Fourth, our analysis only captures information from structured EHR data and does not take advantage of the available unstructured data, e.g., progress notes, discharge summaries, radiology reports, etc. Unstructured EHR data contain a wealth of clinically significant details complementary to structured EHR data, including symptoms, diagnostic reasoning, and treatment plans. The performance of clinical applications can benefit by combining structured and unstructured data[18,30,31]. Despite these challenges, COHD is still meaningful as a resource of clinical statistics. Although EHR prevalence may not accurately estimate general population prevalence, it may more accurately reflect healthcare consumption rates. Since the general population prevalence includes individuals who do not seek treatment for their conditions, it overestimates the number of people consuming healthcare services. In this regard, EHR prevalence may be a more useful statistic to healthcare administrators who must optimally allocate hospital resources to various services to efficiently meet demands and insurance providers estimating the costs of covering health services. COHD may also have a strong impact as a resource for discovering and quantifying associations between clinical concepts, as demonstrated in Use Case 2. This analysis is akin to early knowledge extraction research mining journal publications or clinical narratives to detect relationships from co-occurring biomedical concepts in text[32-34]. These data have several potential real world use cases. Associations between drugs and conditions can be useful towards pharmacovigilance efforts, detecting adverse side effects from drugs when unknown positive associations occur. Conversely, when unknown negative associations occur between a concept and a condition, this association may indicate a beneficial effect of the concept on the condition. Such associations may help uncover new applications of existing drugs (drug repurposing), protective effects of genetic conditions, etc. Reasoning algorithms can consume clinical association data to make or support inferences and generate hypotheses for clinical research. The National Center for Advancing Translational Sciences (NCATS) Biomedical Data Translator (https://ncats.nih.gov/translator) is already using COHD to integrate clinical association data into knowledge graphs for reasoning and question answering.

Limitations

There are several limitations to this work in addition to those mentioned above. We assume that each distinct person ID identifies a unique patient; however, patients may have duplicate registrations in the CDW[35]. The co-occurrence analysis does not account for temporal relations between concepts. Without restricting co-occurrences by time, we do not know whether one concept precedes the other, how closely in time they occur, or the frequency of a given concept occurring across time within a patient (i.e., chronic vs. acute). Within the lifetime dataset, there is no limit to the window of association between concepts; a hypothetical patient with a condition occurring in 1985 and a drug exposure in 2015 will still increment the co-occurrence count for the condition-drug pair. In a future study, we will restrict co-occurrence counts by various temporal relations to find associations occurring over different time windows. The analysis for prevalence and co-occurrence rates did not take into account the hierarchical relationships between concepts, i.e., that higher-level concepts subsume lower-level concepts. For example, the count for Ibuprofen (concept ID 1177480) only includes entries from the CDW where the specific concept ID 1177480 was used, but does not include entries using other descendant concepts, such as Ibuprofen 600 MG Oral Tablet (concept ID 19019073). In a future study, we will leverage the OMOP concept_ancestor table to include descendant concepts in each concept’s count. This approach will also mitigate some of the issues with coding variations across time and practices as different concepts with minor semantic differences can be aggregated into higher-level concepts. Using a related approach, Hripcsak et al. evaluated nine patient phenotype cohorts, comparing gold standard cohorts queried on source ICD-9-CM and ICD-10-CM data against OMOP mapped SNOMED-CT concept sets, and observed a maximum error rate of 0.26%[36]. To protect patient privacy, we removed all concepts with counts ≤ 10 and randomized the statistics for all concepts. These procedures remove rare concepts from the database and perturb the counts, eliminating COHD’s utility for very rare concepts and limiting confidence in statistics for infrequent concepts. This analysis only includes a single hospital system: CUIMC and NYP. Performing the same analysis across multiple institutions will have several benefits. Aggregating this analysis across multiple sites can diversify the population, improve accuracy, increase power and sensitivity to rare conditions, and validate results by comparing across sites. We invite other institutions to join us in publicly releasing clinical statistics. We hope that this paper and the provided code serve as catalysts to encourage other institutions to share their data to the common benefit of the translational research community. In future studies, we plan to collaborate with other institutions to compare EHR prevalence and co-occurrence frequencies across sites to assess agreement, generalizability, variance, and concept coverage across sites. We also plan to take a deeper investigation into the concept associations to validate the association detection and evaluate their utility for reasoning tasks and hypothesis generation.

Additional information

How to cite this article: Ta, C. N. et al. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci. Data. 5:180273 doi: 10.1038/sdata.2018.273 (2018). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  32 in total

1.  Automated knowledge extraction from MEDLINE citations.

Authors:  E A Mendonça; J J Cimino
Journal:  Proc AMIA Symp       Date:  2000

2.  Cancer statistics, 2018.

Authors:  Rebecca L Siegel; Kimberly D Miller; Ahmedin Jemal
Journal:  CA Cancer J Clin       Date:  2018-01-04       Impact factor: 508.702

3.  Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics.

Authors:  Hui Cao; Marianthi Markatou; Genevieve B Melton; Michael F Chiang; George Hripcsak
Journal:  AMIA Annu Symp Proc       Date:  2005

4.  Prevalence of chronic disease in the elderly based on a national pharmacy claims database.

Authors:  Corina Naughton; Kathleen Bennett; John Feely
Journal:  Age Ageing       Date:  2006-11       Impact factor: 10.668

5.  Hypertension among adults in the United States: National Health and Nutrition Examination Survey, 2011-2012.

Authors:  Tatiana Nwankwo; Sung Sug Yoon; Vicki Burt; Quiping Gu
Journal:  NCHS Data Brief       Date:  2013-10

6.  Matching identifiers in electronic health records: implications for duplicate records and patient safety.

Authors:  Allison B McCoy; Adam Wright; Michael G Kahn; Jason S Shapiro; Elmer Victor Bernstam; Dean F Sittig
Journal:  BMJ Qual Saf       Date:  2013-01-29       Impact factor: 7.035

Review 7.  Anticholinergics in the treatment of children and adults with acute asthma: a systematic review with meta-analysis.

Authors:  G J Rodrigo; J A Castro-Rodriguez
Journal:  Thorax       Date:  2005-07-29       Impact factor: 9.139

8.  Early detection of heart failure with varying prediction windows by structured and unstructured data in electronic health records.

Authors:  Yajuan Wang; Kenney Ng; Roy J Byrd; Jianying Hu; Shahram Ebadollahi; Zahra Daar; Christopher deFilippi; Steven R Steinhubl; Walter F Stewart
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2015

9.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers.

Authors:  George Hripcsak; Jon D Duke; Nigam H Shah; Christian G Reich; Vojtech Huser; Martijn J Schuemie; Marc A Suchard; Rae Woong Park; Ian Chi Kei Wong; Peter R Rijnbeek; Johan van der Lei; Nicole Pratt; G Niklas Norén; Yu-Chuan Li; Paul E Stang; David Madigan; Patrick B Ryan
Journal:  Stud Health Technol Inform       Date:  2015

10.  Effect of vocabulary mapping for conditions on phenotype cohorts.

Authors:  George Hripcsak; Matthew E Levine; Ning Shang; Patrick B Ryan
Journal:  J Am Med Inform Assoc       Date:  2018-12-01       Impact factor: 4.497

View more
  19 in total

1.  A novel approach for exposing and sharing clinical data: the Translator Integrated Clinical and Environmental Exposures Service.

Authors:  Karamarie Fecho; Emily Pfaff; Hao Xu; James Champion; Steve Cox; Lisa Stillwell; David B Peden; Chris Bizon; Ashok Krishnamurthy; Alexander Tropsha; Stanley C Ahalt
Journal:  J Am Med Inform Assoc       Date:  2019-10-01       Impact factor: 4.497

2.  An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication.

Authors:  Olivier Morin; Martin Vallières; Steve Braunstein; Jorge Barrios Ginart; Taman Upadhaya; Henry C Woodruff; Alex Zwanenburg; Avishek Chatterjee; Javier E Villanueva-Meyer; Gilmer Valdes; William Chen; Julian C Hong; Sue S Yom; Timothy D Solberg; Steffen Löck; Jan Seuntjens; Catherine Park; Philippe Lambin
Journal:  Nat Cancer       Date:  2021-07-22

3.  Natural Language Processing Applications in the Clinical Neurosciences: A Machine Learning Augmented Systematic Review.

Authors:  Quinlan D Buchlak; Nazanin Esmaili; Christine Bennett; Farrokh Farrokhi
Journal:  Acta Neurochir Suppl       Date:  2022

4.  OARD: Open annotations for rare diseases and their phenotypes based on real-world data.

Authors:  Cong Liu; Casey N Ta; Jim M Havrilla; Jordan G Nestor; Matthew E Spotnitz; Andrew S Geneslaw; Yu Hu; Wendy K Chung; Kai Wang; Chunhua Weng
Journal:  Am J Hum Genet       Date:  2022-08-22       Impact factor: 11.043

Review 5.  Harnessing endophenotypes and network medicine for Alzheimer's drug repurposing.

Authors:  Jiansong Fang; Andrew A Pieper; Ruth Nussinov; Garam Lee; Lynn Bekris; James B Leverenz; Jeffrey Cummings; Feixiong Cheng
Journal:  Med Res Rev       Date:  2020-07-13       Impact factor: 12.944

6.  Sex, obesity, diabetes, and exposure to particulate matter among patients with severe asthma: Scientific insights from a comparative analysis of open clinical data sources during a five-day hackathon.

Authors:  Karamarie Fecho; Stanley C Ahalt; Saravanan Arunachalam; James Champion; Christopher G Chute; Sarah Davis; Kenneth Gersing; Gustavo Glusman; Jennifer Hadlock; Jewel Lee; Emily Pfaff; Max Robinson; Eric Sid; Casey Ta; Hao Xu; Richard Zhu; Qian Zhu; David B Peden
Journal:  J Biomed Inform       Date:  2019-10-30       Impact factor: 6.317

7.  Tracing diagnosis trajectories over millions of patients reveal an unexpected risk in schizophrenia.

Authors:  Hyojung Paik; Matthew J Kan; Nadav Rappoport; Dexter Hadley; Marina Sirota; Bin Chen; Udi Manber; Seong Beom Cho; Atul J Butte
Journal:  Sci Data       Date:  2019-10-15       Impact factor: 6.444

8.  PhenCards: a data resource linking human phenotype information to biomedical knowledge.

Authors:  James M Havrilla; Cong Liu; Xiangchen Dong; Chunhua Weng; Kai Wang
Journal:  Genome Med       Date:  2021-05-25       Impact factor: 11.117

9.  Comparative effectiveness of medical concept embedding for feature engineering in phenotyping.

Authors:  Junghwan Lee; Cong Liu; Jae Hyun Kim; Alex Butler; Ning Shang; Chao Pang; Karthik Natarajan; Patrick Ryan; Casey Ta; Chunhua Weng
Journal:  JAMIA Open       Date:  2021-06-16

10.  Machine learning liver-injuring drug interactions with non-steroidal anti-inflammatory drugs (NSAIDs) from a retrospective electronic health record (EHR) cohort.

Authors:  Arghya Datta; Noah R Flynn; Dustyn A Barnette; Keith F Woeltje; Grover P Miller; S Joshua Swamidass
Journal:  PLoS Comput Biol       Date:  2021-07-06       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.