| Literature DB >> 25977789 |
Samuel G Finlayson1, Paea LePendu1, Nigam H Shah1.
Abstract
Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.Entities:
Mesh:
Year: 2014 PMID: 25977789 PMCID: PMC4322575 DOI: 10.1038/sdata.2014.32
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Workflow Architecture. The architecture of our workflow starts with (a) patient notes that are grouped together based on their nearness in time. Given the patient timeline bins, clinical terms are recognized from the notes and recorded into (b) the clinical concept occurrence matrix, which is scanned for (c) counting pairwise the frequency and co-frequency of concepts. This data can be used to calculate (d) contingency tables and Bayesian probability estimates. For example, the concept X has a frequency of f(X) and is pairwise co-frequent with concept Y exactly f(X,Y) times.
Summary statistics for each bin width.
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
| As bin width increases towards infinity, the total number of bins decreases, while the number of terms and concepts contained within each bin increases. The number of bins in the ∞-day window (261,397) is equal to the total number of patients. | |||||||
|
| 7,334,261 | 5,571,972 | 3,969,069 | 2,716,892 | 2,014,460 | 1,417,462 | 261,397 |
|
| 1.99 | 2.62 | 3.67 | 5.36 | 7.24 | 10.28 | 55.76 |
|
| 169.48 | 200.70 | 246.12 | 305.91 | 363.36 | 447.20 | 1,417.36 |
|
| 41.60 | 48.84 | 59.12 | 72.42 | 85.28 | 104.38 | 332.37 |
Figure 2Mappings among terms and concepts. The figure explains the mappings that can be used to decode the frequency files stored in records 1 and 2. We use a subset of terms related to ‘hydrocephalus’ to demonstrate the mapping of terms (File 1) to concepts and UMLS CUIs. Terms map onto concepts in a many-to-many fashion (File 3). Concepts map onto CUIs in a one-to-one fashion (File 2b) and have an associated string for human readability (File 2a).
Figure 3Filling out the 2-by-2 contingency table.