| Literature DB >> 29361077 |
Honghan Wu1,2, Giulia Toti3, Katherine I Morley3,4, Zina M Ibrahim1,5, Amos Folarin1,5, Richard Jackson1, Ismail Kartoglu6, Asha Agrawal7, Clive Stringer7, Darren Gale7, Genevieve Gorrell8, Angus Roberts8, Matthew Broadbent9, Robert Stewart9,10, Richard J B Dobson1,5.
Abstract
Objective: Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs.Entities:
Mesh:
Year: 2018 PMID: 29361077 PMCID: PMC6019046 DOI: 10.1093/jamia/ocx160
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.(A) SemEHR data model: entities (patient, clinical note, concept, and concept mentions) and their associations. (B) SemEHR generates 2 longitudinal views for each patient: concept mentions grouped in typed and dated documents (upper part), and concept mentions grouped in structured (discharge) summaries (lower part).
Figure 2.The architecture of SemEHR is composed of 3 subsystems: (1) the producing subsystem (upper part of the figure), creation of SemEHR semantic index by harmonizing, natural language processing, and indexing EHR data; (2) the continuous learning subsystem, addressing study-specific requirements and supporting fine-tuning for separate studies; and (3) the consuming subsystem (lower part), supporting tailored care, patient recruitment, and clinical research by semantic searching and study-based continuous learning.
Figure 3.Screenshots of key functionalities provided by the consuming subsystem. (A) Identifying query concepts (UMLS CUIs): facilities to ensure the correct and complete concepts are used in the query to derive accurate clinical findings. (A1) Concept search for matching a user search term to one or more ontology (UMLS) concepts; logical reasoning is implemented to enable the automated inclusion of semantically related concepts (eg, hepatocellular damage is liver damage). (A2) Concept validation component for checking and approving the automated inferred concepts based on the aim and criteria of the clinical study (eg, only retain alcohol-related liver conditions for addiction analytics). (B) Selecting and summarizing cohort (the full text in the screenshot has been deliberately rewritten to avoid leaking sensitive patient data). A summary table is generated for a user query where each row summarizes the numbers of total mentions and contextualized mentions for one patient. (C) Patient timeline: longitudinal document view (upper), structured medical profile view (based on FHIR discharge summary format), and the view of latest vital signs and other measurements.
Given a disease (identified by one or more UMLS concepts, ie, search concepts), SemEHR can generate a summary table for a cohort of patients, which, for each patient, gives the number of positive mentions of the search concepts within all of his/her EHR documents. Using this number as the only feature, we classify whether a patient suffers from a disease or not.
| Precision | Recall | Class (200) | Precision | Recall | Class (1000) | ||||
|---|---|---|---|---|---|---|---|---|---|
| 0.857 | 0.522 | 0.649 | Hepatitis C positive (33) | 0.985 | 0.855 | 0.915 | HIV positive (76) | ||
| 0.941 | 0.989 | 0.964 | Hepatitis C unknown (177) | 0.988 | 0.999 | 0.994 | HIV unknown (924) | ||
| Weighted avg. | 0.931 | 0.935 | 0.928 | Weighted avg. | 0.988 | 0.988 | 0.988 |
aTwo hundred CRIS patients evaluated for hepatitis C; classification model: naive Bayes; test method: 10-fold cross-validation; search concepts: C0019196, C2148557, C0220847. This shows the results of a 200-patient cohort for hepatitis C infection.
bOne thousand CRIS patients evaluated for HIV; classification model: decision table; test method: 10-fold cross-validation; search concepts: C0019699, C0920550. This shows the results of a 1000-patient cohort for HIV.
The performance of SemEHR laboratory measurement extraction on MIMIC-III data: 11 measurements are studied (first column); 100 patients were randomly selected for this study
| Laboratory measurements (UMLS label) | MIMIC-III label | # Correct (structured data comparison) | # Incorrect (structured data comparison) | # Actually correct (manually verified) | # Total extracted measurements | Accuracy (structured data comparison) (%) | Accuracy (manually verified) (%) |
|---|---|---|---|---|---|---|---|
| Hematocrit | Hematocrit | 38 | 5 | 4 | 43 | 88.37 | 97.67 |
| Platelets | Platelet count | 1 | 1 | 1 | 2 | 50.00 | 100.00 |
| Sodium | Sodium | 15 | 0 | 0 | 15 | 100.00 | 100.00 |
| Mean corpuscular hemoglobin concentration | Mean corpuscular hemoglobin concentration | 35 | 1 | 0 | 36 | 97.22 | 97.22 |
| Alanine aminotransferase | Alanine aminotransferase | 19 | 3 | 2 | 22 | 86.36 | 95.45 |
| Red blood cell distribution width | Red blood cell distribution width | 35 | 1 | 0 | 36 | 97.22 | 97.22 |
| Serum aspartate aminotransferase | Aspartate aminotransferase | 20 | 2 | 1 | 22 | 90.91 | 95.45 |
| Chloride | Chloride | 15 | 0 | 0 | 15 | 100.00 | 100.00 |
| Blood urea | Urea nitrogen | 3 | 0 | 0 | 3 | 100.00 | 100.00 |
| Leukocytes | White blood cells | 34 | 5 | 4 | 39 | 87.18 | 97.44 |
| Glucose | Glucose | 18 | 3 | 0 | 21 | 85.71 | 85.71 |
| Average accuracy | 89.36 | 96.93 | |||||
The extracted results were assessed by 2 steps: (1) comparing with the structured data (querying lab events table in MIMIC-III; accuracy reported in the 7th column), and (2) manually checking not-matched items in the first step (accuracy reported in the last column).
The number of extracted semantic entities in 5 sections of SemEHR medical profiles of the 100 randomly selected MIMIC-III patients, which are usually not recorded in structured EHRs
| Admission medications | Family history | Social history | History of past illness | Hospital discharge instructions | |||||
|---|---|---|---|---|---|---|---|---|---|
| # Total annotations | |||||||||
| 1475 | 156 | 445 | 1575 | 1162 | |||||
| Top 5 semantic types by frequency | |||||||||
| Temporal concept | 442 | Finding | 42 | Finding | 132 | Disease or syndrome | 337 | Clinical attribute | 359 |
| Pharmacologic substance | 393 | Disease or syndrome | 33 | Temporal concept | 86 | Finding | 189 | Temporal concept | 158 |
| Finding | 194 | Neoplastic process | 28 | Pharmacologic substance | 58 | Temporal concept | 182 | Health care–related organization | 133 |
| Clinical drug | 121 | Pharmacologic substance | 14 | Clinical attribute | 30 | Therapeutic or preventive procedure | 180 | Health care activity | 126 |
| Health care–related organization | 51 | Clinical attribute | 9 | Individual behavior | 28 | Body part, organ, or organ component | 96 | Finding | 79 |