| Literature DB >> 33215074 |
Steven R Chamberlin1, Steven D Bedrick1,2, Aaron M Cohen1, Yanshan Wang3, Andrew Wen3, Sijia Liu3, Hongfang Liu3, William R Hersh1.
Abstract
OBJECTIVE: Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval.Entities:
Keywords: electronic health record; information retrieval; patient cohort discovery; structured queries
Year: 2020 PMID: 33215074 PMCID: PMC7660955 DOI: 10.1093/jamiaopen/ooaa026
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
A sample of the 56 topics with number, source, summary, and pool size, as described in the text
| Num | Source | Summary | Pool | Def rel | % | Poss rel | % | Not rel | % |
|---|---|---|---|---|---|---|---|---|---|
| 2 | OHSU | Adults with IBD who have not had GI surgery | 684 | 63 | 9.2 | 4 | 0.6 | 617 | 90.2 |
| 7 | OHSU | Hereditary hemorrhagic telangiectasia | 695 | 15 | 2.2 | 0 | 0.0 | 680 | 97.8 |
| 9 | OHSU | Children with focal epilepsy with partial seizures | 687 | 31 | 4.5 | 13 | 1.9 | 643 | 93.6 |
| 17 | OHSU | RA on MTX w/o biologic DMARD | 704 | 20 | 2.8 | 0 | 0.0 | 684 | 97.2 |
| 32 | PheKB | ACE inhibitor-induced cough | 700 | 40 | 5.7 | 0 | 0.0 | 660 | 94.3 |
| 33 | PheKB | Children with ADHD on CNS stimulant | 732 | 112 | 15.3 | 0 | 0.0 | 620 | 84.7 |
| 42 | NQF | Elderly patients with dementia on antipsychotic medication | 731 | 24 | 3.3 | 0 | 0.0 | 707 | 96.7 |
| 44 | NQF | COPD with potentially avoidable complication | 680 | 38 | 5.6 | 0 | 0.0 | 642 | 94.4 |
| 48 | REP | Stroke after first MI | 698 | 5 | 0.7 | 0 | 0.0 | 693 | 99.3 |
| 52 | REP | Cataract surgery and prior SSRI use | 737 | 23 | 3.1 | 13 | 1.8 | 701 | 95.1 |
Also shown are number and percentage for definitely relevant, possibly relevant, and not relevant from the initial relevance assessment process. ACE, angiotensin covering enzyme; ADHD, attention deficit hyperactivity disorder; CNS, central nervous system; COPD, chronic obstructive pulmonary disease; DMARD, disease-modifying anti-rheumatic drugs; GI, gastrointestinal; IBD, inflammatory bowel disease; MI, myocardium infarction; MTX, methotrexate; NQF, National Quality Forum; OHSU, Oregon Health & Science University; PheKB, Phenotype KnowledgeBase; RA, rheumatoid arthritis; REP, Rochester Epidemiology Project; SSRI, selective serotonin reuptake inhibitor.
Figure 1.B-Pref distributions for topics within each run. Box ends represent the upper and lower quartile values and whiskers extend 1.5 times the interquartile range. Data points beyond the end of the whiskers are values for individual topics outside the whiskers. The parameter settings are ordered hierarchically first by topic representation (A–C), then text subset (all, notes), then aggregation method (max, sum) and finally the Retrieval Model (BM25, divergence from randomness [DFR], LMDirichlet, and term frequency-inverse document frequency [TFIDF]).
Figure 2.B-Pref distributions for parameter combinations within each topic. Box ends represent the upper and lower quartile values and whiskers extend 1.5 times the interquartile range. Data points beyond the end of the whiskers are values for parameter combinations outside the whiskers. Boxplots are ordered by median B-Pref values.
Structured Boolean query for topic 7: adults 18–100 years old who have a diagnosis of hereditary hemorrhagic telangiectasia, which is also called Osler-Weber-Rendu syndrome
| (demographics.BIRTH_DATE: Range[1913-01-01, 1995-12-31]) |
| AND |
| ( |
| encounter_diagnoses.DX_ICD = 448.0 |
| OR |
| hospital_encounters.ADMITTING_DX_ICD_CODE = 448.0 |
| OR |
| hospital_encounters.BILL_DISCHARGE_DX_ICD_CODE = 448.0 |
| OR |
| hospital_encounters.hospital_encounters.BILL_DX2_ICD_CODE = 448.0 |
| OR |
| hospital_encounters.BILL_DX3_ICD_CODE = 448.0 |
| OR |
| hospital_encounters.BILL_DX4_ICD_CODE = 448.0 |
| OR |
| hospital_encounters.ENCOUNTER_DIAGNOSIS = 448.0 |
| OR |
| problem_list.DX_ICD = 448.0 |
| OR |
| notes. NOTE_TEXT contains “Hereditary hemorrhagic telangiectasia” |
| OR |
| notes. NOTE_TEXT contains “Osler-Weber-Rendu” |
| ) |
Figure 3.Recall of relevant patients from word-based query pools by structured queries, ordered by recall for each topic.
Figure 4.Precision for structured queries (red line) and word-based judged pools (blue line), ordered by structured query precision for each topic.
Figure 5.Recall distributions for 10 selected topics based on combined full structured-query relevance judged pools. Red triangles are the values for the structured queries while the box and whisker plots contain the distributions for word-based queries with the original 48 different parameters.
Ten topics with additional relevance judgments for results from structured Boolean queries
| Topic | Structured query patients retrieved | Word-based query relevant | Structured query added relevant | Structured query relevant and retrieved | Recall for structured query | Precision for structured query | Structured query relevant missed |
|---|---|---|---|---|---|---|---|
| 2 | 750 | 67 | 490 | 438 | 0.89 | 0.58 | 52 |
| 7 | 50 | 15 | 24 | 24 | 1.00 | 0.48 | 0 |
| 9 | 357 | 44 | 190 | 173 | 0.91 | 0.48 | 17 |
| 17 | 110 | 20 | 112 | 109 | 0.97 | 0.99 | 3 |
| 32 | 390 | 40 | 368 | 353 | 0.96 | 0.91 | 15 |
| 33 | 1092 | 112 | 983 | 982 | 1.00 | 0.90 | 1 |
| 42 | 347 | 24 | 347 | 344 | 0.99 | 0.99 | 3 |
| 44 | 378 | 38 | 266 | 264 | 0.99 | 0.70 | 2 |
| 48 | 68 | 5 | 37 | 32 | 0.86 | 0.47 | 5 |
| 52 | 133 | 36 | 157 | 133 | 0.85 | 1.00 | 12 |
The structured queries retrieved additional patients who were judged for relevance, allowing calculation of recall and precision for these queries as well as determination of numbers found by the word-based queries but missed by the structured queries.
Figure 6.Precision distributions for 10 selected topics based on combined full structured-query relevance judged pools. Red triangles are the values for the structured queries while the box and whisker plots contain the distributions for word-based queries with the original 48 different parameters.