Literature DB >> 29888090

The Data Gap in the EHR for Clinical Research Eligibility Screening.

Alex Butler1, Wei Wei1, Chi Yuan1, Tian Kang1, Yuqi Si2, Chunhua Weng1.   

Abstract

Much effort has been devoted to leverage EHR data for matching patients into clinical trials. However, EHRs may not contain all important data elements for clinical research eligibility screening. To better design research-friendly EHRs, an important step is to identify data elements frequently used for eligibility screening but not yet available in EHRs. This study fills this knowledge gap. Using the Alzheimer's disease domain as an example, we performed text mining on the eligibility criteria text in Clinicaltrials.gov to identify frequently used eligibility criteria concepts. We compared them to the EHR data elements of a cohort of Alzheimer's Disease patients to assess the data gap by usingthe OMOP Common Data Model to standardize the representations for both criteria concepts and EHR data elements. We identified the most common SNOMED CT concepts used in Alzheimer 's Disease trials, andfound 40% of common eligibility criteria concepts were not even defined in the concept space in the EHR dataset for a cohort of Alzheimer 'sDisease patients, indicating a significant data gap may impede EHR-based eligibility screening. The results of this study can be useful for designing targeted research data collection forms to help fill the data gap in the EHR.

Entities:  

Year:  2018        PMID: 29888090      PMCID: PMC5961795     

Source DB:  PubMed          Journal:  AMIA Jt Summits Transl Sci Proc


Introduction

Randomized clinical trials (RCTs) are the well-regarded gold standard for generating high-quality medical evidence[1]. The success of RCTs depends on successful enrollment[1,2], which remains the No.1 barrier to RCTs. According to the recent statistics, only 2-4% of adult patients with cancer participate in RCTs, and this number remained unchanged since 1994[2,3]. Inefficient or unrepresentative participant recruitment can cause study delays, increase costs, weaken the statistical power of analysis, and finally, may lead to failed clinical trials[4]. A major bottleneck step in RCT recruitment is eligibility screening[2]. However, conventional methods for eligibility screening involves laborious manual review of the syntactic rules and semantic concepts in eligibility criteria and clinical data sources[5,6]. This process is not only time-consuming, but also expensive: the cost of eligibility screening is usually not compensated through contracts supporting CTs, and the expense can go up to $336.48 per participant[2]. Much effort[4,7,8] has been made to advance automated identification of eligible patients in the biomedical informatics research community. In the meantime, Electronic Health Record (EHR) data have been recognized as an important clinical data source and were adopted in multiple automated identification methods[7-10]. EHR-based automated approaches have been reported to reduce workload by up to 90%[7] and almost reached the theoretical maximum area under ROC curve[8]. A concern of EHR-based eligibility screening is that EHRs may not contain all important data frequently used for eligibility screening since EHRs are designed for patient care rather than clinical research. Our previous study in cancer trial eligibility criteria showed that a lot of eligibility criteria used in cancer trials are not present in EHR data so that clinical research coordinators creatively invented a list of “major eligibility criteria” for patient screening to optimize the efficiency of patient screening[11]. A recent study by Köpcke et al. showed that on average 55% of eligibility criteria required data elements are present in EHR. However, there are three major limitations of their study: (1) only numeric and structured data elements in EHRs like checkboxes and dropdown menus were included in analyses so that EHR narratives were excluded; (2) EHR data from five participating hospitals were not harmonized using any common data model, resulting in unaccounted overlaps or inconsistency among available EHR data elements across sites; (3) the whole process was manual so that patient characteristics (i.e., clinical entities) were manually identified from free-text eligibility criteria followed by assignment of semantic categories, which were again manually mapped to EHR data elements, making their method not scalable. This study presented here shares the same goal of Kopcke’s study but contributes a novel scalable data-driven approach by leveraging the public clinical trial summary text and the publicly available synthetic clinical data. Next we will describe our methodology details and results as well as implications.

Methods

To overcome the limitations of Kopcke’s study, we extracted common data elements from free-text eligibility criteria[12,13] for Alzheimer’s disease (AD) and represented both EHR data elements and eligibility criteria concepts using The Observational Medical Outcomes Partnership (OMOP)[14] Common Data Model (CDM) supported by the Observational Health Data Sciences and Informatics (OHDSI)[15] consortium (). The OMOP CDM has been adopted by active scientific consortiums such as OHDSI[15] and eMERGE[16], and has included about 1.26 billion patients as of October 2017. The OMOP CDM-standardized EHR ensures the semantic interoperability of EHR data from multiple participating sites. The sheer number of patients will allow large sample sizes and likely lead to more generalizable study results. Free-text eligibility criteria were automatically processed using Eligibility Criteria Information Extraction (ElilE)[12], an open-source information extraction system for structuring eligibility criteria according to the OMOP CDM, and then extracted information (e.g., clinical entities) was stored in a relational database[13]. The fully automated eligibility criteria processing techniques make our method highly scalable and improve the efficiency of large-scale studies. As the first step for methodology illustration, we used eligibility criteria from 1,587 clinical trials for AD and a de-identified EHR dataset, Synthesized Public Use File (SynPUF) 1%, to study the data gap. The publicly available SynPUF 1% dataset, which includes a set of over 116,350 patients’ de-identified EHR structured data points, served as the clinical data source. We mapped clinical entities in eligibility criteria to The Systematized NOMenclature of MEDicine - Clinical Terms (SNOMED CT)[17] terms (hereafter referred to as “variables”), merged relevant variables and created a list of unique common variables. SNOMED CT was chosen as the ultimate clinical database in this analysis because it has been preferred as the encoding terminology for clinical concepts by researchers on various other projects.[18] We picked a subjective threshold of “being present in at least 15 trials” to select common variables, visualized the relations among the variables and their parents, and analyzed the prevalence of the variables in an EHR dataset. For the purposes of this analysis, we focused on the 19,570 patients who had a previous diagnosis of Alzheimer’s Disease within the SynPUF 1% dataset (hereafter referred to as “the EHR dataset”) as the clinical data source in this study. OHDSI ATLAS, a web-based open source software available at http://www.ohdsi.org/web/atlas for scientific analyses of observational data was adopted to identify qualified patient records from the EHR. The details of are provided below (Figure 2).
Figure 2.

The eight-step workflow of this study.

Automated Eligibility Criteria Extraction from AD Trials and Concept Standardization

Free-text eligibility criteria were downloaded from The ClinicalTrials.gov, reformatted using the previously published open-source EliIE[12] system, and stored in a public relational database (https://github.com/Yuqi92/DBMS EC)[13]. All the eligibility criteria of 1,587 AD trials (collected until September 2016) were represented using the OMOP CDM v5.0 model, which allows focusing on four classes of entities: , and A total of 9,261 unique clinical entities were identified[13] from all of the eligibility criteria. For analysis, corresponding modifiers (e.g., qualifier, measurement) and inclusion/exclusion status were attached to each entity.

Manual Curation of Clinical Entities

A manual review of unique clinical entities was performed by a medical student (AB). Modifications were made to produce a simplified list of clinical entities (e.g., AD was used to refer to Alzheimer’s disease). To identify the relevant entities, all entities were sorted alphabetically, so word-similar entity comparison was possible as has been done algorithmically by Varghese and Dugas[19]. All reasons for modification were captured and can serve as evidence in the future for eligibility criteria terminology guidelines.

UMLS Concept Recognition

The clinical entities in the simplified list were mapped to the Unified Medical Language System (UMLS) Metathesaurus[20], which was chosen because it is the largest thesaurus in the biomedical domain.[21] This mapping was performed via a widely adopted NLP system developed by the National Library of Medicine, MetaMap [22]. MetaMapwas chosen over other NLP systems because of its widespread adoption, easy learning curve and batch request functionality, which allowed large blocks of text to be analyzed simultaneously. For clarity, all phrases contained in the original entity list will be referred to as “entities” and all terms found in the UMLS Metathesaurus will be referred to as “concepts.”. The configuration of MetaMap query options were as below: – JSONf 2 (formatted JSON output), – g (Allow Concept Gaps), – z (Term Processing), – Q 4 (Composite Phrases), – y (Use Word Sense Disambiguation), – E (Indicate Citation End; required for batch scheduler) Figure 3 illustrates this concept recognition process. When multiple phrases contain one or more concepts in a query, the term with the highest MetaMap score was retrieved. In the case that multiple phrases containing 1 or more concept were returned with identical MetaMap scores, the phrase with the lowest level of clinical specificity was chosen to not exclude any concepts. Review of the simplified entity list found numerous multi-term entities, so single term retrieval was not performed.
Figure 3.

The process of deriving SNOMED CT terms from clinical trial eligibility criteria. 42,131 clinical entitieswere extracted from the eligibility criteria of 1,587 clinical trials. A simplified list of 4,260 clinical entities wasgenerated following manual review and filtration, and this list was mapped first to 3,294 UMLS concepts, and then to1,991 SNOMED CT variables, of which 304 variables occur in more than 1% of all trials (i.e., 15 trials).

UMLS CUI Manual Review and Revision

There were a number of data quality issues identified when performing concept extraction. A total of 3,610 manual edits were made to the “master list” for clinical entities as tracked by our computer with the six main types, including typos, plural, trimmed, other formatting reason, simplification, and multi-term (Table 1). Therefore, the identified UMLS concepts and associated Concept Unique Identifiers (CUIs) were manually reviewed and corrected by a medical student (AB). The corrections were performed for two primary reasons: (1) simple corrections which are applied when the CUI of a concept is replaced by a more appropriate CUI, and (2) type corrections which are applied when the CUI of a concept is replaced by a CUI of a more appropriate type according to UMLS coding.
Table 1.

Manual revision of clinical entities.

Types of RevisionExampleTimes
Formatting; Typodelerium -> delirium207
Formatting; Pluralcancers -> cancer253
Formatting; removal of non-informative wordsheart rate measurement -> heart rate364
Formatting; removal of abbreviationsabsolute neutrophil count (ANC) -> absolute neutrophil count1768
Simplificationasthmatic conditions -> asthma573
Breaking down long phrases to logically-connected single phrasesbasal or squamous cell carcinoma -> basal cell carcinoma or squamous cell carcinoma445
Total3610

Mapping to SNOMED CT

For every UMLS concept, its corresponding term in SNOMED CT was identified. Due to the design of UMLS Metathesaurus as a hub for numerous terminologies, the SNOMED CT variables associated with the UMLS concepts were used when such variables were possible. In the case that no SNOMED CT variable was found, a manual search of the SNOMED CT terminology was conducted to identify the closest available match (Figure 3). Manual modifications were also performed for SNOMED CT types which were inappropriate for use in eligibility screening. For example, “alanine aminotransferase (substance)” was changed to “alanine aminotransferase measurement (procedure).”

Establishing a “Master List”

Trial occurrences were tracked for each clinical entity and carried through to mapped SNOMED CT variables to calculate an overall trial frequency. SNOMED CT variables chosen for the “master list” were found in at least 1% of all trials, meaning they were used as an eligibility criterion in at least 15 trials.

Visualization of Selected SNOMED CT Variables

Since SNOMED CT maintains a hierarchical structure, the parents of all variables present in the “master list” were captured. All of the “master list” variables, their parent variables, and the “is-a” hierarchical relations were stored in JSON files and visualized using a modified d3j s package. Also, the trial frequency for each variable was also obtained and stored within the corresponding JSON file. Of note, every parent of a “master list” variable was considered to have the same trial frequency as its child.

Assessment of SNOMED CT Variable Coverage in the EHR Dataset

The SNOMED CT ID associated with each SNOMED CT variable in the “master list” was queried in ATLAS and the record count (RC) and descendant record count (DRC) were returned. RC indicates the number of times a specific variable is found in the EHR dataset, and DRC indicates the number of times a specific variable and its descendants are found in the dataset. SNOMED CT variables were further classified into five sets: categorical variables (e.g., the presence of Parkinson’s Disease) that are available in EHR continuous variables (e.g., age) that are available in EHR variable not found in EHR, but can be derived from the existing EHR variable, such as “chronological age” canbe derived from variable “date of birth” variables not available in EHR, but the data could be collected from a patient without medical training, such asquestions in Mini-Mental Status Exam variables not available in EHR, and the information could not be provided by a patient without medical training, such as “General Metabolic Function” variables not found in EHR, and not relevant for eligibility screening, such as “Psychiatric”

Results

The 42,131 entities identified in clinical trial eligibility criteria contained 9,261 unique entities, 1,930 of which corresponded to medication information which were not included in this analysis. Manual review of the remaining 7,331 unique non-medication entities simplified the list to 4,260 entities. To reach this simplified list, 3,610 manual changes were made. 2,591 changes were made for formatting reasons (e.g. AD, AD Disease -> Alzheimer’s Disease), 574 changes were made for simplification reasons (e.g. asthmatic conditions, adult asthma -> asthma) and 445 changes were made for ‘multi-term’ entities (e.g. basal or squamous cell carcinoma -> basal cell carcinoma or squamous cell carcinoma). A total of 4,260 unique clinical concepts were mapped to UMLS concepts via MetaMap, resulting in 4,026 unique MetaMap term sets (e.g. basal cell carcinoma or squamous cell carcinoma is a single ‘term set’ as the phrase was extracted from an eligibility criterion, but each underlined section is handled as a separate UMLS concept). A total of 111 manual searches were performed, including 66 searches for multi-term clinical entities, one for a typo in the entity, and 44 for inaccurate MetaMap mapping as assessed by the medical student (AB). After sorting, the final UMLS concept list was composed of 3,294 unique concepts. Of note, it was observed on manual review that many of the lab tests being used for eligibility assessment were found to be of UMLS type “Amino Acid, Peptide, or Protein” so all concepts of this type were re-queried searching only for concepts with the type “Laboratory Procedure” or “Laboratory or Test Result”. Direct matching to SNOMED CT using the UMLS Metathesaurus returned 1,991 unique SNOMED CT variables (e.g. basal cell carcinoma [UMLS code C0007117] is directly linked to epithelioma basal cell [SNOCT code 275265005] within databases). 56 variables were manually added by the direct query in the SNOMED CT Browser as no direct UMLS to SNOMED CT connection existed. Further, during the manual review, it was observed that some UMLS concepts which had no direct SNOMED CT equivalent could be applicable to a SNOMED CT variable returned for another concept, so the trial count and additional information was attached from both concepts to the single SNOMED CT variable. When filtered by variables identified in at least 15 trials out of the entire list, a “master list” was generated containing 318 UMLS concepts and 304 SNOMED CT variables (14 concepts had no correlated SNOMED CT variable). The UMLS concepts found in the “master list” were found in 1491 of the 1512 queried trials, i.e., a trialcoverage of 98.6%.

Visualization of The Common Eligibility Criteria SNOMED CT Variables and their hierarchical relations

The highly prevalent eligibility criteria concepts in AD trials are listed in Table 2. Since there exist hierarchical relations among these concepts, an online visualization was also generated for these concepts. Each node in the visualization is a common eligibility criteria concept in AD trials followed by its prevalence. For example, “mental disorder” is a node with prevalence of 82.21% because it is used by 82.21% of AD trials for patient screening. The visualization of “master list” concepts and their super classes can be observed at http://htmlpreview.github.io/?https://github.com/Butler925/Alz viz/blob/master/index git.htm.
Table 2.

The most commonly adopted eligibility criteria variables and their prevalence in AD trials (the last column with column header as “#” indicates the number of parent concepts)

SNOMED-CT Concept Representation for Commonly Adopted Eligibility VariablesSNOMED_IDPrevalenceType oflevelParent_SNOMED ID#
Clinical finding40468400397.09%finding11388750051
Disease6457200194.25%disorder24046840031
Mental disorder7473200982.21%disorder3645720011
Disorder of brain8130800979.50%disorder3645720011
Organic mental disorder11147900874.74%disorder474732009, 813080092
Dementia5244800674.60%disorder51114790081
Cerebral degeneration presenting primarily with dementia27998200564.62%disorder3645720011
Clinical history and observation findings25017100864.55%finding24046840031
Alzheimer’s disease2692900464.29%disorder652448006, 2799820052
Staging and scales25429100060.65%staging scale11388750051
Assessment scales27324900660.65%assessment scale22542910001
Procedure7138800258.33%procedure11388750051
Observable entity36378700251.19%observable entity11388750051
Mini-mental state examination27361700046.63%assessment scale32732490061
Qualifier value36298100045.24%qualifier value11388750051
General finding of observation of patient11822200641.14%finding32501710081
Presenile dementia1234800639.62%disorder6524480061
Disorder of cardiovascular system4960100739.55%disorder3645720011
Psychological finding11636700638.96%finding32501710081
Mental state, behavior and/or psychosocial function finding38482100638.96%finding41163670061
Disorder of nervous system11894000335.78%disorder3645720011
Current chronological age42414400234.06%observable entity31057270081
Age AND/OR growth period10572700834.06%observable entity23637870021
Disorder of blood vessel2755000933.33%disorder4496010071
Evaluation procedure38605300033.33%procedure2713880021
Disorder of body system36296500532.41%disorder3645720011
Cerebrovascular disease6291400032.28%disorder5275500091
Magnetic resonance imaging11309100031.88%procedure2713880021
Disorder by body site12394600830.16%disorder3645720011
Procedure by method12892700925.79%procedure2713880021
Mood disorder4620600525.73%disorder4747320091
Substance abuse6621400725.66%disorder3645720011
Descriptor27209900824.80%qualifier value23629810001
Cerebrovascular accident23069000724.54%disorder6629140001
Global assessment of functioning -1993 Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition adaptation28406100923.94%assessment scale32732490061
Systemic disease5601900723.35%finding41182220061
General body state finding8283200822.49%finding41182220061
Impaired cognition38680600221.43%finding53848210061
System disorder of the nervous system23022600021.16%disorder41189400031
Movement disorder6034200221.16%disorder52302260001
Extrapyramidal disease7634900321.16%disorder6603420021
Disorder of head11893400521.03%disorder41239460081
Depressive disorder3548900721.03%disorder5462060051

SNOMED CT Variable Assessment

Overall, the “master list” contained 21 SNOMED CT semantic types and 13 of the 19 highest-level SNOMED CT variable types. The prevalence of these concepts in AD trials is shown in Table 3, with the top 20 shown in Table 4. Of note, the majority of the variables in Table 4 are specific except for variable “Disease”, which is very vague. The less vague but still non-specific example variables are “Systematic Disease” and “History of clinical finding in subject”.
Table 3.

The counts of trials containing each SNOMED CT semantic type.

SNOMED-CT Semantic TypeTrial CountPrevalence in Trials
Disorder142594.25%
Finding107270.90%
Assessment scale91760.65%
Staging scale91760.65%
Procedure88258.33%
Observable entity77451.19%
Qualifier value68445.24%
Situation25016.53%
Physical object23115.28%
Attribute16310.78%
Linkage concept16310.78%
Body structure15410.19%
Metadata1056.94%
Morphologic abnormality1258.27%
Mother563.70%
Substance211.39%
Regime/therapy332.18%
Environment191.26%
Environment/location191.26%
Event171.12%
Organism150.99%
Table 4.

The top 20 common SNOMED CT terms in AD trials and their prevalence in EHR dataset.

SNOMED CT TermSNOMED-CT IDTrial CountPrevalence in TrialsCount of usesin EHR data for AD patients
Alzheimer’s disease2692900497264.29%30,262
Mini-mental state examination27361700070546.63%0
Presenile dementia1234800659939.62%7,089
Disease6457200155536.71%12,029,900
Current chronological age42414400251534.06%0
Mental disorder7473200949933.00%505,870
Magnetic resonance imaging11309100048231.88%63,171
Cerebrovascular accident23069000737124.54%4
Global assessment of functioning -1993 Diagnostic and Statistical Manual of Mental Disorders-ver.4th28406100936123.88%0
Systemic disease5601900735323.35%0
Disorder of nervous system11894000333522.16%780,478
Substance abuse6621400727918.45%9,466
Parkinson’s disease4904900027518.19%0
Impaired cognition38680600226017.20%13,375
Seizure disorder12861300224015.87%28,586
Hypersensitivity reaction42196100221814.42%4,686
Schizophrenic disorders19152600521614.29%40777
History of clinical finding in subject41766200020713.69%189,543
Risk identification: childbearing family38641400420513.56%0
Clinicaldementia rating scale27336700220413.49%0

The Data Gap

Table 5 shows the counts of SNOMED CT variables from the “master list” for each of the five categories. 60% of the variables from the “master list” were found in the EHR dataset, but data for about 40% of the variables that are not available in EHR could be provided by patients without clinicians’ assessment. Determining if patients could answer some of the criteria that have no data in the EHR largely relied on health literacy and access to their medical records. Criteria that are considered symptoms or based on clinical discretion (e.g. amyloid deposition, neurological deficit, psychotic symptom) are unanswerable by patients. Further, specific lab test results (e.g. Cobalamin deficiency, laboratory test abnormal) are also considered to be unanswerable by patients as they may not have the health literacy to address these criteria. Those criteria which are considered answerable by patient are broken into three categories: (1) discrete diagnosis (e.g. Parkinson’s Disease, Multiple Sclerosis, Carcinoma of Prostate), (2) answerable with online test (e.g. visual acuity, auditory acuity, memory function), and (3) answerable with structured questions (e.g.Clinical Dementia Rating Scale, Hachinski Ischemia Score, Geriatric Depression Scale). The ‘master list’ with EHRrecord counts, descendant record counts, and characterization about how a patient can address the criterion is at https://docs.google.com/spreadsheets/dAR6 xc iEq34YUWuJLzT26J 1kskEGIGmoQCOgrUJiB3w/edit?usp=sharing.
Table 5.

The count of SNOMED CT variables from the “master list” in the five categories.

Category DescriptionExampleCategoriesTotal Count
In EHR, categorical variablesPresenile Dementia132181 (60%)
In EHR, continuous variablesLaboratory Test40
Not in EHR, can be derivedChronological Age9
Not in EHR, answerable by patientQuestions from Mini-Mental Status Exam59123 (40%)
Not in EHR, not answerable by patientGeneral Metabolic Function34
Not applicablePsychiatric30

Discussion

The EHR data gap for eligibility screening

From Table 4 we can see that multiple variables used frequently for eligibility screening were not present in the EHR, including mini-mental state exam questions’ answers, global assessment of function, systematic disease, risk identification: child bearing family status, and clinical dementia rating scale. Rating scales used frequently by researchers are usually not available in EHR dataset but constitute important eligibility criteria concepts for AD trials ’eligibility criteria. Our study showed that 60.65% of AD trials include assessment scales and 1.79% of AD trials include symptom ratings, whose corresponding data are not available in EHRs. Overall, forty percent of the “master list” SNOMED CT variables could not be found in the corresponding structured EHR dataset for patients with AD. The percentage is comparable with the 55% coverage of patients’ characteristics from the study of Kopcke et al. The two studies’ results suggest fully automated EHR-based eligibility screening may still be impossible with the current schema due to the significant data gap, even though both eligibility criteria and EHR data are well represented using a common data model. An improved model may include patient-reported data in areas where criteria are not available in the EHR to allow for comprehensive eligibility criteria coverage.

Patient self-reported data as a new data source

An interesting finding is that 19% of the “master list” SNOMED CT variables did not exist in the EHR but could be answered by patients. The finding suggests the involvement of patients in the eligibility screening process may help recruiting more eligible patients. Successful stories include one by Williams et al.[23] who developed and implemented a computer-assisted interview system in an urban rheumatology clinic, and another by Goncalves et al.[24] who showed that use of patient-facing web forms could capture structured data. However, different opinions also exist. Forexample, one study by Wuerdeman et al.[25] concluded that patient-reported data are likely not as complete or accurate as the information provided by a provider. Some other barriers also have been reported, such as technological fluency, privacy concerns, and lack of technology infrastructure[26,27]. Further, given that Alzheimer’s Disease affects a patient’s cognition and often presents in the elderly, this could impact the reliability of patient-reported information so it is important that patient-facing tools would include family members and other stakeholders.

Reusable variables

Since the 304 UMLS concepts from “master list” variables were found in 98.6% of all the Alzheimer’s disease clinical trials, the clinical entities associated with these concepts could be adopted as common data elements (CDEs)[28], and may help reducing the workload of future Alzheimer’s disease clinical trials by avoiding assessing some of the 9,261 unique clinical entities. There is no currently established CDE for Alzheimer’s Disease, so the results of this study could serve as an important first step.

Major Eligibility Criteria

A similar approach to determine the most relevant eligibility criteria was undertaken by using an interview-style approach[11]. Paulson & Weng highlighted the importance of identifying major criteria in creating an optimal clinical trials recruitment tool. Providing equal weight to each eligibility criterion does a disserve in requiring excessive resources for a diminishing return in screening power, so focusing on those most frequent or more important criteria that allow for more robust eligibility screening provides a very strong advantage.

Limitations

This study has multiple limitations. First, only Alzheimer’s Disease clinical trials and SNOMED CT variables were included in this study, and this may result in bias in the coverage estimation. If more diseases and all terminologies from OMOP CDM model were included, the assessment of the information gap between EHR and eligibility criteria would be more accurate. Second, we identified a few discrepancies in our SynPUF dataset which may have impacted our results. For example, Parkinson’s Disease as referenced in the SNOMED CT database found no record counts in patient records, however the dataset used in this analysis identified overlap of Parkinson’s Disease in our dataset when searched outside of the SNOMED database. It is possible that there is a coding issue with our dataset, but the more likely scenario is that Parkinson’s Disease is primarily codified using a different clinical database. Future analyses into data source heterogeneity should also be conducted in an attempt to simplify and centralize how all of this data is referenced. Third, variables such as Cerebrovascular accident requires semantic inference and cannot be aligned literally because EHR data may contain specific incidents of Cerebrovascular accident, not this generic concept. Our current simple approach for aligning concepts in criteria and EHR data was unfortunately unable to find its counterpart in the EHR dataset. One implication of this finding is that we need more sophisticated methods for concept matching that is based on semantic alignment between terms, not just based on term matching. Alternative NLP systems to MetaMap, including MedLEE and cTAKES among others, have shown improved identification of clinical terms and may be used in the future to improve on the results elucidated here.[29] Lastly, one of the most significant limitations in this study involves the intensive manual review necessary to produce these results and its impact on scalability. As evidenced by the 3,610 manual changes made to the original term list in additional to subsequent type modifications and proof-reading, there is a high level of heterogeneity in clinical terminology found in clinical trial eligibility criteria. This heterogeneity increases the workload associated with performing analyses like this and reduces the confidence in the ultimate results. Further, it reduces the scalability of the methods used here. However, tracking of these manual changes does provide some insight into how to address this heterogeneity. Two of the three most common causes for manual modification, formatting and multiple terms, could be easily addressed by using standard term sets or CDEs as mentioned previously. Standardized lists of terms to be used in Alzheimer’s Disease eligibility criteria would avoid any variation in terms based on formatting discrepancies and would allow for simple handling of multiple term concepts (e.g. could identify basal cell carcinoma and squamous cell carcinoma is both terms existed in a standard list). Manual modifications due to simplification were performed primarily for the simplicity of this analysis, so future studies into addressing term heterogeneity should also focus on this reason for modification.

Conclusions

We found 40% of the most commonly used criteria variables in Alzheimer’s trial are not available in the concept space in EHR of the patients with Alzheimer’s disease. The result suggests that EHR-based eligibility screening may not achieve perfect performance due to the information gap. To overcome this challenge, a possible solution could be asking patients for missing information during recruitment when using EHR data for trial-eligible patient screening.
  27 in total

1.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors:  A R Aronson
Journal:  Proc AMIA Symp       Date:  2001

2.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  Clinical trials: the challenge of recruitment and retention of participants.

Authors:  Raisa B Gul; Parveen A Ali
Journal:  J Clin Nurs       Date:  2010-01       Impact factor: 3.036

4.  A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries.

Authors:  Yonghui Wu; Joshua C Denny; S Trent Rosenbloom; Randolph A Miller; Dario A Giuse; Hua Xu
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

Review 5.  Personal health records: a scoping review.

Authors:  N Archer; U Fevrier-Thomas; C Lokker; K A McKibbon; S E Straus
Journal:  J Am Med Inform Assoc       Date:  2011 Jul-Aug       Impact factor: 4.497

6.  The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors:  Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal:  BMC Med Genomics       Date:  2011-01-26       Impact factor: 3.063

7.  Usability of a computer-assisted interview system for the unaided self-entry of patient data in an urban rheumatology clinic.

Authors:  Carl A Williams; Thomas Templin; Angelia D Mosley-Williams
Journal:  J Am Med Inform Assoc       Date:  2004-04-02       Impact factor: 4.497

8.  Viewpoints and concerns of a clinical trial participant.

Authors:  R R Joseph
Journal:  Cancer       Date:  1994-11-01       Impact factor: 6.860

9.  An OMOP CDM-Based Relational Database of Clinical Research Eligibility Criteria.

Authors:  Yuqi Si; Chunhua Weng
Journal:  Stud Health Technol Inform       Date:  2017

10.  Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data.

Authors:  Felix Köpcke; Dorota Lubgan; Rainer Fietkau; Axel Scholler; Carla Nau; Michael Stürzl; Roland Croner; Hans-Ulrich Prokosch; Dennis Toddenroth
Journal:  BMC Med Inform Decis Mak       Date:  2013-12-09       Impact factor: 2.796

View more
  5 in total

1.  Cognitive Function Characterization Using Electronic Health Records Notes.

Authors:  Adrienne Pichon; Betina Idnay; Karen Marder; Rebecca Schnall; Chunhua Weng
Journal:  AMIA Annu Symp Proc       Date:  2022-02-21

2.  A knowledge base of clinical trial eligibility criteria.

Authors:  Hao Liu; Yuan Chi; Alex Butler; Yingcheng Sun; Chunhua Weng
Journal:  J Biomed Inform       Date:  2021-04-01       Impact factor: 6.317

3.  A systematic review on natural language processing systems for eligibility prescreening in clinical research.

Authors:  Betina Idnay; Caitlin Dreisbach; Chunhua Weng; Rebecca Schnall
Journal:  J Am Med Inform Assoc       Date:  2021-12-28       Impact factor: 4.497

4.  Design and Implementation of an Informatics Infrastructure for Standardized Data Acquisition, Transfer, Storage, and Export in Psychiatric Clinical Routine: Feasibility Study.

Authors:  Martin Dugas; Nils Opel; Rogério Blitz; Michael Storck; Bernhard T Baune
Journal:  JMIR Ment Health       Date:  2021-06-09

5.  The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records.

Authors:  Michela Assale; Linda Greta Dui; Andrea Cina; Andrea Seveso; Federico Cabitza
Journal:  Front Med (Lausanne)       Date:  2019-04-17
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.