Literature DB >> 25030032

Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department.

Yizhao Ni¹, Stephanie Kennebeck², Judith W Dexheimer³, Constance M McAneney², Huaxiu Tang¹, Todd Lingren¹, Qi Li¹, Haijun Zhai¹, Imre Solti⁴.

Abstract

OBJECTIVES: (1) To develop an automated eligibility screening (ES) approach for clinical trials in an urban tertiary care pediatric emergency department (ED); (2) to assess the effectiveness of natural language processing (NLP), information extraction (IE), and machine learning (ML) techniques on real-world clinical data and trials. DATA AND METHODS: We collected eligibility criteria for 13 randomly selected, disease-specific clinical trials actively enrolling patients between January 1, 2010 and August 31, 2012. In parallel, we retrospectively selected data fields including demographics, laboratory data, and clinical notes from the electronic health record (EHR) to represent profiles of all 202795 patients visiting the ED during the same period. Leveraging NLP, IE, and ML technologies, the automated ES algorithms identified patients whose profiles matched the trial criteria to reduce the pool of candidates for staff screening. The performance was validated on both a physician-generated gold standard of trial-patient matches and a reference standard of historical trial-patient enrollment decisions, where workload, mean average precision (MAP), and recall were assessed.
RESULTS: Compared with the case without automation, the workload with automated ES was reduced by 92% on the gold standard set, with a MAP of 62.9%. The automated ES achieved a 450% increase in trial screening efficiency. The findings on the gold standard set were confirmed by large-scale evaluation on the reference set of trial-patient matches. DISCUSSION AND
CONCLUSION: By exploiting the text of trial criteria and the content of EHRs, we demonstrated that NLP-, IE-, and ML-based automated ES could successfully identify patients for clinical trials.

Entities: Chemical Disease Gene Species

Keywords: Automated Clinical Trial Eligibility Screening; Information Extraction; Machine Learning; Natural Language Processing

Mesh：

Year: 2014 PMID： 25030032 PMCID： PMC4433376 DOI： 10.1136/amiajnl-2014-002887

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

OBJECTIVE

This study investigates use of state-of-the-art natural language processing (NLP), information extraction (IE), and machine learning (ML) technologies for automated clinical trial eligibility screening (ES). Our specific aims are to (1) develop an automated ES approach for clinical trials enrolling in the emergency department (ED) at an urban tertiary care pediatric hospital and (2) assess the effectiveness of NLP, IE, and ML techniques on real-world clinical data and trials. The overall objective is to develop a high-sensitivity automated ES approach to identify patients who meet eligibility characteristics of a trial to reduce the pool of potential candidates for staff screening. To assist the readers, a complete list of acronyms used in the paper is presented in the online supplementary appendix table A1.

BACKGROUND AND SIGNIFICANCE

Clinical trials are critical to the progress of medical science; however, awareness and access to clinical trials pose significant challenges to patients and physicians alike. Several reports have described the initial benefit of leveraging electronic health record (EHR) information to enhance trial recruitment.1–3 However, in most circumstances, ES is still conducted manually. Manual screening typically requires a lengthy review of patient records, a cumbersome process that creates a significant financial burden for an institution.4 In a busy clinical care center, the task of screening patients for clinical trials without bias is labor-intensive.5,6 For pharmaceutical companies, the clinical trial phase is the most expensive component of drug development, and any improvement in the efficiency of the recruitment process would be highly consequential.7 For these reasons, identifying eligible participants automatically on the basis of EHR information promises great benefits for translational science. In recent years, EHR-based eligibility screening for clinical trials has become a very active area for research and development; as such, several automated/semiautomated systems have been developed.8–19 These ES systems either (1) manually design specific triggers for a clinical trial (eg, age, gender, and diagnosis) to identify eligible patients8,9,17,18 or (2) automatically match patterns between clinical trial description and EHR content to identify eligible patient cohorts.12–14,16,19 However, trial-specific triggers normally lack generalizability to new clinical trials. A recent study also demonstrated that alert fatigue affects physicians’ responsiveness, possibly because of the low accuracy of the triggers.20 For automated trial–patient pattern matching, several methods have been proposed to standardize trial criteria.21–27 These methods enable the creation of computable patterns from trial description (and patient EHRs) and effectively advance the development of automated ES systems.12–14,16,19 The annual Text REtrieval Conference (TREC) recently included a medical record track dedicated to ES, where participants attempted to rank patients for a clinical query based on the content of physician notes.28–36 Despite these efforts, many barriers remain.37,38 First, although automated ES systems should ideally be evaluated on real-world data, this goal is hindered by the lack of access to production EHRs.37 Only a handful of studies provided evaluations on real-world trial–patient matching, and most of them focused on one specific clinical trial.12–14,16,19,39 Even the TREC medical record track had to use synthetic clinical queries because of the lack of available real-world trial–patient matches. Second, not all automated ES algorithms proposed in the literature improve performance (eg, the term expansion algorithm proposed in the TREC track reports only worsens the performance).29,31,35 Finally, few studies explicitly report trial screening efficiency with and without automated ES; additional study is required to fill this gap in our knowledge. To address these barriers and evaluation gaps, we customized state-of-the-art NLP, IE, and ML technologies and developed an automated ES approach. Utilizing a physician-generated gold standard of trial–patient matches and a reference standard of historical trial–patient enrollment decisions on a diverse set of clinical trials, we will contribute to the body of knowledge of automated ES by (1) evaluating a state-of-the-art automated ES approach on real-world clinical data and trials, (2) further assessing the ES algorithms proposed in the TREC literature, and (3) comparing trial screening efficiency both with and without automated ES.

DATA AND METHODS

We focused on clinical trials for pediatric patients who visited the ED at Cincinnati Children's Hospital Medical Center between January 1, 2010 and August 31, 2012. The study was approved by the institutional review board. In current practice, enrollment decisions in the ED are made on a per patient visit basis. A clinical research coordinator matches current patients with the actively enrolling trials open on the patients’ date of visit based on the information collected during the visit (eg, demographics and diagnosis). Therefore, in this study, we also treated each patient visit (referred to as an ‘encounter’) as the unit of analysis and made an eligibility prediction for each encounter.

Gold standard trial–patient matches

To create a gold standard set of trial–encounter matches for evaluation, we randomly sampled 5 days from the study period and collected all 1475 encounters and 13 disease-specific clinical trials (inclusion/exclusion criteria included one or more diseases) enrolling on those days. Owing to labor limitation, we further narrowed down the population by randomly selecting 600 encounters from the 1475 samples. The resulting 13 trials and the 600 encounters formed the dataset for the gold standard. Two board-certified, pediatric emergency medicine physicians each with more than 10 years’ experience independently reviewed all charts for each encounter and the criteria for each trial enrolling on the encounter date and made an eligibility decision for every trial–encounter pair. Differences between the physicians’ decisions were resolved during adjudication sessions. Inter-annotator agreement between the two physicians was calculated using the F-value to define the agreement in gold standard.40

Historical trial–patient enrollment decisions

We collected all 239547 encounters in the ED during the study period. Of these, 36752 encounters between 00:00 and 8:00 and during holidays were excluded because of no clinical trial staffing in that time frame, providing a population of 202795 encounters. The 13 trials used in the gold standard and the 202795 encounters then formed a reference set for large-scale evaluation, in which a set of historical trial–patient enrollment decisions were leveraged as trial–patient matches. The enrollment decisions include all patients who were approached and their eligibilities confirmed in person (the patients could opt out of enrollment). The decisions do not build a gold standard because some eligible patients might not have been approached if the clinical research coordinators were busy enrolling other patients. However, the historical set includes all patients found eligible by the coordinators irrespective if they later declined enrollment. Consequently, the set forms a useful reference standard to evaluate ES algorithms in replicating eligibility decisions in a clinical practice setting.

Clinical trial description and patient EHR data

We collected the description of the 13 clinical trials as used by the research coordinators during manual screening, including title, purpose, and inclusion/exclusion criteria. An example clinical trial description is shown in figure 1, and a description of the trials is presented in online supplementary table A2.

Figure 1:

An example clinical trial description (trial 9 in online supplementary table A2).

An example clinical trial description (trial 9 in online supplementary table A2). On the basis of the prestudy communication with the ED physicians, we extracted 15 EHR data fields that were commonly reviewed by clinical research coordinators during ES to represent the patients’ profiles. The data fields were categorized into two groups: (1) structured fields, such as demographics and laboratory data; (2) unstructured text-based fields, such as diagnosis and clinical notes. A description of the data fields is presented in table 1. The structured fields were used to build logical constraint filters (LCFs), while the unstructured fields were used in NLP-based matching components. Not every encounter had all unstructured fields present, and the descriptive statistics of these fields are shown in figure 2.

Table 1:

Structured and unstructured data fields extracted from patients’ electronic health records

Data field	Data field description	Data field class
Age	Patient's age	Demographics (S)
Gender	Patient's gender	Demographics (S)
Language	Patient's spoken language	Demographics (S)
Acuity	Acuity of the patient's chief complaint (from 1 to 5:1 indicates urgent complaint and 5 non-urgent complaint)	Encounter information (S)
Guardian presence	Whether the patient is escorted by his/her legal guardian	Encounter information (S)
Pregnancy, Yes/No	Whether the patient is pregnant	Encounter information (S)
Vital signs	Patient's first vital sign measurements (eg, temperature) in the ED	Encounter information (S)
GCS*	Patient's Glasgow Coma Scale	Encounter information (S)
Chief complaint	Patient's chief complaint documented during the encounter	Encounter information (US)
Diagnosis	Patient's diagnosis documented during the encounter	Encounter information (US)
ED clinical notes	Clinical notes written in the ED during the encounter	Encounter information (US)
Medical history	Patient's medical history documented during the encounter. Includes patient's historical diagnoses (eg, diagnosis name, diagnosis date, and brief comment about the diagnosis) prior to this encounter	History information (US)
Surgical history	Patient's surgical history documented during the encounter. Includes surgery (eg, surgery name and date of surgery) to the patient prior to this encounter	History information (US)
Family history	Relevant medical histories of patient's family members documented during the encounter. Include the problems of the family members provided by the patient	History information (US)
Medication history	Patient's medication history documented during the encounter. Includes all medications used by the patient prior to this encounter	History information (US)

‘S’ in ‘Data field class’ indicates a structured field and ‘US’ an unstructured text-based field.

*A default score of 15 was generated for GCS if the chief complaint was not related to head trauma.

ED, emergency department.

Figure 2:

Numbers of encounters covered by the unstructured data fields and numbers of entries for the fields.

Structured and unstructured data fields extracted from patients’ electronic health records ‘S’ in ‘Data field class’ indicates a structured field and ‘US’ an unstructured text-based field. *A default score of 15 was generated for GCS if the chief complaint was not related to head trauma. ED, emergency department. Numbers of encounters covered by the unstructured data fields and numbers of entries for the fields.

Automated ES approach

We customized and implemented state-of-the-art NLP, IE, and ML algorithms to build the ES framework (figure 3). Given a clinical trial and the encounter candidates, the approach applied LCFs to exclude ineligible encounters based on structured data fields derived from the trial criteria (step 1 in figure 3). The unstructured data fields of the prefiltered encounters were then processed, from which the medical terms were extracted and stored in the encounter pattern vectors (step 2). The same process was applied to the trial criteria to construct the trial pattern vector (step 3); the vector was also extended with informative patterns extracted from EHRs of previously eligible patients to capture hyponyms relevant to the trial criteria (step 4). Finally, IE algorithms matched the trial vector with the encounter vectors and returned a ranked list of potentially eligible encounters (step 5).

Figure 3:

Architecture of the proposed automated eligibility screening approach.

Logical constraint filters

Some characteristics of a patient —for example, age and gender—have been beneficial in earlier studies.29–31 Hence, we manually extracted the criteria of these characteristics from the trial description and applied LCFs on the structured data fields (table 1) to exclude ineligible encounters.

Text processing and medical term/assertion identification

The text processing utilized advanced NLP algorithms to extract informative textual patterns from patients’ unstructured data fields. The process first combined the documents based on field types. For example, if two physician notes and two medical history entries were written during an encounter, the clinical narratives were concatenated on the basis of field types and generated two documents (one physician note and one medical history). The content was then segmented into sentences using the Stanford sentence parser, and all duplicate sentences (eg, copy-pasted and ‘templated’ narratives) within the same field-type-based document were removed.41 Non-informative tokens, such as stop words, were also removed in this process. All remaining words from the documents were stored as bag-of-words in the encounter pattern vectors. The importance of medical information hidden within clinical narratives has been increasingly recognized as a critical component in describing a patient's profile.29,30,35 Building on our experience with Mayo Clinic's clinical Text Analysis and Knowledge Extraction System (cTAKES), we adapted it to extract text-derived, term-level medical information from the unstructured data fields.42 cTAKES assigned concept unique identifiers (CUIs) from the Universal Medical Language System (UMLS), the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) codes, and the clinical drug codes of RxNorm to text strings.43–45 Our customized cTAKES implementation is described in our earlier publications.46,47 None of the trial description or the patient data used in the present study were included in the earlier training of our customized cTAKES model. To convert negation expression, we implemented a negation detector based on the NegEx algorithm.48 For example, the phrase ‘Negative for abdominal pain’ was converted into ‘NEG_C0694551’ in the assertion detection component. All text and medical terms were converted if necessary before being added to the encounter vectors. For the trial eligibility description, the same text and medical term processing was applied to the inclusion and exclusion criteria to extract term-level patterns for the trial vectors. All terms extracted from the exclusion criteria were converted into negated format.

Supervised term expansion

Term expansion is the process of expanding a query with additional terms, mainly hyponyms of query words, to improve the match between the query and its candidates. Some of the top-performing approaches in the TREC medical record track have tried this technique for ES, which attempted to expand the trial vector with all possible hyponyms from the UMLS hierarchy.29–31,35 For instance, beside using the word ‘cancer’ from the eligibility criteria, the algorithms looked up all words related to ‘cancer’ from the UMLS hierarchy (eg, ‘neuroblastoma’ and ‘glioma’) and added them to the trial vector. This unsupervised expansion was detrimental to screening performance because it introduced many irrelevant terms.29,31,35 To address this problem, we developed a more principled term expansion component using supervised learning techniques: we used the most informative patterns retrieved from the EHRs of previously eligible patients for a trial to find the hyponyms (eg, football) relevant to the trial criteria (eg, sports-related trauma). This is the first, known to us, introduction of supervised learning to ES for enhancing trial–patient matching, which we refer to as supervised term expansion (STE). Mathematically, the text/medical terms from the encounter vectors of previously eligible patients were weighed by term frequency–inverse document frequency (TF-IDF) feature selection, where the top K terms (K = 100) were expanded to the trial vector.49 The population of ‘previously eligible patients’ is described in the Experiments section below.

IE algorithms

The encounter vectors were used to represent patients’ profiles and stored in a Lucene information retrieval database.50 The same processes along with STE were applied to build the pattern vector for a trial. The IE algorithms then matched between the patterns of the trial and the prefiltered encounters and ranked the candidates based on TF-IDF similarity.51 Finally, the ranked list of encounters was generated to facilitate the staff screening.

Experiments

Evaluation metrics

We adopted three evaluation metrics to measure performance. (1) To assess the overall quality of the ES output, we applied the mean average precision (MAP) commonly calculated in information retrieval: where M is the number of trials, Nm is the number of eligible encounters for trial m, n denotes the n-th encounter in the output, P(n) is the precision at cut-off n, and δ(n) is an indicator function equaling 1 if the n-th encounter is eligible for trial m and is 0 otherwise.52 (2) To compare the screening efficiencies of different ES approaches, we used the ‘workload’ metrics, defined as the number of encounters required to be reviewed from the output to identify all eligible patients.39 The workload equals the number of predicted eligible encounters (ie, true positive + false positive) when recall = 100%. (3) To assess the recall at different algorithm cut-offs, we thresholded the ES output with 10–100% cut-offs and plotted the recall curve. These evaluations were applied to both the gold standard and the reference standard experiments.

Comparison of ES approaches

The baseline approach (denoted by BASELINE) simulated the screening process without automated ES. It was implemented by randomly shuffling the encounter list for a clinical trial. We then compared its performance with three variants of the ES approach that cumulatively integrated the proposed components: (1) LCF: the approach used the LCF component to exclude ineligible encounters and randomly shuffled the prefiltered encounters for a trial; (2) LCF + NLP: the approach specified in figure 3 without the STE component; (3) LCF + NLP + STE: the STE component was also included. To fill in the gap in the TREC literature, we additionally validated the contribution of different pattern sets on the LCF + NLP approach: we tested the four pattern sets (Text, UMLS CUI, SNOMED CT, and RxNorm) individually and in combination and assessed the MAP performance respectively. The best combination of the pattern sets was used in LCF + NLP + STE. In all experiments no manual customizations were made to our ES algorithms (eg, adding additional rules to the negation detector) to over-fit the current datasets. The STE component was always trained on the data that were never part of the test set in each experiment.

Evaluation scenarios

We first performed twofold cross-validation on the gold standard set to evaluate the ES approaches. For each fold we used 300 encounters as candidates and evaluated the ES outputs for each trial against the gold standard eligibility decisions. The eligible patients in the other 300 encounters were regarded as ‘previously eligible patients’ to train the STE component. To assess the performance of STE with different sizes of training samples, we also used 1–100% of the eligible patients from the reference standard set to train the component (figure 4B). To assure the integrity of the evaluation, all patients in the gold standard were removed from the training data, providing 3864 ‘previously eligible patients’. In the case of 1%/2%/5% of the training data, the experiments were repeated 100/50/20 times on each fold to enable the use of all training samples. The results were then averaged over the experiments as the performance of that fold. For the rest of the portions, the experiments were repeated 10 times. For all experiments, the statistical significance of the performance difference was assessed using the paired t test. Because of the number of different tests conducted, we also applied the Bonferroni correction to the p values to account for the increased possibility of type I error.53

Figure 4:

Average workload and mean average precision (MAP) performance of the eligibility screening (ES) approaches on the gold standard set (A) and the performance of LCF + NLP + STE with different sizes of training samples (B). Statistical significance tests (paired t test) of the performance difference between LCF + NLP + STE and the other ES approaches are also presented. LCF, logical constraint filter; NLP, natural language processing; STE, supervised term expansion. To conduct the evaluation on the reference standard set, we simulated the current practice and assessed the ES approaches on a day-by-day basis—that is, given an open trial and all encounters on day X, we ran the ES algorithms on the encounters for this trial and evaluated the outputs against the historical decisions on day X. The performance was averaged over all open days of the trial as performance for this trial. In this scenario, the patients found eligible for a trial up to day X were used to train the STE component. Hence, on day 1, the STE was not used because no previously eligible patients were available, while, on day 2, all patients found eligible in day 1 were used to train the STE, and so forth.

RESULTS

Descriptive statistics of evaluation data

For the gold standard set, the physicians reviewed 3061 trial–encounter pairs and found 75 matches (2.45% average eligibility rate). The numbers of eligible candidates for the trials are presented in online supplementary table A2. The overall inter-annotator agreement was 96.81%, indicating good agreement on the eligibility decision. Among the 202795 encounters, patients in 4177 encounters were found eligible for any of the 13 trials in historical enrollment decisions, providing 4210 trial–encounter matches in the reference standard set (see online supplementary table A2) (average eligibility rate 1.25%).

Gold standard experiments

Figure 4A presents the average workload and MAP performance of the ES approaches over all trials in the twofold cross-validation experiment. Without automated prescreening (BASELINE), a clinical research coordinator would have to screen on average 98 encounters per trial to identify all eligible patients in the gold standard set. With the automated approach LCF + NLP + STE, the workload was reduced dramatically by more than 90% to eight screened encounters per trial. A similar trend was observed when MAP between different approaches was compared, where the improvements of LCF + NLP + STE over BASELINE and LCF were statistically significant. In the cross-validation experiment, LCF + NLP + STE did not significantly outperform LCF + NLP because of insufficient training data (figure 4A). However, we observed consistent improvement in its performance when more training data from the reference standard were used (figure 4B). In the case of 10% training data (387 samples), it outperformed LCF + NLP statistically significantly on both evaluations (p = 1.00E-9 on workload and p = 4.86E-2 on MAP). Figure 5 presents the recall curves at different algorithm cut-offs. LCF + NLP + STE (trained on eligible patients in the alternative fold of the gold standard) achieved 90% recall when thresholding the top 22% of its output as eligible candidates, suggesting that the screening efficiency was improved by about 450% while missing only 10% of eligible patients.

Figure 5:

Recall performance of the eligibility screening approaches at different cut-offs of algorithm outputs. LCF, logical constraint filter; NLP, natural language processing; STE, supervised term expansion. Finally, we assessed the contribution of the four pattern sets in table 2. The LCF + NLP approach with all patterns achieved the best performance (combination 15), followed closely by LCF + NLP using Text, SNOMED CT, and UMLS CUI (combination 14). The improvements of the best pattern combination were statistically significant over the variants using Text, SNOMED CT, and RxNorm individually (combination 1/2/4), or any combination of Text and SNOMED CT with RxNorm (combination 6/8). It is worth noting that the LCF + NLP approach with UMLS CUI (combination 2), or any combination of Text, SNOMED CT, and UMLS CUI (eg, combination 9/10/13) also achieved high performances, which were close to that of the best combination.

Table 2:

Pattern set					MAP	p Value
Combination	Text	SNOMED	CUI	RxNorm	MAP	p Value
1	×	×	×	√	0.296	1.61E-4*
2	×	×	√	×	0.559	0.354
3	×	√	×	×	0.502	7.30E-3*
4	√	×	×	×	0.527	4.10E-2*
5	×	×	√	√	0.553	0.322
6	×	√	×	√	0.503	1.48E-2*
7	×	√	√	×	0.554	9.10E-2
8	√	×	×	√	0.527	3.45E-2*
9	√	×	√	×	0.562	0.160
10	√	√	×	×	0.565	0.260
11	×	√	√	√	0.548	6.38E-2
12	√	×	√	√	0.562	0.161
13	√	√	×	√	0.565	0.285
14	√	√	√	×	0.583	8.91E-2
15	√	√	√	√	0.584	N/A

√, pattern set used; × , otherwise.

Bold number indicates the best result.

N/A indicates that the performances between the two ES approaches are identical and no p value is returned.

*The performance difference between the two ES approaches is statistically significant at the 0.05 level.

CUI, concept unique identifier; ES, eligibility screening; LCF, logical constraint filter; MAP, mean average precision; NLP, natural language processing.

Average MAP of the LCF + NLP approach using different combinations of pattern sets; statistical significance tests (paired t test) of the performance difference between the best pattern combination and the others are also presented √, pattern set used; × , otherwise. Bold number indicates the best result. N/A indicates that the performances between the two ES approaches are identical and no p value is returned. *The performance difference between the two ES approaches is statistically significant at the 0.05 level. CUI, concept unique identifier; ES, eligibility screening; LCF, logical constraint filter; MAP, mean average precision; NLP, natural language processing.

Reference standard experiments

Figure 6 illustrates the evaluation on the reference standard set, where we observed identical trends of performance for the four ES approaches. Again, LCF + NLP + STE achieved the best performance, and its improvements over the other approaches were statistically significant.

Figure 6:

Average workload and mean average precision (MAP) performance of the eligibility screening approaches on the reference standard set (A) and the recall performance of the approaches at different algorithm cut-offs (B). Statistical significance tests (paired t test) of the performance difference between LCF + NLP + STE and the other approaches are also presented. LCF, logical constraint filter; NLP, natural language processing; STE, supervised term expansion.

DISCUSSION

In the gold standard experiments, the LCF approach showed good capability in excluding ineligible patients (workload reduction 49%, 50 vs 98 screened encounters). However, without the information from clinical narratives, it was unable to match descriptive criteria (eg, diagnosis) with patients’ profiles. By applying the NLP and IE algorithms, LCF + NLP further improved the performance (workload reduction 86%, 14 vs 98 screened encounters). This result verifies the effectiveness of the NLP and IE techniques and confirms the findings of some reports of the TREC medical record track on real-world data.29–31,33,35 For LCF + NLP + STE, we observed consistent improvement in performance when the STE training data increased (figure 4B). When a training size similar to the test data was used, the approach achieved better performance than LCF + NLP (figure 4A, workload reduction 43%, 8 vs 14 encounters). Given sufficient training samples (figure 4B), LCF + NLP + STE outperformed LCF + NLP statistically significantly. This promising result showed the great potential of STE in boosting the performance of automated ES. One representative example was observed on trial 9, where the inclusion criterion ‘sports related blunt trauma’ was ambiguous and found hardly any matches in patients’ clinical notes. By exploring the EHRs of previously eligible patients, STE additionally picked up sport-related terms (eg, football and soccer) for the trial vector and greatly improved the trial–patient matching (evidenced by a 66.7% workload reduction over LCF + NLP on this trial, 2 vs 6 screened encounters). The identical trends for the four approaches observed in the reference standard experiments confirmed the above findings and validated the scalability of our ES algorithms. By investigating the contribution of different pattern sets (table 2), we found that no single pattern covered a complete list of information, and the UMLS CUI was shown to be more informative than the others. The Text set was less informative, and combining it with UMLS CUI slightly improved the performance (combination 9 vs 2). A similar trend was observed on the SNOMED CT set, which did not contribute much additional information when UMLS CUI was used (combination 7 vs 2). Since drug-related information in the trial criteria was sparse (see online supplementary table A2), the RxNorm set contributed little information for trial–patient matching. Consequently, combining RxNorm with the other patterns barely influenced the results in our case. These observations suggest that, when designing the ES algorithms, one should customize pattern sets on the basis of trial requirements (eg, whether it contains drug information). Adding more patterns will not always increase the screening performance.

Error analysis, limitations, and future work

We performed error analysis for LCF + NLP + STE by reviewing the charts for all false positives made in the workload evaluation on the twofold cross-validation experiment. The LCF + NLP + STE approach made 88 errors, which were grouped into six categories in table 3. About 44% of the errors were ascribed to the confusion between similar signs and symptoms (cause 1, eg, recommending a patient with ‘RUQ abdominal pain’ to a trial for ‘RLQ abdominal pain’) and the omission of exclusions implied in the clinical narratives such as time-related criteria (cause 4, eg, omitting the clue ‘pain started four days ago’ indicating that the symptom had lasted for more than 72 h). This is because our ES approach uses ‘bag-of-words’ patterns, which limits its ability in finding semantic relations between consecutive words. To alleviate this problem, we will extend the pattern set to ‘bag-of-phrases’ in our future work and apply advanced NLP algorithms to analyze the semantic and temporal relations within the context.

Table 3:

False positive errors made by the LCF + NLP + STE approach

Cause of false positive errors identified by the chart review	Error (%)
1. The ES approach matched similar signs and symptoms (eg, RLQ and RUQ abdominal pain) but omitted the other criteria	30.68
2. The ES approach matched the correct diagnosis but could not identify ineligible patients because the exclusions did not exist in the collected EHR data fields (eg, less than 32 weeks’ gestational age)	17.04
3. The ES approach omitted the negation expression of the signs and symptoms (eg, Mom denied patient had diarrhea) and hence caused wrong patient recommendation	14.78
4. The ES approach matched the correct diagnosis but omitted some inclusions/exclusions implied in the clinical narratives (eg, symptoms >72 h)	13.63
5. The ES approach matched the terms expanded by the STE component (eg, football, soccer and skating) but omitted the primary criteria (eg, diagnosis)	2.27
6. Wrong diagnosis, other reasons	21.59

EHR, electronic health record; ES, eligibility screening; LCF, logical constraint filter; NLP, natural language processing; RLQ, right lower quadrant; RUQ, right upper quadrant; STE, supervised term expansion.

False positive errors made by the LCF + NLP + STE approach EHR, electronic health record; ES, eligibility screening; LCF, logical constraint filter; NLP, natural language processing; RLQ, right lower quadrant; RUQ, right upper quadrant; STE, supervised term expansion. Another set of errors were caused by missing inclusion/exclusion criteria in the EHR data fields (cause 2, eg, we did not collect the field of gestational age, a criterion used in trial 1 and 13 in this study). The approach will be more powerful if we integrate more EHR fields into the LCF component (eg, additional demographics and laboratory data). Since we did not manually customize the ES algorithms to over-fit the current data, the mistakes made by the components (eg, negation detector and STE) were propagated and caused errors in patient recommendation (causes 3 and 5). We will tune these components on our current data (eg, introducing additional rules in the negation detector) to improve their accuracies in future study. One limitation of the study is that its evaluation is restricted to retrospective data. In the future, we will evaluate the practicality of automated ES in a randomized controlled prospective test environment. To verify the generalizability of the ES algorithms, we plan to test our approach on a more diversified patient population (eg, adult patients), multiple institutions, and clinical data under different formats (eg, clinical record formats used in different vendors’ EHR product).

CONCLUSION

By leveraging NLP, IE, and ML technologies on both the eligibility criteria and the patient EHRs, we demonstrated that NLP-, IE-, and ML-based automated ES could successfully identify patients for disease-specific clinical trials. Using a physician-generated, gold-standard-based evaluation of real-world clinical data and trials, the approach achieved more than 90% workload reduction potential in patient cohort identification and showed the potential of a 450% increase in trial screening efficiency. This work also verified the effectiveness of the NLP, IE, and ML algorithms and UMLS components in a real-world dataset. Large-scale evaluation on the historical trial–patient enrollment decisions confirmed the findings and validated the scalability of the proposed algorithms. Consequently, we hypothesize that the automated ES approach, when rolled out for production, will have potential for significant impact in reduction of time and effort for executing clinical research, particularly as important new initiatives greatly expand the number of, and access to, potential clinical trials for patients.

35 in total

1. Enrolling patients into clinical trials faster using RealTime Recuiting.

Authors: A J Butte; D A Weinstein; I S Kohane
Journal: Proc AMIA Symp Date: 2000

2. Efficacy and cost-effectiveness of an automated screening algorithm in an inpatient clinical trial.

Authors: Catherine C Beauharnais; Mary E Larkin; Adrian H Zai; Emily C Boykin; Jennifer Luttrell; Deborah J Wexler
Journal: Clin Trials Date: 2012-02-03 Impact factor: 2.486

3. Analysis of eligibility criteria representation in industry-standard clinical trial protocols.

Authors: Sanmitra Bhattacharya; Michael N Cantor
Journal: J Biomed Inform Date: 2013-06-12 Impact factor: 6.317

4. EliXR: an approach to eligibility criteria extraction and representation.

Authors: Chunhua Weng; Xiaoying Wu; Zhihui Luo; Mary Regina Boland; Dimitri Theodoratos; Stephen B Johnson
Journal: J Am Med Inform Assoc Date: 2011-07-31 Impact factor: 4.497

5. Dynamic categorization of clinical research eligibility criteria by hierarchical clustering.

Authors: Zhihui Luo; Meliha Yetisgen-Yildiz; Chunhua Weng
Journal: J Biomed Inform Date: 2011-06-12 Impact factor: 6.317

6. Competing for patients: an ethical framework for recruiting patients with brain tumors into clinical trials.

Authors: George M Ibrahim; Caroline Chung; Mark Bernstein
Journal: J Neurooncol Date: 2011-02-12 Impact factor: 4.130

7. Automated determination of metastases in unstructured radiology reports for eligibility screening in oncology clinical trials.

Authors: Valentina I Petkov; Lynne T Penberthy; Bassam A Dahman; Andrew Poklepovic; Chris W Gillam; James H McDermott
Journal: Exp Biol Med (Maywood) Date: 2013-10-09

8. Evaluating alert fatigue over time to EHR-based clinical trial alerts: findings from a randomized controlled study.

Authors: Peter J Embi; Anthony C Leonard
Journal: J Am Med Inform Assoc Date: 2012-04-25 Impact factor: 4.497

9. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials.

Authors: Ioannis Korkontzelos; Tingting Mu; Sophia Ananiadou
Journal: BMC Med Inform Decis Mak Date: 2012-04-30 Impact factor: 2.796

10. A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction.

Authors: Qi Li; Haijun Zhai; Louise Deleger; Todd Lingren; Megan Kaiser; Laura Stoutenborough; Imre Solti
Journal: J Am Med Inform Assoc Date: 2012-12-25 Impact factor: 4.497

37 in total

1. Automatic data source identification for clinical trial eligibility criteria resolution.

Authors: Chaitanya Shivade; Courtney Hebert; Kelly Regan; Eric Fosler-Lussier; Albert M Lai
Journal: AMIA Annu Symp Proc Date: 2017-02-10

2. Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest.

Authors: A Névéol; P Zweigenbaum
Journal: Yearb Med Inform Date: 2016-11-10

Review 3. Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing.

Authors: D Demner-Fushman; N Elhadad
Journal: Yearb Med Inform Date: 2016-11-10

4. Finding warning markers: Leveraging natural language processing and machine learning technologies to detect risk of school violence.

Authors: Yizhao Ni; Drew Barzman; Alycia Bachtel; Marcus Griffey; Alexander Osborn; Michael Sorter
Journal: Int J Med Inform Date: 2020-04-25 Impact factor: 4.046

5. Cohort selection for clinical trials: n2c2 2018 shared task track 1.

Authors: Amber Stubbs; Michele Filannino; Ergin Soysal; Samuel Henry; Özlem Uzuner
Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497

Review 6. Clinical Research Informatics: Supporting the Research Study Lifecycle.

Authors: S B Johnson
Journal: Yearb Med Inform Date: 2017-09-11

Review 7. Contemporary use of real-world data for clinical trial conduct in the United States: a scoping review.

Authors: James R Rogers; Junghwan Lee; Ziheng Zhou; Ying Kuen Cheung; George Hripcsak; Chunhua Weng
Journal: J Am Med Inform Assoc Date: 2021-01-15 Impact factor: 4.497

8. Clinical Informatics Researcher's Desiderata for the Data Content of the Next Generation Electronic Health Record.

Authors: Timothy I Kennell; James H Willig; James J Cimino
Journal: Appl Clin Inform Date: 2017-12-21 Impact factor: 2.342

Review 9. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review.

Authors: Kory Kreimeyer; Matthew Foster; Abhishek Pandey; Nina Arya; Gwendolyn Halford; Sandra F Jones; Richard Forshee; Mark Walderhaug; Taxiarchis Botsis
Journal: J Biomed Inform Date: 2017-07-17 Impact factor: 6.317

10. A Time-and-Motion Study of Clinical Trial Eligibility Screening in a Pediatric Emergency Department.

Authors: Judith W Dexheimer; Huaxiu Tang; Andrea Kachelmeyer; Melanie Hounchell; Stephanie Kennebeck; Imre Solti; Yizhao Ni
Journal: Pediatr Emerg Care Date: 2019-12 Impact factor: 1.454