Literature DB >> 30368002

Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances.

Sumithra Velupillai¹, Hanna Suominen², Maria Liakata³, Angus Roberts⁴, Anoop D Shah⁵, Katherine Morley⁶, David Osborn⁷, Joseph Hayes⁸, Robert Stewart⁹, Johnny Downs¹⁰, Wendy Chapman¹¹, Rina Dutta¹².

Abstract

The importance of incorporating Natural Language Processing (NLP) methods in clinical informatics research has been increasingly recognized over the past years, and has led to transformative advances. Typically, clinical NLP systems are developed and evaluated on word, sentence, or document level annotations that model specific attributes and features, such as document content (e.g., patient status, or report type), document section types (e.g., current medications, past medical history, or discharge summary), named entities and concepts (e.g., diagnoses, symptoms, or treatments) or semantic attributes (e.g., negation, severity, or temporality). From a clinical perspective, on the other hand, research studies are typically modelled and evaluated on a patient- or population-level, such as predicting how a patient group might respond to specific treatments or patient monitoring over time. While some NLP tasks consider predictions at the individual or group user level, these tasks still constitute a minority. Owing to the discrepancy between scientific objectives of each field, and because of differences in methodological evaluation priorities, there is no clear alignment between these evaluation approaches. Here we provide a broad summary and outline of the challenging issues involved in defining appropriate intrinsic and extrinsic evaluation methods for NLP research that is to be used for clinical outcomes research, and vice versa. A particular focus is placed on mental health research, an area still relatively understudied by the clinical NLP research community, but where NLP methods are of notable relevance. Recent advances in clinical NLP method development have been significant, but we propose more emphasis needs to be placed on rigorous evaluation for the field to advance further. To enable this, we provide actionable suggestions, including a minimal protocol that could be used when reporting clinical NLP method development and its evaluation.

Entities: Chemical

Keywords: Clinical informatics; Epidemiology; Evaluation; Information extraction; Mental Health Informatics; Natural Language Processing; Public Health; Text analytics

Mesh：

Year: 2018 PMID： 30368002 PMCID： PMC6986921 DOI： 10.1016/j.jbi.2018.10.005

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Introduction

Appropriate utilization of large data sources such as Electronic Health Record (eHealth records or EHR) databases could have a dramatic impact on health care research and delivery. Owing to the large amount of free text documentation now available in EHRs, there has been a concomitant increase in research to advance Natural Language Processing (NLP) methods and applications for the clinical domain [1,2]. The field has matured considerably in recent years, addressing many of the challenges identified by Chapman et al. [3], and meeting the recommendations by Friedman et al. [4]. For example, the above include recommendations to address the key challenges of limited collaboration, lack of shared resources and evaluation-approaches of crucial tasks, such as de-identification, recognition and classification of medical concepts, semantic modifiers, and temporal information. These challenges have been addressed by the organization of several shared tasks. These include the Informatics for Integrating Biology and the Bedside (i2b2) challenges [5-9], the Conference and Labs of the Evaluation Forum (CLEF) eHealth challenges [10-13], and the Semantic Evaluation (SemEval) challenges [14-16]. These efforts have enabled a valuable platform for international NLP method development. Furthermore, the development of open-source NLP software specifically tailored to clinical text has led to increased adoptability. Such NLP software include the clinical Text Analysis Knowledge Extraction System (cTAKES)1 and Clinical Language Annotation, Modeling, and Processing Toolkit (CLAMP),2 information extraction and retrieval infrastructure solutions such as SemEHR [17], as well as general purpose tools such as the the general architecture for text engineering (GATE)3 and Stanford CoreNLP.4 New initiatives, such as the Health Natural Language Processing (hNLP) Center,5 also aim to facilitate the sharing of resources, which would enable further progress through availability, transparency, and reproducibility of NLP methodologies. In recent years, the field of mental health has shown a burgeoning increase in the use of NLP strategies and methods, mainly because most clinical documentation is in free-text, but also arising from the increasing availability of other types of documents providing behavioural, emotional, and cognitive indicators as well as cues on how patients are coping with different conditions and treatments. Such texts sources include social media and online fora [18-21] as well as doctor-patient interactions [22-24] and online therapy [25], to mention a few examples. However, although there have been a few shared tasks related to mental health [26-28] the field is still narrower than that of biomedical or general clinical NLP. The maturity of NLP method development and state-of-the-art results have led to an increase in successful deployments of NLP solutions for complex clinical outcomes research. However, the methods used to evaluate and appraise NLP approaches are somewhat different from methods used in clinical research studies, although the latter often rely on the former for data preparation and extraction. There is a need to clarify these differences and to develop novel approaches and methods to bridge this gap. This paper stems from the findings of an international one-day workshop in 2017 (see online Supplement). The objective was to explore these evaluation issues by outlining ongoing research efforts in these fields, and brought together researchers and clinicians working in the areas of NLP, informatics, mental health, and epidemiology. The workshop highlighted the need to provide an overview of requirements, opportunities, and challenges of using NLP in clinical outcomes research (particularly in the context of mental health). Our aim is to provide a broad outline of current state-of-the-art knowledge, and to make recommendations on directions going forward in this field, with a focus on considerations related to intrinsic and extrinsic evaluation issues.

Evaluation paradigms

All empirical research studies need to be evaluated (or validated) in order to allow for scientific assessment of a study. In clinical outcomes research, studies are usually designed as clinical trials, cohort studies or case-control studies, with the aim to assess whether a risk factor or intervention has a significant association with an outcome of interest. NLP method development, on the other hand, aims to produce computational solutions to a given problem. Studies of diagnostic tools are most similar to NLP method development - testing whether a history item, examination finding or test result is associated with a subsequent diagnosis. The most basic underlying construction for quantitative validation in both fields is a 2 × 2 contingency table (or confusion matrix), where the number of correctly and incorrectly assigned values for a given binary outcome or classification label is compared with a gold (reference) standard, i.e. the set of ’true’ or correct values. This table can then be used to calculate performance metrics such as precision (Positive Predictive Value), recall (sensitivity), accuracy, F-score, and specificity. In clinical studies, this can be used to calculate measures of association, such as risk ratio and odds ratio. There are other evaluation measures that can be used when the outcome is more complex, e.g., continuous or ranked (for NLP, see e.g., [29], for clinical prediction models, see e.g., [30]). Validation or evaluation of clinical outcomes whether it be a trial, cohort or case-control study relies on statistical measurements of effect, and can be validated internally (measured on the original study sample) or externally (measured on a different sample) [31]. Typically, a number of predictors (variables) interact in these models, thus multivariable models are common, where it is important to account for biases to ensure model validity. Because the goal of NLP method development is to produce computational solutions to specific problems, evaluation criteria can be intrinsic (evaluating an NLP system in terms of directly measuring its performance on attaining its immediate objective) and extrinsic (evaluating an NLP system in terms of its usefulness in fulfilling an over-arching goal where the NLP system is perhaps part of a more complex process or pipeline) [32-34,29,35]. The goal of clinical research studies, on the other hand, typically relates to assessing the effect of a treatment or intervention. Clinical NLP method development has mainly focused on internal, intrinsic evaluation metrics. Typically, these methods have been developed and evaluated on word, sentence or document level annotations that model specific attributes and features, such as document content (e.g., patient status, or report type), document section types (e.g., current medications, past medical history, discharge or summary), named entities and concepts (e.g., diagnoses, symptoms, or treatments), or semantic attributes (e.g., negation, severity, or temporality). Although the intrinsic evaluation metrics are important and valuable, especially when comparing different NLP methods for the same task, they are not necessarily of value or particularly informative when the task is applied on a higher-level problem (e.g., patient level) or on new data. For instance, current state-of-the-art that is achieved in medical concept classification is > 80% F-score [7], which is close to human agreement on the same task; however, if such a system was to be deployed in clinical practice, any > 0% error rate, such as the misclassification of a drug or a history of severe allergy, might be seen as unacceptable. True negatives are rarely taken into consideration in NLP evaluation, often because this is intractable in text analysis [36]. Yet, specificity (the true negative ratio, i.e. the proportion of a gold standard construct that is identified by the new assessment) is often a key factor in clinical research, particularly in medical screening but also in categorisation of exposures (e.g. case status) and outcomes. Thus when using outputs from NLP approaches in clinical research studies, it is not always clear how best to incorporate and interpret NLP performance metrics.

Opportunities and challenges from a clinical perspective

The opportunities and potential of NLP are hugely exciting for health research generally, and for mental health research specifically. As clinical informatics resources become larger and more comprehensive as a result of text-derived meta-data, the possibility of determining outcomes, prognoses, effectiveness, and harm are all within closer reach, requiring fewer resources than would be needed to conduct primary research studies. A variety of data sources are amenable to clinical research such as social media, wearable device data, audio and video recordings of team discussions and interactions. However, EHR-derived data potentially offer the most immediate value, given the time and patient numbers over which data have already been collected, their comprehensive and now-established use across healthcare, and given the depth of clinical information potentially available in real world services. Compared to primary research cohorts, the coverage is huge and substantially more generalisable, and allows for external validation of models [37].

Clinical NLP applied on mental health records

A key issue with mental EHR data is that the most salient information for research and clinical practice tends to be entered in text fields rather than pre-structured data, with up to 70% of the record documented in free-text [38]. For instance, the Clinical Record Interactive Search (CRIS) system from the South London and Maudsley mental health trust (SLaM) contained almost 30 million event notes and correspondence letters, and more than 322, 000 patients,6 yielding an average of 90 documents per patient (even more if additional text content would be included, e.g., free-text entries in risk assessment notes). This is partly because the most important features of mental health care do not lend themselves to structured fields. Such features include the salience of the self-reported experience (i.e. mental health symptoms), determining treatment initiation and outcome evaluation, as well as the complex circumstances influencing presentations and prognoses (e.g., social support networks, recent or past stressful experiences, psychoactive substance use). Moreover, written information can be more accurate and reliable, and allows for expressiveness, which better reflects the complexity of clinical practice [39,40]. While there have been calls for increased structuring of health records, these seem to be mainly driven by convenience issues for researchers or administrators (i.e. ease of access to pre-structured data) rather than the preferences of the clinical staff actually entering data [39]. Most clinical researchers and clinicians are accustomed to research methods involving highly scrutinised de novo data collection with standardised instruments (such as the Beck Depression Inventory (BDI) or the Positive and Negative Syndrome Scale (PANNS)). These have established psychometric properties for the concepts they measure, such as symptom severity in patients with schizophrenia (e.g., positive symptoms such as delusions, hallucinations). Using NLP methods to derive and identify such concepts from EHRs holds great promise, but requires careful methodological design. If NLP algorithms are to be fit for purpose, they need to demonstrate accurate and reliable measurement, which in turn needs to be communicated in a language that is understandable across disciplines especially considering the differences in language used by the clinical and NLP academic communities. Because of the importance of information accuracy in medical practice, including the validity and reliability of tests and instruments, translating NLP system outputs to an interpretable measure is key. This way the clinical community can easily understand the basis for the underlying NLP model, allowing for the potential translation of NLP-derived observational findings into clinical interventions. Moreover, ensuring that an NLP approach is appropriately designed for a specific clinical problem is essential. For example, timely detection of the risk of suicidal behaviour in patients using NLP approaches on EHR data is clinically important, but challenging: not only because of the various ways this can be documented in text, but also because of the complexity of the clinical construct. Suicidal behaviour is relatively rare, and current tools for assessing suicide risk are inadequate and suffer from low PPV/precision [41]. Data-driven methods hold promise as a solution to develop more accurate predictive models, but they need to be carefully designed. In one case study, < 3% of EHR documents for 200 patients had any suicide-related information documented in freetext, whilst at a patient level, 22% of the patients had at least one document with written suicide-related information [42]. Thus, for this type of use-case, it is important for method development to ensure an appropriate sample (document or patient), and to provide interpretable NLP output results. Other key challenges in applying NLP on mental health records include moving beyond simple named-entity recognition towards ascertaining novel and more complex entities such as markers of socioeconomic status or life experiences, as well as unpicking temporality in order to reconstruct disorder and treatment pathways. In addition, there are the more computational challenges of moving beyond single-site applications to wider multi-site provision of NLP resources, as well as evaluating translation for international use. Other types of textual data such as patient-generated text (e.g., online forums, questionnaires, and feedback forms) involve additional challenges, e.g., the ability to adapt models for clinical constructs such as mood recognition from a wider population to individuals, as well as the ability to calculate mood scores over time, based on a number of linguistic features and often in the face of sparse or missing data.

Using NLP for large-sample clinical research

The capacity of NLP approaches to extract additional, non-structured information is particularly important for large-sample research studies, which are often focused on identifying as many predictors (and potential confounders) of an outcome as possible [43]. These may include factors at both the ‘macro-environment’, such as family/social circumstances, and the individual patient-level, such as tobacco use [44], and is especially important for mental health research given there may be reticence to code stigmatised conditions, such as illicit drug use, when they are not the primary reason for seeking treatment [45]. Also, structured codes cannot accommodate diagnostic uncertainty, and do not permit the recording of clinically-relevant information that supports a diagnosis, e.g., sleep or mood, but is not the specific condition for which a patient receives treatment [46]. Thus NLP approaches both enable the improvement of case identification from health records [46,47] and can provide a much richer set of data than could be achieved by the use of structured data alone. However, this increase in the depth of data provided by NLP can come at a cost to study reproducibility and research transparency. An EHR-based study requires a clear specification of how the data recorded for each patient were collected and processed prior to analysis. In the context of EHR research this is often referred to as developing ‘phenotypes’, with the intention that the algorithms developed can be reused by others [48-50]. Incorporation of NLP output data in phenotype algorithms may make it more difficult for researchers using different EHR data to replicate results. For instance, if the underlying data that was used to develop an NLP solution to extract a phenotype such as atrial fibrillation is specific to the EHR system, geographical area and other factors, the NLP algorithm may produce different results if applied on new data for the same task. Even if NLP methods are shared, their application may be hampered if similar source documents are not available. This issue would be compounded if multiple phenotypes are used to build the epidemiological data set. One practical solution is to adopt some of the measures suggested for clinically-focused observational research, such as the publication of study protocols and/or cohort descriptions [51].

NLP in clinical practice — towards extrinsic evaluation

Nuances of human language mean that no NLP algorithm is completely accurate, even for a seemingly straightforward task such as negation detection [52]. An error rate can be accommodated statistically in research, but to support decisions about individual patient care, results of NLP must be verified by a clinician before being used to make recommendations about patient management. Such verification might be better accepted by the users if the system provides probabilistic outputs rather than binary decisions. The difficulties in safely incorporating these uncertainties may have contributed to the gap between research applications of NLP and its use in clinical settings [2]. When algorithms are used in clinical decision support, it is important to display the information that is used to make the recommendation, and for clinicians to be aware of potential weaknesses of the algorithm. Clinical decision systems are more useful if they provide recommendations within the clinical workflow at the time and location of decision making [53]. Within EHR systems, NLP may be used to improve the user interface, such as the ease of finding information in a patient’s record. Real-time NLP can potentially assist clinicians to enter structured observations, evaluations or instructions from free text by, for example, automatically transforming a paragraph into a diagnostic code or suggested treatment. The accuracy of such algorithms may be tested by calculating the proportion of suggested structured entries that the clinician verifies as being correct. Clinical NLP systems have not, as of yet, been developed with clinical experts in mind, and have rarely been evaluated according to extrinsic evaluation criteria. As NLP systems become more mature, usability studies will also be a necessary step in NLP method development, to ensure that clinicians’ and other non-NLP users’ input can be taken into consideration. For instance, in the 2014 i2b2 Track 3 - Software Usability Assessment, it was shown that current clinical NLP software is hard to adopt [54]. Tools such as Turf (EHR Usability Toolkit)7could be made common practice when developing NLP solutions for clinical research problems. NLP could also become more integrated into EHR systems in the future, where evaluation metrics that focus on the time a documentation task takes, documentation quality, and other aspects also need to be considered [55].The ideal evaluation would be a randomized trial in clinical practice, comparing usability and data quality between user interfaces incorporating different NLP algorithms.

Opportunities and challenges from a Natural Language Processing perspective

The two major approaches to NLP, as described by Friedman et al. [4], namely symbolic (based on linguistic structure and world knowledge) and statistical methods (based on frequency of occurrence and distribution patterns) methods, are still dominant in clinical NLP development. Advances in machine learning algorithms, such as neural networks, have influenced NLP applications, and there are, of course, further developments to be expected. However, many of the developments, particularly in neural network models, assume large, labeled datasets, and these are not readily available for clinical use-cases that require analysis of EHR text content. Another challenge is data availability — ethical regulations and privacy concerns need to be addressed if authentic EHR data are to be used for research, but there are also alternative methods that can be used to create novel resources (Section 4.1). Furthermore, evaluation of NLP systems is still typically performed with standard statistical metrics based on intrinsic criteria, not necessarily optimal for the clinical research problem at hand. To address such issues, it is important to identify which level of analysis is appropriate, and model the problem accordingly (Section 4.2). Enriching informatics approaches with novel data sources, using evaluation metrics that capture novel aspects such as model interpretability or time sensitivity, and developing NLP solutions with the clinical endusers in mind (Section 4.3) could lead to considerable advances in this field.

Methods for developing shareable data

Risks for compromised privacy are particularly evident in analyzing text from health records (i.e. the inability to fully convince the relevant authorities that all explicit and implicit privacy-sensitive information has been de-identified) and in big data health research more generally (i.e. unforeseen possibilities of inferring an individual’s identity after record linkages from multiple de-identified sources). The same ethical and legal policies that protect privacy complicate the data storage, use, and exchange from one study to another, and the constraints for these data exchanges differ between jurisdictions and countries [56]. As a timely solution to these data exchange problems, synthetic clinical data has been developed. For example, a set of 301 patient cases which includes recorded spoken handover and annotated verbatim transcriptions based on synthetic patient profiles, has been released and used in shared tasks in 2015 and 2016 [12,13,57]. Similarly, synthetic clinical documents have been used in 2013 and 2014 in shared tasks on clinical NLP in Japanese [58]. Synthetic data has been successful in tasks such as dialogue generation [59] and is a promising direction at least as a complement for method development where access to data is challenging.

Intrinsic evaluation and representation levels

When considering the combination of NLP methods and clinical outcomes research, differences in granularity are a challenge. NLP methods are usually developed to identify and classify instances of some clinically relevant phenomenon at a sub-document or document level. For example, NLP methods for the extraction of a patient’s smoking status (e.g., current smoker, past smoker or non-smoker) will typically consider individual phrases that discuss smoking, of which there may be several in a single document [60]. Even in cases where an NLP method is used to classify a whole document (e.g., assigning tumor classifications to whole histopathology reports [61]), there may be several documents for an individual patient. Typically, in evaluating clinical NLP methods, a gold standard corpus with instance annotations is developed, and used to measure whether or not an NLP approach correctly identifies and classifies these instances. If a gold standard corpus contains multiple annotations and documents for one patient, and the NLP system correctly classifies these, the evaluation score will be higher. For clinical research, on the other hand, only one of these instances may be relevant and correct. In the extreme, a small number of patients with a high number of irrelevant instances, could bias the NLP evaluation relative to the clinical research question. For instance, a gold standard corpus annotated on a mention level for positive suicide-related information (patient is suicidal) or negated (patient denies suicidal thoughts) was used to develop an NLP system [62] which had an overall accuracy of 91.9%. However, when this system was applied for a clinical research project to identify suicidal patients, implementing a document- and, more crucially, a patient-level classification based on such instance-level annotations required non-trivial assumptions, because the documents could contain several positive and negative mentions, and each patient could have several EHR documents [42]. There is thus often a gap between instance level and patient level evaluations. In order to resolve the differences in granularity between the NLP and clinical outcomes evaluations, this gap needs to be bridged somehow. Typically, some post-processing will be required, in order to filter the instances found by the NLP method, before their use in clinical outcomes research. For example, post-processing might merge instances, or might remove those that are irrelevant. For some use-cases, this post-processing procedure can be based on currently available evidence, such as case identification for certain diseases (e.g., asthma status [63]). The gap as described does not always exist, and post-processing is not always appropriate. There are cases in which an NLP method might be used to process all of the relevant text associated with a single patient, over time, in order to directly predict a single outcome; for example, all past text (and other information) could be used in order to assign a diagnosis code [64]. Moreover, for some clinical use-cases, patient-level annotations by e.g., manual chart review of sets of notes for one patient-level clinical label might in some cases be more efficient for developing gold standards and subsequent NLP solutions. Evaluation metrics for such use-cases could be developed to measure the degree that the NLP system correctly classifies groups compared to manual review. For specific use-cases, this approach might even be more appropriate than focusing on mention- and document-level annotations. Looking further, NLP methodology that addresses clinical objectives by not only finding relevant instances, but actually summarising all relevant information over time, is desirable; however, this is a non-trivial aspiration, particularly considering methods for evaluation and conveyance of such summaries.

Beyond electronic health record data

Work on using computational language analysis on speech transcripts to study communication disturbances in patients with schizophrenia [65] or to predict onset of psychosis [66,67] has shown promising results. Further, the availability of large datasets has led to advances in the field of psycholinguistics [68]. The increasing availability online of patient related texts including social media posts and themed fora, especially around long term conditions, have also lead to an increase in NLP applications for mental health and the health domain in general. For example, recent NLP work classifies users into patient groups based on social media posts over time [21,69,70], prioritises posts for potential interventions based on topics, sentiment and the overall conversation thread [27] or identifies temporal expressions and relations within clinical texts [15,16]. Whilst this is encouraging in terms of the interaction between NLP and the health domain, these tasks are still primarily evaluated using classic NLP system performance metrics such as accuracy, recall, and F-score. Recent NLP community efforts have initiated new evaluation levels and metrics, e.g., prediction of current and future psychological health based on childhood essays from longitudinal cohort data as in the 2018 Computational Linguistics and Clinical Psychology Workshop (CLPsych) shared task,8 however the direct clinical applicability of such approaches is yet to be shown. Work by Tsakalidis et al. [71] is the first to use both language and heterogeneous mobile phone data to predict mental well-being scores of individual users over time calibrated against psychological scales. Results were promising, yet their evaluation strategy involved training and testing of data from all users, a scenario often encountered in studies involving mood score predictions using mobile phone data [72-74]. However, a more realistic evaluation scenario would involve either: (a) intra-user predictions over time, that is, for the same user calculating a mood score or other health indicator given previous data in a sequence, at consecutive time intervals or (b) predictions of some indicator, over time, for an unseen user, given a model created on the basis of other users [75]. For these tasks to have potential clinical utility, new evaluation metrics would need to be introduced that focus on a number of other aspects, such as: time sensitive and timely prediction: longitudinal prediction of an indicator such as a score from a psychological scale or other health indicator. These would need to be predicted over time as monitoring points rather than predictions that are independent of time, as is the case of current standard classification approaches. personalised models: intra-user models are very useful for personalised health monitoring. However, such personalised models would require large amounts of longitudinal data for individual users which are not often available. model interpretability: a model should provide confidence scores for its predictions and provide evidence for the prediction. Current model evaluation focusses on performance without any regard to interpretability.

Actionable guidance and directions for the future

NLP method development for the clinical domain has reached mature stages and has become an important part of advancing data-driven health care research. In parallel, the clinical community is increasingly seeing the value and necessity of incorporating NLP in clinical outcomes studies, particularly in domains such as mental health, where narrative data holds key information. However, for clinical NLP method development to advance further globally and, for example, become an integral part of clinical outcomes research, or have a natural place in clinical practice, there are still challenges ahead. Based on the discussions during the workshop, the main challenges include data availability, evaluation workbenches and reporting standards. We summarize these below and provide actionable suggestions to enable progress in this area.

Data availability

The lack of sufficiently large sets of shareable data is still a problem in the clinical NLP domain. We encourage the increased development of alternative data sources such as synthetic clinical notes [57,58], which alleviates the complexities involved in governance structures. However, in parallel, initiatives to make authentic data available to the research community through alternative governance models are also encouraged, like the MIMIC-III database [76]. Greater connection between NLP researchers, primary data collectors, and study participants are required. Further studies in alternative patient consent models (e.g., interactive e-consent [77]) could lead to larger availability of real-world data, which in turn could lead to substantial advances in NLP development and evaluation. Moving beyond EHR data, there is valuable information also in accessible online data sources such as social media (e.g., PatientsLikeMe), that are of particular relevance to the mental health domain, and that could also be combined with EHR data [78]. Efforts to engage users in donating their public social media and sensor data for research such as OurDataHelps9 are interesting avenues that could prove very valuable for NLP method development. Furthermore, in addition to written documentation, there is promise in the use of speech technologies, specifically for information entry at the bedside [57,79-83].

Evaluation workbenches

Current clinical NLP methods are typically developed for specific use-cases and evaluated intrinsically on limited datasets. Using such methods off-the-shelf on new use-cases and datasets leads to unknown performance. For clinical NLP method development to become more integral in clinical outcomes research, there is a need to develop evaluation workbenches that can be used by clinicians to better understand the underlying parts of an NLP system and its impact on outcomes. Work in the general NLP domain could be inspirational for such development, for instance integrating methods to analyse the effect of NLP pipeline steps in downstream tasks (extrinsic evaluation) such as the effect of dependency parsing approaches [84]. Alternatively, methods that enable analysis of areas where an existing NLP solution might need calibration when applied on a new problem, e.g., by posterior calibration [85] are an interesting avenue of progress. If clinical NLP systems are developed for non-NLP experts, to be used in subsequent clinical outcomes research, the NLP systems need to be easy to use. Facilitating the integration of domain knowledge in NLP system development can be done by providing support for formalized knowledge representations that can be used in subsequent NLP method development [86].

Reporting standards

Most importantly, ensuring transparency and reproducibility of clinical NLP methods is key to advance the field. In the clinical research community, the issue of lack of scientific evidence for a majority of reported clinical studies has been raised [87]. Several aspects need to be addressed to make published research findings scientifically valid, among others replication culture and reproducibility practices [88]. This is true also for clinical NLP method development. We propose a minimal protocol inspired by [32], see Figs. 1 and 2, that outlines the key details of any clinical NLP study; by reporting on these, others can easily identify whether or not a published approach could be applicable and useful in a new study and for example, whether or not adaptations might be necessary. This could encourage further development of a comprehensive guidance framework for NLP, similar to what has been proposed for the reporting of observational studies in epidemiology in the STROBE statement [89], and other initiatives (e.g., [90,91,51]).

Fig. 1

Example of a suggested structured protocol with essential details for documenting NLP approaches and performed evaluations. The example includes different levels of evaluation (intrinsic and extrinsic) that could be outlined with details about the task, metrics, results, and error analysis/comments.

Fig. 2

A minimal protocol example of details to report on the development of a clinical NLP approach for a specific problem, that would enable more transparency and ensure reproducibility.

The key details that are needed are: Data: What type of source data was used? How was it sampled? What is the size (in terms of sentences, words) How was the data obtained? Is it available to other researchers? NLP approach: What was the objective or task? At which type of textual unit is analysis performed (document, sentence, entities, word)? Is there a gold/reference standard? If so, how was the gold/reference standard generated? If it was manual, what was the Interrater/annotator agreement? Have the guidelines and definitions been made publicly available? Model development: what type of approach was taken? Parameter settings? Prerequisites? Evaluation: was the model evaluated intrinsically or extrinsically? Which metrics were used? What types of errors were common? If the evaluation was extrinsic, what assumptions were made? For example, if an NLP output provides counts of mentions for a condition, what threshold for determining whether or not someone can be considered a case was chosen, and why? How was the conversion from mention to case level done? Conversely, if an NLP system provides an output for a patient-level label based on a set of information sources (e.g., documents), what method was applied to identify and analyse disagreements?

Conclusions

We have sought to provide a broad outline of the current state-of-the-art, opportunities, challenges, and needs in the use of NLP for health outcomes research, with a particular focus on evaluation methods. We have outlined methodological aspects from a clinical as well as an NLP perspective and identify three main challenges: data availability, evaluation workbenches and reporting standards. Based on these, we provide actionable guidance for each identified challenge. We propose a minimal structured protocol that could be used when reporting clinical NLP method development and its evaluation, to enable transparency and reproducibility. We envision further advances particularly in methods for data access, evaluation methods that move beyond current intrinsic metrics and move closer to clinical practice and utility, and in transparent and reproducible method development.

Supplementary Material

Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.jbi.2018.10.005.

58 in total

1. From Smallpox to Big Data: The Next 100 Years of Epidemiologic Methods.

Authors: Stephen J Gange; Elizabeth T Golub
Journal: Am J Epidemiol Date: 2015-10-06 Impact factor: 4.897

2. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.

Authors: Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny
Journal: J Am Med Inform Assoc Date: 2013-03-26 Impact factor: 4.497

Review 3. Symptom severity prediction from neuropsychiatric clinical records: Overview of 2016 CEGS N-GRID shared tasks Track 2.

Authors: Michele Filannino; Amber Stubbs; Özlem Uzuner
Journal: J Biomed Inform Date: 2017-04-25 Impact factor: 6.317

Review 4. Risks and benefits of speech recognition for clinical documentation: a systematic review.

Authors: Tobias Hodgson; Enrico Coiera
Journal: J Am Med Inform Assoc Date: 2015-11-17 Impact factor: 4.497

5. Identity Management and Mental Health Discourse in Social Media.

Authors: Umashanthi Pavalanathan; Munmun De Choudhury
Journal: Proc Int World Wide Web Conf Date: 2015-05

6. Sentiment Analysis of Suicide Notes: A Shared Task.

Authors: John P Pestian; Pawel Matykiewicz; Michelle Linn-Gust; Brett South; Ozlem Uzuner; Jan Wiebe; K Bretonnel Cohen; John Hurdle; Christopher Brew
Journal: Biomed Inform Insights Date: 2012-01-30

7. Defining disease phenotypes using national linked electronic health records: a case study of atrial fibrillation.

Authors: Katherine I Morley; Joshua Wallace; Spiros C Denaxas; Ross J Hunter; Riyaz S Patel; Pablo Perel; Anoop D Shah; Adam D Timmis; Richard J Schilling; Harry Hemingway
Journal: PLoS One Date: 2014-11-04 Impact factor: 3.240

Review 8. Extracting information from the text of electronic medical records to improve case detection: a systematic review.

Authors: Elizabeth Ford; John A Carroll; Helen E Smith; Donia Scott; Jackie A Cassell
Journal: J Am Med Inform Assoc Date: 2016-02-05 Impact factor: 4.497

9. GUILD: GUidance for Information about Linking Data sets.

Authors: Ruth Gilbert; Rosemary Lafferty; Gareth Hagger-Johnson; Katie Harron; Li-Chun Zhang; Peter Smith; Chris Dibben; Harvey Goldstein
Journal: J Public Health (Oxf) Date: 2018-03-01 Impact factor: 2.341

10. Understanding Depressive Symptoms and Psychosocial Stressors on Twitter: A Corpus-Based Study.

Authors: Danielle Mowery; Hilary Smith; Tyler Cheney; Greg Stoddard; Glen Coppersmith; Craig Bryan; Mike Conway
Journal: J Med Internet Res Date: 2017-02-28 Impact factor: 5.428

38 in total

1. Enhancing clinical concept extraction with contextual embeddings.

Authors: Yuqi Si; Jingqi Wang; Hua Xu; Kirk Roberts
Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497

2. Coding Free-Text Chief Complaints from a Health Information Exchange: A Preliminary Study.

Authors: Sotiris Karagounis; Indra Neil Sarkar; Elizabeth S Chen
Journal: AMIA Annu Symp Proc Date: 2021-01-25

3. RESEARCHComparing Strategies for Identifying Falls in Older Adult Emergency Department Visits Using EHR Data.

Authors: Brian W Patterson; Gwen Costa Jacobsohn; Apoorva P Maru; Arjun K Venkatesh; Maureen A Smith; Manish N Shah; Eneida A Mendonça
Journal: J Am Geriatr Soc Date: 2020-09-20 Impact factor: 5.562

Review 4. Deep learning in clinical natural language processing: a methodical review.

Authors: Stephen Wu; Kirk Roberts; Surabhi Datta; Jingcheng Du; Zongcheng Ji; Yuqi Si; Sarvesh Soni; Qiong Wang; Qiang Wei; Yang Xiang; Bo Zhao; Hua Xu
Journal: J Am Med Inform Assoc Date: 2020-03-01 Impact factor: 4.497

Review 5. Evolving Role and Future Directions of Natural Language Processing in Gastroenterology.

Authors: Fredy Nehme; Keith Feldman
Journal: Dig Dis Sci Date: 2020-02-27 Impact factor: 3.199

6. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients.

Authors: Jean Coquet; Selen Bozkurt; Kathleen M Kan; Michelle K Ferrari; Douglas W Blayney; James D Brooks; Tina Hernandez-Boussard
Journal: J Biomed Inform Date: 2019-04-20 Impact factor: 6.317

7. Real-time clinical note monitoring to detect conditions for rapid follow-up: A case study of clinical trial enrollment in drug-induced torsades de pointes and Stevens-Johnson syndrome.

Authors: Sarah DeLozier; Peter Speltz; Jason Brito; Leigh Anne Tang; Janey Wang; Joshua C Smith; Dario Giuse; Elizabeth Phillips; Kristina Williams; Teresa Strickland; Giovanni Davogustto; Dan Roden; Joshua C Denny
Journal: J Am Med Inform Assoc Date: 2021-01-15 Impact factor: 4.497

8. Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Authors: David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman
Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497

Review 9. Clinical concept extraction: A methodology review.

Authors: Sunyang Fu; David Chen; Huan He; Sijia Liu; Sungrim Moon; Kevin J Peterson; Feichen Shen; Liwei Wang; Yanshan Wang; Andrew Wen; Yiqing Zhao; Sunghwan Sohn; Hongfang Liu
Journal: J Biomed Inform Date: 2020-08-06 Impact factor: 6.317

10. Strategies for sustaining high-quality pediatric asthma care in community hospitals.

Authors: Sravya Jaladanki; Sarah B Schechter; Marquita C Genies; Michael D Cabana; Roberta S Rehm; Eric Howell; Sunitha V Kaiser
Journal: Health Serv Res Date: 2021-09-07 Impact factor: 3.402