Literature DB >> 30157522

Advancing the State of the Art in Clinical Natural Language Processing through Shared Tasks.

Michele Filannino1,2, Özlem Uzuner1,2.   

Abstract

OBJECTIVES: To review the latest scientific challenges organized in clinical Natural Language Processing (NLP) by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies.
METHODS: We harvested the literature by using Google Scholar and PubMed Central to retrieve all shared tasks organized since 2015 on clinical NLP problems on English data.
RESULTS: We surveyed 17 shared tasks. We grouped the data into four types (synthetic, drug labels, social data, and clinical data) which are correlated with size and sensitivity. We found named entity recognition and classification to be the most common tasks. Most of the methods used to tackle the shared tasks have been data-driven. There is homogeneity in the methods used to tackle the named entity recognition tasks, while more diverse solutions are investigated for relation extraction, multi-class classification, and information retrieval problems.
CONCLUSIONS: There is a clear trend in using data-driven methods to tackle problems in clinical NLP. The availability of more and varied data from different institutions will undoubtedly lead to bigger advances in the field, for the benefit of healthcare as a whole. Georg Thieme Verlag KG Stuttgart.

Entities:  

Mesh:

Year:  2018        PMID: 30157522      PMCID: PMC6115235          DOI: 10.1055/s-0038-1667079

Source DB:  PubMed          Journal:  Yearb Med Inform        ISSN: 0943-4747


1 Introduction

Recent years have seen an increase in the number of scientific challenges, also called shared tasks, organized for the advancement of Natural Language Processing (NLP) in clinical data 1 . Shared tasks promote work specific to a “challenge question” posed to the research community and aim to evaluate the state of the art. Without the unifying framework of a shared task, even though the NLP community might work on the same general problem, the nuances of problems will vary, to the point where the approaches would not be comparable. For this reason, shared task organizers often provide a data set, annotated with gold standard annotations, for system development and tuning. The evaluation of the systems on the challenge question takes place on a held-out data set. This setup provides a way of comparing systems head-to-head on the same data and task, and helps identify the state of the art. The shared task data may remain available for research beyond the challenge time frame, providing a common benchmark for assessing the quality of future attempts 2 . They also provide a great resource for training future generations: they are a great instrument to advance the research and engage students. The ready availability of datasets, evaluation scripts, and commentary provides an ideal environment that serves as a catalyst and motivator. In addition to the above benefits, in the clinical domain, because of the scarcity of data and their poor availability, shared tasks make it possible for the global research community to tackle problems that would otherwise be inaccessible to them 3 . However, attaining these benefits requires overcoming some obstacles 4 : Availability of data: Clinical data, i.e., data that contains clinical information, can take many forms. Most often used synonymously with electronic health records (EHRs), clinical data can contain social media data as well as information from resources such as drug labels. Each of these forms of data comes with its own challenges for access and use De-identification: Health Insurance Portability and Accountability Act (HIPAA) 5 defines requirements for safeguarding of patient health data, indicating elements of private health information (PHI) that need to be protected. De-identification, i.e., removal of PHI from records, provides one way of addressing this concern. However, there is a downside to de-identi-ication: this process alters the contents of the original records and as a result some useful information may be lost. On the other hand, HIPAA-compliant de-identification may not be adequate in some cases, e.g., professions, which are not covered by HIPAA, are allowed to remain in the records even though, when rare enough, they could uniquely identify patients. This makes de-identification a challenging process that needs to strike a delicate balance between de-identifying the data, so that it can be shared, and preserving the medical content of the data, so that it can be useful for downstream medical research. As a result, de-identification often requires a manual review of the data - an expensive and time-consuming process that ultimately limits the size of the shared data Annotation: Often the bigger cost in shared task organization comes from gold standard generation for the clinically-relevant task that is posed to the community 6 . Gold standard generation requires input from experts who are well-versed in the tasks studied. Experts tend to be medical professionals with high hourly rates - another parameter that needs to be balanced against the volume of desired data. In this paper, we review the latest scientific challenges organized to tackle NLP problems on clinical data. We highlight the tasks and the most effective methodologies used to tackle these tasks, along with the data used.

2 Methods

This review focuses on shared tasks using clinical data to tackle NLP problems. The relevant studies have been identified by querying Google Scholar and PubMed with “((shared task) OR challenge) AND (clinical OR health OR EHR) AND (NLP)”. The returned articles were limited to those describing clinical NLP shared tasks which were published since 2015. This resulted in a total of 17 shared tasks. Four challenges took place in 2015, six in 2016 and seven in 2017. Sixteen shared tasks are complete and published. One is completed but still in the process of publishing the outcomes. For a survey of shared tasks held before 2015, one can refer to Velupillai et al. 7 . For a survey in the broader field of biomedical text mining, one can refer to Huang et al. 8 .

3 Shared Tasks

Recent clinical NLP shared tasks have utilized social media data (e.g., Twitter, forum posts), journal articles (e.g., MEDLINE/ PubMed), as well as electronic health records (e.g., pathology reports, nursing admission notes, psychiatric evaluation records, etc.) and other health-related documents such as drug labels. Collectively, these shared tasks posed questions on a variety of data, including both de-identified real data and synthetic records. Table 1 summarizes the key characteristics of each of the shared tasks. We present the shared tasks according to the type of data they use.
Table 1

Clinical NLP Challenges, the tasks they posed, and the number of participating teams, since 2015, ordered by data sensitivity.

CategoryYearChallenge nameTask descriptionData typeData sourceteams type
AcademiaIndustryJointTotal
2015 TREC Clinical Decision Support (CDS) 9 Paticnt-ccntcrcd information retrievalMedical case narrativesSynthetic, PubMed333036
TREC Precision Medicine 11
Synthetic2017> Track 1> Track 2Patient-centered literature article retrieval Patient-centered clinical trials retrievalSemi-structured cases Synthetic, PubMed, ClinicalTrials.gov 275032
2016 CLEF cHealth 12 Information extractionNursing handover notesNICTA synthetic nursing handover notes4004
Text Analysis Conference (TAC) Adverse Drug Reaction Extraction from Drug Labels (ADR) 18
> Track 1ADR mentions and modifiers extraction
Prescription drug labels2017> Track 2> Track 3> Track 4Relation extraction Positive ADR filtering Positive ADR normalizationDrug labels Drugs-Library.com 63110
2015 CLPsych: Depression and PTSD on Twitter 22 Binary classification of depression and PTSD usersSocial mediaTwitter3003
Social Media Mining (SMM) 24
2016> Track 1> Track 2> Track 3ADR classification Information extraction Concept normalizationSocial mediaTwitter92011
Online social data2017 Social Media Mining for Health Applications (SMM4HA) 29 > Track 1> Track 2> Hack 3ADR classification Classification of medication intake Concept normalizationSocial mediaTwitter121013
2016 CLPsych: Triaging content in online peer-support forums 33 Classification of mental health severity in 4 levelsForumRcachOut131115
2017 CLPsych: Triaging content in online peer-support forums 35 Classification of mental health severity in 4 levelsForumRcachOut12 2 115
2017 NTCIR-13 MedWeb 36 8-class classification of diseases and symptomsMultilingual Social mediaTwitter7119
Analysis of Clinical Text (ACT) 39
2015> Track 1> Track 2a> Track 2bDisorder NER and normalizationTemplate slot filling (given gold spans)Disorder recognition and template slot filling (end-to-end)Clinical notesShARc corpus (MIMIC)183021
2016 TREC Clinical Decision Support (CDS) 43 Paticnt-ccntcrcd IRNursing admission notesMIMIC, PubMed21 5 0 26
Medication and Adverse Drug Events (MADE1.0)
2017> Track 1> Track 2Medication, ADE, sign and symptom identification Relation extractionClinical notesUMass Memorial Medical Center
Clinical TempEval 45
> Track 1Time expression extraction
2015> Track 2> Track 3Event extractionRelation extraction (wrt DCT)Relation extraction (wrt narrative containers)Pathology reportsMayo Clinic3003
Clinical data Clinical TempEval 46 > Track 1 Time expression extraction
2016> Track 2> Track 3Event extractionRelation extraction (wrt DCT)Relation extraction (wrt narrative containers)Pathology reportsMayo ClinicU3014
Clinical TempEval 48
2017> Track 1> Track 2Time expression extraction (cross-domain) Event extraction (cross-domain)Pathology reports, Clinical notesMayo Clinic92011
> Track 3Relation extraction (wrt DCT)Relation extraction (wrt narrative containers)
Centers for Excellence in Genomics N-GRJD (CEGS-NGR1D) 51
2016> Track la> Track lb> Track 2De-identification (cross-domain) Dc-identificationPsychiatric Symptom Severity PredictionPsychiatric evaluation recordsPartners Healthcare and Harvard Medical School235331

3.1 Synthetic Data

Synthetic data can serve as a placeholder for real data and allows to side-step the privacy issues related to real data. The downside of synthetic data is that its generation comes with a cost and must make sure that the synthetic data captures the characteristics of real data so that the solutions developed can be valid on real data. The 2015 Text REtrieval Conference (TREC) Clinical Decision Support (CDS) shared task aimed at evaluating biomedical retrieval systems 9 . The organizers provided a set of 30 synthetic case narratives (called topics), consisting of a short textual description, a summary, and a diagnosis. They asked the participants to develop systems for retrieving the most relevant scientific articles within a collection of 733,138 articles 1 from PubMed Central (PMC) 2 . Thirty-six teams participated in this task, 33 from academia, three from industry. The top performing system achieved an inferred normalized discounted cumulative gain (infNDCG) of 38.21% 10 by combining several Information Retrieval (IR) models (BM25, PL2, BB2). The 2017 TREC Precision Medicine (PM) shared task 11 utilized 30 semi-structured synthetic topics (e.g., disease, genetic variants, demographic information, and other factors) and evaluated IR systems for their ability to match topics with: 1) 26,759,399 abstracts from MEDLINE; and 2) 241,006 clinical trial descriptions from ClinicalTrials.gov 3 . Thirty-two teams participated in this task, 27 from academia, ive from industry. The top performing system achieved a precision at 10 (P@10) of 63.10% and 44.29% for track 1 and 2, respectively. This system combined a query expansion module with a heuristic scoring method for abstracts and trials. The Conference and Labs of the Evaluation Forum (CLEF) eHealth 2016 4 shared task 12 used the National Information and Communications Technology Australia (NICTA) Synthetic Nursing Handover Data 15 . This data set consisted of 300 notes that were authored by a registered nurse 5 . Each note consisted of a patient profile and a free-form text paragraph. One of the proposed tasks asked to the participants on this data was to automatically pre-populate handover forms with relevant text-snippets (slot illing) 16 . Three teams participated in this task, all of them from academia. The top performing system scored 38.2% (F1-score) and relied on a Conditional Random Field (CRF) model that used a set of features extracted from Stanford CoreNLP, Unified Medical Language System (UMLS) 17 , WordNet, regular expression patterns, and Latent Dirichlet Allocation (LDA) clusters 18 . A wrapper algorithm evaluated several different subsets of these features and ultimately selected the best one.

3.2 Real Data

Prescription Drug Labels

Prescription drug labels published by the Food and Drug Administration (FDA) contain information about uses of medications, indications, and side effects. They are meant for public use and are free of any privacy concern 6 . This makes them a good target for studying medication-related problems, such as identifying adverse drug reactions (ADRs), comparing ADRs presented in labels from different manufacturers for the same drug, and performing pharmacovigi-lance by identifying new ADRs not currently included in labels. The 2017 Text Analysis Conference (TAC) ADR Extraction from Drug Labels 19 studied FDA drug labels. The organizers shared a dataset of 2,309 unannotated drug labels, 200 of which manually annotated with ADR spans, relations, and concept identifiers (IDs) 7 . TAC proposed four tasks: 1) ADR mentions and modifiers span extraction; 2) extraction of relations between ADRs and their corollaries; 3) filtering of positive ADRs; and 4) positive ADR normalization 20 . Ten teams took part in this task, six from academia, three from industry, and one j oint team. The same system ranked first on all tasks, where it achieved an Fl-score of 82.48%, 49.00%, 82.19% (macro Fl), and 85.33% (macro Fl), respectively. This system used two distinct bi-directional Long Short Term Memory (LSTM) -CRF models with some post-processing rules to tackle the first two tasks. A learning-to-rank approach using RankSVM (support vector machine) on the top 10 normalization candidates tackled Tasks 3 and 4.

Online Social Data

Among the information shared in social media are personal views, experiences, and even health information 21 . However, social media data are not free of privacy and ethics concerns 22 . Access to most social media data requires a registration and consent to the governing rules, which can prevent secondary uses and limit the maximum amount of data to be collected. If social media data are not de-identified, then they cannot be shared among institutions and must be (re-)obtained directly from their source, e.g., Twitter data are often “distributed” in the form of tweet IDs, user IDs, and download scripts. Since 2015, there have been six clinical NLP shared tasks that used social media data. Four of them have been manually de-identified (or anonymized) and require a data use agreement (DUA) to be signed. Some are available for download beyond the challenges’ timeframes. The 2015 Computational Linguistics and Clinical Psychology Workshop (CLPsych) used Twitter data for classifying users based on depression and post-traumatic stress disorder (PTSD) 23 . The organizers collected, anonymized, and annotated tweets of the form “I have just been diagnosed with X”, with “X” being depression or PTSD. The resulting dataset included 7,857 million tweets from 477 depression patients, 396 PTSD patients, and 1,746 control users. The data were distributed according to Twitter terms of service, along with a privacy agreement that required protective measures for downloaded copies. The data are available for download 8 and require Institutional Review Board (IRB) approval and signing of the privacy policy. Four teams participated in this task, three from academia, one from industry. The best performing system achieved an average precision above 80% 24 and was based on a Support Vector Machine (SVM) with linear kernel and baseline lexical features with term-frequency-inverse document frequency (TF-IDF) weighting. The 2016 Social Media Mining shared task (SMM) 25 studied tweets for identifying ADRs. A data set of 10,822 anonymized tweets 26 was annotated by two pharmacology experts and was made available to the participants 9 . The shared task consisted of three tracks: 1) classification of tweets as ADR- and non-ADR-related; 2) ADR span extraction from tweets; and 3) linking ADRs to their UMLS 17 concepts. Eleven teams took part in this task but only six are reported in the overview: four from academia, two from industry. In the first track, the best performing system achieved an F1-score of 41.95% 27 by using an ensemble of Random Forest models with unigram, bigram, and trigram features. Track 2 was tackled as a Named Entity Recognition (NER) task by all the participants with the most effective machine learning (ML) model being CRFs and achieving 61.10% F1-score 28 on a subset of the entire corpus (2,131 annotated tweets). The organizers did not receive submissions for Track 3. Track 1 was re-proposed at the 2017 workshop 29 along with two new tasks: classification of medication intake types, and normalization of clinical concepts to the Medical Dictionary for Regulatory Activities (MedDRA) 20 . The 2017 workshop also extended the 2016 dataset to 15,717 tweets for training and 9,961 for testing. For classification of ADR-related tweets, the top performing system achieved an F1-score of 43.5% with an SVM model trained on textual features and domain-specific word embeddings 30 . For classification of medication intake, the top performing system scored F1 at 69.3% and used convolutional neural networks (CNNs) on word embeddings 31 . Finally, the top performing system for concept normalization scored an F1-score of 88.5% and used an ensemble of linear and deep learning models 32 . The 2016 CLPsych shared task 33 used 65,024 posts from the online forum of ReachOut, an Australian non-proit that supports young people. A total of 1,227 posts were manually prioritized by three independent judges by how urgently they need a response from a moderator (i.e., paraprofessional support) in a 4-point scale. The remaining posts were left un-annotated to experiment with semi-supervised and unsupervised techniques. Fifteen teams took part in this task: 13 from academia, one from industry, and one joint team. The top performing system achieved a macro-averaged F1-score of 42% by using an ensemble of classifiers working on different granularity of text 34 . The task was repeated in 2017 with an expanded dataset (157,963 posts, of which 1,588 were annotated 10 ) and attracted a similar number of teams 35 . The best performing team obtained a macro-averaged F1-score of 46.7%. The data from 2016 and for 2017 are available for download on request. Finally, the 2017 NII Testbeds and Community for Information access and Research'13 (NTCIR-13) MedWeb shared task 11 used a dataset of 2,560 tweets in Japanese, English, and Chinese 36 . The organizers manually de-identified the data and shared them with the participants under a DUA. Participants were asked to label the data with eight diseases/symptoms: influenza, diarrhea, hay fever, cough/sore throat, headache, fever, runny nose, and cold. Four academic teams took part in the English subtask by submitting 12 systems. The best system 37 achieved an exact match accuracy of 88% by using an ensemble ofhierarchical attention networks (HAN) and deep character-level convolutional neural networks (CNNs). At the time of writing, only the training data was available for download 12 .

Clinical Notes

Clinical notes constitute the most sensitive set of data for shared tasks. They are governed by HIPAA and access to these data can require human subjects training, as well as DUAs even when they are de-identified. Medical Information Mart for Intensive Care (MIMIC) is the most frequently used source of de-identified clinical notes. It contains health data of over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 38 . Since 2015, two shared tasks have utilized MIMIC as their data set. Other shared tasks have used de-identified and annotated data from their own home institutions. Unless noted otherwise, these data are distributed with DUAs. The 2015 Analysis of Clinical Text (ACT) shared task 39 utilized the ShaRe dataset 40 consisting of 531 manually annotated discharge summaries, electrocardiograms, echo, and radiology reports from MIMIC-II. ACT focused on two tasks: 1) detection and normalization of disorder mentions; and 2) template slot illing. Twenty-one teams took part in this task, 18 from academia, three from industry. These teams tackled the first task as a sequence labelling problem, using CRFs in combination with word embeddings and ad-hoc sentence clustering. The second task was proposed in two settings according to whether the participants used the gold or the predicted spans for the disorder mentions (Track 2.a and 2.b respectively). The best performing system for the first task scored 75.7% (strict F1-score) 41 . On the second task, the same system scored first in both settings: 88.6% (weighted accuracy score) on Track 2.a and 80.8% (F1 * weighted accuracy score) on Track 2.b 42 . This system tackled the tasks by using a combination of CRFs and a binary SVM, both based on part-of-speech tags and syntactic features. The 2016 TREC Clinical Decision Support shared tasks 43 studied patient-centered IR. The organizers provided a set of 1.25 million scientific articles from PubMed Central (PMC), and 30 nursing admission notes from MIMIC-III (called topics). With permission from the MIMIC team, the notes were made publicly available without the need for a DUA. Even though the notes were already de-identified, the de-identification process was manually carried out a second time for maximum privacy protection. For consistency with the previous challenges in the series 9 44 ,, only the notes’ history of present illness sections were provided to the participants 13 . Participants were asked to retrieve articles relevant for answering questions on diagnoses, tests, and treatments. Twenty-six teams took part in the challenge: 21 from academia, and five from industry. The top performing system achieved a precision at 10 of 40.33%, which is higher than the best score achieved in 2015 (see above). Despite this, the average results were lower than in 2015. The organizers ascribed the result to the difference of real Intensive Care Unit (ICU) notes from synthetic general practice notes. NLP Challenges for Detecting Medication and Adverse Drug Events from Electronic Health Records (MADE1.0) 14 utilized 1,092 medical records from 21 cancer patients in the UMass Memorial Medical Center to propose three tasks: 1) clinical named entity recognition; 2) relation identification; and 3) end-to-end systems to conduct the first two tasks together. This shared task is currently completed but the overview paper is not published yet. The Clinical TempEval challenges 45 46 , hosted at the Semantic Evaluation series (SemEval), used 600 de-identified clinical notes and pathology reports from cancer patients at the Mayo Clinic that are manually annotated with temporal expressions, medical events, and temporal relations 15 . In 2015, three academic teams participated in this shared task. In 2016, 14 teams participated, three of which were from industry. The 2016 results were better than those of the previous year, with the top performing system in time expression extraction achieving an F1-score of 79.5% 47 using linear and structural SVMs on morphological, syntactic, discourse, and word representation features. The same system ranked first also in the event (F1-score 90.3%) and temporal relations tasks (F1 75.6% for the relations with respect to document creation times (DCTs), and F1 47.9% for the ones among narrative containers). Clinical TempEval 2017 48 studied a domain adaptation problem, from colon cancer to brain cancer pathology reports and clinical notes. The corpus for this task contained 1,216 notes from each of the two types of cancer patients at the Mayo Clinic 16 . The notes were manually de-identified and annotated by experts 6 17 . Eleven teams took part in this shared task: nine from academia, and two from industry. The best performing system achieved F1-scores of57% for time expression spans (using an ensemble of CRFs, rules and decision trees) 49 , 72% for event spans, 59% for temporal relations with respect to the DCT, and 32% for those among narrative containers 50 . The system used neural networks with character and word embeddings combined with SVMs. Those results were approximately 20% lower than the ones registered by systems trained and tested on the same domain 45 46 . Finally, the CEGS-NGRID Shared Tasks and Workshop on Challenges in NLP for Clinical Data made available a corpus of 1,000 manually de-identified psychiatric evaluation records from Partners Healthcare 51 . The organizers extended the HIPAA definition of PHI for better privacy protection. They proposed two tasks: 1) de-identification 52 , and 2) symptom severity prediction 53 . De-identification was studied in two subtasks: a) benchmarking pre-existing de-identification systems 54 55 on psychiatric records 18 (called “sight unseen”); and b) regular de-identification. Overall, 31 teams took part, 23 from academia, five from industry, and three jointly from industry/aca-demia. The same system scored the highest in both subtasks of Task 1: F1-score of 79.85% 52 , and F1-score 91.43% 55 respectively. The system used a combination of CRFs, BI-LSTMs, and rules. The result suggests that “out-of-the-box solutions provide a good start at building models that can be tuned to the new data”. In the second task, symptom severity prediction, the systems were scored using the Inverse Normalized Mean Absolute Error Macro-averaged (INMAE M ), which weights a prediction's error according to its ordinal distance from the correct class. The top performing system used an ensemble of machine learning classifiers based on morphological, syntactic, and structural features and achieved an INMAE M score of 86.3% which is close to the level of accuracy recorded by the least experienced of the annotators. The information presented in this section highlights how the data varies in its sensitivity to privacy, which inversely correlates with the available data size. Tasks range from NER to relation extraction, multi-class classification problems and information retrieval, with these last ones being the most successful in terms of both attracting participation and system performance. The use of CRF and BI-LSTM models is common to almost all the top performing NER systems. More diverse methods are used for the relation extraction and multi-class classification problems.

4 Discussion

The discussed shared tasks offer interesting insights related to the availability of data, the advances in the state-of-the-art techniques, the role of privacy, and the importance of data size in supporting the methodological advances.

4.1 Data Availability

The concerns of availability of data, privacy, and cost of annotation ultimately shape the landscape of the field and give direction to the state of the art. Attempts to bypass concerns of availability of data and privacy with synthetic data results in displacing the cost of de-identification to the cost of generating synthetic records and come at the risk of generating a synthetic set that may not represent real data perfectly. Efforts to use social media data to understand the user perspective on her/his health problems face the same kind of privacy concerns as the notoriously sensitive EHR data. They additionally run into constraints related to long term access to data: either they do not remain available after the challenge or they need to be re-obtained from the social media site itself. When the data are to be re-obtained, this leaves the fate of the data set in the hands of the users of social media and could be lost if the users delete the messages or their accounts.

4.2 Observing Advances in the State of the Art

Shared tasks continue to grow both in their numbers and in the participation they attract. Especially for the tasks that are organized regularly, the consistency in the tasks and growing datasets continue to attract growing numbers of participants. Some tasks such as de-identification and NER tend to recur because of their high practical value. Table 2 shows the performances of the systems participating in the most recent shared tasks. It shows that tasks such as clinical named entity recognition (medications, times, events, PHI) are well understood with system performances above 70% (see TAC ADR 2017, CLPsych 2015, ACT 2015, and CEGS-NGRID 2016), while tasks such as relation extraction with performances below 50% need more attention (see the Clinical TempEval series). Clinical information retrieval tasks, with a performance around 50% (see the TREC series), show the need for further research. Finally, multi-class classification tasks (see the CLPsych series) show a performance below 50%, which can be partly justified by the lack of annotated data.
Table 2

List of shared tasks with data source, data size, sub-tasks descriptions, and best-performance score (metrics differ per challenge). The table also contains information about data availability after the challenge, whether the data have been de-identified, and whether they require a DUA to be signed.

CategoryYearChallenge nameTask descriptionData typeData sourceData sizeDe-identification / anonymizationDUACurrently Available?Best PerformanceMeasure
2015 TREC Clinical Decision Support (CDS) 9 Patient-centered information retrievalMedical case narrativesSynthetic, PubMed30 topics, 730K articlesnonoyes38.21%infNDCG
TREC Precision Medicine 11
Synthetic2017> Track 1> Track 2Patient-centered literature article retrievalPatient-centered clinical trials retrievalSemi-structured cases Synthetic, PubMed, ClinicalTrials.gov 30 topics, 27M abstracts, 241K trialsnonoyes63.10%44.29%P@10P@10
2016 CLEF cHcalth 12 Information extractionNursing handover notesNICTA synthetic nursing handover notes300 notesnonoyes38.20%Fl (macro avg.)
Text Analysis Conference (TAC) Adverse Drug Reaction Extraction from Drug Labels (ADR) 18
Prescription drug labels2017> Track 1> Track 2> Track 3> Track 4ADR mentions and modifiers extractionRelation extractionPositive ADR filteringPositive ADR normalizationDrug labels Drugs-Library.com 2309 labelsnonoyes82.48%49.00%82.19%85.33%FlFlFl (macro avg.)Fl (macro avg.)
2015 CLPsych: Depression and PTSD on Twitter 22 Binaty classification of depression and PTSD usersSocial mediaTwitter7.8M tweetsyesyesyes80.00%Avg. Precision
Social Media Mining (SMM) 24
2016> Track 1> Track 2> Track 3ADR classificationInformation extractionConcept normalizationSocial mediaTwiner10,882 tweetsnonoyes41.95%61,10%-FlFl
Social Media Mining for Health Applications (SMM4HA) 29
Online social data2017> Track 1> Track 2> Track 3ADR classificationClassification of medication intakeConcept normalizationSocial mediaTwitter15,777 tweetsnonoyes43.50%69.30%88.50%FlFl (micro avg.)Accuracy
2016 CLPsych: Triaging content in online peer-support forums 33 Classification of mental health severity in 4 levelsForumReachOut65,024 (1,227 annotated)yesyesyes, on request42.00%Fl (macro avg.)
2017 CLPsych: Triaging content in online peer-support forums 35 Classification of mental health severity in 4 levelsForumReachOut157,963 posts (1,588 annotated)yesyesyes, on request46.70%Fl (macro avg.)
2017 NTCIR-13 MedWeb 36 8-class classification of diseases and symptomsMultilingual Social mediaTwitter2560 tweetsyesyesyes, on request-
Analysis of Clinical Text (ACT) 39
2015> Track 1> Track 2a> Track 2bDisorder NER and normalizationTemplate slot filling (given gold spans)Disorder recognition and template slot filling (end-to-end)Clinical notesShARc corpus (MIMIC)531 summariesyesyesyes75.70%88.60%80.80%Fl (strict)Fl * weighted acc.Fl * weighted acc.
2016 TREC Clinical Decision Support (CDS) 43 Patient-centered IRNursing admission notesMIMIC, PubMed30 notes, 1.25M abstracts40.33%P@10
Medication and Adverse Drug Events (MADE1.0)
2017> Track 1> Track 2Medication, ADE, sign and symptom identificationRelation extractionClinical notesUMass Memorial Medical Center1092 recordsyesyesno-
Clinical TempEval 45
2015> Track 1> Track 2> Track 3Time expression extractionEvent extractionRelation extraction (wrt DCT)Relation extraction (wrt narrative containers)Pathology reportsMayo Clinic600 notesyesyesyes, on request72.50%87.50%70.20%12.30%FlFlFlFl
Clinical data Clinical TempEval 46
2016> Track 1> Track 2> Track 3Time expression extractionEvent extractionRelation extraction (wrt DCT)Relation extraction (wrt narrative containers)Pathology reportsMayo Clinic600 notesyesyesyes, on request79.50%90.30%75.60%47.90%FlFlFlFl
Clinical TempEval 48
2017> Track 1> Track 2> Track 3Time expression extraction (cross-domain)Event extraction (cross-domain)Relation extraction (wrt DCT)Relation extraction (wrt narrative containers)Pathology reports, Clinical notesMayo Clinic1216 notesyesyesyes, on request57.00%72.00%59.00%32.50%FlFlFlFl
Centers for Excellence in Genomics N-GRID (CEGS-NGRID) 51
2016> Track la> Track lb> Track 2De-identification (cross-domain)Dc-identificationPsychiatric Symptom Severity PredictionPsychiatric evaluation recordsPartners Healthcare and Harvard Medical School1000 recordsyesyesyes, on request79.85%91.43%86.30%FlFlINMAE^M

4.3 Balancing Access, Privacy, and Corporate Confidentiality

Interestingly, until now, academic institutions have dominated shared task participation. Few of the shared tasks reviewed in this paper had a significant participation from industry (e.g., the TREC series and CEGS N-GRID). Industry bridges the gap between pure research and technology 56 . However, the stringent rules governing the use of data and the hesitation to openly share the methods for fear of losing intellectual property result in decreased participation. Attracting more companies to shared tasks would help in diversifying the methods and contributions, reduce the gap between academia and industry, and shorten the time it takes for methods to be adopted by industry. DUAs required from participants before access to data vary in complexity. Some DUAs pose really strict requirements, e.g., storing the data on machines that are not connected to the Internet for the entire duration of the challenge. Keeping the terms of DUAs to those requirements that match the sensitivity level of data could open up more data sets to more parties for research and encourage participation of more parties.

4.4 Larger Datasets Support Methodological Advances

The approaches used to tackle problems in clinical NLP are almost entirely in the realm of data-driven methods. Named entity recognition tasks, such as medication or ADR extraction, are commonly solved using CRFs or deep learning approaches (BI-LSTMs), often with word embeddings although n-gram features are still used. Classification and relation extraction tasks are tackled using ensembles, often as a way of coping with the imbalance nature of classes. This makes a compelling argument for advocating the adoption of bigger datasets. Despite increasing the cost of design and annotation, richer data sets have the benefit of increasing the external validity of the developed solutions.

5 Conclusions

In this paper we reviewed the latest scientific challenges organized in clinical NLP, by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies. We surveyed 17 shared tasks, grouped by the type of data used (synthetic, drug labels, social data, and clinical data). We found that the type of data is correlated with its size and sensitivity. Recognition and classification of named entities are the most common tasks, usually tackled by data-driven approaches. We hope that the growing number of success stories in shared task organization will encourage more institutions to share data. More and varied data from different institutions will undoubtedly lead to bigger advances in the field, for the benefit of healthcare as a whole.
  18 in total

Review 1.  The medical dictionary for regulatory activities (MedDRA).

Authors:  E G Brown; L Wood; S Wood
Journal:  Drug Saf       Date:  1999-02       Impact factor: 5.606

2.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

Review 3.  Healthcare professionals' organisational barriers to health information technologies-a literature review.

Authors:  Maria Lluch
Journal:  Int J Med Inform       Date:  2011-10-13       Impact factor: 4.046

Review 4.  Utilizing social media data for pharmacovigilance: A review.

Authors:  Abeed Sarker; Rachel Ginn; Azadeh Nikfarjam; Karen O'Connor; Karen Smith; Swetha Jayaraman; Tejaswi Upadhaya; Graciela Gonzalez
Journal:  J Biomed Inform       Date:  2015-02-23       Impact factor: 6.317

5.  Multiparameter Intelligent Monitoring in Intensive Care II: a public-access intensive care unit database.

Authors:  Mohammed Saeed; Mauricio Villarroel; Andrew T Reisner; Gari Clifford; Li-Wei Lehman; George Moody; Thomas Heldt; Tin H Kyaw; Benjamin Moody; Roger G Mark
Journal:  Crit Care Med       Date:  2011-05       Impact factor: 7.598

6.  Evaluating the state-of-the-art in automatic de-identification.

Authors:  Ozlem Uzuner; Yuan Luo; Peter Szolovits
Journal:  J Am Med Inform Assoc       Date:  2007-06-28       Impact factor: 4.497

7.  Effective teamwork in healthcare: research and reality.

Authors:  Dave Clements; Mylène Dault; Alicia Priest
Journal:  Healthc Pap       Date:  2007

8.  Realizing the full potential of electronic health records: the role of natural language processing.

Authors:  Lucila Ohno-Machado
Journal:  J Am Med Inform Assoc       Date:  2011 Sep-Oct       Impact factor: 4.497

9.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions.

Authors:  Wendy W Chapman; Prakash M Nadkarni; Lynette Hirschman; Leonard W D'Avolio; Guergana K Savova; Ozlem Uzuner
Journal:  J Am Med Inform Assoc       Date:  2011 Sep-Oct       Impact factor: 4.497

10.  Benchmarking clinical speech recognition and information extraction: new data, methods, and evaluations.

Authors:  Hanna Suominen; Liyuan Zhou; Leif Hanlen; Gabriela Ferraro
Journal:  JMIR Med Inform       Date:  2015-04-27
View more
  6 in total

1.  2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.

Authors:  Sam Henry; Kevin Buchan; Michele Filannino; Amber Stubbs; Ozlem Uzuner
Journal:  J Am Med Inform Assoc       Date:  2020-01-01       Impact factor: 4.497

Review 2.  Spontaneously generated online patient experience data - how and why is it being used in health research: an umbrella scoping review.

Authors:  Julia Walsh; Christine Dwumfour; Jonathan Cave; Frances Griffiths
Journal:  BMC Med Res Methodol       Date:  2022-05-14       Impact factor: 4.612

Review 3.  AI in Health: State of the Art, Challenges, and Future Directions.

Authors:  Fei Wang; Anita Preininger
Journal:  Yearb Med Inform       Date:  2019-08-16

Review 4.  Medical Information Extraction in the Age of Deep Learning.

Authors:  Udo Hahn; Michel Oleynik
Journal:  Yearb Med Inform       Date:  2020-08-21

5.  A natural language processing approach for identifying temporal disease onset information from mental healthcare text.

Authors:  Natalia Viani; Riley Botelle; Jack Kerwin; Lucia Yin; Rashmi Patel; Robert Stewart; Sumithra Velupillai
Journal:  Sci Rep       Date:  2021-01-12       Impact factor: 4.379

6.  The role of machine learning in developing non-magnetic resonance imaging based biomarkers for multiple sclerosis: a systematic review.

Authors:  Md Zakir Hossain; Elena Daskalaki; Anne Brüstle; Jane Desborough; Christian J Lueck; Hanna Suominen
Journal:  BMC Med Inform Decis Mak       Date:  2022-09-15       Impact factor: 3.298

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.