Literature DB >> 32823318

Medical Information Extraction in the Age of Deep Learning.

Abstract

OBJECTIVES: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes-diseases and drugs (or medications)-and relations between them.
METHODS: For the time period from 2017 to early 2020, we searched for relevant publications from three major scientific communities: medicine and medical informatics, natural language processing, as well as neural networks and artificial intelligence.
RESULTS: In the past decade, the field of Natural Language Processing (NLP) has undergone a profound methodological shift from symbolic to distributed representations based on the paradigm of Deep Learning (DL). Meanwhile, this trend is, although with some delay, also reflected in the medical NLP community. In the reporting period, overwhelming experimental evidence has been gathered, as illustrated in this survey for medical IE, that DL-based approaches outperform non-DL ones by often large margins. Still, small-sized and access-limited corpora create intrinsic problems for data-greedy DL as do special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies.
CONCLUSIONS: The paradigm shift from (feature-engineered) ML to DNNs changes the fundamental methodological rules of the game for medical NLP. This change is by no means restricted to medical IE but should also deeply influence other areas of medical informatics, either NLP- or non-NLP-based. Georg Thieme Verlag KG Stuttgart.

Entities: CellLine Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 32823318 PMCID： PMC7442512 DOI： 10.1055/s-0040-1702001

Source DB: PubMed Journal: Yearb Med Inform ISSN： 0943-4747

1 Introduction

The past decade has seen a truly revolutionary paradigm shift for Natural Language Processing (NLP) as a result of which Deep Learning (DL) (for a technical introduction, cf. 1 ; for comprehensive surveys, cf. 2 and 3 ) became the dominating mind-set of researchers and developers in this field (for surveys, cf. 4 5 ). Yet, DL is by no means a new computational paradigm. Rather it can be seen as the most recent offspring of neural computation in the evolution of computer science (cf. the historical background provided by Schmidhuber 6 ). But unlike in previous attempts, it now turns out to be extremely robust and effective for adequately dealing with the contents of unstructured visual 7 , audio/speech 8 , and textual data 9 . The success of Deep Neural Networks (DNNs) has many roots. Perhaps the most important methodological reason is that, with DNNs, manual feature selection or (semi-)automated feature engineering is abandoned. This time-consuming tuning step was at the same time mandatory and highly influential on the performance of earlier generations of ML systems in NLP based on Markov Models (MMs), Conditional Random Fields (CRFs), Support Vector Machines (SVMs), etc. In a DL system, however, the relevant features (and their relative contribution to a classification decision) are automatically computed as a result of thousands of iterative training cycles. The ultimate reason for the success behind DNNs is a pragmatic criterion though: system performance . Compared with results in biomedical Information Extraction (IE), obtained in previous years with standard ML methods, DL approaches changed profoundly the rules of the game. In a landslide manner, for the same task and domain, performance figures jumped up to unprecedented levels so far and DL systems consistently outperformed by large margins non-DL state-of-the-art (SOTA) systems for different tasks. Section 3 provides ample evidence for this claim and features the new SOTA results with a deeper look at IE, a major application class of medical NLP (for alternative surveys, cf. 10 11 12 ). Despite specialized hardware at disposal now, training DNNs still requires tremendous computational resources and processing time. Luckily, for general NLP, huge collections of language models (so-called embeddings ) have already been trained on huge corpora (comprised of hundreds of millions of Web-scraped documents, including newspaper and Wikipedia articles) so that these pre-compiled model resources can be readily reused when dealing with general-purpose language. But medical (and biological) language mirrors special-purpose language characteristics and comprises a large variety of sublanguages of its own. This becomes obvious in Section 3 where we deal with scholarly scientific writing (with documents typically taken from PubMed). Here, differences to general language are mostly due to the use of a highly specialized technical vocabulary (covered by numerous terminologies, such as MeSH, SNOMED-CT, or ICD). Even more challenging are clinical notes and reports (with documents often taken from the MIMIC 1 (Medical Information Mart for Intensive Care) clinical database) which typically exhibit syntactically ill-formed, telegraphic language with lots of acronyms and abbreviations as an additional layer of complexity (cf. the seminal descriptive work distinguishing both these sublanguage types by Friedman et al. 13 ). Newman-Griffis and Fosler-Lussier 14 investigated different sublanguage patterns for the many varieties of clinical reports (pathology reports, discharge summaries, nurse and Intensive Care Unit notes, etc.), while Nunez and Carenini 15 discussed the portability of embeddings across various fields of medicine reflecting characteristic sublanguage use patterns. These constraints have motivated the medical NLP community to adapt embeddings originally trained on general language to the medical language. Table 1 lists those medically informed embeddings, many of which are the basis for the IE applications discussed in Section 3.

Table 1

An Overview of Common Embeddings—Biomedical Language Models

Our survey emphasizes the fundamental methodological paradigm shift of current NLP research from symbolic to distributed representations as the basis of DL. It thus complements earlier contributions to the International Medical Informatics Association (IMIA) Yearbook of Medical Informatics which focused exclusively on the role of social media documents 23 , had a balanced view on the relevance of both Electronic Health Records (EHRs) and social media posts 24 , or dealt with the importance of shared tasks for the progress in medical NLP 25 . The last two Yearbook surveys of the NLP section most closely related to medical IE were published in 2015 26 and 2008 27 . The survey by Velupillai et al. 28 dealt with opportunities and challenges of medical NLP for health outcomes research, with particular emphasis on evaluation criteria and protocols. We also refer readers to alternative surveys of DL as applied to medical and clinical tasks. Wu et al. 29 reviewed literature for works using DL for a broader view of clinical NLP, whereas Xiao et al. 30 and Shickel et al. 31 performed systematic reviews on the applications of DL to several kinds of EHR data, not only text. Miotto et al. 32 and Esteva et al. 33 further extended that scope to include clinical imaging and genomic data beyond the scope of classical EHRs. From an even broader perspective of the huge amounts of biomedical data, Ching et al. 34 examined various applications of DL to a variety of biomedical problems—patient classification, fundamental biological processes, and treatment of patients—and discussed the unique challenges that biomedical data pose for DL methods. In the same vein, Rajkomar et al. 35 used the entire EHR, including clinical free-text notes, for clinical predictive modeling based on DL (targeted, e.g ., at the prediction of in-hospital mortality or patient’s final discharge diagnoses). They also demonstrated that DL methods outperformed traditional statistical prediction models.

2 Design and Goals of this Survey

In this survey, we concentrated on publications within the time window from 2017 to early 2020 and screened the contributions from three major scientific communities involved in medical IE: Medicine and medical informatics are covered by PubMed; Natural language processing is covered by the ACL Anthology, the digital library of the Association for Computational Linguistics; Neural networks are covered by the major conference series of the neural network community (Neural Information Processing Systems (NIPS/NeurIPS)) whereas the artificial intelligence community gets in via the Association for the Advancement of Artificial Intelligence (AAAI) Digital Library which keeps the records from the AAAI and IJCAI conferences. We also included health-related publications from the digital libraries of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE). When necessary, we also refered to e-preprint archives such as arXiv.org, since they have become a new, increasingly important distribution channel for the most recent research results in computer science (yet, in that state typically without peer review) and thus foreshadow future directions of research. We searched these literature repositories with a free-text query that can be approximated as follows: (information extraction OR text mining OR named entity recognition OR relation extraction) AND (deep learning OR neural network) AND (medic* OR clinic* OR health) For this setting, we found approximately 1,000 unique publications, screened them for relevance, and, finally, included roughly 100 into this survey.

3 Deep Neural Networks for Medical Information Extraction

In this section, we introduce applications of DNNs to medical NLP for two different tasks, Named Entity Recognition (NER) and Relation Extraction (REX). The focus of our discussion relies on studies dealing with English as reference language since the vast majority of benchmark and reference data sets are in English 2 . After a brief description of each task, we summarize the current SOTA in tables which generalize often subtle distinctions in experimental design and workflows. Our main goal is to show the diversity of major benchmark datasets, DL approaches, and embeddings being used. For these tables, we extracted all symbolic ( e.g ., corpus or DL approach) and numerical information ( e.g ., about annotation metadata, performance scores) directly from the cited papers. The assessment of different systems for the same task is centered around their performance on gold data in evaluation experiments. We refrain from highlighting minor differences in the reported scores because of different datasets being used for evaluation, changing volumes of metadata, and sometimes even the genres they contain. Hence, from a strict methodological perspective, the reported results have to be interpreted with utmost caution for two main reasons 37 . First, the choice of pre-processing steps, such as tokenization, inclusion/exclusion of punctuation marks, stop word removal, morphological normalization/lemmatization/stemming, n-gram variability, entity blinding strategies, and, second, the calibration of training methods (split bias, pooling techniques, hyperparameter selection (dropout rate, window size, etc.)) have a strong impact on the way a chosen embedding type and DL model finally performs, even within the same experimental setting. However, the data we report give valuable comparative information of the SOTA, though with fuzzy edges. This situation might be remedied by a recently proposed common evaluation framework for biomedical NLP, the BLUE (Biomedical Language Understanding Evaluation) benchmark 3 22 , which consists of five different biomedical NLP tasks (including NER and REX) with ten corpora (including BC5CDR, DDI, and i2b2 that also occur in the tables below), or the one proposed by Chauhan et al. 37 4 enabling a more lucid comparison of various training methodologies, pre-processing, modeling techniques, and evaluation metrics. For the tables provided in the next subsections, we used the F 1 score as the main ordering criterion for the cited studies (from highest to lowest) 5 . We usually had to select among a large variety of experimental conditions (with different scores). The final choices we made were led by the criterion to favor comparability among all studies. This means that higher (and lower) outcomes may have been reported in the cited studies for varying experimental conditions. Still, the top-ranked system(s) in each of the following tables defines the current SOTA for a particular application.

3.1 Named Entity Recognition

The task of Named Entity Recognition (NER) is to identify crucial medical named entities ( i.e., spans of concrete mentions of semantic types such as diseases or drugs and their attributes) in running text. For a recent survey of DL-based approaches and architectures underlying NER as a generic NLP application, see 38 .

3.1.1 Diseases

A primary target of NER in the medical field is the automatic identification of diseases in scientific articles and clinical reports. For instance, textual occurrences of disease mentions ( e.g., “Diabetes II” or “cerebral inflammation” ) are mapped to a common semantic type, Disease 6 . The crucial role of recognizing diseases in medical discourse is also emphasized by a number of surveys dealing with the recognition of special diseases. For instance, Sheikhalishahi et al. 40 discussed NLP methods targeted at chronic diseases and found that shallow ML and rule-based approaches (as opposed to more sophisticated DL-based ones) prevail. Koleck et al. 41 summarized the use of NLP to analyze symptom information documented in EHR free-text narratives as an indication of diseases and similar to the previous survey found little coverage of DL methods in this application area as well. Savova et al. 42 reviewed the current state of clinical NLP with respect to oncology and cancer phenotyping from EHR. Datta et al. 43 focused on an even more specialized use case—the lexical representation required for the extraction of cancer information from EHR notes in a frame-semantic format. The research summarized in Table 2 is strictly focused on Disease recognition and, for reasons of comparability, based on the use of shared data sets and metadata (gold annotations). Two benchmarks are prominently featured, BC5CDR 44 and NCBI 45 7 . BC5CDR is a corpus made of 1,500 PubMed articles, with 4,409 annotated chemicals, 5,818 diseases, and 3,116 chemical-disease interactions, created for the BioCreative V Chemical and Disease Mention Recognition Task 44 . As an alternative, the NCBI Disease Dataset 45 consists of a collection of 793 PubMed abstracts annotated with 6,892 disease mentions which are mapped to 790 unique disease concepts (thus, this corpus can also be used for grounding experiments).

Table 2

Medical Named Entity Recognition: Diseases. Benchmark Datasets from BC5CDR 44 and NCBI 45 .

The current top performance for Disease recognition comes close to 90% F 1 8 . Lee et al. 20 use a Transformer model with in-domain training (BioBERT), but also (attention-based) BiLSTMs which perform strongly in the range of 88–89% F 1 score. For the choice of embeddings being used, self-trained ones might be a better choice than pre-trained ones, e.g., those provided by bio.nlplab.org 16 . The incorporation of (large) dictionaries does not provide a competitive advantage in the experiments reported here. Though multi-task learning and transfer learning seem reasonable choices ( 39 46 and 47 , respectively) to combat the sparsity of datasets, they generally do not boost systems to the top ranks. Interesting though are differences for the same approach on different evaluation data sets. For the second-best system by Sachan et al. 47 , F 1 scores differ for BC5CDR and NCBI by 2.0 (for the third-best 46 by 2.7) percentage points, whereas for the best non-DL approach by Lou et al. 48 , this difference amounts to remarkable 4.1 percentage points. This hints at a strong dependence of the results of the same system set-up on the specific corpus these results have been worked out and, thus, limits generalizability. On the other hand, corpora obviously cannot be blamed for intrinsic analytical hardness since cross-rankings occurs: the system by Lee et al. 20 gets the over-all highest F 1 score for NCBI but underperforms for BC5CDR, whereas for the tagger used by Sachan et al. 47 the ranking is reversed—their system performs better on BC5CDR than on NCBI (differences are in the range of 2 percentage points). The most stable system in this respect is the one by Zhao et al. 39 . Finally, the distance between the best- and second-best-performing DL systems ( 20 and 47 , respectively) and their best non-DL counterpart 48 amounts to 7.6 percentage points (for NCBI) and 3.1 percentage points (for BC5CDR), respectively.

3.1.2 Medication

The second major medical named entity type we here discuss is related to medication information. NER complexity is increased for this task since it is split into several subtasks, including the recognition of drug names ( Drug ), frequency ( Dr-Freq ) and (manner or) route of drug administration ( Dr-Route ), dosage ( Dr-Dose ), duration of administration ( Dr-Dur ), and adverse drug events ( Dr-ADE ). These subtypes are highly relevant in the context of medication information and are backed up by an international standard, the HL7 Fast Healthcare Interoperability Resources (FHIR) 9 . Tables 3 and 4 provide an overview of the SOTA on this topic.

Table 3

Medical Named Entity Recognition: Drugs. Benchmark Datasets: n2c2 56 ; i2b2 2009 57 ; MADE 1.0 59 ; DDI 60 .

Table 4

Medical Named Entity Recognition: Medication Attributes. Benchmark Datasets: n2c2 56 ; i2b2 2009 57 ; MADE 1.0 59 ; DDI 60 .

For medication information, four gold standards had a great impact on the field in the past years. The most recent one came out of the 2018 n2c2 Shared Task on Adverse Drug Events and Medication Extraction in Electronic Health Records 56 , a successor of the 2009 i2b2 Medication Challenge 57 , now with a focus on Adverse Drug Events (ADEs). It includes 505 discharge summaries (303 in the training set and 202 in the test set), which originate from the MIMIC-III clinical care database 58 . The corpus contains nine types of clinical concepts (including drug name), eight attributes (reason, ADE, frequency, strength, duration, route, form, and dosage – from which we chose five for comparison), and 83,869 concept annotations. Relations between drugs and the eight attributes were also annotated and summed up to 59,810 relation annotations (see Section 3.2.1). The third corpus, MADE 1.0 59 , formed the basis for the 2018 Challenge for Extracting Medication, Indication, and Adverse Drug Events (ADEs) from Electronic Health Record (EHR) Notes and consists of 1,092 de-identified EHR notes from 21 cancer patients. Each note was annotated with medication information (drug name, dosage, route, frequency, duration), ADEs, indication (symptom as reason for drug administration), other signs and symptoms, severity (of disease/symptom), and relations among those entities, resulting in 79,000 mention annotations. Finally, the DDI corpus 60 , originally developed for the Drug-Drug Interaction (DDI) Extraction 2013 Challenge 61 , is composed of 792 texts selected from the (semi-structured) DrugBank database 10 and other 233 (unstructured) MEDLINE abstracts, summing up 1,025 documents. This fine-grained corpus has been annotated with a total of 18,502 pharmacological substances and 5,028 drug-drug interactions 11 . Hence, the medication NER task not only comes with a higher entity type complexity but also with text genres different from the disease recognition task—while the former puts emphasis on clinical reports, the latter focuses on scholarly writing. Except for route and ADE, all top scores for NER were achieved on the n2c2 corpus. For drug names, the current SOTA exceeds 95% F 1 score established by Wei et al. 62 . As to the subtypes, their system also compares favorably to alternative architectures by a large F 1 margin ranging from 8.6 percentage points (for duration) down to 1.0 (for drug name). For route, the distance to the best system is marginal (around 1 percentage point) 12 , whereas for ADE it is huge (more than 10 percentage points, a strong outlier). Overall, frequency, route, and dosage recognition reach outstanding F 1 scores in the range of 95 up to 97%, while for duration information top F 1 scores drop remarkably by at least 10 to 20 percentage points. Still, the recognition of ADEs seems to be the hardest task, with the best system by Wunnava et al. 67 peaking at around 64% F 1 on MADE 1.0 data (here the top performing system by Wei et al. 62 plummets down to 53% F 1 ). Interestingly, ADEs are verbally the least constrained type of natural language utterance compared with all the other entity types considered here. In terms of DL methodology, BiLSTM-CRFs are the dominating approach. Yet, the type of embeddings used by different DL systems varies a lot ranging from pre-trained Word2vec embeddings and those self-trained on MIMIC-III (for the top performers) to GloVe embeddings pre-trained on CommonCrawl, Wikipedia, EHR notes, and PubMed. There seems to be no generalizable winner for either choice of embeddings given the current state of evaluations, but self-training on medical raw data, such as MIMIC-III, challenge data sets, or, more advisable, using the now available BioSentVec 18 and BlueBERT 22 embeddings pre-trained on MIMIC-III, might be advantageous. Studies in which the same system configuration was tested on different corpora are still lacking so that corpus effects are unknown (unlike for diseases; see Table 2 ). Yet, there is one interesting though not so surprising observation: Unanue et al. 65 explored the two slices of the DDI corpus, with a span of F 1 scores of more than 16 percentage points. This obviously witnesses the influence of a priori (lack of) structure—DrugBank data is considerably more structured than MEDLINE free texts and, thus, the former gets much higher scores than the latter. Comparing DL approaches vs. non-DL ones (a CRF architecture) on the same corpus (MADE 1.0), we found that for the core entity type (Drug), the recognition performance differs by almost 3 percentage points, for frequency, route and dose marginally by less than 1, yet for duration and ADE it amounts to roughly 5 and 12 percentage points, respectively—consistently in favor of Deep Neural Networks (DNNs).

3.2 Relation Extraction

Once named entities have been identified, a follow-up question emerges: does some sort of semantic relation hold among these entities? We surveyed this Relation Extraction (REX) task with reference to results that have been achieved for information related to medication attributes and drug-drug interaction.

3.2.1 Medication-Attribute Relations

In Section 3.1.2, we already dealt with single named entity types typically associated with medication information, namely drug names and administration frequency, duration, dosage, route, and ADE, yet in isolated form only. In this subsection, we are concerned with making the close associative ties between Drugs and typical conceptual attributes, such as Frequency , Duration , Dosage , Route , ADE , and Reason (for prescription), explicit. Hence, the recognition of the respective named entity types (Drugs, Dr-Freq , Dr-Dur , Dr-Dose , Dr-Route , Dr-ADE , and Dr-Reason ) turns out to be a good starting point for solving this REX task. Not surprisingly, the benchmarks for this task are a subset of the ones in Tables 3 and 4 depicting the results for medication-related NER. Table 5 provides an overview of the experimental results for finding medication-attribute relations in medical, in effect, clinical, documents.

Table 5

Medical Relation Extraction: Medication-Attribute Relations (including ADEs). Benchmark Datasets: n2c2 56 ; MADE 1.0 59 .

The overall results from medication-focused NER are mostly confirmed for the REX task. The n2c2 corpus is the reference dataset for top performance. The group who achieved top F 1 scores for the medication NER problem also performed best for the medication-attribute REX task 62 , with extraordinary figures for Frequency , Route , and Dosage relations (in the upper 98% F 1 range), a superior one for the Duration relation (93% F 1 ), and good ones on the (hard to deal with) Adverse and Reason relations (85% F 1 ). Still, the distances to the second-best system for the same corpus (n2c2) are not so pronounced in most cases, ranging by 1 percentage point (for Frequency , Route , Dosage, and Duration ), yet increased up to 3 (for Adverse ) and 7 (for Reason ) percentage points. For the MADE 1.0 corpus, a similar picture emerges. From a lower offset (typically around 3 F 1 percentage points compared with n2c2), differences between the best and second-best systems were on the order of (negligible) 1 percentage point for Frequency , Route, and Dosage , yet increased by roughly 3, 5, and 7 percentage points for Reason , Duration , and Adverse events , respectively. Yet, in 4 out of 6 cases ( Frequency , Dosage , Duration , and Adverse events ) non-DL systems (CRFs, SVMs) outperformed their DL counterparts with small margins (in the range of (again, negligible) 1 percentage point) for Frequency and Dosage , yet with higher ones for Duration and Adverse events (5 and 7 percentage points, respectively). In cases where the DL approach ranked higher than a non-DL one, differences ranged between 1 and 3 percentage points (for Route and Reason , respectively). Thus, the MADE 1.0 corpus constitutes a benchmark where well-engineered standard ML classifiers can still play a competitive role. However, we did not find this pattern of partial supremacy of non-DL approaches for the n2c2 benchmark. The top performers for the medication attribute REX task 62 employed a joint learning approach based on CNN-RNN (thus diverging from the most successful architectures for medication NER; see Tables 3 and 4 ) and rule-based post-processing that outperformed a simple CNN-RNN. Summarizing, the CNN-RNN approach seems more favorable than an (attention-based) BiLSTM, with preferences for self-trained in-domain embeddings.

3.2.2 Drug-Drug Interaction

The second type of medication-focused relation we consider here are drug-drug interactions as featured in the DDI challenge (for surveys on the impact of DL on recent research on drug-drug interactions, cf. 82 83 , for a survey on drug-drug interaction combining progress in data and text mining from EHRs, scientific papers, and clinical reports but lacking in-depth coverage of DL methods, cf. 84 , for the NLP-focused recognition of ADEs also lacking awareness of DL contributions to this topic, cf. 85 ). Four main types of relations between drugs are considered: pharmacokinetic Mechanism , drug Effect , recommendation or Advice regarding a drug interaction, and Interaction between drugs without providing any additional information. Overall, the DDI corpus on which these evaluations were run is divided into 730 documents taken from DrugBank and 175 abstracts from MEDLINE and contains 4,999 relation annotations (4,020 train, 979 test). Recognition rates for these relations (cf. Table 6 ) are considerably lower than for the medication-related attributes when linked to drugs (cf. Table 5 ). The best systems peak at 85% F 1 score for Advice (a distance of more than 13 percentage points to the top recognition results for medication-attributes), they slip to 78% 13 and 77% for Mechanism and Effect , respectively, and plummet to 59% for Interaction 14 . Differences between the first and second-ranked systems are typically small, yet become larger on subsequent ranks (roughly between 3 to 4 percentage points relative to the top-ranked system). As with medication attributes, drug-drug interactions can also be recognized in a competitive way by CNN-RNN architectures, but attention-based LSTMs perform also considerably well. Again, self-trained embeddings using in-domain corpora seem to be advantageous for this relation class. Reflecting the drop in performance, one may conclude that drug-drug interactions constitute a markedly harder task than the conceptually much closer medication-attribute relations.

Table 6

Medical Relation Extraction: Drug-Drug Interaction. Benchmark Dataset: DDI 60 .

Finally, Table 6 most drastically supports our claim that DL approaches outperform non-DL ones. The difference between both approaches amounts to 5 percentage points for Mechanism , 7 for Effect and Interaction , and 8 for Advice .

4 Conclusions

We have presented various forms of empirical evidence that (with one exception only) Deep Learning-based neural networks outperformed non-DL, feature engineered, approaches for several information extraction tasks. However, despite their success, Deep Neural Networks and their embedding models have their shortcomings as well. One of the most problematic issues is their dependence on huge amounts of training data: SOTA embedding models are currently trained on hundreds of billions of tokens 89 . This magnitude of data volume is out of reach for any training effort in the medical/clinical domain 90 . Also, embeddings are very vulnerable to malicious attacks or adversarial examples—small changes at the input level may result in severe misclassification 5 . Another well-known problem relates to the instability of word embeddings. Word embeddings depend on their random initialization and the processing order of the underlying examples and therefore they do not necessarily converge on exactly the same embeddings even after several thousands of training iterations 91 92 . Finally, although DL is celebrated for not requiring manual feature engineering, the effects of proper hyperparameter tuning on DNNs 93 remain an issue for DL 94 . Apart from these intrinsic problems, Kalyan and Sangeetha 95 and Khattak et al. 96 refer to extrinsic drawbacks of neural networks, such as opaque encodings (resulting in lacking interpretability) or limited transferability of large models (hindering knowledge distillation for smaller models). Still, the sparsity of corpora and special linguistic phenomena of the medical (clinical) sublanguage(s) create intrinsic problems for data-greedy DL approaches that have to be overcome by special learning strategies for neural systems, such as transfer learning or domain adaptation. Research on adapting general language models to medical language constraints is just in its beginning. Yet, there is no simple solution to this problem. Wang et al. 97 evaluated Word2vec embeddings trained on private clinical notes, PMC, Wikipedia, and the Google News corpus both qualitatively and quantitatively and showed that the ones trained on Electronic Health Record data performed better on most of the tested scenarios. However, they also found that word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on general domain corpora for any downstream biomedical NLP task (other experimental evidence of the effects of in- and out-of-domain corpora and further parameters, such as corpus size, on word embedding performance is reported by Lai et al., 98 ). While this survey focused on the application domain of medical IE to demonstrate the outstanding role of DL for medical Natural Language Processing, one might be tempted to generalize this trend to other applications as well. There is, indeed, plenty of evidence in the literature that other application fields, such as question answering (and the closely related area of machine reading), summarization, machine translation, and speech processing, reveal the same pattern. However, for text categorization (in the sense of mapping free text to some pre-defined medical category system, such as ICD, SNOMED, or MeSH) this preference is less obvious, since traditional Machine Learning or rule-based models still play an important role here and, more often than for the IE application scenario, show competitive performance against DL approaches. Whether this exception will persist or will be swept away by future research remains an open issue.

59 in total

Review 1. Deep learning in neural networks: an overview.

Authors: Jürgen Schmidhuber
Journal: Neural Netw Date: 2014-10-13

2. Adverse Drug Events Detection in Clinical Notes by Jointly Modeling Entities and Relations Using Neural Networks.

Authors: Bharath Dandala; Venkata Joopudi; Murthy Devarakonda
Journal: Drug Saf Date: 2019-01 Impact factor: 5.606

3. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review.

Authors: Theresa A Koleck; Caitlin Dreisbach; Philip E Bourne; Suzanne Bakken
Journal: J Am Med Inform Assoc Date: 2019-04-01 Impact factor: 4.497

Review 4. Clinical information extraction applications: A literature review.

Authors: Yanshan Wang; Liwei Wang; Majid Rastegar-Mojarad; Sungrim Moon; Feichen Shen; Naveed Afzal; Sijia Liu; Yuqun Zeng; Saeed Mehrabi; Sunghwan Sohn; Hongfang Liu
Journal: J Biomed Inform Date: 2017-11-21 Impact factor: 6.317

Review 5. A guide to deep learning in healthcare.

Authors: Andre Esteva; Alexandre Robicquet; Bharath Ramsundar; Volodymyr Kuleshov; Mark DePristo; Katherine Chou; Claire Cui; Greg Corrado; Sebastian Thrun; Jeff Dean
Journal: Nat Med Date: 2019-01-07 Impact factor: 53.440

6. BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

Authors: Jiao Li; Yueping Sun; Robin J Johnson; Daniela Sciaky; Chih-Hsuan Wei; Robert Leaman; Allan Peter Davis; Carolyn J Mattingly; Thomas C Wiegers; Zhiyong Lu
Journal: Database (Oxford) Date: 2016-05-09 Impact factor: 3.451

7. BioWordVec, improving biomedical word embeddings with subword information and MeSH.

Authors: Yijia Zhang; Qingyu Chen; Zhihao Yang; Hongfei Lin; Zhiyong Lu
Journal: Sci Data Date: 2019-05-10 Impact factor: 6.444

8. Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data.

Authors: Mike Conway; Mengke Hu; Wendy W Chapman
Journal: Yearb Med Inform Date: 2019-08-16

9. Drug drug interaction extraction from the literature using a recursive neural network.

Authors: Sangrak Lim; Kyubum Lee; Jaewoo Kang
Journal: PLoS One Date: 2018-01-26 Impact factor: 3.240

10. An attention-based effective neural model for drug-drug interactions extraction.

Authors: Wei Zheng; Hongfei Lin; Ling Luo; Zhehuan Zhao; Zhengguang Li; Yijia Zhang; Zhihao Yang; Jian Wang
Journal: BMC Bioinformatics Date: 2017-10-10 Impact factor: 3.169

8 in total

1. Multi-objective data enhancement for deep learning-based ultrasound analysis.

Authors: Chengkai Piao; Mengyue Lv; Shujie Wang; Rongyan Zhou; Yuchen Wang; Jinmao Wei; Jian Liu
Journal: BMC Bioinformatics Date: 2022-10-20 Impact factor: 3.307

2. A Deep Learning Approach to Estimate the Incidence of Infectious Disease Cases for Routinely Collected Ambulatory Records: The Example of Varicella-Zoster.

Authors: Corrado Lanera; Ileana Baldi; Andrea Francavilla; Elisa Barbieri; Lara Tramontan; Antonio Scamarcia; Luigi Cantarutti; Carlo Giaquinto; Dario Gregori
Journal: Int J Environ Res Public Health Date: 2022-05-13 Impact factor: 4.614

3. Protocol for a reproducible experimental survey on biomedical sentence similarity.

Authors: Alicia Lara-Clares; Juan J Lastra-Díaz; Ana Garcia-Serrano
Journal: PLoS One Date: 2021-03-24 Impact factor: 3.240

4. Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish.

Authors: Pilar López-Úbeda; Alexandra Pomares-Quimbaya; Manuel Carlos Díaz-Galiano; Stefan Schulz
Journal: BMC Med Inform Decis Mak Date: 2021-05-04 Impact factor: 2.796

5. Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models.

Authors: Phillip Richter-Pechanski; Nicolas A Geis; Christina Kiriakou; Dominic M Schwab; Christoph Dieterich
Journal: Digit Health Date: 2021-11-26

6. Use of unstructured text in prognostic clinical prediction models: a systematic review.

Authors: Tom M Seinen; Egill A Fridgeirsson; Solomon Ioannou; Daniel Jeannetot; Luis H John; Jan A Kors; Aniek F Markus; Victor Pera; Alexandros Rekkas; Ross D Williams; Cynthia Yang; Erik M van Mulligen; Peter R Rijnbeek
Journal: J Am Med Inform Assoc Date: 2022-06-14 Impact factor: 7.942

7. Improving medical term embeddings using UMLS Metathesaurus.

Authors: Ashis Kumar Chanda; Tian Bai; Ziyu Yang; Slobodan Vucetic
Journal: BMC Med Inform Decis Mak Date: 2022-04-29 Impact factor: 3.298

8. Evaluating Patients' Experiences with Healthcare Services: Extracting Domain and Language-Specific Information from Free-Text Narratives.

Authors: Barbara Jacennik; Emilia Zawadzka-Gosk; Joaquim Paulo Moreira; Wojciech Michał Glinkowski
Journal: Int J Environ Res Public Health Date: 2022-08-17 Impact factor: 4.614

8 in total