Literature DB >> 32910823

Cross-lingual Unified Medical Language System entity linking in online health communities.

Yonatan Bitton¹, Raphael Cohen¹, Tamar Schifter², Eitan Bachmat¹, Michael Elhadad¹, Noémie Elhadad³.

Abstract

OBJECTIVE: In Hebrew online health communities, participants commonly write medical terms that appear as transliterated forms of a source term in English. Such transliterations introduce high variability in text and challenge text-analytics methods. To reduce their variability, medical terms must be normalized, such as linking them to Unified Medical Language System (UMLS) concepts. We present a method to identify both transliterated and translated Hebrew medical terms and link them with UMLS entities.
MATERIALS AND METHODS: We investigate the effect of linking terms in Camoni, a popular Israeli online health community in Hebrew. Our method, MDTEL (Medical Deep Transliteration Entity Linking), includes (1) an attention-based recurrent neural network encoder-decoder to transliterate words and mapping UMLS from English to Hebrew, (2) an unsupervised method for creating a transliteration dataset in any language without manually labeled data, and (3) an efficient way to identify and link medical entities in the Hebrew corpus to UMLS concepts, by producing a high-recall list of candidate medical terms in the corpus, and then filtering the candidates to relevant medical terms.
RESULTS: We carry out experiments on 3 disease-specific communities: diabetes, multiple sclerosis, and depression. MDTEL tagging and normalizing on Camoni posts achieved 99% accuracy, 92% recall, and 87% precision. When tagging and normalizing terms in queries from the Camoni search logs, UMLS-normalized queries improved search results in 46% of the cases.
CONCLUSIONS: Cross-lingual UMLS entity linking from Hebrew is possible and improves search performance across communities. Annotated datasets, annotation guidelines, and code are made available online (https://github.com/yonatanbitton/mdtel).

Entities: Chemical Disease Gene Species

Keywords: UMLS; natural language processing; online health communities

Mesh：

Year: 2020 PMID： 32910823 PMCID： PMC7566404 DOI： 10.1093/jamia/ocaa150

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Online health communities (OHCs) have become increasingly relevant as a source of real-world evidence for researchers and informational support for patients and their caregivers. The content of OHC posts is challenging to process automatically, but it is a critical step to enable downstream applications to advance biomedical knowledge, to support community moderators, and to support patients in finding and making sense of the rich information contained in OHC posts. Named-entity recognition has been investigated in English OHCs and social media., In languages other than English, a set of additional challenges compound natural language processing of health texts, including the fact that most languages do not have a version of the Unified Medical Language System (UMLS) available to them to normalize concepts. In this article, we focus on the challenge of cross-lingual entity linking (XEL), which consists in mapping entity mentions in text of a source language to entities in a knowledge base, specifically UMLS, in a target language. We ground our work on a popular Hebrew-speaking OHC, Camoni. A salient aspect of Hebrew health texts is the heavy use of transliteration of medical terms. Transliteration consists of writing a word from a source language (eg, English) in a different target script (eg, Hebrew alphabet) using the closest corresponding letters. Transliterations are thus phonetically similar to the source word. We performed an empirical analysis of the way medical terms are mentioned in the Camoni communities. We find that most of the medical terms are transliterated, and furthermore, the transliterations include unreliable abbreviations and misspellings. Different people transliterate the same word in different forms, which creates variability. As discussed in Goldberg and Elhadad, when foreign words are borrowed in languages with similar alphabets and sound systems, it is easy to map them back to the source terms, as the words are written in similar manners (usually, only small suffix variants are introduced). Words borrowed into languages with different writing and sound systems (eg, English words in Japanese, Hebrew, and Arabic texts) are more challenging. In addition, automatic translation does not cover most of the observed mentions of medical terms. In our test, automatic Hebrew-English translation in the medical domain fails to translate over 38% of the medical terms (Supplementary Appendix S1). The high prevalence of transliteration of medical terms leads to high variability, which is a challenge for information retrieval and data mining., Camoni, like many OHCs, relies on Google Custom Search application programming interface (API) as their search engine. Many user queries mention medical terms, which are very likely to include noisy transliterations. For example, a Hebrew query with a spelling variant such as “How do I know I have fibromyalegia?” does not return any results when “fibromyalgia” is transliterated. In this work, we empirically assess the prevalence of this issue, find it related in the case of Hebrew medical terms to the variability introduced by transliteration, and introduce a method based on UMLS normalization of medical terms in the patient-authored content. Our key contributions are: An unsupervised method for creating a transliteration dataset. The dataset is used for training a state-of-the-art transliteration model; An efficient method to match the identified transliterated entities into their UMLS concept unique identifier (CUI); A contextual relevance model developed to detect medical term mentions in each disease-specific community. Our method is applicable to other languages with different scripts and we document steps to reproduce both intrinsic and extrinsic evaluation methods for XEL applied to UMLS.

Background and Significance

Entity linking is the task of linking spans in text to concepts in a knowledge base such as the UMLS. It is particularly useful in the clinical and health domains, when documents are written in free text and frequently refer to biomedical concepts., XEL is needed when a document is in a source language, different from the language of the labels used in the target knowledge base. About 71% of the concept names in UMLS are labeled in English. Other languages occur much less: for Spanish, ∼10.2%, and for French, ∼2.85%; for Hebrew, only 485 terms occur in the UMLS. When the source and target languages operate over different alphabets and sound systems, both translation and transliteration of terms must be handled. Several CLEF eHealth challenges (2015-2019) have focused on named-entity recognition in English and French in biomedical articles, with applications to multilingual information extraction from health reports. The 2017-2018 tasks, explore automatic assignments of International Classification of Diseases–Tenth Revision codes to health-related documents in English, French, Hungarian, Italian, and German. The 2015-2016 tasks focus on information retrieval in biomedical domains. Xu et al introduce a model to identify cross-lingual candidates for concept normalization using a character-based neural translation model trained on a multilingual biomedical terminology. They use UMLS data in Spanish, French, Dutch, and German. Our approach is similar and extends this work on the following dimensions: (1) we focus on OHCs with noisy text as opposed to scientific articles; (2) we study the case of transliteration in Hebrew, which uses a different alphabet and sound system than English; and (3) we introduce an unsupervised method to create a transliteration dataset, which can be used across languages with varying scripts and without existing resources. A tagger that detects entities in Arabic medical documents tags terms with semantic class such as medical problem, test, and treatment, a task introduced in the 2010 i2b2 challenge. Data from Wikipedia, DBpedia, and SNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms) are leveraged, and a binary classifier for each semantic category is trained. Our work differs, as we perform a coarser form of detection for medical terms in general, without classifying their type, but aim for high recall in an unsupervised manner. In our intrinsic evaluation, we reproduce a similar task for Hebrew for the UMLS semantic groups of Disorders and Chemical or Drugs. Previous work has addressed the task of detecting transliterations and map them back to source terms., The NEWS 2018 Shared Task on Transliteration of Named Entities focused on identifying transliterations in multiple languages in the news domain. In our work, we build on a neural machine translation model for named-entity transliteration developed for this task. The model uses a deep attentional recurrent neural network (RNN) encoder-decoder model. This architecture was ranked first in several tracks at the NEWS 2018 Shared Task when applied to Hebrew.

MATERIALS AND METHODS

Our approach to cross-lingual UMLS entity linking is called Medical Deep Transliteration Entity Linking (MDTEL). Given a community post document in Hebrew (w1, w2…w), we want to tag the spans (s1…s), which contain medical terms and link them with a UMLS CUI (Figure 1). Figure 2 shows the overall structure of the processing pipeline.

Figure 1.

Figure 2.

Cross-lingual entity linking processing pipeline: offline, a filtered subset of Unified Medical Language System (UMLS) is transliterated and translated into Hebrew, producing pairs ; a post from the online health community is passed to the High recall matcher which searches for matches, by intersecting n-grams of the posts text with the Hebrew medical terms in the pairs, producing a list of candidate concept unique identifiers (CUIs). The contextual relevance model uses language models features, the UMLS Relatedness package, and additional features in order to filter relevant medical terms in the context of the post. STR: String.

This forum post contains 26 words, and 6 spans that link to 5 different unique identifiers of Unified Medical Language System medical terms. Notice that a span can contain more than 1 word (like the term “multiple sclerosis”), a single Unified Medical Language System concept unique identifier can be referenced from several places in the same post. Cross-lingual entity linking processing pipeline: offline, a filtered subset of Unified Medical Language System (UMLS) is transliterated and translated into Hebrew, producing pairs ; a post from the online health community is passed to the High recall matcher which searches for matches, by intersecting n-grams of the posts text with the Hebrew medical terms in the pairs, producing a list of candidate concept unique identifiers (CUIs). The contextual relevance model uses language models features, the UMLS Relatedness package, and additional features in order to filter relevant medical terms in the context of the post. STR: String. In an offline process, we prepare a forward transliteration model to transliterate the UMLS into Hebrew. In addition to the transliterated UMLS terms, we include translations for key words that are related to the OHC main topic selected from a manually collected medical glossary. The result of this module is a list of pairs (i, i). That is, given a term such as “diabetes,” we generate multiple candidate Hebrew terms such as a highly likely transliteration “di-a-bi-tis” and its translation “sa-ke-re-t.” A matcher to match spans from the post to the transliterated or translated UMLS Hebrew terms, producing a high-recall list of candidate matches. A contextual relevance model that filters the high-recall list by detecting matches that are not used in a medical sense. We perform 2 evaluations to assess cross-lingual entity linking performance: (1) in an intrinsic test, we measure precision, recall, and F1 score of the XEL model on a manually annotated dataset of Hebrew documents with marked UMLS concepts for Disorders and Chemicals or Drugs; and (2) in an extrinsic evaluation, we quantify the impact of UMLS term normalization on Hebrew posts from the OHC on information retrieval quality.

Data

Camoni corpus

The Camoni communities have about 20 000 registered members and 100 000 unique visitors per month. Camoni is organized around 39 disease-specific communities (Supplementary Table S1.2). We extract text from 3 communities (diabetes, sclerosis, and depression), for a total of 55 000 posts and 2.5 million tokens (Supplementary Table S1.3).

Gold-standard annotation of UMLS terms

To test XEL performance, we constructed an annotated dataset in which all mentions of specific UMLS terms are annotated. Annotation guidelines (https://github.com/yonatanbitton/mdtel) accounted for linguistic properties of Hebrew text to handle compound nominal expressions, aggregated prepositions, and conjunctions. We focused this annotation on the most common semantic groups: Disorders and Chemical and Drugs. The guidelines were refined through 3 rounds of test with 3 annotators on 50 posts and analyzing disagreement. Annotation was carried out in the Doccano online annotation tool (https://doccano.herokuapp.com/) (Figure 3). Two fifth-year medical students annotated documents and prepared a set of posts from the 3 OHCs. A total of 802 forum posts were annotated with overall 4106 term mentions and 1700 unique terms (Supplementary Table S2.1). The Cohen’s kappa interannotator agreement was 0.76, 0.76, and 0.71 for the diabetes, sclerosis, and depression communities, respectively, indicating high agreement. For the intrinsic evaluation, given the annotated mentions, we executed a pretrained model of our method and aligned the Hebrew mentions with UMLS CUIs. One of the annotators then validated the linking as accurate or not.

Figure 3.

Doccano online annotation tool with the Hebrew Unified Medical Language System schema.

General medical term annotation

In addition to the fine-grained UMLS term mentions gold dataset, we prepared a coarse annotation data to tag any mention of a “medical term.” We collected 100 posts of each community and manually tagged relevant medical terms. Statistics of this dataset are shown in Supplementary Table S5.1. In the processing pipeline, we use this small dataset to train a binary relevance classifier for text mentions conditioned on the community.

Building a transliteration model

We first build a transliteration model capable of mapping English medical terms to multiple likely Hebrew transliterations (transliteration is a noisy process even when people do it). We learn a character-based transducer that maps English strings to corresponding Hebrew strings. The model is based on a neural model based on an RNN encoder-decoder architecture. To train such a model, a large dataset of pair (En, He) term pairs is required. We introduce an unsupervised method to synthesize such a dataset, instead of creating it through manual annotation. We act under the hypothesis that medical terms are usually transliterated. We initialize the data from a dictionary of English medical terms and add some words from medical glossaries we collected from trusted Web sources. We then use Google Translate to translate those words into Hebrew. In some cases, the terms are mapped to transliterations, and in others to Hebrew word translations. This step yielded 106 573 pairs. Because we assume most of the mappings are transliterations, we build on the generalization capability of the RNN architecture to filter out the cases of translations in this collected data.

Mapping UMLS terms to Hebrew terms

To produce the list of UMLS terms, we apply the following steps: Filter UMLS terms by type: For each evaluation experiment, we filter the UMLS in different ways: in the intrinsic evaluation, we focus on the disorders and drugs semantic groups. In the extrinsic evaluation, we include concepts expected in typical consumer health corpus and keep concepts with TTY in adjectives, drug names, supplementary concepts, common names, entry, hierarchical terms, brand names, finding names, clinical synopsis, ingredients, scientific names, active substance, main heading, and chemicals. The filtered terms are mostly abbreviations or acronyms. We collected 2.8 million medical terms; 640 000 of these are single words, and 2.160 million are multiwords. Transliterate: We construct a transliteration model and apply it on the selected UMLS terms. We collect the top 3 transliterations predicted for each UMLS term. This process results in a list of 8.4 million Hebrew terms mapped to their source UMLS CUI for general medical terms. Translate: For some medical terms, a Hebrew translation is more likely to be used than a transliteration (eg, the Hebrew translation of “diabetes” is more common than its transliteration). To identify those terms, we collect a list of pairs (x), where y is the Hebrew translation of the English word x in case a transliteration is infrequent. The identification of these terms is described in Supplementary Appendix S2. This process produced 14 766 Hebrew translated (and nontransliterated) terms mapped to their source UMLS CUI.

UMLS XEL

We now describe how to exploit the dataset of medical term pairs (transliterated and translated) and perform UMLS entity linking on the OHC documents. We proceed in 2 steps: (1) produce candidate matches with high recall and (2) filter candidates using a machine learning model that was trained to detect relevant terms in the context of a post. Our entity-linking method has 2 contributions: UMLS entity linking is conditioned on the topic of the document. We use a different relevance classifier for each disease-specific community. Relevance classification is contextual—the same term may be tagged as a medical mention in some contexts, but not in others.

High-recall matching

To generate a high-recall list of candidates, we scan all posts in each community, and search all matches for any term y from the pairs dataset (either transliterations or translations). We match both the observed tokenized form of the Hebrew text in the post and its lemmatized text. Lemmatization is particularly important in Hebrew because most function words (articles, prepositions, and conjunctions) appear in Hebrew in a form agglutinated to the next word. We use the YAP model for Hebrew lemmatization. We use a fuzzy string matching algorithm to match between the UMLS transliterated terms and the post tokens in a parallel manner. Because many UMLS terms are multiword expressions, we match n-grams (1,2,3) in the text. In the full algorithm, filter_model refers to the relevance model described in Figure 4.

Figure 4.

Overall algorithm for entity linking—combining high-recall n-grams matching and contextual filtering.

Contextual relevance ranking

The relevance classifier takes the whole post as a context to determine whether a term mention is used in a medical sense, because some words may be considered medical in some contexts and not medical in other contexts. We expect that in a nonmedical corpus like Wikipedia, medical terms will have low probability, unless they are common phrases. We train a Hebrew language model on a Hebrew Wikipedia corpus with 3.8 million sentences, which were split 80% train and 20% validation. We trained multiple neural language models (Supplementary Appendix S3). Using this model, we calculate p(match | context) and combine a UMLS similarity feature between the candidate term and the name of the community (eg, “diabetes”) and count-based features. We train binary classifiers on the gold-standard and general datasets we prepared. This classifier is eventually applied on the list of high-recall candidates produced earlier and provides the final identified terms in the original Hebrew post, each linked to the corresponding UMLS concept.

RESULTS

Intrinsic evaluation

In the intrinsic evaluation experiment, we train a model to recognize drugs and disorders mentions in a post. The settings of this experiment are the following: When preparing the list of pairs , we filter the UMLS to only include drugs and disorders. We train the contextual filter model for each class—disorders and drugs—using a random forest decision tree classifier on a split of the gold-standard dataset to filter the spans returned by the high-recall matcher. We use a 75%-25% train-test split of the dataset. Similar to previous work, we measure entity level metrics (exact match of the span) and token-level metrics (using BIO encoding of spans) and report Precision, Recall and F1 score. Results are presented in Table 1 (see also Supplementary AppendixTables S4.2-S4.5). F1 score for exact entity recognition is 0.75 overall, with the less stringent token-level measure reaching 0.78 (accounting for partial match and overlaps). Precision is generally better than recall for all metrics. The source of the low recall is that the list of pairs we have generated does not include the target terms that occur in the posts, and not the contextual filter.

Table 1.

Intrinsic evaluation: Entity-level recognition (exact span) performance on gold-standard dataset

	Accuracy	f1_score	Precision	Recall	Support
Diabetes	0.97	0.73	0.71	0.75	314
Sclerosis	0.98	0.76	0.82	0.71	306
Depression	0.99	0.75	0.77	0.73	262
Weighted average	0.98	0.75	0.77	0.73	—

Table 2.

MDTEL UMLS entity linking performance on test data

Community	F1 score	Precision	Recall	ROC AUC	Accuracy: Filter model	Accuracy: Full algorithm	High-recall candidates filtered out
Diabetes	87.6	82.0	94.1	82.1	84.2	98.7	31.7%
Sclerosis	93.8	92.6	95.0	94.1	91.2	98.8	50.9%
Depression	87.5	87.5	87.5	97.9	85.9	99.1	53.8%

AUC: area under the receiver-operating characteristic curve; MDTEL: Medical Deep Transliteration Entity Linking; ROC: receiver-operating characteristic; UMLS: Unified Medical Language System.

Intrinsic evaluation: Entity-level recognition (exact span) performance on gold-standard dataset MDTEL UMLS entity linking performance on test data AUC: area under the receiver-operating characteristic curve; MDTEL: Medical Deep Transliteration Entity Linking; ROC: receiver-operating characteristic; UMLS: Unified Medical Language System. Supplementary Appendix Table S4.6 indicates the proportion of the high-recall candidates filtered out by the contextual filters for each community and each type of term. On average, about half of the candidates are filtered out based on context. In ablation tests, we confirm that F1 score drops significantly (from 0.75 to 0.64) when using the full UMLS instead of selected semantic groups (diseases and drugs). Without contextual filter, F1 score drops from 0.75 to 0.57. The link to UMLS CUI was accurate in 96% of the cases. Reported F1 performance of English UMLS linkers is in the range of 0.72-0.82 (cTAKES [clinical Text Analysis and Knowledge Extraction System])., Our method reaches 0.75 F1 with a small training dataset (about 1000 mentions) of social media text, suffering mainly from low recall. Supplementary Appendix S4 provides detailed error analysis.

Extrinsic evaluation

In this evaluation, we identify all medical terms mentions, without fine-grained classification into semantic groups, according to the following setup: When preparing the list of pairs , we filter the UMLS to include types expected in OHCs. We train the contextual filter model, using the same random forest decision tree classifier with a 75%-25% train-test split. Table 2 summarizes our results. Performance is measured on the high-recall list of candidate terms. When considering all tokens in the posts, performance is reported as accuracy full algorithm. With accuracy ranging from 98.7% to 99.1% and F1 score between 87.5 and 93.8, the performance of MDTEL on noisy Hebrew text is similar to that obtained by specialized UMLS linkers in English such as cTAKES. About half of the high-recall candidates are filtered out by the contextual relevance classifier, demonstrating the importance of context in identifying term mentions. Increased performance compared with the intrinsic evaluation is explained by (1) a larger number of instances for training and (2) coarser decision making—we do not attempt to identify the semantic type of the mention, only the fact that a medical term is mentioned.

Information retrieval improvement in OHCs with UMLS term normalization

Once term mentions are identified, we normalize them and assess the impact of this normalization on an information retrieval task. For example, given the search query submitted to Camoni (“How do you know I have fibromyalegia?”), we link the transliterated “fibromyalegia” with high similarity to the transliterated term “fibromyalgia” to generate the new query, “How do you know I have fibromyalgia?” We collect search queries submitted to Camoni from the previous 3 years. We focused on queries that occurred <20 times and are more likely to include infrequent transliteration forms. About 6500 queries are collected for each community. Google Search Baseline: We collect the current search results for each query without any modification on the full Camoni site (using Google Custom Search API). Google Spelling Suggestion Baseline: We use Google Spelling Suggestions API to propose a fixed query and collect search results for the fixed query. UMLS Normalization: We linked the text of the query to UMLS and performed an alteration for each case where similarity between a linked term is high (>0.8) but not perfect (< 1.0). A query may include several medical terms, in which case we generate 1 fixed query for each normalized term, with only 1 medical term altered. Get results of the normalized queries: For each search query, we collect search results using Google Custom Search API. For each original query, we select the normalized query returning the most search results.

Analysis of experimental results

We measure the percentage of queries where the normalized queries got more results than the original query. Table 3 compares the recall improvements introduced by UMLS normalization and Google Spelling Suggestion.

Table 3.

Quantitative information retrieval improvement using MDTEL

Community	Queries	Queries improved Google Spelling Suggestion	Queries improved MDTEL UMLS Linking
Diabetes	6581	22.7%	45.2%
Sclerosis	6325	22.6%	46.5%
Depression	7302	22.2%	47.3%

MDTEL: Medical Deep Transliteration Entity Linking; UMLS: Unified Medical Language System.

Quantitative information retrieval improvement using MDTEL MDTEL: Medical Deep Transliteration Entity Linking; UMLS: Unified Medical Language System. MDTEL improves recall for about twice as many queries as the data-driven Google correction method. This difference is explained by the following factors: we focus on medical terms, we consider several alterations for each query, and our process exploits pretrained data that are not observed frequently in search queries, and hence, for which Google’s approach is not likely to find corrections. We perform manual analysis of the search results with MDTEL UMLS normalization to verify that in the cases with increased search results, MDTEL also improves the relevance of the search results. Given the original and the fixed query Q and Q, and the corresponding search results Answers(Q) and Answers(Q), we asked medical students to annotate the following questions: Does Q have the same intended meaning as Q? Are Answers(Q) documents that match the meaning of Q? In this experiment, we presented as Answers(Q) the title of the top 10 search results received in the website as ranked by Google API: articles or posts titles, without their content. To account for the cases where the original query had very few answers, we sample 100 cases in which UMLS normalization increased the search count from <5 (there are 1314, 1272, and 1491 such queries for the diabetes, sclerosis, and depression communities, respectively). We ask the question: “Do you find a relevant answer to Q in Answers(Q)?” We do not show Q to the annotator to avoid introducing bias. Full results are shown in Supplementary Appendix S5. A total of 94% of the answers are positive, indicating that UMLS normalization did not alter the intended meaning of the query (Supplementary Table S5.2). In a second evaluation, we account for cases where the original query yielded many results (>5). We sampled 100 examples of Q, Answers(Q), Answers(Q), showed them to the annotator as Q, Answers(A), Answers(B), and asked: “Which answers are better?” To avoid bias, the annotator is not told which column corresponds to MDTEL. Full results are shown in Supplementary Table S5.3. In 96% of cases, UMLS normalization either improved or preserved the quality of the search results, with about 36% of the cases improved. Finally, we ask the annotator to directly assess whether the replacement performed by MDTEL respects the intended meaning of the query. The replacement was found problematic in only 4 cases of 300. Supplementary Appendix S5 provides detailed error analysis.

DISCUSSION

User text normalization at indexing time

In all experiments reported previously, we limit ourselves to normalization of the query at search time. We expect that patient-authored posts also contain misspellings of medical terms in Hebrew. Normalizing the posts through UMLS linking at the time they are posted would increase search efficacy and ease text analytics. Practically, this process is more complex to implement than query alteration because it must be verified by the author of the post or through postediting curation. The potential improvement in search efficacy is assessed in the Supplementary Appendix.

Generalizing to other languages

The MDTEL method is generalizable to other languages with different writing script than English. It relies on several components, we list language-specific requirements and annotation efforts for each step: Transliteration model: To train the transliteration model, the only resource needed is an English medical terms list, which can be taken from our reference. It is useful to add domain-specific terms, and to include translated terms in cases in which the translation is more common than the transliteration. The generation of the synthetic dataset to train the transliteration model requires access to an automatic translation API (eg, Google Translate). High recall matcher: This component uses the expanded UMLS MRCONSO table to build the list of pairs. The matcher itself requires a morphological analyzer (we use YAP for Hebrew) to lemmatize the text of the post. Contextual relevance model: This is a machine learning model, and it needs labeled dataset in order to learn. We experimented with several dataset sizes, and for all experiments, using 500 posts as labeled dataset (25% of those posts as a test set) provided usable performance. The guidelines for annotation must be adapted to each language (eg, addressing morphological and syntactic constructs such as compound words). Our method relies on the fact that transliterations for medical terms are frequent. This must be validated for each language. A simple test can be done in order to verify this: taking the English medical terms, and measure how often translations are not words in the target language. In Hebrew, we found 37% (28 308 of 77 698) of the resulting translations are not Hebrew words.

CONCLUSION

We present MDTEL, a method for UMLS XEL in OHCs, and assess its value in a named-entity recognition and information retrieval tasks in a popular Israeli community. UMLS XEL is a challenging task in non-English languages, especially for languages with different writing and sound systems such as Chinese, Japanese, Hebrew, and Arabic. We observe that most medical term mentions in patient-authored text are formed by transliteration. Our method exploits advances in neural methods to generate transliterations from source text. We enhance a deep attentional RNN encoder-decoder transliteration model by synthesizing a set of term pairs starting from the UMLS vocabulary in an unsupervised manner. We also note that the classification of an n-gram as a UMLS medical concept mention depends on its context, both the general topic of the document and the surrounding tokens. We take this context dependence into account and build a relevance classifier for each community in the OHC that accounts for contextual neural language model features when detecting medical term mentions. On a domain of noisy Hebrew text characterized by very high variability (1.43 average forms for each medical term), MDTEL achieves performance similar to that achieved on English text by leading UMLS linkers on a vocabulary of over 8 million distinct forms. We demonstrate the efficacy of UMLS linking in patient-authored text by analyzing the improvements introduced in the OHC search engine by applying MDTEL on search queries. Both quantitative and manual performance analysis demonstrate the high value of UMLS terms in improving search. Cross-lingual UMLS linking also enables text analytics—we measured the prevalence of medical terms in patient-authored text (about 5%), the distribution of UMLS contexts in forums, which are key features enabling sense-making and trend detection for community moderators.

FUNDING

This work was supported by National Institute of General Medical Studies grant number R01 GM114355 (to NE).

AUTHOR CONTRIBUTIONS

YB, RC, EB, ME, and NE worked on study design and algorithms. YB implemented and carried out experiments. TS participated in designing evaluation study design. All coauthors participated in the writing of the manuscript.

SUPPLEMENTARY APPENDIX

Supplementary Appendix is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST

The authors have no competing interests to declare. Click here for additional data file.

10 in total

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors: A R Aronson
Journal: Proc AMIA Symp Date: 2001

2. Exploring semantic groups through visual approaches.

Authors: Olivier Bodenreider; Alexa T McCray
Journal: J Biomed Inform Date: 2003-12 Impact factor: 6.317

3. The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors: Olivier Bodenreider
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

Review 5. Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing.

Authors: D Demner-Fushman; N Elhadad
Journal: Yearb Med Inform Date: 2016-11-10

Review 6. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text.

Authors: G Gonzalez-Hernandez; A Sarker; K O'Connor; G Savova
Journal: Yearb Med Inform Date: 2017-09-11

7. Threading together patient expertise.

Authors: Andrea Civan; Wanda Pratt
Journal: AMIA Annu Symp Proc Date: 2007-10-11

8. Analysis of free online physician advice services.

Authors: Raphael Cohen; Michael Elhadad; Ohad Birk
Journal: PLoS One Date: 2013-03-26 Impact factor: 3.240

9. Online cancer communities as informatics intervention for social support: conceptualization, characterization, and impact.

Authors: Shaodian Zhang; Erin O'Carroll Bantum; Jason Owen; Suzanne Bakken; Noémie Elhadad
Journal: J Am Med Inform Assoc Date: 2017-03-01 Impact factor: 4.497

10. Unsupervised Medical Entity Recognition and Linking in Chinese Online Medical Text.

Authors: Jing Xu; Liang Gan; Mian Cheng; Quanyuan Wu
Journal: J Healthc Eng Date: 2018-04-18 Impact factor: 2.682

10 in total

1 in total

1. The UMLS knowledge sources at 30: indispensable to current research and applications in biomedical informatics.

Authors: Betsy L Humphreys; Guilherme Del Fiol; Hua Xu
Journal: J Am Med Inform Assoc Date: 2020-10-01 Impact factor: 4.497

1 in total