| Literature DB >> 32910823 |
Yonatan Bitton1, Raphael Cohen1, Tamar Schifter2, Eitan Bachmat1, Michael Elhadad1, Noémie Elhadad3.
Abstract
OBJECTIVE: In Hebrew online health communities, participants commonly write medical terms that appear as transliterated forms of a source term in English. Such transliterations introduce high variability in text and challenge text-analytics methods. To reduce their variability, medical terms must be normalized, such as linking them to Unified Medical Language System (UMLS) concepts. We present a method to identify both transliterated and translated Hebrew medical terms and link them with UMLS entities.Entities:
Keywords: UMLS; natural language processing; online health communities
Mesh:
Year: 2020 PMID: 32910823 PMCID: PMC7566404 DOI: 10.1093/jamia/ocaa150
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.This forum post contains 26 words, and 6 spans that link to 5 different unique identifiers of Unified Medical Language System medical terms. Notice that a span can contain more than 1 word (like the term “multiple sclerosis”), a single Unified Medical Language System concept unique identifier can be referenced from several places in the same post.
Figure 2.Cross-lingual entity linking processing pipeline: offline, a filtered subset of Unified Medical Language System (UMLS) is transliterated and translated into Hebrew, producing pairs
Figure 3.Doccano online annotation tool with the Hebrew Unified Medical Language System schema.
Figure 4.Overall algorithm for entity linking—combining high-recall n-grams matching and contextual filtering.
Intrinsic evaluation: Entity-level recognition (exact span) performance on gold-standard dataset
| Accuracy | f1_score | Precision | Recall | Support | |
|---|---|---|---|---|---|
| Diabetes | 0.97 | 0.73 | 0.71 | 0.75 | 314 |
| Sclerosis | 0.98 | 0.76 | 0.82 | 0.71 | 306 |
| Depression | 0.99 | 0.75 | 0.77 | 0.73 | 262 |
| Weighted average | 0.98 | 0.75 | 0.77 | 0.73 | — |
MDTEL UMLS entity linking performance on test data
| Community | F1 score | Precision | Recall | ROC AUC | Accuracy: Filter model | Accuracy: Full algorithm | High-recall candidates filtered out |
|---|---|---|---|---|---|---|---|
| Diabetes | 87.6 | 82.0 | 94.1 | 82.1 | 84.2 | 98.7 | 31.7% |
| Sclerosis | 93.8 | 92.6 | 95.0 | 94.1 | 91.2 | 98.8 | 50.9% |
| Depression | 87.5 | 87.5 | 87.5 | 97.9 | 85.9 | 99.1 | 53.8% |
AUC: area under the receiver-operating characteristic curve; MDTEL: Medical Deep Transliteration Entity Linking; ROC: receiver-operating characteristic; UMLS: Unified Medical Language System.
Quantitative information retrieval improvement using MDTEL
| Community | Queries | Queries improved Google Spelling Suggestion | Queries improved MDTEL UMLS Linking |
|---|---|---|---|
| Diabetes | 6581 | 22.7% | 45.2% |
| Sclerosis | 6325 | 22.6% | 46.5% |
| Depression | 7302 | 22.2% | 47.3% |
MDTEL: Medical Deep Transliteration Entity Linking; UMLS: Unified Medical Language System.