| Literature DB >> 27671202 |
Magnus Ahltorp1, Maria Skeppstedt2, Shiho Kitajima3, Aron Henriksson4, Rafal Rzepka3, Kenji Araki3.
Abstract
BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.Entities:
Keywords: Agglomerative hierarchical clustering; Distributional semantics; Japanese language processing; Medical vocabulary expansion; Random indexing
Year: 2016 PMID: 27671202 PMCID: PMC5037651 DOI: 10.1186/s13326-016-0093-x
Source DB: PubMed Journal: J Biomed Semantics
Vocabulary size
| Medical finding | Pharmaceutical drug | Body part | |
|---|---|---|---|
| ( | ( | ( | |
| units) | units) | units) | |
| All terms in used vocabularies | 77350 | 27912 | 2960 |
| More than 50 occurrences in the segmented corpus as a semantic unit in the context of at least one other semantic unit | 753 | 276 | 214 |
Fig. 1Recall for retrieving semantic units belonging to the three investigated semantic categories
Best results for top n. For window sizes 1+1 and 8+8
| % Window size 1+1 | Best strategy | Average | 2.5 % percentile | 97.5 % percentile | Variance |
| Medical Finding | summed similarity |
| 13.9 % | 19.2 % | 0.0002 |
| Pharmaceutical Drug | summed similarity |
| 20.1 % | 29.4 % | 0.0006 |
| Body Part | cluster level 83 | 16.2 % | 7.5 % | 21.9 % | 0.0011 |
| Window size 8+8 | Best strategy | Average | 2.5 % percentile | 97.5 % percentile | Variance |
| Medical Finding | cluster level 12 | 12.1 % | 9.1 % | 14.9 % | 0.0002 |
| Pharmaceutical Drug | cluster level 1 | 11.1 % | 7.3 % | 14.7 % | 0.0004 |
| Body Part | luster level 2 |
| 18.4 % | 28.1 % | 0.0006 |
The best results are shown in bold face
Best results for top 10n. For window sizes 1+1 and 8+8
| Window size 1+1 | Best strategy | Average | 2.5 % percentile | 97.5 % percentile | Variance | |||||
| Medical Finding | cluster level 100 |
| 55.3 % | 60.8 % | 0.0002 | |||||
| Pharmaceutical Drug | cluster level 14 |
| 56.2 % | 77.4 % | 0.0029 | |||||
| Body Part | cluster level 73 | 36.6 % | 20.6 % | 46.5 % | 0.0036 | |||||
| Window size 8+8 | Best strategy | Average | 2.5 % percentile | 97.5 % percentile | Variance | |||||
| Medical Finding | cluster level 20 | 44.9 % | 40.5 % | 49.7 % | 0.0006 | |||||
| Pharmaceutical Drug | cluster level 2 | 32.2 % | 26.6 % | 37.9 % | 0.0010 | |||||
| Body Part | cluster level 83 |
| 38.6 % | 51.8 % | 0.0011 | |||||
The best results are shown in bold face
Fig. 2This illustrates how often a term is found when used as reference standard term. The first stack shows the number of terms that are correctly retrieved between 0 % and 5 % of the times they are used in the reference standard, the second stack shows the number of terms retrieved between 5 % and 10 % of the times, and so on. The statistics are shown for top 10n candidate terms (using cluster level 100 and fully lemmatised and stop word filtered corpus for Medical Finding and Pharmaceutical Drug and cluster level 34 with the corpus retaining more information for Body Part)
Fig. 3This illustrates the frequency of the terms in the TOBYO corpus for two opposite groups of terms used as evaluation data; those terms that were found in less than 5 % of the cases they were used as a reference standard term and those that were used in more than 95 % of the cases they were used as a reference standard term