| Literature DB >> 34185005 |
Sanghoun Song1, Hyung Joon Joo2,3,4, Yunjin Yum5,6, Jeong Moon Lee6, Moon Joung Jang6, Yoojoong Kim6, Jong-Ho Kim6,2, Seongtae Kim1, Unsub Shin1.
Abstract
BACKGROUND: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences.Entities:
Keywords: Korean; fastText; medical word pair; relatedness; similarity; word embedding
Year: 2021 PMID: 34185005 PMCID: PMC8277378 DOI: 10.2196/29667
Source DB: PubMed Journal: JMIR Med Inform
Examples of the practice session for human validation in which word pairs of the practice session included medical as well as general terms. Term 1 and Term 2 were presented to the participants. However, their anticipated similarity and relatedness categories were kept hidden.
| Term 1 | Term 2 | Anticipated category |
| 책방 (bookstore) | 서점 (bookshop) | Similarity: high |
| 학교 (school) | 경찰서 (police station) | Similarity: middle |
| 까치 (magpie) | 중국어 (the Chinese language) | Similarity: low |
| 친구 (friend) | 사람 (human) | Relatedness: high |
| 겨울 (winter) | 난로 (heater) | Relatedness: middle |
| 핸드폰 (cell phone) | 미술 (art) | Relatedness: low |
| 심혈관질환 (cardiovascular disease) | 관상동맥질환 (coronary artery disease) | Similarity: high |
| 암성통증 (cancer pain) | 월경통 (menstrual pain) | Similarity: middle |
| 좌골신경통 (sciatica) | 간성혼수 (hepatic coma) | Similarity: low |
| 심근경색 (myocardial infarction) | 흉통 (chest pain) | Relatedness: high |
| 세티리진 (cetirizine) | 구강건조 (dry mouth) | Relatedness: middle |
| 백반증 (vitiligo) | 라미 (lamisil) | Relatedness: low |
Hyperparameters of Word2Vec and FastText.
| Parameter name | Specified argument |
| Dimension size | 300 |
| Window size | 5 |
| Negative sampling ratio | 10% |
| Minimum frequency | 10 |
| Workers | 3 |
| Batch words | 10,000 |
| Alpha | 0.25% |
| Epochs | 20 |
Figure 1Score distribution plots of the participants (n=16 attending physicians from a tertiary hospital); the distributions of the scores of participants 1, 5, and 13 were absolutely skewed and were therefore excluded from further analyses.
Figure 2Scatter plot of the correlation between the similarity and relatedness tasks.
Interrater agreement (using the intraclass correlation coefficient) on word pair-sets grouped by the semantic domain types.
| Task | Word pairs of the same domain | Word pairs of different domains | |
|
|
|
| |
|
| Original word pair set | 0.49a | 0.41b |
|
| Final word pair set after modification | 0.49c | 0.42d |
|
|
|
| |
|
| Original word pair set | 0.52a | 0.57b |
|
| Final word pair set after modification | 0.51e | 0.57f |
an=409.
bn=198.
cn=408.
dn=196.
en=407.
fn=192.
Figure 3Correlation between cosine distance from FastText and the human evaluations from 13 attending physicians.
Figure 4Correlation between the reference scores for the original University of Minnesota Semantic Relatedness Set (UMNSRS) word pair sets (English version) and the scores from 12 health information managers for the Korean translation version.
Figure 5Correlation between the cosine distance of FastText embedding models and the human evaluations by 12 health information managers of the Korean version of the University of Minnesota Semantic Relatedness Set (UMNSRS) word pair sets.