| Literature DB >> 35974113 |
Yoojoong Kim1, Jong-Ho Kim2,3, Hyung Joon Joo2,3,4, Sanghoun Song5, Jeong Moon Lee2, Moon Joung Jang2, Yun Jin Yum2,6, Seongtae Kim7, Unsub Shin7, Young-Min Kim8.
Abstract
With advances in deep learning and natural language processing (NLP), the analysis of medical texts is becoming increasingly important. Nonetheless, despite the importance of processing medical texts, no research on Korean medical-specific language models has been conducted. The Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language, as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train the language models. In this paper, we present a Korean medical language model based on deep learning NLP. The model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. The pre-trained model showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation, and the evaluation for the Korean medical named entity recognition showed a 0.053 increase in the F1-score.Entities:
Mesh:
Year: 2022 PMID: 35974113 PMCID: PMC9381714 DOI: 10.1038/s41598-022-17806-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Pre-training process for KM-BERT. A pair of sentences joined by two sentences with special tokens is used as input. (A) MLM task. (B) NSP task.
Description for datasets used in evaluations.
| Task | Evaluation | Data description |
|---|---|---|
| MLM and NSP | Pre-training | Test set of the collected corpus |
| MLM | Intrinsic | External Korean medical text |
| NSP | Intrinsic | External Korean medical text |
| MedSTS | Extrinsic | MedSTS dataset translated from English to Korean |
| NER | Extrinsic | Korean medical NER dataset |
Figure 2Pre-training results of KM-BERT and KM-BERT-vocab for MLM and NSP tasks over epoch. Dashed line representing KR-BERT and dot-dashed line representing M-BERT denotes the performance of the final pre-trained models. (A) MLM accuracy. (B) NSP accuracy. (C) MLM loss. (D) NSP loss.
Figure 3Distribution of MLM accuracy by 100 repetitions for each language model over each corpus type. (A) Medical textbook. (B) Health information news. (C) Medical research article.
Figure 4Distribution of the predicted next sentence probability for the NSP task. (A–C) Medical textbook, health information news, and medical research article with the next sentence relationship. (D–F) Random sentence pairs for corpus types that correspond to (A–C) with no next sentence relationship.
NSP accuracy of the intrinsic evaluation dataset.
| Model | Medical textbooks | Health information news | Medical research article | Overall | |||
|---|---|---|---|---|---|---|---|
| KM-BERT | 0.997 | 0.871 | 0.997 | 0.932 | 1 | 0.935 | 0.955 |
| KM-BERT-vocab | 1 | 0.850 | 1 | 0.922 | 0.997 | 0.925 | 0.949 |
| KR-BERT | 1 | 0.415 | 0.993 | 0.408 | 0.997 | 0.371 | 0.697 |
| M-BERT | 0.912 | 0.752 | 0.956 | 0.599 | 0.861 | 0.810 | 0.815 |
Extrinsic evaluation results on MedSTS.
| Model | Pearson correlation | Spearman correlation |
|---|---|---|
| KM-BERT | 0.869 | 0.860 |
| KM-BERT-vocab | 0.851 | 0.834 |
| KR-BERT | 0.823 | 0.811 |
| M-BERT | 0.842 | 0.830 |
Examples of the MedSTS dataset containing the true MedSTS similarity and similarities measured by KM-BERT and KR-BERT.
| Sentences | Similarity | |||
|---|---|---|---|---|
| Sentence 1 | Sentence 2 | Med STS | KM-BERT | KR-BERT |
| Qsymia 3.75–23 mg capsule multiphasic release 24 h 1 capsule by mouth one time daily | Aleve 220 mg tablet 2 tablets by mouth one time daily as needed | 0 | 1 | 3 |
| Patient requires extensive assistance in the following activities: toileting, transfer to/from bed/chair, mobility | Patient requires limited assistance in the following activities: bathing, dressing, toileting | 2.75 | 4 | 3 |
Sentences were translated from the original English MedSTS sentences into Korean.
Extrinsic evaluation results on Korean medical NER.
| Model | F1 |
|---|---|
| KM-BERT | 0.866 |
| KM-BERT-vocab | 0.847 |
| KR-BERT | 0.847 |
| M-BERT | 0.813 |