| Literature DB >> 30961602 |
Ying Xiong1, Zhongmin Wang2, Dehuan Jiang1, Xiaolong Wang1, Qingcai Chen1, Hua Xu3, Jun Yan4, Buzhou Tang5.
Abstract
BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.Entities:
Keywords: Clinical named entity recognition; Fine-grained Chinese word segmentation; Part-of-speech tagging
Mesh:
Year: 2019 PMID: 30961602 PMCID: PMC6454584 DOI: 10.1186/s12911-019-0770-7
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Workflow of our study
Statistics of our fine-grained Chinese word segmentation and part-of-speech tagging Corpus for clinical text
| Dataset | Notes | Sentences | words |
|---|---|---|---|
| Training | 1440 | 6867 | 158,035 |
| valid | 180 | 813 | 19,290 |
| test | 180 | 857 | 21,472 |
| total | 1800 | 8537 | 198,797 |
Statistics of entities on different categories of CCKS
| Dataset | Body | Disease | Symptom | Test | Treatment | Total |
|---|---|---|---|---|---|---|
| Train | 10,719 | 722 | 7831 | 9546 | 1048 | 29,866 |
| Test | 3021 | 553 | 2311 | 3143 | 465 | 9493 |
Examples of CWS and POS tagging representation using tagging schema “BMES”
| Word segmentation | BMES tags |
|---|---|
| 左眼视力下降数年 | 左/S 眼/S 视/B 力/E 下/B 降/E 数/S 年/S |
| Word segmentation and POS tagging | BMES tags |
| 左眼视力下降数年 | 左/S-JJ 眼/S-NN 视/B-NN 力/E-NN 下/B-VV 降/E-VV 数/S-CD 年/S-M |
| 查血常规 | 查/S-VV 血/B-NN 常/M-NN 规/E-NN |
Performance of CRF and BiLSTM-CRF on CWS and POS tagging for clinical text on our corpus
| Task | Method | P(%) | R(%) | F(%) | |
|---|---|---|---|---|---|
| CWS | CRF | 96.75 | 97.14 | 96.94 | |
| BiLSTM-CRF | 96.56 | 96.66 | 96.61 | ||
| CWS&POS tagging | CWS | CRF | 97.18 | 96.73 | 96.95 |
| BiLSTM-CRF | 96.86 | 96.76 | 96.81 | ||
| POS tagging | CRF | 95.34 | 94.89 | 95.11 | |
| BiLSTM-CRF | 94.81 | 94.72 | 94.77 | ||
Effect of fine-grained CWS and POS tagging on NER for clinical text
| System | P(%) | R(%) | F(%) |
|---|---|---|---|
| Baseline | 88.83 | 88.74 | 88.79 |
| + CWS (CRF) | 89.28 | 88.79 | 89.13 |
| + CWS&POS (CRF) | 90.90 | 88.20 | 89.53 |
| + CWS (BiLSTM-CRF) | 89.31 | 88.75 | 89.03 |
| + CWS&POS (BiLSTM-CRF) | 89.65 | 88.63 | 89.14 |