| Literature DB >> 23566040 |
Buzhou Tang1, Hongxin Cao, Yonghui Wu, Min Jiang, Hua Xu.
Abstract
BACKGROUND: Named entity recognition (NER) is an important task in clinical natural language processing (NLP) research. Machine learning (ML) based NER methods have shown good performance in recognizing entities in clinical text. Algorithms and features are two important factors that largely affect the performance of ML-based NER systems. Conditional Random Fields (CRFs), a sequential labelling algorithm, and Support Vector Machines (SVMs), which is based on large margin theory, are two typical machine learning algorithms that have been widely applied to clinical NER tasks. For features, syntactic and semantic information of context words has often been used in clinical NER systems. However, Structural Support Vector Machines (SSVMs), an algorithm that combines the advantages of both CRFs and SVMs, and word representation features, which contain word-level back-off information over large unlabelled corpus by unsupervised algorithms, have not been extensively investigated for clinical text processing. Therefore, the primary goal of this study is to evaluate the use of SSVMs and word representation features in clinical NER tasks.Entities:
Mesh:
Year: 2013 PMID: 23566040 PMCID: PMC3618243 DOI: 10.1186/1472-6947-13-S1-S1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Counts of different types of entities in training and test data sets used in this study.
| Concepts (N = 72,846) | ||||
|---|---|---|---|---|
| Training (349 notes) | 11,968 | 8,500 | 7,369 | 27,837 |
| Test (477 notes) | 18,550 | 13,560 | 12,899 | 45,009 |
Figure 1Examples of two different tag representations: BIO vs. BIESO.
Figure 2A hierarchical structure fragment of 42 words.
Figure 3A fragment of the semantic thesaurus of three words.
Performance of SSVMs and CRFs based NER systems when different features and tag representations were used.
| Tags | Features | SSVMs - F(R/P)(%) | CRFs - F(R/P)(%) |
|---|---|---|---|
| BIO | Base | 84.89(83.39/86.44) | 84.62 (82.35/87.01) |
| Base + Clustering | 85.22(84.05/86.43) | 85.16 (82.94/87.50) | |
| Base + Distributional | 85.19(84.00/86.42) | 85.12(82.80/87.58) | |
| Base + Clustering + Distributional | |||
| BIESO | Base | 85.42(83.60/87.31) | 85.04(82.31/87.97) |
| Base + Clustering | 85.74(84.15/87.40) | 85.59(83.16/88.16) | |
| Base + Distributional | 85.74(84.16/87.38) | 85.35(82.82/88.05) | |
| Base + Clustering + Distributional | |||
Results by entity type for the best performed SSVMs and CRFs clinical entity recognition systems.
| Algorithm | Category | Exact matching (%) | Inexact matching (%) | ||||
|---|---|---|---|---|---|---|---|
| Recall | Precision | F-measure | Recall | Precision | F-measure | ||
| SSVMs | Overall | 84.31 | 87.38 | 91.78 | 93.03 | ||
| Problem | 86.75 | 88.50 | 87.61 | 93.53 | 95.29 | 94.40 | |
| Treatment | 85.72 | 89.27 | 87.46 | 91.45 | 95.17 | 93.27 | |
| Test | 85.13 | 89.84 | 87.42 | 90.26 | 95.50 | 92.81 | |
| CRFs | Overall | 83.30 | 88.20 | 90.52 | 93.96 | ||
| Problem | 85.73 | 89.02 | 87.34 | 92.46 | 96.12 | 94.25 | |
| Treatment | 84.14 | 89.88 | 86.92 | 89.99 | 96.03 | 92.92 | |
| Test | 84.07 | 90.74 | 87.28 | 88.94 | 95.96 | 92.32 | |
Comparison between our system and other state-of-the-art systems.
| Systems | Algorithm | Exact matching |
|---|---|---|
| F-measure (%) | ||
| Our system | SSVMs | 85.8 |
| deBruijn et al [ | Semi-Markov | 85.2 |
| Jiang et al [ | CRFs | 83.9 |
| Kang et al [ | CRFs | 82.1 |
| Gurulingappa et al [37] | CRFs | 81.8 |
| Patrick et al [38] | CRFs | 81.3 |