| Literature DB >> 25861377 |
Shengyu Liu1, Buzhou Tang1, Qingcai Chen1, Xiaolong Wang1, Xiaoming Fan1.
Abstract
Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25861377 PMCID: PMC4377447 DOI: 10.1155/2015/913489
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Singleton feature templates.
| Number | Feature template |
|---|---|
|
| Word feature |
|
| POS |
|
| Chunk |
|
| Orthographical feature |
|
| DrugBank |
|
| FDA |
|
| Jochem |
|
| Word embeddings feature |
|
| Prefix of length of 3 |
|
| Prefix of length of 4 |
|
| Prefix of length of 5 |
|
| Suffix of length of 3 |
|
| Suffix of length of 4 |
|
| Suffix of length of 5 |
|
| Word shape (generalized word class) |
|
| Word shape (brief word class) |
Figure 1Conjunction features for “apigenin.”
Statistics of the DDIExtraction 2013 corpus.
| DrugBank | MEDLINE | |||||
|---|---|---|---|---|---|---|
| Training | Test | Total | Training | Test | Total | |
| Documents | 572 | 54 | 626 | 142 | 58 | 200 |
| Sentences | 5675 | 145 | 5820 | 1301 | 520 | 1821 |
| Drug | 8197 | 180 | 8377 | 1228 | 171 | 1399 |
| Group | 3206 | 65 | 3271 | 193 | 90 | 283 |
| Brand | 1423 | 53 | 1476 | 14 | 6 | 20 |
| No-human | 103 | 5 | 108 | 401 | 115 | 516 |
Experimental results of the DNR systems on the DDIExtraction 2013 corpus under strict matching criterion (%).
| Feature | Feature number |
|
|
|
|---|---|---|---|---|
|
| 43935 | 84.75 | 72.89 | 78.37 |
|
| 294782 | 86.64 | 71.87 | 78.57 |
|
| 117913 | 88.37 | 72.01 |
|
Comparisons between our system and all systems in the DDIExtraction 2013 challenge (%).
| Method | Strict | ||
|---|---|---|---|
|
|
|
| |
| Our system | 88.37 | 72.01 |
|
| WBI [ | 73.40 | 69.80 | 71.50 |
| NLM_LHC | 73.20 | 67.90 | 70.40 |
| LASIGE [ | 69.60 | 62.10 | 65.60 |
| UTurku [ | 73.70 | 57.90 | 64.80 |
| UC3M [ | 51.70 | 54.20 | 52.90 |
| UMCC_DLSI-(DDI) [ | 19.50 | 46.50 | 27.50 |
Detailed comparisons between WBI and our system (%).
| WBI | Our system | Δ | |||||
|---|---|---|---|---|---|---|---|
|
|
|
| |||||
| Strict | 73.40 | 69.80 | 71.50 | 88.37 | 72.01 | 79.36 | +7.86 |
| Exact | 85.50 | 81.30 | 83.30 | 93.38 | 76.09 | 83.85 | +0.55 |
| Type | 76.70 | 73.00 | 74.80 | 91.41 | 74.49 | 82.09 | +7.29 |
| Partial | 87.70 | 83.50 | 85.60 | 95.08 | 77.48 | 85.38 | −0.22 |
|
| |||||||
| Drug (strict) | 73.60 | 85.20 | 79.00 | 93.35 | 88.03 | 90.61 | +11.61 |
| Brand (strict) | 81.00 | 86.40 | 83.60 | 100.0 | 94.92 | 97.39 | +13.79 |
| Group (strict) | 79.20 | 76.10 | 77.60 | 90.15 | 76.77 | 82.92 | +5.32 |
| No-human (strict) | 31.40 | 9.10 | 14.10 | 90.91 | 8.26 | 15.14 | +1.04 |
Figure 2Performance curves of DNR systems with different percentages of all features.
Experimental results of the DNR systems using different features on the DDIExtraction 2013 corpus under strict matching criterion (%).
| Feature |
|
|
|
|---|---|---|---|
|
| 81.04 | 63.56 | 71.24 |
|
| 78.41 | 67.78 | 72.71 |
|
| 86.05 | 69.24 | 76.74 |
|
| 84.69 | 66.91 | 74.76 |
|
| 83.84 | 71.87 | 77.39 |
|
| 82.70 | 69.68 | 75.63 |
|
| 85.56 | 69.97 | 76.98 |
|
| 84.75 | 72.89 | 78.37 |