| Literature DB >> 27270713 |
Haodi Li1, Buzhou Tang2, Qingcai Chen1, Kai Chen1, Xiaolong Wang1, Baohua Wang3, Zhe Wang4.
Abstract
In this article, an end-to-end system was proposed for the challenge task of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction in BioCreative V, where DNER includes disease mention recognition (DMR) and normalization (DN). Evaluation on the challenge corpus showed that our system achieved the highest F1-scores 86.93% on DMR, 84.11% on DN, 43.04% on CID relation extraction, respectively. The F1-score on DMR is higher than our previous one reported by the challenge organizers (86.76%), the highest F1-score of the challenge.Database URL: http://database.oxfordjournals.org/content/2016/baw077.Entities:
Mesh:
Year: 2016 PMID: 27270713 PMCID: PMC4911788 DOI: 10.1093/database/baw077
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Example of annotated records.
Statistics of the dataset for the CDR task of BioCreative V.
| Datasets | # DOC | # chemicals | # diseases | # CID relations | ||
|---|---|---|---|---|---|---|
| mention | ID | mention | ID | |||
| T&D | 1000 | 10550 | 2973 | 8426 | 3829 | 2050 |
| Test | 500 | 5385 | 1435 | 4424 | 1988 | 1066 |
Features used in two individual sequence labeling modules: CRFs and SSVMs.
| Feature | Description |
|---|---|
| Bag-of-words | Unigrams: |
| Bigrams: | |
| Trigrams: | |
| Part-of-speech (POS) tags | Unigrams: |
| Bigrams: | |
| Trigrams: | |
| Combinations of tokens and POS tags | |
| Sentence information | Length of the current sentence; whether there is any bracket unmatched in the current sentence? |
| Affixes | Prefixes and suffixes of the length from 1 to 5. |
| Orthographical features | Whether the current word is an upper Caps word? Contains a digit or not? Has uppercase characters inside? Etc. |
| Word shapes | Any or consecutive uppercase character(s), lowercase character(s), digit (s) and other character(s) in the current word is/are replaced by ‘A’, ‘a’, ‘#’ and ‘-’ respectively. |
| Section information | Which section the current word belongs to, title or abstract? |
| Word representation features [5] | Brown clustering ( |
| Dictionary features | Chemical dictionary: CTD, DrugBank, MeSH, Pharmacogenetics Knowledge Base (PharmGKB) ( |
| Disease dictionary: CTD, MeSH, UMLS, disease ontology ( | |
| Frequency features | Whether the frequency of the current word is higher than a given value (4 in our system) and the inverse document frequency of it is less than another given value (0.1 in our system)? |
| Character N-grams | Character N-grams (N = 1, 2, …, 4) within the current word. |
Figure 2.Workflow of our normalization module.
Figure 3.An example of DN for a DM axonal neuropathy using dictionary look-up.
Results of our system on CMR and DMR (%).
| Method | Chemical | Disease | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | |
| NA | NA | NA | 81.62 | 78.91 | 80.24 | |
| 93.08 | 84.53 | 88.60 | NA | NA | NA | |
| 94.25 | 90.44 | 92.30 | 88.37 | 85.23 | 86.78 | |
| 94.58 | 91.35 | 92.93 | 87.74 | 86.05 | 86.88 | |
| 95.05 | 90.96 | 92.96 | 88.68 | 85.23 | 86.93 | |
Results of our system on CN and DN (%).
| Method | Chemical | Disease | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | |
| NA | NA | NA | 42.71 | 67.46 | 52.30 | |
| 95.02 | 81.11 | 87.52 | NA | NA | NA | |
| NA | NA | NA | 81.15 | 80.13 | 80.64 | |
| NA | NA | NA | 81.25 | 81.33 | 81.29 | |
| 87.15 | 90.73 | 88.90 | 77.89 | 80.43 | 79.14 | |
| 87.95 | 91.43 | 89.68 | 78.62 | 82.14 | 80.34 | |
| 87.83 | 90.03 | 88.92 | 78.36 | 78.11 | 78.24 | |
| 93.48 | 90.94 | 92.19 | 88.64 | 80.03 | 84.11 | |
Results of our system on the CID relation subtask (%).
| Method | P | R | F1 |
|---|---|---|---|
| 16.43 | 76.45 | 27.05 | |
| 18.51 | 77.65 | 29.89 | |
| 53.82 | 34.33 | 41.92 | |
| 54.61 | 34.99 | 42.65 | |
| 55.83 | 34.15 | 42.37 | |
| 57.93 | 34.24 | 43.04 |