| Literature DB >> 28810903 |
Gamal Crichton1, Sampo Pyysalo2, Billy Chiu2, Anna Korhonen2.
Abstract
BACKGROUND: Named Entity Recognition (NER) is a key task in biomedical text mining. Accurate NER systems require task-specific, manually-annotated datasets, which are expensive to develop and thus limited in size. Since such datasets contain related but different information, an interesting question is whether it might be possible to use them together to improve NER performance. To investigate this, we develop supervised, multi-task, convolutional neural network models and apply them to a large number of varied existing biomedical named entity datasets. Additionally, we investigated the effect of dataset size on performance in both single- and multi-task settings.Entities:
Keywords: Biomedical text mining; Convolutional neural networks; Multi-task learning; Named entity recognition
Mesh:
Year: 2017 PMID: 28810903 PMCID: PMC5558737 DOI: 10.1186/s12859-017-1776-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The datasets and details of their annotations
| Dataset | Contents | Entity counts |
|---|---|---|
| AnatEM [ | Anatomy NE | 13,701 |
| BC2GM [ | Gene/Protein NE | 24,583 |
| BC4CHEMD [ | Chemical NE | 84,310 |
| BC5CDR [ | Chemical, Disease NEs | Chemical: 15,935; Disease:12,852 |
| BioNLP09 [ | Gene/Protein NE | 14,963 |
| BioNLP11EPI [ | Gene/Protein NE | 15,811 |
| BioNLP11ID [ | 4 NEs | Gene/Protein: 6551; Organism: 3471; |
| Chemical: 973; Regulon-operon: 87 | ||
| BioNLP13CG [ | 16 NEs | Gene/Protein: 7908; Cell: 3492; Cancer: 2582 |
| Chemical: 2270; Organism: 1715; Multi-tissue structure: 857; | ||
| Tissue: 587; Cellular component: 569; Organ: 421; | ||
| Organism substance: 283; Pathological formation: 228; Amino acid: 135; | ||
| Immaterial anatomical entity: 102; Organism subdivision: 98; | ||
| Anatomical system: 41; Developing anatomical structure: 35 | ||
| BioNLP13GE [ | Gene/Protein NE | 12,057 |
| BioNLP13PC [ | 4 NEs | Gene/Protein: 10,891; Chemical: 2487; |
| Complex: 1502; Cellular component: 1013 | ||
| CRAFT [ | 6 NEs | SO: 18,974; Gene/Protein: 16,064; |
| Taxonomy: 6868; Chemical: 6053; CL: 5495; GO-CC: 4180 | ||
| Ex-PTM [ | Gene/Protein NE | 4698 |
| JNLPBA [ | 5 NEs | Gene/Protein: 35,336; DNA: 10,589; Cell Type: 8639 |
| Cell Line: 4330; RNA: 1069 | ||
| Linnaeus [ | Species NE | 4263 |
| NCBI-Disease [ | Disease NE | 6881 |
| GENIA-PoS [ | PoS-Tagging | N/A |
Fig. 1Single-task convolutional model
Fig. 2Multi-output multi-task convolutional model
Fig. 3Multi-task dependent convolutional model
Best positive effects
| Dataset | STM | Best MO-MTM | Best dataset |
|---|---|---|---|
| AnatEM | 81.55 |
| NCBI-Disease |
| BC2GM |
| 72.21 | Ex-PTM |
| BC4CHEMD |
| 80.31 | BioNLP13GE |
| BC5CDR | 83.66 |
| BioNLP11EPI |
| BioNLP09 | 83.90 |
| BioNLP13GE |
| BioNLP11EPI | 77.72 |
| BioNLP09 |
| BioNLP11ID | 81.50 |
| BioNLP13GE |
| BioNLP13CG | 76.74 |
| BioNLP13PC |
| BioNLP13GE | 73.28 |
| BioNLP11EPI |
| BioNLP13PC | 80.61 |
| Ex-PTM |
| CRAFT |
| 78.48 | BioNLP13GE |
| Ex-PTM | 68.56 |
| BioNLP11EPI |
| JNLPBA |
| 68.92 | BioNLP13GE |
| Linnaeus |
| 83.63 | NCBI-Disease |
| NCBI-Disease | 80.26 |
| Ex-PTM |
| Average | 78.43 | 78.81 | N/A |
Datasets in rightmost column are the auxiliary ones. (Bold: best scores, *: statistically significant)
Chemical group
| Dataset | STM | MO-MTM |
|---|---|---|
| BC4CHEMD |
| 82.51 |
| BC5CDR | 87.02 |
|
| BioNLP11ID |
| 63.74 |
| BioNLP13CG | 66.40 |
|
| BioNLP13PC | 74.53 |
|
| CRAFT |
| 74.83 |
| Average | 76.43 | 77.49 |
(Bold: best scores, *: statistically significant)
Species group
| Dataset | STM | MO-MTM |
|---|---|---|
| BioNLP11ID | 74.14 |
|
| BioNLP13CG | 82.75 |
|
| CRAFT |
| 97.44 |
| Linnaeus |
| 83.54 |
| Average | 84.65 | 86.13 |
(Bold: best scores, *: statistically significant)
Cellular component group
| Dataset | STM | MO-MTM |
|---|---|---|
| BioNLP13CG | 72.79 |
|
| BioNLP13PC | 83.23 |
|
| CRAFT | 61.04 |
|
| Average | 72.35 | 74.18 |
(Bold: best scores, *: statistically significant)
Disease group
| Dataset | STM | MO-MTM |
|---|---|---|
| BC5CDR |
| 80.39 |
| NCBI-Disease | 80.26 |
|
| Average | 80.36 | 80.42 |
(Bold: best scores, *: statistically significant)
Cell group
| Dataset | STM | MO-MTM |
|---|---|---|
| BioNLP13CG |
| 82.83 |
| CRAFT |
| 86.89* |
| Average | 85.66 | 84.86 |
(Bold: best scores, *: statistically significant)
Gene/protein group
| Dataset | STM | MO-MTM |
|---|---|---|
| BC2GM | 72.63 |
|
| BioNLP09 | 83.90 |
|
| BioNLP11EPI | 77.72 |
|
| BioNLP11ID | 86.20 |
|
| BioNLP13CG | 83.40 |
|
| BioNLP13GE | 73.28 |
|
| BioNLP13PC | 83.21 |
|
| CRAFT | 72.85 |
|
| Ex-PTM | 68.56 |
|
| JNLPBA | 69.60 |
|
| Average | 77.14 | 79.43 |
(Bold: best scores, *: statistically significant)
Single task and multi-task f-scores on NER tasks
| Dataset | Baseline | STM | MO-MTM | D-MTM |
|---|---|---|---|---|
| AnatEM | 81.79 | 81.55 | 81.83 |
|
| BC2GM | 70.31 | 72.63 |
| 72.87 |
| BC4CHEMD | 81.08 | 82.95 | 82.37 |
|
| BC5CDR | 83.11 | 83.66 |
| 83.83 |
| BioNLP09 | 81.84 | 83.90 |
| 84.10 |
| BioNLP11EPI | 74.98 | 77.72 |
| 78.03* |
| BioNLP11ID | 81.44 | 81.50 | 80.58* |
|
| BioNLP13CG | 75.23 | 76.74 |
| 77.52* |
| BioNLP13GE | 72.49 | 73.28 |
| 74.00* |
| BioNLP13PC | 79.35 | 80.61 |
| 81.50* |
| CRAFT | 78.76 | 79.55 | 79.10 |
|
| Ex-PTM | 65.75 | 68.56 |
| 69.67* |
| JNLPBA | 67.43 | 69.60 |
| 69.44 |
| Linnaeus | 79.01 | 83.98 | 81.57 |
|
| NCBI-Disease | 79.09 | 80.26 | 79.02 |
|
| Average | 76.78 | 78.43 | 79.26 | 78.79 |
(Bold: best scores, *: statistically significant compared to single-task model)
Effect of dataset size reduction on single-task and multi-task performance
| 1.0 | 0.5 | 0.5 | 0.25 | 0.25 | 0.1 | 0.1 | |
|---|---|---|---|---|---|---|---|
| Dataset | STM | STM | MO-MTM | STM | MO-MTM | STM | MO-MTM |
| AnatEM |
|
| 78.35* | 74.82* |
|
| 63.15 |
| BC2GM |
| 70.27* |
|
| 67.14* | 63.07* |
|
| BC4CHEMD |
|
| 79.22*+ |
| 76.26* | 71.94* |
|
| BC5CDR |
| 81.15* |
| 79.09* |
| 74.47* |
|
| BioNLP09 |
| 81.89* |
|
| 79.58* | 75.12* |
|
| BioNLP11EPI |
| 74.00* |
| 70.89* |
| 67.63* |
|
| BioNLP11ID |
| 76.65 |
| 70.60* |
| 68.19* |
|
| BioNLP13CG |
| 70.58* |
| 65.08* |
| 51.61* |
|
| BioNLP13GE | 73.28 | 73.32 |
| 67.43 |
| 52.66* |
|
| BioNLP13PC |
| 75.39* |
| 70.03* |
| 57.62* |
|
| CRAFT |
| 75.25* |
| 72.19* |
| 60.91* |
|
| Ex-PTM | 68.56 | 62.81 |
| 53.30* |
| 47.01* |
|
| JNLPBA | 69.60 | 68.34 |
| 66.63* |
| 62.80* |
|
| Linnaeus | 83.98 | 80.08* |
| 69.53* |
| 39.44 |
|
| NCBI-Disease |
| 76.51 |
| 71.88* |
|
| 62.89* |
| Average | 78.43 | 75.01 | 78.24 | 70.41 | 75.47 | 61.73 | 68.64 |
(Bold: best scores for dataset, Italic: better score for each setting, *: statistically significant compared to full single-task model, +: statistically significant compared to corresponding single-task model)
Comparison to benchmark results
| Corpus | Benchmark | Ours | Matching criteria |
|---|---|---|---|
| AnatEM | 91.61 | 88.55 | Right boundary match |
| BC2GM | 87.21 | 84.41 | Alternative boundaries |
| BC4CHEMD | 87.39 | 82.32 | Exact |
| BC5CDR | 86.76 | 83.87 | Exact |
| JNLPBA | 72.55 | 68.95 | Exact |
| Linnaeus | 95.68* | 79.33 | Exact |
| NCBI-Disease | 82.9 | 77.82 | Exact |
(*: see text for caveat regarding comparability)