| Literature DB >> 29122011 |
Marco Basaldella1, Lenz Furrer2, Carlo Tasso1, Fabio Rinaldi3.
Abstract
BACKGROUND: This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles.Entities:
Keywords: Machine learning; Named entity recognition; Natural language processing; Text mining
Mesh:
Year: 2017 PMID: 29122011 PMCID: PMC5679148 DOI: 10.1186/s13326-017-0157-6
Source DB: PubMed Journal: J Biomed Semantics
Feature sets: features used by the NN and CRF (see the “Features” section for details)
| Neural network | Conditional random fields | |
|---|---|---|
| Implementation | ||
| Software | R [ | CRFSuite [ |
| Model parameters | 1 hidden layer of size 2×( | Training algorithm: averaged perceptron, default epsilon, 2 words window |
| Input | n-grams selected by OGER | Single tokens |
| Features | ||
| Candidate character count | Count | — |
| Candidate is all uppercase | Label yes/no | Label yes/no |
| Candidate is all lowercase | Label yes/no | Label yes/no |
| Candidate contains Greek (i.e. “alpha”, | Label yes/no | Label yes/no |
| Candidate contains dashes (‘-’) | Count | Label yes/no |
| Candidate contains numbers | Count | Label yes/no |
| Candidate ends with a number | Label yes/no | Label yes/no |
| Candidate contains capital letter not in first position | Label yes/no | Label yes/no |
| Candidate contains lowercase characters | Count | Label yes/no |
| Candidate contains uppercase characters | Count | Label yes/no |
| Candidate contains spaces | Count | Label yes/no |
| Candidate contains symbols | Count | Label yes/no |
| 2-3 character affixes appearing in an ontology in [ | Normalized frequency | Label yes/no |
| Candidate is symbol | — | Label yes/no |
| Candidate’s part-of-speech | — | Yes, using [ |
| Candidate’s stem | — | Yes, using [ |
| Candidate pre-selected by OGER | — | Yes (see the “ |
| Total features | 36 | About 2.8 million |
| Tagging speed | 1286 tokens/sec | 632 tokens/sec |
Comparison of the NER performance obtained in this paper with the previous version of the system [11]
| System | Precision | Recall | F1 |
|---|---|---|---|
| OGER 2016 | 0.34 | 0.55 | 0.42 |
| OGER+Distiller 2016 | 0.85 | 0.37 | 0.51 |
| OGER | 0.59 |
| 0.62 |
| OGER+Distiller NN |
| 0.60 |
|
| OGER+Distiller CRF | 0.69 | 0.49 | 0.58 |
| OGER+Distiller Mixed | 0.87 | 0.63 | 0.73 |
| Distiller CRF | 0.71 | 0.47 | 0.58 |
The best values are highlighted in boldface
Per-entity-type breakdown of the precision scores obtained by the different pipelines
| Evaluation method: strict | Evaluation method: average | |||||
|---|---|---|---|---|---|---|
| Entity type | OG | OG+NN | OG+CRF | OG | OG+NN | OG+CRF |
| All | 0.59 |
| 0.69 | 0.61 |
| 0.80 |
| Chemicals | 0.44 |
| 0.48 | 0.45 |
| 0.50 |
| Cells | 0.88 | 0.88 |
| 0.93 | 0.94 |
|
| Biological processes/molecular functions | 0.39 |
| 0.68 | 0.45 |
| 0.73 |
| Cellular components | 0.51 |
| 0.87 | 0.52 |
| 0.90 |
| Organisms | 0.29 |
| 0.82 | 0.29 |
| 0.83 |
| Proteins | 0.49 |
| 0.74 | 0.50 |
| 0.80 |
| Sequences | 0.46 |
| 0.23 | 0.48 |
| 0.27 |
The best values are highlighted in boldface
Per-entity-type breakdown of the recall scores obtained by the different pipelines
| Evaluation method: strict | Evaluation method: average | |||||
|---|---|---|---|---|---|---|
| Entity type | OG | OG+NN | OG+CRF | OG | OG+NN | OG+CRF |
| All |
| 0.60 | 0.50 |
| 0.61 | 0.58 |
| Chemicals |
| 0.68 | 0.26 |
| 0.68 | 0.27 |
| Cells | 0.77 | 0.67 |
| 0.77 | 0.71 |
|
| Biological processes/molecular functions | 0.25 | 0.22 |
| 0.29 | 0.25 |
|
| Cellular components | 0.60 | 0.56 |
| 0.61 | 0.58 |
|
| Organisms |
| 0.91 | 0.91 |
| 0.91 |
|
| Proteins |
| 0.75 | 0.66 |
| 0.75 | 0.72 |
| Sequences |
| 0.64 | 0.08 |
| 0.65 | 0.09 |
The best values are highlighted in boldface
Per-entity-type breakdown of the F1 scores obtained by the different pipelines
| Evaluation method: strict | Evaluation method: average | |||||
|---|---|---|---|---|---|---|
| Entity type | OG | OG+NN | OG+CRF | OG | OG+NN | OG+CRF |
| All | 0.62 |
| 0.58 | 0.65 |
| 0.67 |
| Chemicals | 0.55 |
| 0.34 | 0.56 |
| 0.35 |
| Cells | 0.80 | 0.76 |
| 0.84 | 0.81 |
|
| Biological processes/molecular functions | 0.30 | 0.35 |
| 0.35 | 0.39 |
|
| Cellular components | 0.55 | 0.70 |
| 0.56 | 0.71 |
|
| Organisms | 0.44 |
| 0.87 | 0.45 |
| 0.88 |
| Proteins | 0.62 |
| 0.70 | 0.63 |
| 0.76 |
| Sequences | 0.54 |
| 0.12 | 0.57 |
| 0.13 |
The best values are highlighted in boldface
Performance of the presented systems in a CR evaluation, compared to results reported in [24]
| System | Precision | Recall | F1 |
|---|---|---|---|
| OGER | 0.32 |
| 0.40 |
| OGER+Distiller NN |
| 0.49 |
|
| OGER+Distiller CRF | 0.49 | 0.29 | 0.37 |
| MMTx | 0.43 | 0.40 | 0.42 |
| MGrep | 0.48 | 0.12 | 0.19 |
| Concept Mapper | 0.48 | 0.34 | 0.40 |
| cTakes Dictionary Lookup |
| 0.43 | 0.47 |
| cTakes Fast Lookup | 0.41 | 0.40 | 0.41 |
| NOBLE Coder | 0.44 | 0.43 | 0.43 |
Please note that, as stated in the “Concept recognition” section, the systems described in [24] are evaluated on the whole corpus, while we use 20 documents for testing and the remainder for training. The best values are highlighted in boldface
Per-entity-type breakdown of Precision, Recall, and F1 obtained by the different pipelines in the CR evaluation
| Precision | Recall | F1 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Entity type | OG | OG+ | OG+ | OG | OG+ | OG+ | OG | OG+ | OG+ |
| NN | CRF | NN | CRF | NN | CRF | ||||
| All | 0.32 |
| 0.49 |
| 0.49 | 0.29 | 0.40 |
| 0.37 |
| Chemicals | 0.28 | 0.59 |
|
| 0.57 | 0.19 | 0.39 |
| 0.32 |
| Cells | 0.88 | 0.87 |
|
| 0.66 | 0.68 | 0.79 | 0.75 |
|
| Biological processes/molecular functions | 0.35 | 0.72 |
|
| 0.17 | 0.05 | 0.25 |
| 0.10 |
| Cellular comp. | 0.49 | 0.87 |
|
| 0.56 | 0.52 | 0.54 |
| 0.65 |
| Organisms | 0.16 |
| 0.47 |
| 0.70 | 0.67 | 0.26 |
| 0.55 |
| Proteins | 0.45 | 0.84 |
|
| 0.74 | 0.64 | 0.59 |
| 0.75 |
| Sequences | 0.27 |
| 0.37 |
| 0.51 | 0.06 | 0.36 |
| 0.10 |
The best values are highlighted in boldface