| Literature DB >> 30666476 |
Lenz Furrer1, Anna Jancso1, Nicola Colic1, Fabio Rinaldi2,3.
Abstract
BACKGROUND: We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step.Entities:
Keywords: Concept recognition; Machine learning; Named entity recognition; Natural language processing
Year: 2019 PMID: 30666476 PMCID: PMC6689863 DOI: 10.1186/s13321-018-0326-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Term indexing using two hash tables. The examples illustrate how dictionary entries are indexed (left) and how the look-up is performed (right)
Fig. 2Example illustrating the disambiguation procedure. The corpus-based postfilter accepts, rejects, or reclassifies annotations from the upstream concept-recognition module
Fig. 3Percentage of terms occurring in Hunspell
Fig. 4Architecture of the NN
Fig. 5System architecture of the OGER++ server
Average processing time analysis for different document formats and sizes
| Size | Format | Documents | doc/s | kiB/s | ann/s | kiB/s (macro) | ann/s (macro) | ann/doc |
|---|---|---|---|---|---|---|---|---|
| Abstracts | txt | 1000 | 9.73 | 8.27 | 462.96 | 11.75 | 239.70 | 47.56 |
| Abstracts | xml | 1000 | 9.45 | 57.26 | 449.43 | 222.44 | 241.55 | 47.56 |
| Full-text | txt | 529 | 0.89 | 16.97 | 866.89 | 18.95 | 621.95 | 979.09 |
| Full-text | xml | 529 | 0.88 | 47.44 | 862.01 | 64.16 | 620.00 | 979.37 |
| Full-text (no disambiguation) | txt | 529 | 17.82 | 341.64 | 28072.22 | 350.24 | 18569.08 | 1575.27 |
For kiB/s and ann/s, micro- and macro-average are given separately
Fig. 6Name overlap among different entity types. The figures in each row denote the percentage of names with this type that are also annotated with the type of the respective column. For example, of all mentions annotated as cell line, close to 39% also have a gene/protein annotation, while only 9% of the gene-annotated mentions also have an annotation as cell line
Evaluation at the level of NER
| Entity type | Precision | Recall | F1 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| OG | OG + Dist | OG + Joint | OG | OG + Dist | OG + Joint | OG | OG + Dist | OG + Joint | |
| All | 0.44 | 0.88 | 0.800 | 0.62 | 0.58 | 0.645 | 0.51 | 0.70 | 0.714 |
| Chemicals | 0.44 | 0.89 | 0.870 | 0.73 | 0.68 | 0.726 | 0.55 | 0.77 | 0.792 |
| Cells | 0.88 | 0.88 | 0.738 | 0.77 | 0.67 | 0.748 | 0.80 | 0.76 | 0.743 |
| BPMFs | 0.39 | 0.78 | 0.628 | 0.25 | 0.22 | 0.349 | 0.30 | 0.35 | 0.449 |
| Cellular components | 0.51 | 0.91 | 0.867 | 0.60 | 0.56 | 0.658 | 0.55 | 0.70 | 0.748 |
| Organisms | 0.29 | 0.98 | 0.977 | 0.92 | 0.91 | 0.920 | 0.44 | 0.94 | 0.948 |
| Proteins | 0.49 | 0.86 | 0.778 | 0.84 | 0.75 | 0.812 | 0.62 | 0.80 | 0.795 |
| Sequences | 0.46 | 0.89 | 0.833 | 0.67 | 0.64 | 0.670 | 0.54 | 0.75 | 0.743 |
Evaluation at the level of concept recognition
| Entity type | Precision | Recall | F1 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| OG | OG + Dist | OG + Joint | OG | OG + Dist | OG + Joint | OG | OG + Dist | OG | |
| All | 0.32 | 0.51 | 0.650 | 0.52 | 0.49 | 0.503 | 0.40 | 0.50 | 0.567 |
| Chemicals | 0.28 | 0.59 | 0.601 | 0.61 | 0.57 | 0.568 | 0.39 | 0.58 | 0.584 |
| Cells | 0.88 | 0.87 | 0.878 | 0.72 | 0.66 | 0.713 | 0.79 | 0.75 | 0.787 |
| BPMFs | 0.35 | 0.72 | 0.634 | 0.19 | 0.17 | 0.178 | 0.25 | 0.27 | 0.278 |
| Cellular components | 0.49 | 0.87 | 0.930 | 0.59 | 0.56 | 0.581 | 0.54 | 0.68 | 0.716 |
| Organisms | 0.16 | 0.49 | 0.486 | 0.71 | 0.70 | 0.709 | 0.26 | 0.58 | 0.577 |
| Proteins | 0.45 | 0.84 | 0.788 | 0.83 | 0.74 | 0.799 | 0.59 | 0.79 | 0.794 |
| Sequences | 0.27 | 0.59 | 0.561 | 0.53 | 0.51 | 0.516 | 0.36 | 0.54 | 0.537 |