| Literature DB >> 27465130 |
Hong-Jie Dai1, Onkar Singh2, Jitendra Jonnagaddala3, Emily Chia-Yu Su4.
Abstract
In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27465130 PMCID: PMC4962763 DOI: 10.1093/database/baw111
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Fig. 1.Workflow of the developed modules.
Examples of prefixed species
| Symbol | Taxonomy ID | Full name |
|---|---|---|
| H | 9606 | human, |
| Zm | 381124 | |
| hum, hsa | 9606 | human, |
| Ath | 3701 |
Fig. 2.Annotations for the article PMID: 9603950.
Fig. 3.Annotations for the article PMID: 11210186.
Performance of the species recognizer module on the ISN corpus
| Configuration | Recognition | Normalization | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| 1. SR | 0.968 | 0.869 | 0.916 | 0.968 | 0.862 | 0.912 |
| 2. NLP+SR | 0.962 | 0.874 | 0.916 | 0.962 | 0.868 | 0.912 |
| 3. NLP+G/P+SR | ||||||
| 4. NLP+G/P+ SR-Sentence | 0.970 | 0.949 | 0.960 | 0.968 | 0.940 | 0.953 |
| LINNAEUS | 0.970 | 0.811 | 0.884 | 0.946 | 0.785 | 0.858 |
| SPECIES | 0.932 | 0.839 | 0.883 | 0.932 | 0.832 | 0.880 |
| 1. SR | 0.963 | 0.820 | 0.886 | 0.957 | 0.814 | 0.880 |
| 2. NLP+SR | 0.966 | 0.823 | 0.889 | 0.817 | 0.883 | |
| 3. NLP+G/P+SR | 0.965 | 0.960 | ||||
| 4. NLP+G/P+SR-Sentence | 0.917 | 0.941 | 0.952 | 0.900 | 0.925 | |
| LINNAEUS | 0.951 | 0.764 | 0.847 | 0.918 | 0.734 | 0.816 |
| SPECIES | 0.921 | 0.8 | 0.856 | 0.919 | 0.795 | 0.852 |
The best PRF-scores are highlighted in bold. P, precision; R, recall; F, F-measure; NLP, natural language processing; G/P, gene/protein recognized module; SR, species recognizer module.
Comparison of the species recognition performance of the developed module with LINNAEUS and SPECIES on the Linnaeus and Species-800 corpora
| Corpus | Tool | Normalization | ||
|---|---|---|---|---|
| P | R | F | ||
| Linnaeus | LINNAEUS | 0.887 | 0.818 | 0.851 |
| SPECIES | ||||
| Our module | 0.892 | 0.728 | 0.802 | |
| Species-800 | LINNAEUS | |||
| SPECIES | 0.839 | 0.726 | 0.778 | |
| Our module | 0.775 | 0.748 | 0.761 | |
P, precision; R, recall; F, F-measure.
Comparison of the species recognizer tools’ performances on the DECA corpus
| Tool | Normalization | ||
|---|---|---|---|
| P | R | F | |
| LINNAEUS | 0.668 | 0.521 | 0.585 |
| LINNAEUS+ | 0.733 | 0.614 | 0.668 |
| SPECIES | 0.742 | 0.633 | 0.683 |
| Our module | |||
LINNAEUS: Run with LINNAEUS’s default species matcher and post-processor, which recognizes species terms from the 10 000 most frequently occurring species in MEDLINE.
LINNAEUS+: Run with entity type dictionary packs downloaded from http://linnaeus.sourceforge.net/. The packs contain updated dictionary files and support normalization of genus names and post-processing instructions.
Download links for the developed BioC-compatible modules
| Resource description | Download link |
|---|---|
| BioC-C# Implementation | |
| Species Recognizer | |
| Gene/Protein Recognizer/Normalizer | |
| Instance-level Species Normalization Corpus |