| Literature DB >> 26380306 |
Chih-Hsuan Wei1, Hung-Yu Kao2, Zhiyong Lu1.
Abstract
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.Entities:
Mesh:
Year: 2015 PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1A screenshot of gene, gene family, and protein domain annotation of PMID: 10828014 in PubTator.
Figure 2Relations between gene, gene family, and protein domains in PMID: 10828014.
The statistic of our gene corpus.
| Data set | Articles | Gene mentions (gene/family/domains) | Gene identifiers |
|---|---|---|---|
| BioCreative II GN training set | 281 | 3,019/1,115/278 | 758 |
| BioCreative II GN test set | 262 | 3,233/1,252/361 | 928 |
| NLM Citation GIA test collection | 151 | 1,205/160/17 | 310 |
|
| |||
| Total | 694 | 7,457/2,527/656 | 1996 |
Figure 3The overview of our integration method (GNormPlus).
The evaluation of human species gene normalization.
| Methods | Precision | Recall |
| System availability |
|---|---|---|---|---|
| Our approach (GNormPlus) | 87.1% | 86.4% | 86.7% | Open source |
| GenNorm [ | 78.9% | 81.4% | 80.1% | GenNorm is open source but AIIA-GMT is no longer available |
| GNAT [ | 90.7% | 82.4% | 86.4% | Open source |
| GeNO [ | 87.8% | 85.0% | 86.4% | N/A |
| Hu et al., 2012 [ | 83.5% | 82.5% | 83.0% | N/A |
| Li et al., 2013 [ | 88.1% | 92.3% | 90.1% | N/A |
The evaluation of cross species gene normalization.
| Methods | TAP-5 | TAP-10 | TAP-20 |
| System availability |
|---|---|---|---|---|---|
| Our approach (GNormPlus) | 33.3% | 36.7% | 36.7% | 50.1% | Open source |
| GenNorm [ | 32.8% | 35.5% | 35.5% | 46.9% | GenNorm is open source but AIIA-GMT is no longer available |
| GeneTuKit [ | 29.7% | 31.4% | 32.5% | — | Open source |
| Kuo et al. [ | 21.4% | 25.1% | 25.1% | 30.6% | N/A |
| Tsai et al. [ | 19.0% | 22.9% | 23.9% | — | N/A |
The comparison of different mention recognition training corpus.
| Gene mention type scheme | Precision | Recall |
|
|---|---|---|---|
| Gene/family/domain | 87.1% | 86.4% | 86.7% |
| Single gene type only | 78.4% | 79.2% | 78.8% |
The frequency of false negative and positive errors of GNormPlus.
| FN | FP | Total | Percentage | |
|---|---|---|---|---|
| Gene mention (GM) recognition | ||||
| Gene/family/domain mention type confusion | 38 | 18 | 56 | 27.1% |
| Wrong boundary or missed gene mention | 18 | 18 | 36 | 17.4% |
| Not a gene mention | 0 | 15 | 15 | 7.3% |
| Gene normalization (GN) | ||||
| Wrong gene identifier due to ambiguity | 19 | 18 | 37 | 17.9% |
| Insufficiency of the gene name dictionary | 19 | 0 | 19 | 9.2% |
| Not annotated in the gold standard | 0 | 17 | 17 | 8.2% |
| Nonhuman genes found | 0 | 11 | 11 | 5.3% |
| Others | 13 | 3 | 16 | 7.7% |