| Literature DB >> 24349160 |
Lishuang Li1, Shanshan Liu1, Lihua Li2, Wenting Fan1, Degen Huang1, Huiwei Zhou1.
Abstract
Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.Entities:
Mesh:
Year: 2013 PMID: 24349160 PMCID: PMC3861319 DOI: 10.1371/journal.pone.0081956
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Architecture of the gene name normalization system.
Figure 2Similarity computing measure based on JaroWinkler distance.
Figure 3An example to determine the unique identifier based on the semantic similarity.
Figure 4An example of the maximum matching in the bipartite graph.
The maximal matching between the extended semantic information and the context is flagged with solid lines. Edges with weight 0 are bypassed and all weights are rounding in this figure.
Figure 5Semantic similarity calculation based on Munkres' Assignment Algorithm.
Results based on different disambiguation algorithms.
| Method | F-score | Precision | Recall | TP | FP | FN |
| JaroWinkler distance | 89.4% | 87.3% | 91.5% | 717 | 104 | 67 |
| Munkres' Assignment Algorithm | 90.1% | 88.1% | 92.1% | 723 | 98 | 62 |
Results using the combination of different steps.
| Method | F-score | Precision | Recall | TP | FP | FN |
| Prepro.+Exact | 88.3% | 88.2% | 88.4% | 694 | 93 | 91 |
| Prepro.+Exact+Appro. | 83.2% | 75.5% | 92.6% | 726 | 236 | 58 |
| Prepro.+Exact+Appro.+Filter | 90.1% | 88.1% | 92.1% | 723 | 98 | 62 |
Prepro. is short for pre-processing; Exact stands for Exact string match; Appro. means Approximate string matching and Filter is based on Wikipedia.
Comparison with other systems.
| Method of authors | Precision | Recall | F-score | TP | FP | FN |
| Our system | 88.1% | 92.3% | 90.1% | 723 | 98 | 62 |
| Wermter et al., 2009 | 87.8% | 85.0% | 86.4% | 668 | 76 | 118 |
| Hakenberg et al., 2008 | 90.7% | 82.4% | 86.4% | 647 | 66 | 138 |
| Hu et al., 2012 | 83.5% | 82.5% | 83.0% | 648 | 128 | 137 |
Figure 6Causes for FN errors.
Figure 7Causes for FP errors.