| Literature DB >> 22984434 |
Yuncui Hu1, Yanpeng Li, Hongfei Lin, Zhihao Yang, Liangxi Cheng.
Abstract
The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.Entities:
Mesh:
Year: 2012 PMID: 22984434 PMCID: PMC3440407 DOI: 10.1371/journal.pone.0043558
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Architecture of the gene name normalization system.
Common biological terms created manually in the stop list.
| activate | family | like |
| anti | families | mRNA |
| antibody | gene | negative |
| cDNA | genes | promoter |
| complex | linked | promoters |
| domain | homolog | receptor |
| domains | homology | subfamily |
| dominant | human | subunit |
| enzymes | humans | superfamily |
Figure 2Excerpt from the abstract with PubMed ID “10588946” and partial annotation information of genes with name “ORP-1” in Entrez Gene.
Example of the dictionary entries and confidence scores.
| Dictionary entries | Confidence scores |
| collagen induced platelet cd62p | −0.291009 |
| derived fibrinogen | −0.931995 |
| dna binding inhibitory | −1.637170 |
| gata 3 transcription factor | 0.997864 |
| lp chain | 0.348884 |
| mitochondrial ribosomal protein mrp1 | 0.644872 |
| nf kb activating kinase nak | 1.031284 |
| protein hsp 60 | −0.498391 |
| peptide hormone factors | −1.072469 |
| rel like protein binding motifs | 0.174024 |
| transforming growth factor beta family member | −0.146254 |
Information of BioCreative II dataset.
| Corpus | Training Set | Test Set |
| Abstracts | 281 | 262 |
| Annotations | 998 | 1130 |
| Entity mentions in golden answers | 640 | 785 |
Results of different mapping and disambiguation strategies.
| Method | F-score | Precision | Recall | TP | FP | FN |
| Exact + Entrez | 0.793 | 0.818 | 0.769 | 604 | 134 | 181 |
| Exact + Approximate + Entrez | 0.809(+2%) | 0.817(−0.1%) | 0.801(+4.2%) | 629 | 141 | 156 |
| Exact + Approximate + Entrez + External | 0.830(+4.7%) | 0.835(+2.1%) | 0.825(+7.3%) | 648 | 128 | 137 |
The first column refers to different mapping and disambiguation approaches. ‘Exact’ is short for ‘exact string matching’ and ‘Approximate’ stands for ‘Approximate string matching’; “Entrez” means using only the information in the EntrezGene database for disambiguation, and ‘External’ indicates combining external resource for disambiguation.
Performance of different filtering methods.
| Method | F-score | Precision | Recall |
| Unfiltered | 0.668 | 0.536 | 0.887 |
| List of family names and cell lines | 0.733(+9.7%) | 0.628(+17.2%) | 0.880(−0.8%) |
| Semantic similarity | 0.740(+10.8%) | 0.667(+24.4%) | 0.831(−6.3%) |
| Machine learning | 0.741(+10.9%) | 0.665(+24.1%) | 0.837(−5.6%) |
| List + Semantic similarity+ Machine learning | 0.830(+24.3%) | 0.835(+55.8%) | 0.825(−7%) |
‘unfiltered’– the method without filtering step; ‘List of family names and cell lines'– the filtering method using protein family names and cell lines extracted from Wikipedia’; ‘Semantic similarity’– the filtering method based on cosine measure calculated in the step of disambiguation; ‘Machine learning’– the filtering method based on the confidence scores obtained by machine learning in named entity recognition.
Figure 3Relationship between the performance and threshold selection in filtering.
Comparison with systems in the GN task of BioCreative II challenge.
| Methods or authors | Precision | Recall | F-score |
| (Joachim et al., 2009) | 87.8% | 85.0% | 86.4% |
| Our System | 83.5% | 82.5% | 83.0% |
| (Hakenberg et al., 2007) | 78.9% | 83.3% | 81.0% |
| (Fundel and Zimmer, 2007) | 79.2% | 81.5% | 80.4% |
| (Schuemie et al., 2007) | 75% | 76% | 75.5% |
| (Neves et al., 2008) | 55.0% | 83.31% | 66.26% |
Causes for FN errors in the normalization task.
| FN causes type | Frequency | Proportion |
| Mentions not identified | 56 | 40.9% |
| Lexicon deficiencies | 31 | 22.6% |
| Incorrect disambiguation | 28 | 20.4% |
| Erroneously filtered | 22 | 16.1% |
Causes for FP errors in the normalization task.
| FP causes type | Frequency | Proportion |
| Spurious gene names | 95 | 74.2% |
| Noise caused by searching | 7 | 5.5% |
| Incorrect disambiguation | 26 | 20.3% |