| Literature DB >> 18230174 |
Abstract
BACKGROUND: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) - a special case of Word Sense Disambiguation (WSD) - is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact - one of the special features of biological articles - that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.Entities:
Mesh:
Year: 2008 PMID: 18230174 PMCID: PMC2262057 DOI: 10.1186/1471-2105-9-69
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Results obtained using the path-length-based method. Column 1 lists the maximal path distance allowed for each given experiment. The results are presented in a Precision – Coverage format.
| Distance Limit | Human | Mouse | Fly | Yeast |
| 1 | 100%–44.35% | 99.88%–97.59% | 99.84%–92.19% | 100%–99.26% |
| 2 | 100%–49.19% | 98.67%–99.32% | 94.58%–97.72% | 100%–99.26% |
| 3 | 85.29%–82.26% | 98.64%–99.51% | 94.44%–98.10% | 100%–99.26% |
Figure 1Precision-coverage curves on the human GSD dataset. The three curves represents different weighting strategies and their points for different levels of filtering of the inverse co-author graph. The authors who had over 100, 50 or 20 MedLine publications were ignored yielding 3 points on the precision-coverage space, while the fourth point of each curve shows the case without any filtering.
Results obtained using the automatic labelled set expanding heuristic. Column 1 refers to the maximal distance allowed in the path finding phase. The results are presented in a Precision – Coverage format.
| Distance Limit | Human | Mouse | Fly | Yeast |
| 0 | 93.3%–12.11% | 96.28%–8.57% | 100%–7.06% | 83.70%–10.23% |
| 1 | 92.56%–32.82% | 91.41%–18.82% | 96.56%–10.78% | 69.75%–18.79% |
| 2 | 91.53%–37.88% | 91.31%–20.07% | 96.56%–10.78% | 69.75%–18.79% |
Results obtained using the combined co-author-based methods
| Method | Human | Mouse | Fly | Yeast |
| With max precision | 100%–52.42% | 99.76%–97.80% | 99.59%–92.42% | 100%–99.25% |
| With max coverage | 84.76%–84.67% | 99.48%–98.74% | 97.94%–95.68% | 100%–99.25% |
Overview of systems which aimed at full coverage. The most frequent sense was used as the baseline method. We represent the results of Xu et al by using MeSH codes in the second row for the sake of comparability. The results of a C4.5 decision tree using the MeSH features are present in the third row. The systems of the two last rows first apply the combined co-author graph based heuristics and when they cannot decide they use the supervised prediction of the cosine similarity metric or the decision tree.
| Method | Human | Mouse | Fly | Yeast |
| Baseline | 59.3%–99.1% | 79% | 66.7% | 65.5% |
| Xu et al [14, 15] MeSH | 86.3%–94.4% | 90.7%–99.4% | 69.4%–99.7% | 78.9%–98.4% |
| Decision tree | 84.68%–100% | 90.90%–99.84% | 72.53%–99.85% | 74.49%–100% |
| Co-author heuristics + similarity | 91.87%–99.19% | 98.54%–99.75% | 97.20%–100% | 94.15%–99.70% |
| Co-author heuristics + decision tree | 94.35%–100% | 98.85%–99.91% | 96.05%–99.85% | 99.63%–100% |
The characteristics of the evaluation sets used
| Organism | # of test cases | Avg # of senses | Avg size of train set | Avg # of synonyms available |
| Human | 124 | 2.35 | 122.09 | 12.36 |
| Mouse | 7844 | 2.33 | 263.0 | 5.36 |
| Fly | 1320 | 2.79 | 35.69 | 9.51 |
| Yeast | 269 | 2.08 | 11 | 2.32 |