| Literature DB >> 15958172 |
Bob J A Schijvenaars1, Barend Mons, Marc Weeber, Martijn J Schuemie, Erik M van Mulligen, Hester M Wain, Jan A Kors.
Abstract
BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.Entities:
Mesh:
Year: 2005 PMID: 15958172 PMCID: PMC1183190 DOI: 10.1186/1471-2105-6-149
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Number of non-gene meanings for gene symbols. Dots indicate the number of human gene symbols (on the vertical axis) and, for each of these symbols, the number of corresponding long forms with a non-gene meaning (horizontal axis). It should be noted that spelling variations may yield different long forms for the same non-gene meaning. To reduce the effect of these variations, long forms were stemmed.
Figure 2Performance of the disambiguation algorithm. Total accuracies of the disambiguation algorithm were determined on the test set of 52,529 Medline abstracts for reference fingerprints derived from different reference descriptions: OMIM annotations or a varying number of Medline abstracts. When two or more abstracts were used, the fingerprints of the individual abstracts were averaged to yield the final reference fingerprint.
Disambiguation of gene vs. non-gene senses. Table entries show the number of abstracts in the test set with gene symbols that were correctly or incorrectly classified by the disambiguation algorithm as having a gene or non-gene sense. The percentages indicate the correctly and incorrectly classified symbols relative to the row totals. Reference fingerprints per gene symbol sense were derived from a combination of five Medline abstracts, not being part of the test set.
| Algorithm | ||
| Reference | Gene | Non-gene |
| Gene | 24243 (93.5%) | 1666 (6.5%) |
| Non-gene | 1197 (4.5%) | 25323 (95.5%) |
Figure 3Steps involved in the construction of test and reference fingerprints. Two sets of abstracts containing symbols with known gene or non-gene meaning were constructed. One set consisted of abstracts with short-form/long-form combinations culled from Medline, the other set consisted of abstracts that were mentioned in OMIM annotations of genes. The two sets were merged by selecting symbols that occurred in both sets and had at least six abstracts for each of their gene senses. The OMIM annotations for the genes in the merged set were stored separately. A reference set was generated by randomly selecting five abstracts per gene sense from the merged set; the remaining abstracts were used for testing. All abstracts in the test and reference set as well as the OMIM annotations were indexed using the combined gene thesaurus, and the resulting "concept fingerprints" were used for reference fingerprint construction and testing of the disambiguation algorithm.