| Literature DB >> 21210969 |
Roney S Coimbra1, Dana E Vanderwall, Guilherme C Oliveira.
Abstract
BACKGROUND: Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples.Entities:
Mesh:
Year: 2010 PMID: 21210969 PMCID: PMC3045796 DOI: 10.1186/1471-2164-11-S5-S3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Dataset description
| Initial dataset | Dataset with PubMed abstracts | Dataset fulfilling the algorithm’s requirements* | Final dataset (ambiguous aliases excluded) | |
|---|---|---|---|---|
| EntrezGene official symbols | 100 | 73 | 68** | 68 |
| Aliases | 425 | 256 | 223 | 165 |
| Abstracts in text corpus | - | 13355 | 12088 | 9005 |
| Unique PubMed IDs in text corpus | - | 11022 | 10312 | 7523 |
| Redundancy in text corpus (%) | - | 21 | 16.6 | 19.7 |
* The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.
** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.
Figure 1Stringency thresholds and vocabulary size. C = thresholds. * = p < 0.05.
Figure 2Vocabulary fingerprint for FADS1 and its aliases. Schematic description of a group-specific informative vocabulary automatically extracted from a text corpus of PubMed abstracts. In this example, two “synonyms” (green arrows) and one “ambiguous” alias (red arrow) of official gene symbol FADS1 (which encodes the enzyme fatty acid desaturase 1; blue arrow) are distinguished by the algorithm when baseline cut-off was set at c = 0.05. The internal control is the unrelated official gene symbol CLEC2B (black arrow). The Jaccard distances to FADS1 are: 1) D5D = 0.937; 2) fatty acid desaturase 1 = 0.944; 3) TU12 = 1; CLEC2B = 1. Yellow boxes = words from the group-specific informative vocabulary that occur in the text corpora of a given gene symbol or alias.