| Literature DB >> 18305827 |
Jeyakumar Natarajan1, Jawahar Ganapathy.
Abstract
Gene function annotation remains a key challenge in modern biology. This is especially true for high-throughput techniques such as gene expression experiments. Vital information about genes is available electronically from biomedical literature in the form of full texts and abstracts. In addition, various publicly available databases (such as GenBank, Gene Ontology and Entrez) provide access to gene-related information at different levels of biological organization, granularity and data format. This information is being used to assess and interpret the results from high-throughput experiments. To improve keyword extraction for annotational clustering and other types of analyses, we have developed a novel text mining approach, which is based on keywords identified at the level of gene annotation sentences (in particular sentences characterizing biological function) instead of entire abstracts. Further, to improve the expressiveness and usefulness of gene annotation terms, we investigated the combination of sentence-level keywords with terms from the Medical Subject Headings (MeSH) and Gene Ontology (GO) resources. We find that sentence-level keywords combined with MeSH terms outperforms the typical 'baseline' set-up (term frequencies at the level of abstracts) by a significant margin, whereas the addition of GO terms improves matters only marginally. We validated our approach on the basis of a manually annotated corpus of 200 abstracts generated on the basis of 2 cancer categories and 10 genes per category. We applied the method in the context of three sets of differentially expressed genes obtained from pediatric brain tumor samples. This analysis suggests novel interpretations of discovered gene expression patterns.Entities:
Keywords: functional clustering; microarray data analysis; text mining
Year: 2007 PMID: 18305827 PMCID: PMC2241933 DOI: 10.6026/97320630002185
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1Experimental design of gene clustering with sentences-level, MeSH and GO keywords
Figure 2Characterization 19 genes differentially expressed genes in response to EGF. (a) All 19 genes against the discovered biological function/process terms. (b) Detailed view of group of manually selected cluster sharing common features (9 genes and 12 function/process terms)
Figure 3Characterization 30 genes differentially expressed genes in response to S1P. (a) All 30 genes against the discovered biological function/process terms. (b) Detailed view of group of manually selected cluster sharing common features (19 genes and 17 function/process terms)
Figure 4Characterization 30 genes differentially expressed genes in response to both EGF and S1P. (a) All 30 genes against the discovered biological function/process terms. (b) Detailed view of manually selected cluster sharing common features (21 genes and 18 function/process terms)