Literature DB >> 21670790

Integrating information from existing databases for enhanced function annotation of genes, genomes and networks.

Abstract

Uncovering functional associations for genes and gene products remains one of the most significant challenges in biology. The classical approaches, such as homology detection, are mainly suited for predicting approximate molecular function of a protein and should be used in context with other methods. Several studies have emerged that employ knowledge-based procedures to extract functional data for genes from a variety of biological sources. However, data derived from a single biological resource often provides only a limited perspective on their functional associations largely due to systematic bias in the underlying data. The post-genomic era has witnessed the emergence of knowledge-based studies that aim to decipher functional associations by combining several biological evidence types. These are expected to provide better insights into the functional aspects of diverse genes, genomes and networks.

Entities: Disease Gene Species

Keywords: evidence types; functional associations; homology; knowledge-based; post-genomic

Year: 2007 PMID： 21670790 PMCID： PMC2255068 DOI： 10.6026/97320630002132

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Description

One of the most significant challenges is to assign reliable function association to genes and gene products. Many proteins (and domains) in eukaryotic genomes are part of multi-member families; they participate in many cellular processes and are located in different parts of cells. In recognition to this, the Gene Ontology (GO) consortium annotates information about molecular function, biological process and cellular component to describe function of a protein. [1] In the post-genomic era, there have been attempts at function annotation that procure data from varied sources. Biological data from a single type of data source, though useful, is often limited in the extent to which it may help uncover functional associations; either because of a systematic bias towards specific genes, gene families and pathways and/or incorporation of false positives during data acquisition. With focus shifting from genes and proteins to biological systems, integrating information from multiple data types is seen as a more robust and accurate means of unraveling functional associations. [2] Several attempts have been undertaken to obtain biological data from multiple data types and implement statistical frameworks for their integration. These studies employ data sources such as global gene expression patterns [3], yeast two-hybrid data [4], genomic characteristics [5], genome-wide RNAi screens [6], literature information etc. Approaches that combine information from multiple data types with information from peer-reviewed scientific literature are particularly successful in providing functional associations for genes and gene products. [7] Most of these frameworks output profiles or clusters of genes based on their similarities within a particular data source and their interpretation is largely dependent on expert knowledge. For instance, microarray technology allows simultaneous study of expression patterns of thousands of genes under specific conditions. Expression data is analysed (using approaches such as clustering) to identify sets of genes with similar expression patterns that are assumed to function in similar physiological processes. Clustering aims to partition genes such that genes with similar expression patterns fall into the same groups called clusters. Since gene clusters are often inclined to be enriched in specific functional categories, identification of such clusters may be used to assign putative functional associations to uncharacterised genes within those clusters. Approaches such as hierarchical clustering, k-means, self organizing map (SOM), principal component analysis (PCA) have been employed to identify sets of co-expressed genes and tools are available for visualisation of these clusters. Different datasets may often provide overlapping or complementary information due to hierarchy in the definition of function of a gene [8]; integrating knowledge from various data types thus, provides a uniform view of functional associations and is most useful when coupled with expert knowledge. Few such attempts have proved to be highly successful in annotating prokaryotic genomes [9] and there have been few attempts in eukaryotic genomes as well. [3] In a recent study, knowledge from structure-function analysis of 3-D structures and sequences, gene expression profiling, text mining, protein-protein interactions and knowledge-based computational tools (Figure 1) have been extensively employed to manually assign either of the three GO [1] categories to the putative members of trypsin-like serine proteases (SPs) family encoded in the genome of Drosophila melanogaster. [10] Through this approach, functional information was obtained for 190 gene products containing serine protease like domains. This approach provides significant functional information for 30 of 190 gene products and to assign a putative function to these with high confidence. Of these ten are supported by literature curation and four are supported by Flybase annotations, while annotations for 16 gene products are entirely derived from analysis of large-scale datasets employed in the study (http://caps.ncbs.res.in/download/Bioinformation/ ). A large scale involvement of many Drosophila SPs and SPHs (that are likely to be proteolytically inactive due to mutations in the residues of the serine protease catalytic triad) was observed in development and immune response, which would explain the diversity observed for this gene family in Drosophila. The approach also helps uncover putative functional associations between genes involved in different metabolic pathways. For example, Drosophila SP CG3066 is a monophenol monooxygenase activator involved in activation of melanization chiefly in response to fungal infection and believed to be involved in a possible cross-talk between melanization and Toll pathway. Time-series expression data suggests that expression of CG3066 is correlated with Easter and Snake, members of Toll signaling pathway in Drosophila. Also, studies suggest high similarity in the putative active sites of CG3066 and Easter. Thus, a probable role for CG3066 in association with the components of the Toll pathway may be associated in early embryogenesis.

Figure 1

A schematic representation of biological data sources and evidence types employed for the enhanced function annotation of Drosophila SPs

Thus, integration of biological data from diverse sources provides an effective means for large-scale function annotations of genes, multi-member gene families and networks. The evolution of such tools is likely to gain further momentum as enormous amounts of high-throughput experimental data from diverse sources are likely to become available in the near future.

10 in total

1. A Bayesian networks approach for predicting protein-protein interactions from genomic data.

Authors: Ronald Jansen; Haiyuan Yu; Dov Greenbaum; Yuval Kluger; Nevan J Krogan; Sambath Chung; Andrew Emili; Michael Snyder; Jack F Greenblatt; Mark Gerstein
Journal: Science Date: 2003-10-17 Impact factor: 47.728

Review 2. Re-analysis of data and its integration.

Authors: Lars Juhl Jensen; Lars M Steinmetz
Journal: FEBS Lett Date: 2005-03-21 Impact factor: 4.124

3. A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana.

Authors: Jiexun Li; Xin Li; Hua Su; Hsinchun Chen; David W Galbraith
Journal: Bioinformatics Date: 2006-07-04 Impact factor: 6.937

Review 4. Literature mining for the biologist: from information retrieval to biological discovery.

Authors: Lars Juhl Jensen; Jasmin Saric; Peer Bork
Journal: Nat Rev Genet Date: 2006-02 Impact factor: 53.242

5. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.

Authors: P Uetz; L Giot; G Cagney; T A Mansfield; R S Judson; J R Knight; D Lockshon; V Narayan; M Srinivasan; P Pochart; A Qureshi-Emili; Y Li; B Godwin; D Conover; T Kalbfleisch; G Vijayadamodar; M Yang; M Johnston; S Fields; J M Rothberg
Journal: Nature Date: 2000-02-10 Impact factor: 49.962

6. A combined algorithm for genome-wide prediction of protein function.

Authors: E M Marcotte; M Pellegrini; M J Thompson; T O Yeates; D Eisenberg
Journal: Nature Date: 1999-11-04 Impact factor: 49.962

Review 7. Genome-wide screening for gene function using RNAi in mammalian cells.

Authors: Lara M Cullen; Greg M Arndt
Journal: Immunol Cell Biol Date: 2005-06 Impact factor: 5.126

8. Gene Ontology: looking backwards and forwards.

Authors: Suzanna E Lewis
Journal: Genome Biol Date: 2004-12-15 Impact factor: 13.583

9. Enhanced function annotations for Drosophila serine proteases: a case study for systematic annotation of multi-member gene families.

Authors: Parantu K Shah; Lokesh P Tripathi; Lars Juhl Jensen; Murad Gahnim; Christopher Mason; Eileen E Furlong; Veronica Rodrigues; Kevin P White; Peer Bork; R Sowdhamini
Journal: Gene Date: 2007-10-15 Impact factor: 3.688

10. STRING: known and predicted protein-protein associations, integrated and transferred across organisms.

Authors: Christian von Mering; Lars J Jensen; Berend Snel; Sean D Hooper; Markus Krupp; Mathilde Foglierini; Nelly Jouffre; Martijn A Huynen; Peer Bork
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10 in total