Literature DB >> 16845022

VIRGO: computational prediction of gene functions.

Naveed Massjouni¹, Corban G Rivera, T M Murali.

Abstract

Dramatic advances in sequencing technology and sophisticated experimental assays that interrogate the cell, combined with the public availability of the resulting data, herald the era of systems biology. However, the biological functions of more than 40% of the genes in sequenced genomes are unknown, posing a fundamental barrier to progress in systems biology. The large scale and diversity of available data requires the development of techniques that can automatically utilize these datasets to make quantified and robust predictions of gene function that can be experimentally verified. We present a service called the VIRtual Gene Ontology (VIRGO) that (i) constructs a functional linkage network (FLN) from gene expression and molecular interaction data, (ii) labels genes in the FLN with their functional annotations in the Gene Ontology and (iii) systematically propagates these labels across the FLN in order to precisely predict the functions of unlabelled genes. VIRGO assigns confidence estimates to predicted functions so that a biologist can prioritize predictions for further experimental study. For each prediction, VIRGO also provides an informative 'propagation diagram' that traces the flow of information in the FLN that led to the prediction. VIRGO is available at http://whipple.cs.vt.edu:8080/virgo.

Entities: Chemical Gene Species

Mesh：

Year: 2006 PMID： 16845022 PMCID： PMC1538839 DOI： 10.1093/nar/gkl225

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

MOTIVATION

More than 250 complete genome sequences are now available, including those of 35 eukaryotes (1). Increasingly sophisticated high-throughput biological experiments provide a wide range of functional genomic information about cell state. These advances, combined with the public availability of these datasets, herald the era of systems biology (2,3). However, a fundamental roadblock to progress in systems biology is the poor state of knowledge about the biological functions of the genes in sequenced genomes (4,5). Using sequence similarity to predict gene function provides annotations only for about 40% of eukaryotic genes (6). Some of these annotations may also be incorrect, as they are transmitted from one genome to another via weak chains of inference (7). Genes of unknown function might support important cellular functions. Discovering the functions of these genes will provide critical insights into the biology of many organisms. In addition, discovering these functions will improve our ability to annotate genomes sequenced in the future. The large scale and diversity of available functional genomic data requires the development of novel computational tools that can automatically integrate these data in order to compute quantified and testable predictions of the functions of poorly understood genes. In this paper, we provide a powerful interface called ‘the VIRtual Gene Ontology’ (VIRGO) that enables a biologist to integrate gene expression data collected in the laboratory with molecular interaction networks, construct a functional linkage network (FLN) from these datasets, label the genes in the FLN with functional annotations from the Gene Ontology (GO) (8) and systematically propagate these labels across the FLN in order to predict the functions of unlabelled genes. The biologist can query VIRGO for predictions of interest and prioritize them using confidence values assigned by VIRGO. VIRGO also provides informative ‘propagation diagrams’ that trace the flow of information in the FLN. These diagrams may assist the biologist in ascertaining the rationale behind a prediction. A number of powerful methods have been published for predicting gene function by integrating different types of functional genomic data (9–16). As far as we are aware, VIRGO is the first web server that makes such a prediction engine widely available.

THE VIRGO SYSTEM

Figure 1 displays the VIRGO system. We describe its main components below.

Figure 1

The VIRGO system. Solid arrows indicate a biologist's interaction with VIRGO. Dotted arrows indicate flow of information and computation within VIRGO. Dashed lines indicate generation of biological hypotheses and experimental data that we hope VIRGO will inspire.

Functional Linkage Networks

A promising basis for predicting gene function identifies functional associations of genes of unknown function with genes of known function. Diverse sources of biological data contain evidence for such associations. For instance, two genes may have the same function if their protein products interact (17,18) or if they have very similar patterns of gene expression (19,20). An FLN (21–24) is a powerful framework for representing and analysing such relationships. An FLN is a graph in which each node corresponds to a gene; the node is labelled by the set of functions that annotate the gene. An edge in an FLN connects two genes if some experimental or computational procedure suggests that these genes might share the same function. Each edge in the FLN has a real-valued weight; the sign of the weight indicates whether the connected genes share or do not share the function, while the magnitude of the weight reflects our confidence in the edge. A number of on-line databases (24–34) have assembled large collections of functional links between genes by curating the literature or by combining multiple experimental and computational procedures. Other authors have proposed techniques for constructing FLNs that integrate multiple sources of data (22,35) or FLNs that are based on gene expression data analysed across multiple species (19,36). Although these databases and algorithms are highly valuable sources of functional associations, many of them focus on constructing FLNs and do not address the question of using FLNs for automatically predicting gene functions.

The GAIN algorithm

VIRGO uses the ‘Gene Annotation using Integrated Networks’ (GAIN) (21) algorithm as its function prediction engine. GAIN automatically and robustly suggests putative functions by systematically propagating functional annotations across the FLN while exploiting the constraints imposed by the topology of the FLN. In earlier work (21), we evaluated GAIN by integrating a protein–protein interaction network for S. cerevisiae based on the GRID database (30) and a gene expression dataset with 300 conditions (gene knockouts and chemical treatments) (37). The protein–protein interaction network provided the edges of the FLN. We assigned each edge in the FLN a weight equal to the absolute value of the Pearson's correlation coefficient of the expression profiles of the genes incident on the edge. We used the GO functional annotations for S. cerevisiae as of December 1, 2002. We considered those GO functions that annotated at most 10% of the genes in the FLN and for which GAIN achieved at least 75% precision and recall on average on leave-one-out cross validation. We restricted our attention to those predicted gene-function pairs where the function belonged to this set of 485 functions. Since GAIN may predict multiple functions for a gene, one predicted function may be an ancestor of another predicted function in the GO directed acyclic graph (DAG). Therefore, if GAIN predicted a gene as having two functions where one function is an ancestor of the other, we discarded the ancestor as a prediction. These steps yielded 207 predicted gene-function pairs spanning 130 distinct genes, 98 distinct functions and all three GO categories. We compared these 207 predictions to the GO annotations for S. cerevisiae as of March 24, 2006. We computed the distance in the GO DAG between each function predicted by GAIN for a gene (based on the 2002 dataset) and the correct annotation in the same GO category as the predicted function (if one existed in the 2006 dataset). For each gene, we selected the predicted function (in each GO category) that achieved the smallest distance to a true annotation for that gene. We calculated that 11 predictions are correct, 12 predicted functions are either parents or children of the true function in the GO DAG, 36 predicted functions are at a distance two in the GO DAG from the true function, and 3 predicted functions are at a distance three from the true function. These 62 predictions span 52 genes (GAIN predicted functions in multiple GO categories for some genes). The 78 genes involved in the remaining predictions continue to have no biologically validated functions in the same GO category as the predicted function. A table listing all the comparisons we performed is available in the Supplementary Data. The validated predictions include nucleolus, chromatin remodeling complex, snoRNA binding RNA binding and vesicle-mediated transport. These results demonstrates GAIN's ability to make accurate predictions of gene function.

The VIRGO pipeline

VIRGO is implemented in Java 1.4 and uses the Apache Jakarta Tomcat web server. The VIRGO database uses a PostgreSQL backend. GAIN is implemented in C++. A typical session for a biologist with VIRGO involves the following steps: The biologist collects a gene expression data set in the laboratory and uploads the data to VIRGO. At this stage, the biologist has option of telling VIRGO to make the resulting predictions public, i.e. available to all users of VIRGO. VIRGO's default policy is to keep the predictions private. VIRGO invokes GAIN to integrate the gene expression dataset with molecular interactions to construct an FLN. GAIN processes the FLN in two separate steps. The first step uses the FLN and existing annotations in GO to compute new predictions of gene function. In the optional second step, the biologist can measure GAIN's performance using leave-one-out cross validation. At the end of each step, VIRGO parses GAIN's output files, stores the results in the VIRGO's database and informs the biologist by email that the step has completed. The biologist queries VIRGO to find high-quality predictions using propagation diagrams, confidence estimates, and other statistics as aids. Figure 2 displays a typical propagation diagram.

Figure 2

This propagation diagram supports the prediction that gene YNL016W (PUB1) is annotated with the biological process ‘RNA binding’ (GO:0000023). Red rectangles denote genes annotated with this function. Blue diamonds represent genes annotated with a different function. Octagons represent genes that either have no known function or are annotated with a function that is an ancestor of ‘RNA binding.’ Of these, the red octagon is the gene of interest. Other blue octagons represent genes that are also predicted to have this function. Red edges are incident on annotated nodes and help to visualize the flow of information in this network. The propagation diagram generated by VIRGO also displays edge weights, which we do not show in this picture.

In the long run, we hope that a biologist will be able to use VIRGO to develop new hypotheses and perform new experiments which will yield further datasets for analysis by VIRGO.

Supported organisms and datasets

Currently, VIRGO supports analysis for S. cerevisiae and H. sapiens. We chose these two organisms since they have large and diverse collection of protein–protein interaction datasets and gene expression measurements. We periodically download these interactions from the respective websites and functional annotations from the GO website. We use the GRID dataset (30) for S. cerevisiae. For H. sapiens, we obtained 31610 interactions between 7393 human proteins from the IDSERVE database (29). We also included 3270 human interactions derived using large scale yeast two-hybrid experiments from Stelzl et al. (38), and 6726 human PPIs from Rual et al. (39). Overall, this human PIN contains 6274 proteins and 34087 interactions and represents interactions from a diverse variety of sources.

CONCLUSIONS AND FUTURE WORK

We have developed VIRGO, a web server for automated prediction of gene functions. A biologist can use VIRGO to obtain predictions for a system of interest by analysing relevant gene expression data integrated with molecular interactions. VIRGO provides useful auxiliary information to the biologist to assess the quality of the predictions and to prioritize them for further analysis. It is easy to extend VIRGO to other organisms for which gene expression data and functional annotations exist. We anticipate adding support for D. melanogaster, C. elegans, and P. falciparum and in the near future. VIRGO will also support functional predictions in organisms for which there are no publicly-available datasets of molecular interactions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

39 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Functional discovery via a compendium of expression profiles.

Authors: T R Hughes; M J Marton; A R Jones; C J Roberts; R Stoughton; C D Armour; H A Bennett; E Coffey; H Dai; Y D He; M J Kidd; A M King; M R Meyer; D Slade; P Y Lum; S B Stepaniants; D D Shoemaker; D Gachotte; K Chakraburtty; J Simon; M Bard; S H Friend
Journal: Cell Date: 2000-07-07 Impact factor: 41.582

3. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons.

Authors: Alvaro Mateos; Joaquín Dopazo; Ronald Jansen; Yuhai Tu; Mark Gerstein; Gustavo Stolovitzky
Journal: Genome Res Date: 2002-11 Impact factor: 9.043

4. Building with a scaffold: emerging strategies for high- to low-level cellular modeling.

Authors: Trey Ideker; Douglas Lauffenburger
Journal: Trends Biotechnol Date: 2003-06 Impact factor: 19.536

5. Bioverse: Functional, structural and contextual annotation of proteins and proteomes.

Authors: Jason McDermott; Ram Samudrala
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

6. A combined algorithm for genome-wide prediction of protein function.

Authors: E M Marcotte; M Pellegrini; M J Thompson; T O Yeates; D Eisenberg
Journal: Nature Date: 1999-11-04 Impact factor: 49.962

7. Global protein function prediction from protein-protein interaction networks.

Authors: Alexei Vazquez; Alessandro Flammini; Amos Maritan; Alessandro Vespignani
Journal: Nat Biotechnol Date: 2003-05-12 Impact factor: 54.908

8. Mapping Gene Ontology to proteins based on protein-protein interaction data.

Authors: Minghua Deng; Zhidong Tu; Fengzhu Sun; Ting Chen
Journal: Bioinformatics Date: 2004-01-29 Impact factor: 6.937

9. The GRID: the General Repository for Interaction Datasets.

Authors: Bobby-Joe Breitkreutz; Chris Stark; Mike Tyers
Journal: Genome Biol Date: 2003-02-27 Impact factor: 13.583

10. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae).

Authors: Olga G Troyanskaya; Kara Dolinski; Art B Owen; Russ B Altman; David Botstein
Journal: Proc Natl Acad Sci U S A Date: 2003-06-25 Impact factor: 12.779

7 in total

1. Accurate evaluation and analysis of functional genomics data and methods.

Authors: Casey S Greene; Olga G Troyanskaya
Journal: Ann N Y Acad Sci Date: 2012-01-23 Impact factor: 5.691

2. FuncBase: a resource for quantitative gene function annotation.

Authors: John E Beaver; Murat Tasan; Francis D Gibbons; Weidong Tian; Timothy R Hughes; Frederick P Roth
Journal: Bioinformatics Date: 2010-05-21 Impact factor: 6.937

3. Integrative approaches to the prediction of protein functions based on the feature selection.

Authors: Seokha Ko; Hyunju Lee
Journal: BMC Bioinformatics Date: 2009-12-31 Impact factor: 3.169

4. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Authors: Zheng Wang; Renzhi Cao; Jianlin Cheng
Journal: BMC Bioinformatics Date: 2013-02-28 Impact factor: 3.169

5. Integrating phenotype and gene expression data for predicting gene function.

Authors: Brandon M Malone; Andy D Perkins; Susan M Bridges
Journal: BMC Bioinformatics Date: 2009-10-08 Impact factor: 3.169

6. High-precision high-coverage functional inference from integrated data sources.

Authors: Bolan Linghu; Evan S Snitkin; Dustin T Holloway; Adam M Gustafson; Yu Xia; Charles DeLisi
Journal: BMC Bioinformatics Date: 2008-02-25 Impact factor: 3.169

7. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence.

Authors: Lourdes Peña-Castillo; Murat Tasan; Chad L Myers; Hyunju Lee; Trupti Joshi; Chao Zhang; Yuanfang Guan; Michele Leone; Andrea Pagnani; Wan Kyu Kim; Chase Krumpelman; Weidong Tian; Guillaume Obozinski; Yanjun Qi; Sara Mostafavi; Guan Ning Lin; Gabriel F Berriz; Francis D Gibbons; Gert Lanckriet; Jian Qiu; Charles Grant; Zafer Barutcuoglu; David P Hill; David Warde-Farley; Chris Grouios; Debajyoti Ray; Judith A Blake; Minghua Deng; Michael I Jordan; William S Noble; Quaid Morris; Judith Klein-Seetharaman; Ziv Bar-Joseph; Ting Chen; Fengzhu Sun; Olga G Troyanskaya; Edward M Marcotte; Dong Xu; Timothy R Hughes; Frederick P Roth
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

7 in total