Literature DB >> 18460543

ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data.

Alexey V Antonov¹, Thorsten Schmidt, Yu Wang, Hans W Mewes.

Abstract

ProfCom is a web-based tool for the functional interpretation of a gene list that was identified to be related by experiments. A trait which makes ProfCom a unique tool is an ability to profile enrichments of not only available Gene Ontology (GO) terms but also of 'complex functions'. A 'Complex function' is constructed as Boolean combination of available GO terms. The complex functions inferred by ProfCom are more specific in comparison to single terms and describe more accurately the functional role of genes. ProfCom provides a user friendly dialog-driven web page submission available for several model organisms and supports most available gene identifiers. In addition, the web service interface allows the submission of any kind of annotation data. ProfCom is freely available at http://webclu.bio.wzw.tum.de/profcom/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2008 PMID： 18460543 PMCID： PMC2447768 DOI： 10.1093/nar/gkn239

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Relating experimental data to biological knowledge is a necessity to cope with the data avalanches emerging from recent developments in high-throughput technologies. Automatic functional profiling has become the de facto approach for the secondary analysis of high-throughput data. A number of tools employing available gene functional annotations as well as pathway databases have been developed (1–18). The advantages and limitations of most of these tools are reviewed in ref. (19). An important aspect of standard functional profiling methodology is inability to overcome the limits of employed annotation vocabularies. Do current annotation vocabularies cover all possible biological functions? Can they cover them in the future? The space of possible biological functions is almost infinite. However, to control it one does not need an infinite number of functional terms. Consider a very direct analogy. Human language contains a limited number of words but through grammar rules these words can be transformed into an almost infinite number of sentences, which allow the expression of almost any idea. In our previous paper (20), we proposed to construct new functional terms (referred to as ‘complex functions’). A ‘complex function’ is constructed as a combination of available terms. The three Boolean operations (‘AND’, ‘OR’, ‘NOT’) play the role of grammar rules and resulting space of ‘complex functions’ covers an almost infinite number of possible biological functions. The present article describes ProfCom, a web tool for functional profiling based on the concept described previously (20). ProfCom supports automatic analyses for several model organisms as well as provides a web service interface, which allows the submission of any kind of annotation data. For each organism, ProfCom provides analysis of different annotations, including Gene Ontology (GO) (21), FunCat (22) and InterPro Motifs (23). ProfCom currently offers automatic analyses for Homo sapiens, Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Drosophila melanogaster and Saccharomyces cerevisiae. In addition, any organism and annotation can be analyzed by ProfCom using Web service interface.

MATERIALS AND METHODS

Statistical analysis and ProfCom profiling engine

A standard tool for automatic functional profiling accepts a query list of genes (referred to as set A, usually the set of genes experimentally identified to be related to the studied biological phenomena) and a reference set (referred to as set B, usually the set of all genes from the analyzed organism). Then, for each attribute f from the set F (f is usually a functional term from the employed annotation vocabulary F, i.e. GO, FunCat, etc.) the number a genes in set A and b genes in set B that have been annotated with f is counted. In the next step, the null hypothesis H0 (genes that belong to the set A are independent of having attribute f) is tested. Hypergeometric, binomial or χ2-tests are usually employed to find over/under represented attributes (19). Unlike most currently available web tools for functional profiling, ProfCom implements different profiling paradigms. Along with standard profiling of functional terms f (referred to as ‘base’ categories) from annotation vocabularies it also searches for the enrichment related to ‘complex functions’, which are defined as any Boolean combination of ‘base’ categories (for example, a new ‘complex function’ w may define the set of genes that belongs simultaneously to the ‘base’ categories f1 and f2). We consider intersection, union and difference operations. For example, intersection of two categories f1 and f2 is formally defined as ‘complex function’ w = f1 ∩ f2. In other words, w corresponds to the set of genes that belong to both categories f1 and f2. The union of two categories f1 and f2 is formally defined as w = f1 ∪ f2. In this case, w corresponds to the set of genes that belong either to category f1 or f2. The difference between two categories f1 and f is formally defined as w = f1/f2; ‘complex function’ w corresponds to the set of genes from category f1 excluding those that simultaneously belong to category f2. Each ‘complex function’ is characterized by the number of base categories required to construct it. We will refer to this characteristic as degree. For example, the base categories can be defined as ‘complex functions’ of the first degree, the category w = f ∩ f2 is a ‘complex function’ of the second degree (intersection). Consideration of all possible ‘complex functions’ leads to combinatorial complexity. To analyze enrichments for all possible combinations of degree higher than 2 is computationally infeasible. For this reason, a search algorithm should be used. ProfCom employs the algorithm based on greedy heuristics (20). Greedy heuristics does not guarantee to find the optimal solution in every case but significantly reduce the computational complexity. To adjust P-values for multiple testing ProfCom uses the Monte–Carlo simulation approach. The estimated P-value corresponds exactly to the definition of an experiment-wise Westfall and Young P-value (3,20,24). More details on the searching algorithm and P-value adjustment can be found in Supplementary Materials.

Automatically supported annotations and gene Ids

As input ProfCom accepts several types of gene or protein identifiers. For example, for the human genome ProfCom supports identifiers from ‘Entrez Gene’ (25), ‘UniProt/Swiss-Prot’, ‘Gene Symbol’ (25,26), ‘UniGene’ (25), ‘Ensembl’ (27), ‘RefSeq Protein ID’, ‘RefSeq Transcript ID’ (28) and ‘Affymetrix probe codes’ (29). Additionally, a mixture of several identifier types is possible. In the first step, user-supplied gene Ids are mapped to ‘Entrez Gene’ identifiers. For this purpose, files from NCBI and Affymetrix websites are used. Detailed information on data sources used by ProfCom is in Table 1.

Table 1.

Types of gene identifiers recognized by ProfCom and data sources used for Id mapping

Type of Ids	File used
‘Gene Symbol’, ‘Ensembl’, ‘LocusTag’	ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
‘RefSeq Protein ID’, ‘RefSeq Transcript ID’	ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
‘UniProt/Swiss-Prot’	ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_refseq_uniprotkb_collab.gz
‘UniGene’	ftp://ftp.ncbi.nlm.nih.go/gene/DAT/gene2unigene
‘Affymetrix probe codes’	http://www.affymetrix.com/Annotation files

Types of gene identifiers recognized by ProfCom and data sources used for Id mapping The user gets full information on mapping of the supplied gene ids. It includes four tables along with the ProfCom results online. Table 1 reports full mapping details of recognized gene Ids. It includes the informational source used as well as a possible multiple mapping of the user supplied Ids to ‘Entrez Gene’ Ids. Table 2 reports unrecognized gene Ids. Table 3 reports the final mapping (one-to-one mapping), which is used in subsequent analyses. ProfCom implements simple heuristics to resolve multiple mapping issues. If it is possible to map a particular gene Id to several ‘Entrez Gene’ Ids, the Id which has the most abundant annotation is selected. However, if the user finds this mapping to be incorrect (Table 3) he/she can simply resubmit the data by substituting those ambiguous gene Ids with ‘Entrez Gene’ Ids considered to be correct. On the other hand, if several supplied gene Ids are mapped to the same ‘Entrez Gene’ Id then they are considered as belonging to one gene and the Ids are reported concatenated together by a semicolon (‘;’). Table 4 reports all such cases.

Table 2

Data file used by ProfCom to automatically retrieve annotations

Annotation	File used
Gene Ontology	ftp://ftp.ncbi.nlm.nih.go/gene/DAT/gene2unigene
InterPro Motifs	ftp://ftp.ebi.ac.uk/pub/databases/interpro/protein2ipr.dat
FunCat	http://mips.gsf.de/

We would like to point out that protein and gene identifiers can be highly ambiguous (30) with multiple synonymous variants. For this reason, the quality of the retrieved annotation can be different for different types of identifiers. Several powerful recourses to map different type of gene Ids exist (http://beta.uniprot.org/). To escape multiple mapping issues, we recommend submitting ‘Entrez Gene’ identifies to ProfCom. ProfCom automatically supports several annotations. Currently, they include GO (21), FunCat (22) and InterPro Motifs (23). Detailed information on data sources used to retrieve each annotation is presented in the Table 2. The ProfCom web interface allows the user to use all annotations simultaneously or combine them. Data file used by ProfCom to automatically retrieve annotations In addition to the interactive web-submissions, custom annotation data can be analyzed using the ProfCom Web service. This allows the use of ProfCom for almost any problem domain, e.g. different annotation types or organisms. Furthermore, web services enable one to run ProfCom analyses in pipelines or automated workflows from most systems. This ensures a fast and convenient usage for a broad range of use cases: starting from a quick hypothesis evaluation to detailed high-quality annotations.

Implementation

ProfCom runs on a standard Apache/Tomcat web server. The actual profiling algorithm is implemented in Java and C for platform independence and high performance. The computation is distributed on Linux workstations utilizing a Sun Grid engine and thus ensures scalability. A ProfCom analysis starts by user-friendly dialog-driven web form. In the first step, the model organism is chosen and the list of gene or protein names of interest is uploaded. Optionally, the reference set of genes can be uploaded. By default, the set of all annotated genes (‘Entrez Gene’ Ids) from the chosen organism is used as the reference set. Depending on the chosen organism the ProfCom web page automatically shows all available annotations.

Illustration of ProfCom model inference process

Here, we present one example of analyses of real data by ProfCom to illustrate it novelties and utilities in comparison to existing related tools. More examples can be found in Supplementary Materials, where we bring together several independent studies that performed gene expression analyses to identify over/under expressed genes in different cancer types. We collect a set of differentially expressed genes originally identified in each study (we refer to each of these sets as set A and the set of all human genes is referred to as set B). In ref. (31), microarray experiments were done to compare gene expression in 50 ovarian cancer specimens, including all four histotypes to gene expression in five pools of normal ovarian surface epithelial cells. Data were analyzed to determine whether changes in gene expression correlated with different histotypes, grade or stage. Several set of genes that show the greatest ability to differentiate between considered cancer subtypes were originally identified. For example, 47 selected genes were 2-fold differentially expressed in mucinous ovarian cancers compared to other histotypes and with normal ovarian surface epithelial cells. Standard functional profiling reveals several GO term significantly overrepresented. It is widely known that the processes of Ca++ homeostasis are often disordered in many cancer types (32). Therefore, the presence of GO term ‘calcium-ion binding’ among top enriched GO terms is of particular interest. Eight genes (MRC1, EFHD2, PLS1, ANXA10, LDLR, MMP1, S100P, THBS2) from the set A are related by this term (Figure 1). On the other hand, there are 894 genes in the whole human genome classified as ‘calcium-ion binding’. Using conventional GO terms vocabularies, standard profiling procedure is not able to supply evidences that would discriminate these eight genes (from all human 894 ‘calcium-ion binding’) and, thus, to clarify molecular mechanism involved.

Figure 1.

ProfCom output table ‘Top enriched categories of degree 1’ for the considered example.

ProfCom output table ‘Top enriched categories of degree 1’ for the considered example. The complex function ‘calcium-ion binding EXCLUDING integral to membrane EXCLUDING hydrolase activity’ inferred by ProfCom (Figure 2) relates all ‘calcium-ion binding’ genes from the set A and is more specific in comparison to a single GO term, i.e. only 533 genes (compared to 894) in the human genome are classified by this complex function. It is not only better from statistical viewpoint (equal selectivity with ∼1-fold increase in specificity), but also supplies valuable biological information which can be helpful for making biological conclusions about molecular mechanisms involved in the considered cancer type.

Figure 2.

ProfCom output table ‘Top enriched categories of degree 3’ for the considered example.

CONCLUSION

Automatic functional profiling becomes the de facto approach for the secondary analysis of high-throughput data. A number of tools employing available gene functional annotations have been developed. However, most of these tools are limited by available annotation vocabularies and may fail to provide full interpretation of biological relationships in a set of genes involved in complex biological phenomena. Here, we present ProfCom, a web-based tool that implements the new profiling paradigm for the interpretation of functional relations between genes. ProfCom profiling engine employs three logical operations (‘AND’, ‘OR’, ‘NOT’) to provide complex functions that classify more specifically the biological role of a gene group. As been demonstrated, in many cases, complex functions provide better understanding of molecular mechanisms involved for the phenomena under study. On the other hand, in some cases, relative GO terms can form many redundant complex functions and may complicate the manual analyses of the ProfCom results. This may be considered as a potential disadvantage. One potential way to resolve redundancy problem is the inclusion of methodologies that group related sets of annotations before the analyses (18,33,34), in the future. ProfCom provides technical support to the user that corresponds to the best currently available standards in the field. It has a dialog-driven web page for submission that covers several mostly exploited model organisms. In addition, the web service interface allows one submitting any kind of annotation data and is not limited to a particular organism or problem domain. This property significantly simplifies the procedure of data analyses and increases the spectrum of gene sets that can be analyzed. These features make ProfCom an attractive practical tool for biologists interpreting new experimental data.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

33 in total

1. The InterPro database, an integrated documentation resource for protein families, domains and functional sites.

Authors: R Apweiler; T K Attwood; A Bairoch; A Bateman; E Birney; M Biswas; P Bucher; L Cerutti; F Corpet; M D Croning; R Durbin; L Falquet; W Fleischmann; J Gouzy; H Hermjakob; N Hulo; I Jonassen; D Kahn; A Kanapin; Y Karavidopoulou; R Lopez; B Marx; N J Mulder; T M Oinn; M Pagni; F Servant; C J Sigrist; E M Zdobnov
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Profiling gene expression using onto-express.

Authors: Purvesh Khatri; Sorin Draghici; G Charles Ostermeier; Stephen A Krawetz
Journal: Genomics Date: 2002-02 Impact factor: 5.736

3. Characterizing gene sets with FuncAssociate.

Authors: Gabriel F Berriz; Oliver D King; Barbara Bryant; Chris Sander; Frederick P Roth
Journal: Bioinformatics Date: 2003-12-12 Impact factor: 6.937

4. NetAffx: Affymetrix probesets and annotations.

Authors: Guoying Liu; Ann E Loraine; Ron Shigeta; Melissa Cline; Jill Cheng; Venu Valmeekam; Shaw Sun; David Kulp; Michael A Siani-Rose
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining.

Authors: Marco Masseroli; Dario Martucci; Francesco Pinciroli
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

6. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

7. A systems biology approach for pathway level analysis.

Authors: Sorin Draghici; Purvesh Khatri; Adi Laurentiu Tarca; Kashyap Amin; Arina Done; Calin Voichita; Constantin Georgescu; Roberto Romero
Journal: Genome Res Date: 2007-09-04 Impact factor: 9.043

8. MIPS: analysis and annotation of proteins from whole genomes.

Authors: H W Mewes; C Amid; R Arnold; D Frishman; U Güldener; G Mannhaupt; M Münsterkötter; P Pagel; N Strack; V Stümpflen; J Warfsmann; A Ruepp
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. GoMiner: a resource for biological interpretation of genomic and proteomic data.

Authors: Barry R Zeeberg; Weimin Feng; Geoffrey Wang; May D Wang; Anthony T Fojo; Margot Sunshine; Sudarshan Narasimhan; David W Kane; William C Reinhold; Samir Lababidi; Kimberly J Bussey; Joseph Riss; J Carl Barrett; John N Weinstein
Journal: Genome Biol Date: 2003-03-25 Impact factor: 13.583

10. PathExpress: a web-based tool to identify relevant pathways in gene expression data.

Authors: Nicolas Goffard; Georg Weiller
Journal: Nucleic Acids Res Date: 2007-06-22 Impact factor: 16.971

39 in total

1. R spider: a network-based analysis of gene lists by combining signaling and metabolic pathways from Reactome and KEGG databases.

Authors: Alexey V Antonov; Esther E Schmidt; Sabine Dietmann; Maria Krestyaninova; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2010-06-02 Impact factor: 16.971

2. CCancer: a bird's eye view on gene lists reported in cancer-related studies.

Authors: Sabine Dietmann; Wanseon Lee; Philip Wong; Igor Rodchenkov; Alexey V Antonov
Journal: Nucleic Acids Res Date: 2010-06-06 Impact factor: 16.971

3. GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists.

Authors: Alexey V Antonov; Sabine Dietmann; Philip Wong; Dominik Lutter; Hans W Mewes
Journal: Nucleic Acids Res Date: 2009-05-06 Impact factor: 16.971

4. A systematic proteomic study of irradiated DNA repair deficient Nbn-mice.

Authors: Anna Melchers; Lars Stöckl; Janina Radszewski; Marco Anders; Harald Krenzlin; Candy Kalischke; Regina Scholz; Andreas Jordan; Grit Nebrich; Joachim Klose; Karl Sperling; Martin Digweed; Ilja Demuth
Journal: PLoS One Date: 2009-05-01 Impact factor: 3.240

5. KEGG spider: interpretation of genomics data in the context of the global gene metabolic network.

Authors: Alexey V Antonov; Sabine Dietmann; Hans W Mewes
Journal: Genome Biol Date: 2008-12-18 Impact factor: 13.583

6. COFECO: composite function annotation enriched by protein complex data.

Authors: Choong-Hyun Sun; Min-Sung Kim; Youngwoong Han; Gwan-Su Yi
Journal: Nucleic Acids Res Date: 2009-05-08 Impact factor: 16.971

7. VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology.

Authors: Zhenjun Hu; Jui-Hung Hung; Yan Wang; Yi-Chien Chang; Chia-Ling Huang; Matt Huyck; Charles DeLisi
Journal: Nucleic Acids Res Date: 2009-05-21 Impact factor: 16.971

8. Martini: using literature keywords to compare gene sets.

Authors: Theodoros G Soldatos; Seán I O'Donoghue; Venkata P Satagopam; Lars J Jensen; Nigel P Brown; Adriano Barbosa-Silva; Reinhard Schneider
Journal: Nucleic Acids Res Date: 2009-10-25 Impact factor: 16.971

9. Two plant viral suppressors of silencing require the ethylene-inducible host transcription factor RAV2 to block RNA silencing.

Authors: Matthew W Endres; Brian D Gregory; Zhihuan Gao; Amy Wahba Foreman; Sizolwenkosi Mlotshwa; Xin Ge; Gail J Pruss; Joseph R Ecker; Lewis H Bowman; Vicki Vance
Journal: PLoS Pathog Date: 2010-01-15 Impact factor: 6.823

10. Functional and evolutionary correlates of gene constellations in the Drosophila melanogaster genome that deviate from the stereotypical gene architecture.

Authors: Shuwei Li; Ching-Hua Shih; Michael H Kohn
Journal: BMC Genomics Date: 2010-05-24 Impact factor: 3.969