Literature DB >> 16845106

KOBAS server: a web-based platform for automated annotation and pathway identification.

Jianmin Wu¹, Xizeng Mao, Tao Cai, Jingchu Luo, Liping Wei.

Abstract

There is an increasing need to automatically annotate a set of genes or proteins (from genome sequencing, DNA microarray analysis or protein 2D gel experiments) using controlled vocabularies and identify the pathways involved, especially the statistically enriched pathways. We have previously demonstrated the KEGG Orthology (KO) as an effective alternative controlled vocabulary and developed a standalone KO-Based Annotation System (KOBAS). Here we report a KOBAS server with a friendly web-based user interface and enhanced functionalities. The server can support input by nucleotide or amino acid sequences or by sequence identifiers in popular databases and can annotate the input with KO terms and KEGG pathways by BLAST sequence similarity or directly ID mapping to genes with known annotations. The server can then identify both frequent and statistically enriched pathways, offering the choices of four statistical tests and the option of multiple testing correction. The server also has a 'User Space' in which frequent users may store and manage their data and results online. We demonstrate the usability of the server by finding statistically enriched pathways in a set of upregulated genes in Alzheimer's Disease (AD) hippocampal cornu ammonis 1 (CA1). KOBAS server can be accessed at http://kobas.cbi.pku.edu.cn.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2006 PMID： 16845106 PMCID： PMC1538915 DOI： 10.1093/nar/gkl167

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Automated analysis of large sets of genes and proteins requires that they be annotated with a common controlled vocabulary. Gene Ontology (GO) (1) which comprises over 19 000 terms in molecular function, biological process and cellular component, has been one of the most widely used controlled vocabularies. GO has been used to annotate whole genomes and find enriched functional categories in upregulated or clustered genes in microarray experiments. A variety of web-based tools have been developed for GO-based analysis, including Gotcha (2), GoFigure (3), FatiGO (4), GFINDer (5), GOstat (6), NetAffx (7), GOToolBox (8) and Onto-Tools (9). However, a weakness of GO is that its terms do not correspond directly to pathways. Knowing the pathways involved in a set of genes or proteins, especially the statistically enriched pathways, could offer more biological insights and generate more directly testable hypotheses. Towards this goal, we have previously studied the KEGG Orthology (KO) (10,11), part of the KEGG suite of resources (12), as an alternative controlled vocabulary (13). We demonstrated that KO is effective in automated annotation of sets of sequences based on similarity to sequences with known KO annotations in the KEGG GENES database. Moreover, because KO is directly linked to KEGG pathways, it enables pathway identification. We developed and tested a KO-Based Annotation System (KOBAS) to find both the most frequent and most statistically significantly enriched pathways in a set of genes or proteins (13). KOBAS can be used to analyze whole genomes and results from DNA microarray and protein 2D gel experiments. For instance, Shi et al. (14) used KOBAS to find that ethylene biosynthesis is the most enriched pathway in a set of cotton fiber-specific genes from the cotton transcriptome profiling data; they then validated the finding by physiological and biochemical experiments. The standalone KOBAS package has been downloaded and used by scientists worldwide. However, given that KOBAS comprises underlying SQLite relational databases, R statistics package, Python scripts and other necessary programs, we have received comments citing difficulties in installation and updating. Therefore, to benefit the entire scientific community, we here report a KOBAS web server with a user-friendly interface. In addition to the existing functionalities of standalone KOBAS, the new web server has the following new features: (i) it allows the input of not only gene or protein sequences but also sequence identifiers (IDs); (ii) it offers three more statistical tests—binomial, χ2 and Fisher's exact tests—in addition to the original hypergeometric distribution test to find significantly enriched pathways; (iii) the underlying databases are updated regularly, a process that is transparent to users; (iv) it has an improved HTML output format; and (v) it allows users to store and manage input data and output results online. In addition to KOBAS, several tools and servers have recently been developed to identify enriched pathways in microarray data; these tools include ArrayXPath (15), PathwayExplorer (16), VAMPIRE (17) and Pathway-Express (9). However, a common feature of all these other tools and servers is that they can only take IDs as input to map directly to pathway databases. This greatly limits the usefulness of the tools because many important organisms have not been annotated or are poorly annotated in pathway databases (e.g. cotton). Direct ID mapping would fail for sequences in these organisms as well as for any newly sequenced genome, EST or cDNA sequences. Even for well-annotated genomes, usually not all transcripts of the same gene are annotated in the pathway databases. During preparation of this manuscript, a KEGG Automatic Annotation Server (KAAS, )came online that can also annotate a set of sequences with KO terms. However, it does not have any other functionalities, such as the important statistical testing of significance of the identified pathways or input by ID. Assigning statistical significance to the pathways is one of the critical features of KOBAS and has been shown to lead to validated hypotheses. To date, KOBAS is the only server that has integrated all the aforementioned functionalities.

ANALYSIS TOOLS

The KOBAS server divides the analysis into two steps to provide more flexibility. The first step annotates a set of genes or proteins (as IDs or sequences) with KO terms. The second step identifies the frequent or statistically significant pathways. Users can also manage their data and results online.

KO annotation

If a user inputs a list of IDs from popular sequence databases, KOBAS will map these IDs directly to genes with known KO annotations using the cross-links we parsed from the KEGG GENES database. If there is a match, the KO terms of the KEGG gene are assigned to the query gene or protein. The list of acceptable sequence databases for each organism is available on the web site in the FAQ section of Documentation. For example, for human, GI numbers of GenBank and IDs of the NCBI Gene database, UniProt, GDB and OMIM are acceptable; for Saccharomyces cerevisiae, GI numbers of GenBank and IDs of the NCBI Gene database, UniProt, SGD and MIPS are acceptable. If there is a match, the output contains four columns: query sequence identifier, KO term, KO term definition and the KEGG gene to which the query is mapped. If a user inputs a set of nucleotide or amino acid sequences in FASTA format (by uploading a file or pasting directly into the input window), KOBAS assigns the KO terms for each sequence based on sequence similarity with entries in KEGG GENES using BLAST (18). We chose BLAST E-value ≤ 10−5 and rank ≤5 as the default cutoffs, meaning that a new sequence is assigned the KO term(s) of the first BLAST hit that (i) has BLAST E-value ≤ 10−5, (ii) has known KO assignments and (iii) has less than five other hits with a lower E-value that do not have KO assignments (13). Users have the option to adjust the cutoffs to increase sensitivity or specificity. A lower E-value or rank returns more reliable mapping results but may leave more sequences unannotated, whereas a higher E-value or rank annotates more sequences but may have some false positives. The choice of the cutoff criteria and the tradeoff between sensitivity and specificity was previously studied (13). Figure 1 shows the output of KO annotation when the input is a set of FASTA sequences. Each row corresponds to a query DNA or protein and lists the sequence identifier extracted from the input, the assigned KO term (hyperlinked to detailed description in KEGG), definition of the KO term, the rank, E-value, score and percent identity of the BLAST hit, and the gene ID of the hit in the KEGG GENES database. If one sequence is annotated with multiple KO terms, each KO annotation is presented in a separate row.

Figure 1

Screenshot of the output of KO annotation when the input is FASTA sequences. The 21 of the 36 upregulated genes in AD CA1 were assigned KO terms based on sequence similarity. Each row corresponds to a query DNA or protein input by the user. The first column contains sequence identifiers extracted from the input. The second column contains the assigned KO terms hyperlinked to detailed descriptions in KEGG. The third column contains KO term definitions. The fourth to seventh columns show the rank, E-value, score and identity of the BLAST hit. The last column contains the gene ID of the hit hyperlinked to the KEGG GENES database. Users can choose to view the results in HTML or text format, edit the text format online and download results to local disks. Users can also select the program for further analysis using the annotation results as input directly.

Because BLAST is computationally intensive, it may not be possible to return results to users immediately if the input is large. The KOBAS server displays a URL if the job cannot be finished within one minute so that the user can access the results later. Alternatively, if the user supplies an e-mail address, the results will be emailed automatically upon completion of the job.

Pathway identification

After a set of genes or proteins are annotated with KO, the user can choose to identify the frequently occurring or the statistically enriched pathways in the set. The input is the output of the previous step, ‘KO Annotation’. Since the third level in the KO hierarchy corresponds to KEGG pathways, we can trace the KO terms of a gene back through the KO hierarchy to its associated pathways. The frequently occurring pathways can be easily identified by tallying the number of genes or proteins associated with each pathway and ranking the numbers. The output lists the name of each of the pathways and the number and percentage of the query genes or proteins that are involved in each pathway. However, as some pathways are naturally large and would involve more genes or proteins in any set just by chance, it is important to identify the statistically significantly enriched pathways compared with a background distribution. The user can use a whole genome as the background distribution by selecting from the list of genomes annotated with KO, or they can use any set of genes or proteins (e.g. the entire probe set on a microarray) annotated with KO as background by uploading or pasting the annotations. Next, the user may choose from four statistical tests—binomial, χ2, Fisher's exact and hypergeometric distribution tests—and the choice of whether to perform multiple testing correction using FDR. The output shows the statistically enriched pathways, listing the pathway name, the number and percentage of the query genes or proteins that are involved in each pathway, the number and percentage of the background genes or proteins that are involved in each pathway, right-tailed p-value and FDR-corrected q-value (if applicable) (Figure 2). The pathways are sorted by increasing p-value, or q-value (if applicable) from most significant to least. Each pathway name is linked to a page of detailed information including all the genes/proteins involved and hyperlinks to the KEGG pathway maps, with the relevant KO terms highlighted.

Figure 2

Screenshot of the list of statistically enriched pathways identified in the upregulated genes in AD CA1, sorted by increasing q-value. The first column shows the name of the pathway. The second column lists the number and percentage of input genes or proteins involved in the pathway (top) and the number and percentage of background genes or proteins involved in the pathway. The third and fourth columns list the p-value and q-value of the statistical significance, respectively.

Online data management

For the convenience of frequent users of KOBAS, the server provides online data management functionalities. Registration is free and open to all. A registered user can save input files and output files in a private ‘User Space’ on the server. The User Space supports a tree-like structure to organize directories and files as shown in Figure 3. The saved input files and intermediate output files can be easily selected for repeated analysis using different parameters. All user input and output are strictly confidential. For guest users, the files are kept on the server for 7 days, whereas for registered users, the files are kept for 6 months.

Figure 3

Screenshot of User Space. Users can organize their data and results in a tree-like structure. Users can upload files from their local disk to the KOBAS server and use them later as input. The output of any analysis will be automatically stored in the User Space for further analysis.

The user can view the analysis history and monitor the job status online, including information on the program, input and output, start time and elapsed time and status of each job. A submitted job has five possible statuses: submitted, running, in queue, finished and failed. A job is put in queue if 10 other jobs are already running on the server. When a job finishes, the output files are automatically saved in the User Space. If a job fails because of invalid input or other reasons, the detailed error message is logged.

IMPLEMENTATION

The KOBAS web server was developed using the platform-independent Java language. Apache Tomcat was used as a container for Java Servlet and JSP. User account information, uploaded input files and analysis results are stored in the MySQL database. The KOBAS server runs on a Linux box (4 Intel Xeon 2.20 GHz and 8 GB RAM). For very large jobs, the user can download the standalone version of KOBAS to run locally. The server web site includes a step-by-step tutorial (with screenshots) for general users as well as detailed technical documentation and an online browser of KOBAS source code for software developers.

EXAMPLE APPLICATION

One of our other research projects involves the genes and pathways involved in Alzheimer's Disease (AD). Colangelo et al. (19) analyzed the expression profiles of 12 633 genes in the AD hippocampal cornu ammonis 1 (CA1) versus healthy controls using DNA microarrays. Using their dataset (), we identified 36 genes as upregulated in AD CA1 with standard criteria (P ≤ 0.05, 2-fold change). We then used ‘KO Annotation’ on the KOBAS server to annotate 21 of the 36 upregulated genes with KO terms using default cutoffs and ‘Pathway Identification’ to find statistically enriched pathways using the χ2 test (Figure 1 and 2). A literature review showed that the top five pathways all have been associated with AD; these include apoptosis (caspase activation, a key step in apoptosis, leads to the proteolytic cleavage of tau) (20), mitogen-activated protein kinase (MAPK) signaling (implicated in the hyperphosphorylation of tau, a major component of the neurofibrillary tangles) (21), Toll-like receptor signaling (activating signal transduction pathways that stimulate immune function) (22), cytokines (promoting and sustaining inflammatory responses—a central feature of AD) (23) and cytokine–cytokine receptor interactions (associated with MAPK expression) (21).

DISSCUSSION

Although there are several online servers for pathway analysis, KOBAS provides the most comprehensive set of functionalities including input by both IDs and sequences, finding both frequent and statistically enriched pathways, four choices of statistical tests, online management of data and analysis, both web-based and standalone versions of the program and both step-by-step tutorial for novice users and detailed technical documentation for bioinformaticians. The power of KOBAS is limited by the number of input genes or proteins that can be assigned KO terms, which in turn is limited by the number of genes and proteins that have known KO annotations. Our previous experience indicates that typically 30–50% of gene or protein sequences in a newly sequenced genome can be assigned KO terms by BLAST similarity. This percentage is slightly lower than that for GO-based annotations, as more genes and proteins have known GO annotations than have KO annotations. However, this gap will decrease as more KO annotations become available. The implementation of four statistical tests offers more flexibility to suit different analysis needs. The hypergeometric test requires that input annotations be a subset of the background annotations. For the chi-square test, when χ2 becomes unreliable (expected frequencies <5), KOBAS will automatically switch to the Fisher's exact test. The binomial test is faster when the number of sequences is large. Currently the KOBAS server allows 10 jobs to run concurrently and puts the other jobs in queue. KO annotation by sequence similarity is limited to 500 sequences per job. There is no limit for input by IDs or pathway identification. We are currently developing a distributed computing version of KOBAS on a cluster, which will enable the server to handle more jobs at a higher computational rate.

23 in total

1. GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining.

Authors: Marco Masseroli; Dario Martucci; Francesco Pinciroli
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

2. ArrayXPath: mapping and visualizing microarray gene-expression data with integrated biological pathway resources using Scalable Vector Graphics.

Authors: Hee-Joon Chung; Mingoo Kim; Chan Hee Park; Jihoon Kim; Ju Han Kim
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary.

Authors: Xizeng Mao; Tao Cai; John G Olyarchuk; Liping Wei
Journal: Bioinformatics Date: 2005-04-07 Impact factor: 6.937

4. A database for post-genome analysis.

Authors: M Kanehisa
Journal: Trends Genet Date: 1997-09 Impact factor: 11.639

Review 5. Mechanisms of cell signaling and inflammation in Alzheimer's disease.

Authors: Gilbert J Ho; Roulla Drego; Edwin Hakimian; Eliezer Masliah
Journal: Curr Drug Targets Inflamm Allergy Date: 2005-04

Review 6. The role of caspase cleavage of tau in Alzheimer disease neuropathology.

Authors: Carl W Cotman; Wayne W Poon; Robert A Rissman; Mathew Blurton-Jones
Journal: J Neuropathol Exp Neurol Date: 2005-02 Impact factor: 3.685

7. VAMPIRE microarray suite: a web-based platform for the interpretation of gene expression data.

Authors: Albert Hsiao; Trey Ideker; Jerrold M Olefsky; Shankar Subramaniam
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. PathwayExplorer: web service for visualizing high-throughput expression data on biological pathways.

Authors: Bernhard Mlecnik; Marcel Scheideler; Hubert Hackl; Jürgen Hartler; Fatima Sanchez-Cabo; Zlatko Trajanoski
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes.

Authors: David M A Martin; Matthew Berriman; Geoffrey J Barton
Journal: BMC Bioinformatics Date: 2004-11-18 Impact factor: 3.169

10. GOToolBox: functional analysis of gene datasets based on Gene Ontology.

Authors: David Martin; Christine Brun; Elisabeth Remy; Pierre Mouren; Denis Thieffry; Bernard Jacq
Journal: Genome Biol Date: 2004-11-26 Impact factor: 13.583

298 in total

1. The Trichomonas vaginalis hydrogenosome proteome is highly reduced relative to mitochondria, yet complex compared with mitosomes.

Authors: Rachel E Schneider; Mark T Brown; April M Shiflett; Sabrina D Dyall; Richard D Hayes; Yongming Xie; Joseph A Loo; Patricia J Johnson
Journal: Int J Parasitol Date: 2011-11-09 Impact factor: 3.981

2. Transcriptome profiling of the UV-B stress response in the desert shrub Lycium ruthenicum.

Authors: Haikui Chen; Yang Feng; Lina Wang; Takahiro Yonezawa; M James C Crabbe; Xiu Zhang; Yang Zhong
Journal: Mol Biol Rep Date: 2014-10-31 Impact factor: 2.316

3. An integrative study of a meromictic lake ecosystem in Antarctica.

Authors: Federico M Lauro; Matthew Z DeMaere; Sheree Yau; Mark V Brown; Charmaine Ng; David Wilkins; Mark J Raftery; John A E Gibson; Cynthia Andrews-Pfannkoch; Matthew Lewis; Jeffrey M Hoffman; Torsten Thomas; Ricardo Cavicchioli
Journal: ISME J Date: 2010-12-02 Impact factor: 10.302

4. Revisiting the β-Lactams for Tuberculosis Therapy with a Compound-Compound Synthetic Lethality Approach.

Authors: Shiqi Xiao; Haidan Guo; Warren S Weiner; Clinton Maddox; Chunhong Mao; Hendra Gunosewoyo; Shaaretha Pelly; E Lucile White; Lynn Rasmussen; Frank J Schoenen; Jeffrey Aubé; William R Bishai; Shichun Lun
Journal: Antimicrob Agents Chemother Date: 2019-10-22 Impact factor: 5.191

5. Gene expression profiling during gland morphogenesis of a mutant and a glandless upland cotton.

Authors: Quan Sun; Yingfan Cai; Yongfang Xie; Jianchuan Mo; Youlu Yuan; Yuzhen Shi; Shengwei Li; Huaizhong Jiang; Zheng Pan; Yunling Gao; Min Chen; Xiaohong He
Journal: Mol Biol Rep Date: 2009-11-04 Impact factor: 2.316

Review 6. Principles and methods of integrative genomic analyses in cancer.

Authors: Vessela N Kristensen; Ole Christian Lingjærde; Hege G Russnes; Hans Kristian M Vollan; Arnoldo Frigessi; Anne-Lise Børresen-Dale
Journal: Nat Rev Cancer Date: 2014-05 Impact factor: 60.716

7. Transcriptome analysis of the Cryptocaryon irritans tomont stage identifies potential genes for the detection and control of cryptocaryonosis.

Authors: Yogeswaran Lokanathan; Adura Mohd-Adnan; Kiew-Lian Wan; Sheila Nathan
Journal: BMC Genomics Date: 2010-01-29 Impact factor: 3.969

8. PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways.

Authors: Nicolas Goffard; Tancred Frickey; Georg Weiller
Journal: Nucleic Acids Res Date: 2009-05-27 Impact factor: 16.971

9. SubpathwayMiner: a software package for flexible identification of pathways.

Authors: Chunquan Li; Xia Li; Yingbo Miao; Qianghu Wang; Wei Jiang; Chun Xu; Jing Li; Junwei Han; Fan Zhang; Binsheng Gong; Liangde Xu
Journal: Nucleic Acids Res Date: 2009-08-25 Impact factor: 16.971

10. Global expression analysis of the brown alga Ectocarpus siliculosus (Phaeophyceae) reveals large-scale reprogramming of the transcriptome in response to abiotic stress.

Authors: Simon M Dittami; Delphine Scornet; Jean-Louis Petit; Béatrice Ségurens; Corinne Da Silva; Erwan Corre; Michael Dondrup; Karl-Heinz Glatting; Rainer König; Lieven Sterck; Pierre Rouzé; Yves Van de Peer; J Mark Cock; Catherine Boyen; Thierry Tonon
Journal: Genome Biol Date: 2009-06-16 Impact factor: 13.583