Literature DB >> 21715386

KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases.

Chen Xie¹, Xizeng Mao, Jiaju Huang, Yang Ding, Jianmin Wu, Shan Dong, Lei Kong, Ge Gao, Chuan-Yun Li, Liping Wei.

Abstract

High-throughput experimental technologies often identify dozens to hundreds of genes related to, or changed in, a biological or pathological process. From these genes one wants to identify biological pathways that may be involved and diseases that may be implicated. Here, we report a web server, KOBAS 2.0, which annotates an input set of genes with putative pathways and disease relationships based on mapping to genes with known annotations. It allows for both ID mapping and cross-species sequence similarity mapping. It then performs statistical tests to identify statistically significantly enriched pathways and diseases. KOBAS 2.0 incorporates knowledge across 1327 species from 5 pathway databases (KEGG PATHWAY, PID, BioCyc, Reactome and Panther) and 5 human disease databases (OMIM, KEGG DISEASE, FunDO, GAD and NHGRI GWAS Catalog). KOBAS 2.0 can be accessed at http://kobas.cbi.pku.edu.cn.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21715386 PMCID： PMC3125809 DOI： 10.1093/nar/gkr483

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

High-throughput experimental technologies such as next generation sequencing, microarray profiling and proteomics profiling are widely used in current biological research and often identify dozens to hundreds of genes related to a biological or pathological process. Given such a set of genes, one wants to ask which metabolic and signaling pathways may be involved and which diseases may be implicated. As the number of genes is often large, it is desirable to have a computational tool to provide initial answers to these questions. However, ab initio prediction of pathways and diseases is challenging. One feasible approach is to use existing databases of known metabolic and signaling pathways and databases of known disease-associated genes as the starting point for annotation of a new set of genes. We have previously reported a standalone software and a web server KOBAS 1.0 (1,2) that annotates an input set of genes or proteins by mapping to genes with known pathways in the KEGG PATHWAY database (3). KOBAS 1.0 was the first software to identify statistically significantly enriched pathways using a hypergeometric test. It has been successfully used in pathway analysis in plants, animals and bacteria [for instance, (4–6)]. During the past decade, many other functional enrichment analysis tools have become available. Most of them focus on identification of enriched functional categories based on Gene Ontology (GO) (7), such as FuncAssociate (8), Ontologizer (9), BiNGO (10), FatiGO (11), GOToolBox (11) and GFinder (12). Although tremendously useful, functional categories are not as informative and intuitive as metabolic and signaling pathways and human diseases. A growing number of tools have been developed for pathway and disease identification, including, but not limited to, MAPPFinder (13), EASE (14), DAVID (15,16), ArrayXPath (17), WebGestalt (18), FuncCluster (19), PageMan (20), GENECODIS (21,22), GeneTrail (23), g:Profiler (24), FunNet (25) and PaLS (26). Except for DAVID, all these tools integrate limited pathway and disease databases (for a comparison, see Supplementary Table S1). Furthermore, none of these tools support sequence similarity mapping, an important feature that allows the user to take advantage of data from other species. It is necessary and important to develop a web server tool which incorporates comprehensive pathway and disease databases and supports both ID mapping and sequence similarity mapping. Here, we report a significantly expanded new version, KOBAS 2.0, which incorporates 5 pathway databases [KEGG PATHWAY, PID (27), BioCyc (28), Reactome (29,30) and Panther (31)] and 5 human disease databases [OMIM (http://www.ncbi.nlm.nih.gov/omim/), KEGG DISEASE (32), FunDO (33,34), GAD (35) and NHGRI GWAS Catalog (NHGRI) (36)]. Similar to version 1.0, KOBAS 2.0 supports not only ID mapping, but also sequence similarity mapping. KOBAS 2.0 consists of a standalone command line program written in Python which runs on most Linux systems as well as a user friendly web server developed using Java. Both command line program and web server are freely available at http://kobas.cbi.pku.edu.cn. KOBAS 2.0 flowchart is summarized in Figure 1 and detailed below.

Figure 1.

KOBAS 2.0 workflow. The types of input can be ID, FASTA sequence, or tabular BLAST output. KOBAS 2.0 has two programs ‘annotate’ and ‘identify’. The first program annotates input genes with pathways and diseases by ID mapping or sequence similarity mapping. The second program identifies statistically significantly enriched pathways and diseases.

MATERIALS AND METHODS

KOBAS 2.0 parses 10 pathway and disease databases and stores the data in a SQL relational database

Table 1 summarizes information about the pathway and disease databases that KOBAS 2.0 incorporates. Specifically, KEGG PATHWAY (3) and Reactome (29,30) are general pathway databases, whereas PID (27) and Panther (31) focus on signaling pathways and BioCyc (28) focuses on metabolic pathways. PID has only human data, whereas the others are multispecies databases. OMIM (http://www.ncbi.nlm.nih.gov/omim/) contains information on all known mendelian disorders and genes. KEGG DISEASE (32) collects knowledge on genetic and environmental factors of diseases. FunDO (33,34) is generated from GeneRIF using Disease Ontology Lite that is a condensed version of Disease Ontology. GAD (35) and NHGRI GWAS Catalog (36) both collect data from genetic association studies: GAD includes data from both candidate genes and GWAS studies, whereas NHGRI GWAS Catalog is a catalog of only GWAS studies.

Table 1.

Pathway and disease databases supported by KOBAS 2.0

Database name	Data content	File format	Number of species	Number of pathways or diseases in human	Number of genes mapped to KEGG GENES/all genes in human	URL
KEGG PATHWAY	Pathway	Text	1327	220	5595/5595	http://www.genome.jp/kegg/pathway.html
PID Curated	Pathway	XML	1	192	2782/3315	http://pid.nci.nih.gov/
PID BioCarta	Pathway	XML	1	254	1907/2391	http://pid.nci.nih.gov/
PID Reactome	Pathway	XML	1	996	3783/4405	http://pid.nci.nih.gov/
BioCyc	Pathway	Text and Table	6	277	1087/1120	http://biocyc.org/
Reactome	Pathway	Table	22	68	4366/4534	http://www.reactome.org/ReactomeGWT/entrypoint.html
Panther	Pathway	Table	43	154	2170/2207	http://www.pantherdb.org/
OMIM	Disease	Table	1	4990	3792/3792	http://www.ncbi.nlm.nih.gov/omim
KEGG DISEASE	Disease	Text	1	323	798/798	http://www.genome.jp/kegg/disease/
FunDO	Disease	Table	1	561	3888/4029	http://django.nubic.northwestern.edu/fundo/
GAD	Disease	Table	1	3770	3164/3238	http://geneticassociationdb.nih.gov/
NHGRI	Disease	Table	1	369	1975/2191	http://www.genome.gov/gwastudies/

aThe numbers in this table are summarized from KOBAS 2.0 backend database updated in November 23rd, 2010. And all the analyses using KOBAS 2.0 in this article are based on this data version.

Pathway and disease databases supported by KOBAS 2.0 aThe numbers in this table are summarized from KOBAS 2.0 backend database updated in November 23rd, 2010. And all the analyses using KOBAS 2.0 in this article are based on this data version. KOBAS 2.0 downloaded the raw data files from each database. As shown in Table 1, the file formats include plain text, XML and table. We have written parsers for all the data files. For each pathway or disease database, we retrieve the gene-term mapping by parsing the raw data files. We retrieve the gene annotation and gene-ID relations from KEGG Genes and BioMart (37). To integrate across different databases, we mapped the genes in all databases to KEGG GENES and KEGG ORTHOLOGY (KO). The gene-pathway and gene-disease data is stored in our backend SQL relational database. The FASTA protein sequence files were preprocessed for BLAST. KOBAS 2.0 backend data is updated every 3 months.

KOBAS 2.0 annotates input genes with pathways and diseases and identifies enriched pathways and diseases

KOBAS 2.0 has two consecutive programs ‘annotate’ and ‘identify’, which is similar to KOBAS 1.0 (1,2). The first program ‘annotates’ each input gene with putative pathways and diseases by mapping the gene to genes in KEGG GENES or terms in KO which are linked to pathway and disease terms in backend databases. For ID mapping, input IDs are mapped directly to genes using the cross-links we parsed from KEGG GENES. Then, if necessary, IDs are mapped to KO terms. For sequence similarity mapping, each input sequence is BLASTed against all sequences in KEGG GENES. The default cutoffs are BLAST E-value <10−5 and rank ≤5. They mean that an input sequence is assigned KO term(s) of the first BLAST hit that (i) has known KO assignments; (ii) has BLAST E-value <10−5; and (iii) has less than five other hits with a lower E-value that do not have KO assignments (1). A new option in KOBAS 2.0 is that users can map against genes in user-specified species instead of all genes by BLASTing against only sequences of the user-specified species. In order to reduce possible false positives due to multidomain proteins, we added a new option to allow users to set a cutoff of BLAST subject coverage. Another new option allows users to restrict sequence mapping to only orthologs as defined by Ensembl Compara (38). The second program ‘identifies’ statistically significantly enriched pathways and diseases by comparing results from the first program against the background (usually genes from the whole genome, or all probe sets on a microarray). Users can define their own background distribution in KOBAS 2.0 (for example, result from the first program to ‘annotate’ all probe sets on a microarray). If users do not upload a background file, KOBAS 2.0 uses the genes from whole genome as the default background distribution. Here, we consider only pathways and diseases for which there are at least two genes mapped in the input. Users can choose to perform statistical test using one of the following four methods: binomial test, chi-square test, Fisher's exact test and hypergeometric test, and perform FDR correction. The purpose of performing FDR correction is to reduce the Type-1 errors. When a large number of pathway and disease terms are considered, multiple hypotheses tests are performed, which leads to a high overall Type-1 error even for a relatively stringent P-value cutoff. KOBAS 1.0 supports the FDR correction method QVALUE (39). In KOBAS 2.0, we add two more popular FDR correction methods: Benjamini-Hochberg (40) and Benjamini-Yekutieli (41).

INPUT AND OUTPUT

Input

The input to ‘annotate’ can be a list of IDs, a FASTA sequence file or a tabular BLAST output. KOBAS 2.0 currently can accept three kinds of IDs: Entrez Gene ID, UniProtKB AC and GI. FASTA sequences can be protein or nucleotide sequences. Because BLAST is computationally intensive, the number of sequences that can be run on the online web server is limited to 500 per run. A new feature in KOBAS 2.0 is that, if users want to annotate more sequences online, they can run BLAST locally and upload the tabular BLAST output as the input to KOBAS 2.0. Or they can always run the standalone version of KOBAS 2.0 which has no limit. If users want to get the pathway and disease annotations of their genes, they only need to run ‘annotate’. If they want to find enriched pathways and diseases, they can feed the output of ‘annotate’ directly into ‘identify’ as input.

Output

The example of the output of ‘annotate’ is shown in Figure 2. Each row corresponds to one input gene. The first column contains the input gene IDs. The second and third columns contain the mapped KEGG GENE IDs, hyperlinked to detailed descriptions in KEGG and the mapped KEGG GENE names. A user can click on ‘details’ next to the input gene ID to see details about the query and related pathways and diseases.

Figure 2.

Screenshot of the output of ‘annotate’. 371 upregulated probe sets in CA are assigned to KEGG human genes by sequence similarity mapping. Users can view the result in table format (by default) or raw format (which can be downloaded to local disks). Users can also directly use the result as the input of ‘identify’ to do further analysis. The examples of the output of ‘identify’ is shown in Figure 3. KOBAS 2.0 separates the results of pathways and diseases into two tables. In the pathway identification result, the first three columns show the pathway name, pathway database and pathway ID, hyperlinked to detailed description in the corresponding database. The fourth column lists two numbers of the input: the first one is the number of input genes mapped to the particular pathway and the second one is the total number of input genes mapped to any pathway in the pathway database. Users can click on the first number in the fourth column to see the list of input genes mapped to the particular pathway. The fifth column lists two numbers of the background: the first one is the number of background genes mapped to the particular pathway and the second one is the total number of background genes mapped to any pathway in the pathway database. The last two columns list the P-value and corrected P-value of the statistical test. In the disease identification result, the seven columns show the disease name, disease database, disease ID, numbers of the input, numbers of the background, P-value and corrected P-value similar to the pathway identification result. KOBAS 2.0 merges redundant pathway and disease terms from different databases.

Figure 3.

Screenshot of the output of ‘identify’. Statistically significantly enriched pathways and diseases of 371 upregulated probe sets in CA identified are sorted by increasing corrected P-value. Only those with corrected P ≤ 0.05 are shown. Similar to the output of ‘annotate’, users can view the result in table format (by default) or raw format (which can be downloaded to local disks).

BENEFIT OF CROSS-SPECIES SEQUENCE SIMILARITY MAPPING OVER ID MAPPING

Other existing pathway analysis tools accept only gene IDs as input and use only ID mapping to annotate their pathways. A benefit of KOBAS 2.0 is that it can use sequence similarity mapping to annotate input genes from species that are not yet well-represented in existing pathway databases. It can also map the genes from other species to human diseases to predict whether these genes may be good candidates to study any human diseases, an important question in the model organism research. To illustrate, we analyzed the microarray expression profiles in rhesus monkeys in two major hippocampal subdivisions critical for memory/cognitive function: cornu ammonis (CA) and dentate gyrus (DG) using data from Blalock et al. (42). We reanalyzed their raw data on six samples from CA and six samples from DG of young rhesus monkeys and identified 371 upregulated probe sets in CA using standard protocol [gcrma and limma through R and Bioconductor (43)]. We then used both DAVID (15,16) and KOBAS 2.0 to annotate these probe sets and identify enriched pathways and diseases by using the entire probe sets on the chip as background. DAVID can perform only ID mapping to rhesus genes in its two pathway databases (KEGG PATHWAY and Panther) and as a result, identified no statistically significantly enriched pathways or diseases (with default options and corrected P ≤ 0.05). On the other hand, KOBAS 2.0 supports sequence similarity mapping by BLAST to annotate the rhesus gene set and can thus take full advantage of the abundant data on human pathways and diseases. We used ‘annotate’ to map sequences of upregulated probe sets in CA as well as the entire probe sets to KEGG human genes with default cutoffs and then used ‘identify’ to perform hypergeometric test and Benjamini-Hochberg FDR correction to find significantly enriched pathways and diseases by using the two results of ‘annotate’ as input and background, respectively. Figure 3 shows significantly enriched pathways and diseases identified by KOBAS 2.0. The results are consistent with known functional differences between the two regions. For example, ‘respiratory electron transport, ATP synthesis by chemiosmotic coupling and heat production by uncoupling proteins’ pathway and ‘glutaricaciduria, type IIB’ and ‘Glutaric academia’ diseases are consistent with the known knowledge that the CA region showed greater expression than DG for genes associated with mitochondrial activity (42); while ‘no2-dependent il-12 pathway in nk cells’, ‘il12 and stat4 dependent signaling pathway in th1 development’ and ‘autoimmune disease’ are consistent with the known knowledge that CA region showed greater expression than DG for genes associated with inflammatory responses (42). We also compared KOBAS 2.0 with popular GO enrichment analysis tools, FuncAssociate 2.0 (8), Ontologizer 2.0 (9), BiNGO (10) and EASE (14) using the same data set. Because these other tools can only take IDs as input, we first mapped the rhesus probe sets to human genes using sequence similarity. Then we ran the four GO enrichment analysis tools, the results of which are shown in Supplementary Table S2. The list of enriched pathways identified by KOBAS 2.0 is more specific and informative than the lists of functional categories identified by the GO enrichment analysis tools, and offers more insights into the biological processes.

CONCLUSIONS

KOBAS 2.0 has an expanded reservoir of underlying pathway databases and statistical tests, and the addition of disease databases. In future research, we aim to improve the graphical representation of the output pathways. We will continue to update KOBAS 2.0 with new pathway and disease data.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

‘National Outstanding Young Investigator’ from Natural Science Foundation of China (31025014); Johnson and Johnson (scholarship); China Ministry of Science and Technology 863 Hi-Tech Research and Development Programs (2007AA02Z165) and 973 Basic Research Program (2011CBA01102, 2007CB946904). Funding for open access charge: 973 Basic Research Program (2011CBA01102). Conflict of interest statement. None declared.

40 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205

3. Ontologizer 2.0--a multifunctional tool for GO term enrichment analysis and data exploration.

Authors: Sebastian Bauer; Steffen Grossmann; Martin Vingron; Peter N Robinson
Journal: Bioinformatics Date: 2008-05-29 Impact factor: 6.937

4. Identifying biological themes within lists of genes with EASE.

Authors: Douglas A Hosack; Glynn Dennis; Brad T Sherman; H Clifford Lane; Richard A Lempicki
Journal: Genome Biol Date: 2003-09-11 Impact factor: 13.583

5. Ensembl Genomes: extending Ensembl across the taxonomic space.

Authors: P J Kersey; D Lawson; E Birney; P S Derwent; M Haimel; J Herrero; S Keenan; A Kerhornou; G Koscielny; A Kähäri; R J Kinsella; E Kulesha; U Maheswari; K Megy; M Nuhn; G Proctor; D Staines; F Valentin; A J Vilella; A Yates
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

6. Reactome: a database of reactions, pathways and biological processes.

Authors: David Croft; Gavin O'Kelly; Guanming Wu; Robin Haw; Marc Gillespie; Lisa Matthews; Michael Caudy; Phani Garapati; Gopal Gopinath; Bijay Jassal; Steven Jupe; Irina Kalatskaya; Shahana Mahajan; Bruce May; Nelson Ndegwa; Esther Schmidt; Veronica Shamovsky; Christina Yung; Ewan Birney; Henning Hermjakob; Peter D'Eustachio; Lincoln Stein
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

7. Transcriptome profiling, molecular biological, and physiological studies reveal a major role for ethylene in cotton fiber cell elongation.

Authors: Yong-Hui Shi; Sheng-Wei Zhu; Xi-Zeng Mao; Jian-Xun Feng; Yong-Mei Qin; Liang Zhang; Jing Cheng; Li-Ping Wei; Zhi-Yong Wang; Yu-Xian Zhu
Journal: Plant Cell Date: 2006-02-03 Impact factor: 11.277

8. Reactome knowledgebase of human biological pathways and processes.

Authors: Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio
Journal: Nucleic Acids Res Date: 2008-11-03 Impact factor: 16.971

9. GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information.

Authors: Ruben Nogales-Cadenas; Pedro Carmona-Saez; Miguel Vazquez; Cesar Vicente; Xiaoyuan Yang; Francisco Tirado; Jose María Carazo; Alberto Pascual-Montano
Journal: Nucleic Acids Res Date: 2009-05-22 Impact factor: 16.971

10. PID: the Pathway Interaction Database.

Authors: Carl F Schaefer; Kira Anthony; Shiva Krupa; Jeffrey Buchoff; Matthew Day; Timo Hannay; Kenneth H Buetow
Journal: Nucleic Acids Res Date: 2008-10-02 Impact factor: 16.971

1408 in total

1. Global transcriptome profiling analysis reveals insight into saliva-responsive genes in alfalfa.

Authors: Wenxian Liu; Zhengshe Zhang; Shuangyan Chen; Lichao Ma; Hucheng Wang; Rui Dong; Yanrong Wang; Zhipeng Liu
Journal: Plant Cell Rep Date: 2015-12-08 Impact factor: 4.570

2. Similarity in gene-regulatory networks suggests that cancer cells share characteristics of embryonic neural cells.

Authors: Zan Zhang; Anhua Lei; Liyang Xu; Lu Chen; Yonglong Chen; Xuena Zhang; Yan Gao; Xiaoli Yang; Min Zhang; Ying Cao
Journal: J Biol Chem Date: 2017-06-20 Impact factor: 5.157

3. Dynamics of the Interaction between Cotton Bollworm Helicoverpa armigera and Nucleopolyhedrovirus as Revealed by Integrated Transcriptomic and Proteomic Analyses.

Authors: Longsheng Xing; Chuanfei Yuan; Manli Wang; Zhe Lin; Benchang Shen; Zhihong Hu; Zhen Zou
Journal: Mol Cell Proteomics Date: 2017-04-12 Impact factor: 5.911

4. GA₃ application in grapes (Vitis vinifera L.) modulates different sets of genes at cluster emergence, full bloom, and berry stage as revealed by RNA sequence-based transcriptome analysis.

Authors: Anuradha Upadhyay; Smita Maske; Satisha Jogaiah; Narendra Y Kadoo; Vidya S Gupta
Journal: Funct Integr Genomics Date: 2018-04-06 Impact factor: 3.410

5. Identification of the Regulon of AphB and Its Essential Roles in LuxR and Exotoxin Asp Expression in the Pathogen Vibrio alginolyticus.

Authors: Xiating Gao; Yang Liu; Huan Liu; Zhen Yang; Qin Liu; Yuanxing Zhang; Qiyao Wang
Journal: J Bacteriol Date: 2017-09-19 Impact factor: 3.490

Review 6. Principles and methods of integrative genomic analyses in cancer.

Authors: Vessela N Kristensen; Ole Christian Lingjærde; Hege G Russnes; Hans Kristian M Vollan; Arnoldo Frigessi; Anne-Lise Børresen-Dale
Journal: Nat Rev Cancer Date: 2014-05 Impact factor: 60.716

7. ZWINT is the next potential target for lung cancer therapy.

Authors: Fang Peng; Qiang Li; Shao-Qing Niu; Guo-Ping Shen; Ying Luo; Ming Chen; Yong Bao
Journal: J Cancer Res Clin Oncol Date: 2019-01-14 Impact factor: 4.553

8. Transcriptome-wide effect of DE-ETIOLATED1 (DET1) suppression in embryogenic callus of Carica papaya.

Authors: Nur Diyana Jamaluddin; Emelda Rosseleena Rohani; Normah Mohd Noor; Hoe-Han Goh
Journal: J Plant Res Date: 2019-01-16 Impact factor: 2.629

9. Alternative Sigma Factor RpoX Is a Part of the RpoE Regulon and Plays Distinct Roles in Stress Responses, Motility, Biofilm Formation, and Hemolytic Activities in the Marine Pathogen Vibrio alginolyticus.

Authors: Dan Gu; Jun Zhang; Yuan Hao; Rongjing Xu; Yuanxing Zhang; Yue Ma; Qiyao Wang
Journal: Appl Environ Microbiol Date: 2019-07-01 Impact factor: 4.792

10. Chromosome-scale genome assembly of sweet cherry (Prunus avium L.) cv. Tieton obtained using long-read and Hi-C sequencing.

Authors: Jiawei Wang; Weizhen Liu; Dongzi Zhu; Po Hong; Shizhong Zhang; Shijun Xiao; Yue Tan; Xin Chen; Li Xu; Xiaojuan Zong; Lisi Zhang; Hairong Wei; Xiaohui Yuan; Qingzhong Liu
Journal: Hortic Res Date: 2020-08-01 Impact factor: 6.793