Literature DB >> 21624890

ADGO 2.0: interpreting microarray data and list of genes using composite annotations.

Sang-Mun Chi¹, Jin Kim, Seon-Young Kim, Dougu Nam.

Abstract

ADGO 2.0 is a web-based tool that provides composite interpretations for microarray data comparing two sample groups as well as lists of genes from diverse sources of biological information. Some other tools also incorporate composite annotations solely for interpreting lists of genes but usually provide highly redundant information. This new version has the following additional features: first, it provides multiple gene set analysis methods for microarray inputs as well as enrichment analyses for lists of genes. Second, it screens redundant composite annotations when generating and prioritizing them. Third, it incorporates union and subtracted sets as well as intersection sets. Lastly, users can upload their own gene sets (e.g. predicted miRNA targets) to generate and analyze new composite sets. The first two features are unique to ADGO 2.0. Using our tool, we demonstrate analyses of a microarray dataset and a list of genes for T-cell differentiation. The new ADGO is available at http://www.btool.org/ADGO2.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21624890 PMCID： PMC3125784 DOI： 10.1093/nar/gkr392

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

High-throughput omics experiments often produce lists of genes, and their biological interpretations have been of substantial interest. Typical approaches examine the extent of the overlap between a list of genes and predefined annotated gene sets using hypergeometric distribution, chi-square or Fisher’s exact test, which may be dubbed collectively as gene list analysis (GLA) (1). For microarrays, each gene has its own score (e.g. two sample t-statistic or fold-change value) and an alternative approach, called gene set analysis (GSA), is applicable without selecting a list of genes (2). In many cases, the ‘interpretation’ of large-scale data indicates investigating the enrichment of pre-set knowledge within the given data. Accordingly, such enrichment analyses are widespread over omics research regardless of the data analyzed (microarray, mass spectrometry, ChIP-chip or next-generation sequencing). In addition, a number of algorithms and tools have been developed in this context (1–3). In both approaches (GSA and GLA), the predefined gene sets play key roles in biological interpretations. Such gene sets are usually derived from biological databases such as Gene Ontology (4) or KEGG (5), where they share a common biological annotation for pathways, functions, cellular localizations or targets of a common transcription factor (TF), for instance. One important problem with most existing methods is that they handle only gene sets with unary annotations, thus limiting the discriminating power of the method employed. For example, suppose we want to examine whether a given list of genes is enriched with the putative targets of some TF. Because most gene sets that share a common TF binding site are dominated with false positive targets, this simple approach may not be very successful when used to uncover the relevant TFs. However, if we take intersections between the putative TF target sets and the gene sets of Gene Ontology, some of them may define biologically more relevant gene sets, which then may be enriched with the gene list. With this rationale, composite annotation gene sets were introduced for GSA (6) and GLA (7), respectively. Thereafter, several software tools were developed for GLA based on composite annotations (8–10). ADGO (6) and ProfCom (9) use Boolean set operations (intersection, union and subtraction) to generate composite gene sets, and GENECODIS (11) and COFECO (10) employ an association rule-mining algorithm to extract co-occurring annotations. In any case, the composite interpreters usually display quite a long and redundant list of significant gene sets, many of which largely overlap each other. Therefore, removing redundancy and abstraction appear to be an important issue when utilizing composite annotations. Here, we suggest three criteria for filtering composite gene sets for GSA and GLA. Taking into account these considerations, we constructed web-based software called ADGO 2.0 to provide composite interpretations for both microarrays and lists of genes. The previous version of ADGO was designed to illustrate the idea of using composite annotations for GSA and provided a single GSA method (6). The current version was totally rebuilt considering the automatic updating of (composite) gene sets, and extended in terms of both coverage and methods. If a composite set is largely overlapped with some single set over a threshold, that set should be removed a priori. In other words, if the members of sets are very similar to each other, the single annotation should have priority. A significant intersection or union set should be screened, if any of the single sets used to generate them are also significant. A significant subtracted set should be screened, if the single set that contains the subtracted set is also significant.

MATERIALS AND METHODS

Supported analyses

ADGO 2.0 currently supports analyses of eight popular organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Escherichia coli.), and from four to seven kinds of annotations (GO terms for biological processes, cellular components, and molecular functions, KEGG pathways, chromosome and cis-regulatory motifs, and OMIM) are provided depending on the species selected. The user can choose one of the four methods (Z-statistic, gene permutation, sample permutation and Gene Set Enrichment Analysis (GSEA)) for GSA and the two methods (Fisher’s exact test and hypergeometric distribution) for GLA. Only applicable methods are displayed depending on the format of input data.

Construction of annotation gene set databases

For all of the gene sets from the seven annotation categories included in ADGO 2.0, we applied three types of Boolean set operations (intersection, union and subtraction) to each pair of gene sets across different categories: the 10–20% rule was applied for intersections and subtraction (6). In other words, a pair of single gene sets was required to have at least 10 genes in common and the two subtracted sets were required to contain at least 20% of the genes in each single set. Union operations were applied if a pair of gene sets has five or more elements in common. The ‘subtraction’ of two annotation sets A and B is denoted by , which is the intersection of A and the complement of B. To ensure the generated composite set is a genuinely novel set, it was compared with each single set and discarded if it has any overlap with some single set over a threshold (‘Filtering Composite Sets’). The overlap (%) is computed by the portion of intersection between the composite set and a single set over the union of these two sets. For the 60% threshold, ∼27% of all composite sets are screened for the three categories of Gene Ontology. All of these single and composite gene sets are prepared in the server in advance and retrieved according to the user’s choice of gene set categories. One important feature of ADGO 2.0 is that the user can upload and analyze his/her own annotation gene sets. If the user chooses some of the built-in gene sets and uploads user gene set data, the server then generates ad hoc composite sets and shows the computation results. For this reason, it takes much more time for analyzing the user’s gene sets.

Processing methods

If the user uploads microarray data or a list of genes, the server detects the file format and displays relevant analysis methods and other options. For a microarray input, four gene set analysis methods are available. Among them, ‘Z-test’ (12) and ‘Gene permutation’ (13) are gene randomization methods, and ‘Sample permutation’ (13) and ‘GSEA’ (14) are sample randomization methods. We used the average t-value for the set score in the gene or sample permutation methods. The Z-test is a parametric method and is the fastest. GSEA is the most widely used but usually takes more time for computing. This becomes problematic when analyzing composite annotations, as the number of gene sets to be handled increases in a quadratic manner against the number of usual single annotations. For this reason, we newly realized the algorithm in C++. We fixed the power of the gene score as p = 1 to deploy the weighted Kolmogorov–Smirnov gene set statistic. For a gene list input, the ‘Fisher’s exact test’ and ‘Hypergeometric distribution’ are provided for the analysis method. For both GSA and GLA, we provide two types of filtering methods for significant composite sets: The ‘Strong Type’ and the ‘Weak Type’. For the strong-type option, a significant composite set is screened if a single set involved in generating the composite set is also significant. For the weak-type option, a significant composite set is displayed if it has a smaller P-value than those of the individual single sets. Therefore, the weak-type option yields more composite sets that are significant.

Input data types and options

For microarray input, the user can upload microarray data with two sample groups. The first column should be the header for gene IDs and the sample data values should follow in the next columns. ADGO 2.0 accepts both single and dual channel gene IDs for microarray input. For a single channel input, the probe IDs for Affymetrix, Illumina and Agilent chips are supported. For a dual channel input, five types of gene IDs (gene symbol, Ensemble, Entrez, Refseq and Uniprot) as well as the systematic names for Saccaromyces cerevisiae are supported. The first sample group data should appear in the first k columns of data values and the second group data should follow in the next l columns. k and l should be specified in the ‘Sample Size’ option. ADGO 2.0 then computes the two-sample t-statistic or average fold-change values for each gene and proceeds. We have another option for microarray input. If the user wants to use gene scores other than the two-sample t-statistic or fold-change values, he/she can directly use the gene score data (a single value for each gene). For a gene list input, the same dual channel IDs are supported. We constructed the reference (background) genes for GLA by merging all the genes contained in each annotation set. In any type of input data, the user can also paste the data into the ‘Paste data panel’ without uploading the data file. More detailed information is available from our web site (http://www.btool.org/ADGO2).

Outputs

If the user executes the analysis, the server shows a list of significant gene sets (single and composite) along with gene set names, gene set id’s, members of each gene set, P-values, False Discovery Rate (FDR) Q-values and Bonferroni’s corrected P-values (Figure 1). The members of each gene set are listed in a descending order of their association strength. The gene set list is sorted according to the FDR Q-values. Certainly, composite annotations increase the number of annotation sets to be analyzed to change the analysis results more or less. However, the FDR Q-values reflect the increased number of gene sets and provide the adjusted significance threshold. The computation results are also downloadable as a text file.

Figure 1.

Z-test results for the T-cell differentiation data set. See the text for an explanation and detailed options. If the user clicks ‘view’, the members of each significant annotation set as well as their scores are shown.

Analysis example

We present an example of gene expression data analysis by ADGO 2.0 to demonstrate its utility. Schones et al. (15) compared the gene expression patterns between resting and activated T cells to understand the molecular changes that occur during the T-cell differentiation (GEO number: GSE10437). We computed the fold changes of each gene in the microarray data to test a GSA. We chose the ‘Z test’ method (Q-value cutoff: 0.01) and two annotation categories: KEGG and Chromosome. We checked three types of composite sets: ‘Single’ + ‘Intersection’ + ‘Subtraction’. Many KEGG pathways related to immune responses including ‘Autoimmune thyroid disease’, ‘Antigen processing and presentation’ and ‘Graft-versus-host disease’ were not significant by themselves when we used only KEGG categories. However, they were significantly induced if we excluded the genes in the chromosome 6p21.3 set. Interestingly, the same immune-related gene sets were significantly downregulated when intersected with the 6p21.3 set. This suggests that some part of the chromosome region 6p21.3 is locked during T-cell differentiation while other immune-related genes outside this region are activated. Indeed, this intersection set contained many HLA (human leukocyte antigen) genes that were mostly downregulated [i.e. HLA-DPB1 (−3.19), HLA-DRA (−1.97), HLA-DMA (−1.92), HLA-DRB1 (−1.71), HLA-E (−1.66), HLA-DBM (−1.54)], significantly affecting the overall patterns of immune-related gene sets. See Figure 1 for the list of significant gene sets. This example illustrates how bi-directional expression patterns within a gene set can be described precisely using composite annotations. We then selected the 200 most induced genes to test a GLA. Using Fisher’s exact test and the three Gene Ontology categories, we interpreted the list. In the first trial, we chose the ‘Strong Type’ option, and we obtained a list of gene sets associated with the input list, many of which, as a single set, had strong relevance with T-cell differentiation. Examples include the cytokine-related processes ‘JAK-STAT cascade’, ‘regulation of tyrosine phosphorylation’, ‘immunoglobulin production’, ‘regulation of T cell activation’ and ‘T-cell proliferation’ (15,16). Additionally, some subtracted sets associated with ‘ribosome’ were detected. We then chose the ‘Weak Type’ to investigate more specific patterns within each single set. Figure 2 shows the computation results. Several intersected and subtracted gene sets appeared on top ranks. Most ‘tyrosine phosphorylation’ gene sets showed stronger patterns when intersected with ‘growth factor activity’. For example, ‘regulation of peptidyl-tyrosine phosphorylation’ had a Q-value in the strong-type analysis, but the intersection of ‘regulation of peptidyl-tyrosine phosphorylation’ and ‘growth factor activity’ had a much better Q-value of in the weak-type analysis. The former single set originally contained 83 genes in total and actually included eight members from the input list, while the latter intersection set had a much smaller number of genes (25 in total), but included seven members from the input list. This feature makes the composite sets more precise descriptors of the enrichment patterns.

Figure 2.

Enriched gene sets for a list of upregulated genes in T-cell differentiation. The weak-type filtering criterion is applied for significant composite sets.

DISCUSSION AND CONCLUSION

ADGO 2.0 is currently a unique tool that supports GSA methods based on composite annotations. We added GLA methods to this new version. It provides several widely used GSA methods including fast GSEA (14). Several other tools [e.g. GENECODIS (11), ProfCom (9), and COFECO (10)] provide analyses via composite annotations only for GLA. GENECODIS and COFECO employ the same type of algorithm and focus on the intersection of two or more annotation gene sets, while ProfCom generates more general types of composite sets using Boolean set operations (up to five single sets). Shortcomings with these tools are that they display all the significant gene sets without filtering redundant information (GENECODIS and COFECO) or a partial list of gene sets identified by a greedy search algorithm (ProfCom). ADGO 2.0 generates composite gene sets based on Boolean operations of two overlapping single sets and screens composite sets that have redundant information for both GSA and GLA. We may also consider composite sets generated by three or more single sets as ProfCom does, but this will increase the computational complexity prohibitively and make it quite complicated to establish legitimate rules (e.g. inclusion and exclusion rules) to screen the redundant information. Using our tool, we demonstrated how to incorporate and interpret significant composite annotations when analyzing microarray data and list of genes. Note that many significant composite terms are hard to interpret clearly. In most cases, we may not find evidence from the literature because complex biological patterns have been rarely explored so far. Therefore, our tool may be used for an explorative research. If a composite pattern is observed repeatedly across many data sets, it may be validated experimentally. One interesting future work with our tool may be to investigate the regulatory interactions between regulators (TF or miRNA) and pathways using gene expression profiles. Because sequence-based predictions of the targets of TF or miRNA inevitably include abundant false positives, taking the intersections of the putative target genes with other gene sets may be useful for exploring specific patterns in regulatory networks.

FUNDING

Basic Science Research Program through a National Research Foundation (NRF) grant funded by the Korean government (MEST) (No. 2011-0015107); National R & D Program for Cancer Control of the Ministry for Health and Welfare, Republic of Korea (1020360) (to S.-Y.K.). Funding for open access charge: NRF grant funded by Korean government (No. 2011-0015107). Conflict of interest statement. None declared.

16 in total

1. ADGO: analysis of differentially expressed gene sets using composite GO annotation.

Authors: Dougu Nam; Sang-Bae Kim; Seon-Kyu Kim; Sungjin Yang; Seon-Young Kim; In-Sun Chu
Journal: Bioinformatics Date: 2006-07-12 Impact factor: 6.937

Review 2. Gene-set approach for expression pattern analysis.

Authors: Dougu Nam; Seon-Young Kim
Journal: Brief Bioinform Date: 2008-01-17 Impact factor: 11.622

3. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

Review 4. Cytokine signaling modules in inflammatory responses.

Authors: John J O'Shea; Peter J Murray
Journal: Immunity Date: 2008-04 Impact factor: 31.745

5. Dynamic regulation of nucleosome positioning in the human genome.

Authors: Dustin E Schones; Kairong Cui; Suresh Cuddapah; Tae-Young Roh; Artem Barski; Zhibin Wang; Gang Wei; Keji Zhao
Journal: Cell Date: 2008-03-07 Impact factor: 41.582

6. GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists.

Authors: Pedro Carmona-Saez; Monica Chagoyen; Francisco Tirado; Jose M Carazo; Alberto Pascual-Montano
Journal: Genome Biol Date: 2007 Impact factor: 13.583

7. COFECO: composite function annotation enriched by protein complex data.

Authors: Choong-Hyun Sun; Min-Sung Kim; Youngwoong Han; Gwan-Su Yi
Journal: Nucleic Acids Res Date: 2009-05-08 Impact factor: 16.971

8. KEGG for linking genomes to life and the environment.

Authors: Minoru Kanehisa; Michihiro Araki; Susumu Goto; Masahiro Hattori; Mika Hirakawa; Masumi Itoh; Toshiaki Katayama; Shuichi Kawashima; Shujiro Okuda; Toshiaki Tokimatsu; Yoshihiro Yamanishi
Journal: Nucleic Acids Res Date: 2007-12-12 Impact factor: 16.971

9. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

10. ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data.

Authors: Alexey V Antonov; Thorsten Schmidt; Yu Wang; Hans W Mewes
Journal: Nucleic Acids Res Date: 2008-05-06 Impact factor: 16.971

3 in total

1. gsGator: an integrated web platform for cross-species gene set analysis.

Authors: Hyunjung Kang; Ikjung Choi; Sooyoung Cho; Daeun Ryu; Sanghyuk Lee; Wankyu Kim
Journal: BMC Bioinformatics Date: 2014-01-14 Impact factor: 3.169

2. REGNET: mining context-specific human transcription networks using composite genomic information.

Authors: Sang-Mun Chi; Young-Kyo Seo; Young-Kyu Park; Sora Yoon; Chan Young Park; Yong Sung Kim; Seon-Young Kim; Dougu Nam
Journal: BMC Genomics Date: 2014-06-09 Impact factor: 3.969

3. Bioinformatics services for analyzing massive genomic datasets.

Authors: Gunhwan Ko; Pan-Gyu Kim; Youngbum Cho; Seongmun Jeong; Jae-Yoon Kim; Kyoung Hyoun Kim; Ho-Yeon Lee; Jiyeon Han; Namhee Yu; Seokjin Ham; Insoon Jang; Byunghee Kang; Sunguk Shin; Lian Kim; Seung-Won Lee; Dougu Nam; Jihyun F Kim; Namshin Kim; Seon-Young Kim; Sanghyuk Lee; Tae-Young Roh; Byungwook Lee
Journal: Genomics Inform Date: 2020-03-31

3 in total