Brian D Bennett1,2, Pierre R Bushel3,2. 1. Integrative Bioinformatics Group, National Institute of Environmental Health Sciences, Research Triangle Park, 27709 NC USA. 2. Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, Research Triangle Park, 27709 NC USA. 3. Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, 27709 NC USA.
Abstract
BACKGROUND: Over-representation analysis (ORA) detects enrichment of genes within biological categories. Gene Ontology (GO) domains are commonly used for gene/gene-product annotation. When ORA is employed, often times there are hundreds of statistically significant GO terms per gene set. Comparing enriched categories between a large number of analyses and identifying the term within the GO hierarchy with the most connections is challenging. Furthermore, ascertaining biological themes representative of the samples can be highly subjective from the interpretation of the enriched categories. RESULTS: We developed goSTAG for utilizing GO Subtrees to Tag and Annotate Genes that are part of a set. Given gene lists from microarray, RNA sequencing (RNA-Seq) or other genomic high-throughput technologies, goSTAG performs GO enrichment analysis and clusters the GO terms based on the p-values from the significance tests. GO subtrees are constructed for each cluster, and the term that has the most paths to the root within the subtree is used to tag and annotate the cluster as the biological theme. We tested goSTAG on a microarray gene expression data set of samples acquired from the bone marrow of rats exposed to cancer therapeutic drugs to determine whether the combination or the order of administration influenced bone marrow toxicity at the level of gene expression. Several clusters were labeled with GO biological processes (BPs) from the subtrees that are indicative of some of the prominent pathways modulated in bone marrow from animals treated with an oxaliplatin/topotecan combination. In particular, negative regulation of MAP kinase activity was the biological theme exclusively in the cluster associated with enrichment at 6 h after treatment with oxaliplatin followed by control. However, nucleoside triphosphate catabolic process was the GO BP labeled exclusively at 6 h after treatment with topotecan followed by control. CONCLUSIONS: goSTAG converts gene lists from genomic analyses into biological themes by enriching biological categories and constructing GO subtrees from over-represented terms in the clusters. The terms with the most paths to the root in the subtree are used to represent the biological themes. goSTAG is developed in R as a Bioconductor package and is available at https://bioconductor.org/packages/goSTAG.
BACKGROUND: Over-representation analysis (ORA) detects enrichment of genes within biological categories. Gene Ontology (GO) domains are commonly used for gene/gene-product annotation. When ORA is employed, often times there are hundreds of statistically significant GO terms per gene set. Comparing enriched categories between a large number of analyses and identifying the term within the GO hierarchy with the most connections is challenging. Furthermore, ascertaining biological themes representative of the samples can be highly subjective from the interpretation of the enriched categories. RESULTS: We developed goSTAG for utilizing GO Subtrees to Tag and Annotate Genes that are part of a set. Given gene lists from microarray, RNA sequencing (RNA-Seq) or other genomic high-throughput technologies, goSTAG performs GO enrichment analysis and clusters the GO terms based on the p-values from the significance tests. GO subtrees are constructed for each cluster, and the term that has the most paths to the root within the subtree is used to tag and annotate the cluster as the biological theme. We tested goSTAG on a microarray gene expression data set of samples acquired from the bone marrow of rats exposed to cancer therapeutic drugs to determine whether the combination or the order of administration influenced bone marrow toxicity at the level of gene expression. Several clusters were labeled with GO biological processes (BPs) from the subtrees that are indicative of some of the prominent pathways modulated in bone marrow from animals treated with an oxaliplatin/topotecan combination. In particular, negative regulation of MAP kinase activity was the biological theme exclusively in the cluster associated with enrichment at 6 h after treatment with oxaliplatin followed by control. However, nucleoside triphosphate catabolic process was the GO BP labeled exclusively at 6 h after treatment with topotecan followed by control. CONCLUSIONS: goSTAG converts gene lists from genomic analyses into biological themes by enriching biological categories and constructing GO subtrees from over-represented terms in the clusters. The terms with the most paths to the root in the subtree are used to represent the biological themes. goSTAG is developed in R as a Bioconductor package and is available at https://bioconductor.org/packages/goSTAG.
Gene lists derived from the results of genomic analyses are rich in biological information [1, 2]. For instance, differentially expressed genes (DEGs) from a microarray or RNA-Seq analysis are related functionally in terms of their response to a treatment or condition [3]. Gene lists can vary in size, up to several thousand genes, depending on the robustness of the perturbations or how widely different the conditions are biologically [4]. Having a way to associate biological relatedness between hundreds or thousands of genes systematically is impractical by manually curating the annotation and function of each gene.Over-representation analysis (ORA) of genes was developed to identify biological themes [5]. Given a Gene Ontology (GO) [6, 7] and an annotation of genes that indicate the categories each one fits into, significance of the over-representation of the genes within the ontological categories is determined by a Fisher’s exact test or modeling according to a hypergeometric distribution [8]. Comparing a small number of enriched biological categories for a few samples is manageable using Venn diagrams or other means of assessing overlaps. However, with hundreds of enriched categories and many samples, the comparisons are laborious. Furthermore, if there are enriched categories that are shared between samples, trying to represent a common theme across them is highly subjective. We developed a tool called goSTAG to use GO Subtrees to Tag and Annotate Genes within a set. goSTAG visualizes the similarities between over-representations by clustering the p-values from the statistical tests and labels clusters with the GO term that has the most paths to the root within the subtree generated from all the GO terms in the cluster.
Implementation
The goSTAG package contains seven functions:loadGeneLists: loads sets of gene symbols for ORA that are in gene matrix transposed (GMT) format or text files in a directoryloadGOTerms: provides the assignment of genes to GO termsperformGOEnrichment: performs the ORA of the genes enriched within the GO categories and computes p-values for the significance based on a hypergeometric distributionperformHierarchicalClustering: clusters the enrichment matrixgroupClusters: partitions clusters of GO terms according to a distance/dissimilarity threshold of where to cut the dendorgramannotateClusters: creates subtrees from the GO terms in the clusters and labels the clusters according to the GO terms with the most paths back to the rootplotHeatmap: generates a figure within the active graphic device illustrating the results of the clustering with the annotated labels and a heat map with colors representative of the extent of enrichmentSee the goSTAG vignette for details of the functions, arguments, default settings and for optional user-defined analysis parameters.The workflow for goSTAG proceeds as follows: First, gene lists are loaded from analyses performed within or outside of R. For convenience, a function is provided for loading gene lists generated outside of R. Then, GO terms are loaded from the biomRt package. Users can specify a particular species (human, mouse, or rat) and a GO subontology (molecular function [MF], biological process [BP], or cellular component [CC]). GO terms that have less than the predefined number of genes associated with them are removed. Next, GO enrichment is performed and p-values are calculated. Enriched GO terms are filtered by p-value or a method for multiple comparisons such as false discovery rate (FDR) [9], with only the union of all significant GO terms remaining. An enrichment matrix is assembled from the –log10 p-values for these remaining GO terms. goSTAG performs hierarchical clustering on the matrix using a choice of distance/dissimilarity measures, grouping algorithms and matrix dimension. Based on clusters with a minimum number of GO terms, goSTAG builds a GO subtree for each cluster. The structure of the GO parent/child relationships is obtained from the GO.db package. The GO term with the largest number of paths to the root of the subtree is selected as the representative GO term for that cluster. Finally, goSTAG creates a figure in the active graphic device of R that contains a heatmap representation of the enrichment and the hierarchical clustering dendrogram, with clusters containing at least the predefined number of GO terms labeled with the name of its representative GO term.Usage example:gene_lists < − loadGeneLists ("gene_lists.gmt")go_terms < − loadGOTerms ()enrichment_matrix < − performGOEnrichment (gene_lists, go_terms)hclust_results < − performHierarchicalClustering (enrichment_matrix)clusters < − groupClusters (hclust_results)cluster_labels < − annotateClusters (clusters)plotHeatmap (enrichment_matrix, hclust_results, clusters, cluster_labels)
Results
To demonstrate the utility of goSTAG, we analyzed the DEGs from gene expression analysis (Affymetrix GeneChip Rat Genome 230 2.0 arrays) of samples acquired from the bone marrow of rats exposed to cancer therapeutic drugs (topotecan in combination with oxaliplatin) for 1, 6, or 24 h in order to determine whether the combination or the order of administration influenced bone marrow toxicity at the level of gene expression. Details of the analysis are as previously described [10]. The data are available in the Gene Expression Omnibus (GEO) [11, 12] under accession number GSE63902. The DEG lists (Additional file 1), along with the GO terms from Bioconductor GO.db package v3.4.0 and GO gene associations based on biomaRt package v2.31.4, were fed into goSTAG using default parameters except for the rat species, the distance threshold set at < 0.3 and the minimum number of GO terms in a cluster set at > = 15. The defaults include only considering BP GO terms and requiring at least 5 genes within a GO category. There were 762 BPs significant from the union of all the lists. As shown in Fig. 1, the more red the intensity of the heat map, the more significant the enrichment of the GO BPs. Fifteen clusters of GO BPs are labeled with the term with the largest number of paths to the root in each. Negative regulation of MAP kinase activity (GO:0043407) was the GO BP labeled exclusively in the cluster associated with enrichment at 6 h after treatment with oxaliplatin followed by control. However, nucleoside triphosphate catabolic process (GO:0009143) was the GO BP labeled exclusively in the cluster associated with enrichment at 6 h after treatment with topotecan followed by control.
Fig. 1
Heat map of GO BPs clustered and labeled with the terms with the most paths to the root. The data used is the –log10 p-values from the ORA of the DEG lists. To: topotecan, Ox: oxaliplatin, Ctrl: control. The x-axis is the samples, and the y-axis is the 762 GO BPs. The more red the intensity, the more significant the enrichment
Heat map of GO BPs clustered and labeled with the terms with the most paths to the root. The data used is the –log10 p-values from the ORA of the DEG lists. To: topotecan, Ox: oxaliplatin, Ctrl: control. The x-axis is the samples, and the y-axis is the 762 GO BPs. The more red the intensity, the more significant the enrichment
Conclusions
goSTAG performs ORA on gene lists from genomic analyses, clusters the enriched biological categories and constructs GO subtrees from over-represented terms in the clusters revealing biological themes representative of the underlying biology. Using goSTAG on microarray gene expression data from the bone marrow of rats exposed to a combination of cancer therapeutics, we were able to elucidate biological themes that were in common or differed according to the treatment conditions. goSTAG is developed in R (open source) as an easy to use Bioconductor package and is publicly available at https://bioconductor.org/packages/goSTAG.
Availability and requirements
Project Name: goSTAGProject Home Page: The R Bioconductor package goSTAG is open source and available at https://bioconductor.org/packages/goSTAGOperating System: Platform independentProgramming Language: R version ≥ 3.4.0License: GPL-3
Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock Journal: Nat Genet Date: 2000-05 Impact factor: 38.330
Authors: Charles Wang; Binsheng Gong; Pierre R Bushel; Jean Thierry-Mieg; Danielle Thierry-Mieg; Joshua Xu; Hong Fang; Huixiao Hong; Jie Shen; Zhenqiang Su; Joe Meehan; Xiaojin Li; Lu Yang; Haiqing Li; Paweł P Łabaj; David P Kreil; Dalila Megherbi; Stan Gaj; Florian Caiment; Joost van Delft; Jos Kleinjans; Andreas Scherer; Viswanath Devanarayan; Jian Wang; Yong Yang; Hui-Rong Qian; Lee J Lancashire; Marina Bessarabova; Yuri Nikolsky; Cesare Furlanello; Marco Chierici; Davide Albanese; Giuseppe Jurman; Samantha Riccadonna; Michele Filosi; Roberto Visintainer; Ke K Zhang; Jianying Li; Jui-Hua Hsieh; Daniel L Svoboda; James C Fuscoe; Youping Deng; Leming Shi; Richard S Paules; Scott S Auerbach; Weida Tong Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908
Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205
Authors: Douglas A Hosack; Glynn Dennis; Brad T Sherman; H Clifford Lane; Richard A Lempicki Journal: Genome Biol Date: 2003-09-11 Impact factor: 13.583
Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Myrtle Davis; Jianying Li; Elaine Knight; Sandy R Eldridge; Kellye K Daniels; Pierre R Bushel Journal: Front Genet Date: 2015-02-12 Impact factor: 4.599
Authors: Cristina Sisu; Paul Muir; Adam Frankish; Ian Fiddes; Mark Diekhans; David Thybert; Duncan T Odom; Paul Flicek; Thomas M Keane; Tim Hubbard; Jennifer Harrow; Mark Gerstein Journal: Nat Commun Date: 2020-07-29 Impact factor: 14.919
Authors: Alexander Jueterbock; Bernardo Duarte; James Coyer; Jeanine L Olsen; Martina Elisabeth Luise Kopp; Irina Smolina; Sophie Arnaud-Haond; Zi-Min Hu; Galice Hoarau Journal: Front Plant Sci Date: 2021-12-02 Impact factor: 5.753
Authors: Maria Schörnig; Xiangchun Ju; Luise Fast; Sebastian Ebert; Anne Weigert; Sabina Kanton; Theresa Schaffer; Nael Nadif Kasri; Barbara Treutlein; Benjamin Marco Peter; Wulf Hevers; Elena Taverna Journal: Elife Date: 2021-01-20 Impact factor: 8.140