Literature DB >> 21798041

GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets.

Stuart G Jantzen1, Ben Jg Sutherland, David R Minkley, Ben F Koop.   

Abstract

BACKGROUND: The increased accessibility of gene expression tools has enabled a wide variety of experiments utilizing transcriptomic analyses. As these tools increase in prevalence, the need for improved standardization in processing and presentation of data increases, as does the need to guard against interpretation bias. Gene Ontology (GO) analysis is a powerful method of interpreting and summarizing biological functions. However, while there are many tools available to investigate GO enrichment, there remains a need for methods that directly remove redundant terms from enriched GO lists that often provide little, if any, additional information.
FINDINGS: Here we present a simple yet novel method called GO Trimming that utilizes an algorithm designed to reduce redundancy in lists of enriched GO categories. Depending on the needs of the user, this method can be performed with variable stringency. In the example presented here, an initial list of 90 terms was reduced to 54, eliminating 36 largely redundant terms. We also compare this method to existing methods and find that GO Trimming, while simple, performs well to eliminate redundant terms in a large dataset throughout the depth of the GO hierarchy.
CONCLUSIONS: The GO Trimming method provides an alternative to other procedures, some of which involve removing large numbers of terms prior to enrichment analysis. This method should free up the researcher from analyzing overly large, redundant lists, and instead enable the concise presentation of manageable, informative GO lists. The implementation of this tool is freely available at: http://lucy.ceh.uvic.ca/go_trimming/cbr_go_trimming.py.

Entities:  

Year:  2011        PMID: 21798041      PMCID: PMC3160396          DOI: 10.1186/1756-0500-4-267

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Background

Transcriptomic experiments conducted using high-density microarrays or RNA-seq often compare two or more states and can generate differentially expressed gene lists comprising hundreds or thousands of genes. These datasets generally require further analysis to identify reliable patterns in expression profiles, such as developmental changes at certain time points, and variable biological processes, such as metabolic pathways. Further analyses such as Gene Ontology (GO) enrichment, pathway enrichment, or clustering methods [1,2] can aid in both the discovery and summarization of important large-scale expression patterns. GO vocabularies are structured as directed acyclic graphs with a clearly defined hierarchical structure. However, this hierarchy contains an added complexity by allowing terms to have multiple parents, or ascendants [3]. An ascendant and a descendant exist in a defined parent-child relationship and constitute a path through the GO hierarchy, connected by zero or more intermediate GO terms. A gene annotated with any term is also annotated with every term that is an ascendant, or parent term, of the more specific term; each GO category will contain all of the genes from each of its children categories. As a gene will be annotated by a term and every ancestor of this term, terms at a variety of depths in the hierarchy will appear in an enriched GO list, given that the GO tool being used recognizes all levels of annotation for an input gene. In most cases, multiple terms from the same hierarchical path will appear in a significant GO list. These multiple categories of differing specificities are not necessarily problematic. On the contrary, they allow for several levels of interpretation, ranging from specific terms that encompass few genes, to higher-level categories that may describe large-scale effects on the system being studied. A current area of study to improve GO analysis focuses on the issue of interdependence between terms in the GO hierarchy, the problem being that many tools used to investigate GO enrichment search for enrichment on a term-for-term basis and do not account for correlations among terms along a path in the hierarchy [4]. Due to the detailed structure and incremental specificity of the GO database, as well as the correlation among enriched terms in a path, there will often be instances in which multiple categories from the same path appear in a list and differ only slightly, or not at all in gene content. When this occurs, often the parent term provides no additional information to the researcher or reader, especially when the terms themselves differ only in a small qualifier. An incorrect assumption can be made by the researcher that the multiplicity of similar categories increases the importance of the function or process to which they relate, whereas it is more likely that one group of genes is causing the inclusion of multiple terms. Accordingly, it makes analysis and presentation clearer when closely related terms containing the same genes are removed. However, this term removal can introduce another issue if the subset of terms to be presented is selected mainly due to the specific interest of the researcher. Reducing the size of GO lists in this manner typically uses arbitrary criteria and can be misleading, as the selected subset may not adequately reflect the entire dataset. Several tools and databases have been developed for the purpose of reducing the inclusion of terms of varying specificity. One type of method proposed to address this issue involves reducing the size of the input database. The Gene Ontology Consortium has produced the GO Slim database, which is a subset comprising more general GO terms [5]. Alternatively, the GO Fat database, developed as part of the Annotation Tool of the DAVID suite of bioinformatics resources, is a subset comprising more specific terms [6]. Both of these methods function by limiting information prior to the enrichment test and therefore do not fully utilize the complete GO database. An alternative method to reduce the resulting amount of enriched terms is through the use of multiple test corrections (MTC) during GO enrichment tests. These methods not only reduce the output dataset, but can also reduce false positives. Goeman and Mansmann [7] present a method that uses the structure of the GO hierarchy to perform MTC in a top-down, bottom-up, or bi-directional 'focus level' manner (working in both directions from a user-defined level of the hierarchy), depending on the desired objective of the researcher. Other methods use the full GO database for enrichment analysis and remove terms or modify relationships between terms during the enrichment test to address the interdependency issue [4,8]. Alexa et al. [8] present iterative bottom-up algorithms that either remove genes from parent categories when a child in the same path is significantly enriched (elim), or reduce the weight of genes in categories that have more significant neighbours in its path (weight). Grossman et al. [4] present the parent-child intersection/union algorithms to reduce the inheritance problem by investigating enrichment in the context of parent-child relationships; a term is significant only if the enrichment is due to its own enriched gene set, rather than due to genes inherited from other categories. Alterovitz et al. [9] have provided a method to investigate categories across a specified information level of the GO hierarchy by generating a numerical value of the "information content" of each GO term and thus could be used to reduce the size of the output dataset. Clustering GO terms with similar content has also been implemented in both the GOstat and DAVID tools [10,11]. Clustering GO terms using these tools increases an understanding of commonalities between terms due to containment of similar sets of genes. However, these methods do not select or remove terms from the total list of enriched terms, and therefore the large dataset remains. Additionally, these clusters do not use the GO hierarchy or follow the parent-child path, but rather cluster based on gene content alone. We have developed a simple, systematic method called GO Trimming for removal of redundancy from a GO category list after enrichment scores are given to terms, and is independent of any statistical package or analysis method. This method consists of an algorithm that is executed in two phases. We present an example of this process performed on a sample GO dataset, and highlight the categories that would be removed by the GO Trimming process according to different levels of stringency. Additionally, we compare GO Trimming to several of the aforementioned approaches performed on a second published dataset.

Algorithm

The GO Trimming algorithm is fully described by the flowchart in Figure 1 and is outlined here. GO Trimming consists of two phases, each requiring one pass through the list of significant GO terms. In the first phase, terms are connected to all other terms that share a common path by labelling with common identifiers (Figure 1A). In the second phase, terms are removed from the list based on levels of redundancy between terms found in a given path (Figure 1B).
Figure 1

GO Trimming algorithm flowchart. A) Phase 1 of GO Trimming: identification of parent-child relationships in GO hierarchy. B) Phase 2 of GO Trimming: strict and soft trimming using 0% and 40% uniqueness thresholds. Green boxes represent start and endpoints for the algorithm. Blue parallelograms represent input and output steps. Red rectangles represent an action required by the user and yellow diamonds represent questions that determine the flow of the algorithm. Input for the algorithm is the query list (list of enriched GO terms) and the GO tree (hierarchy of all GO terms). Output is a list of GO terms with soft and strict trimmed terms removed.

GO Trimming algorithm flowchart. A) Phase 1 of GO Trimming: identification of parent-child relationships in GO hierarchy. B) Phase 2 of GO Trimming: strict and soft trimming using 0% and 40% uniqueness thresholds. Green boxes represent start and endpoints for the algorithm. Blue parallelograms represent input and output steps. Red rectangles represent an action required by the user and yellow diamonds represent questions that determine the flow of the algorithm. Input for the algorithm is the query list (list of enriched GO terms) and the GO tree (hierarchy of all GO terms). Output is a list of GO terms with soft and strict trimmed terms removed. The list of significantly enriched GO terms is retrieved by the researcher (Figure 1A) from any tool that performs statistical testing on GO categories (e.g. GeneSpring GX (Agilent, Santa Clara, CA); GOstat [10]). In addition, each term must have the following associated information: the number of differentially regulated genes annotated with the term (labelled "Diff. Genes" in Table 1) and the total number of genes annotated with the GO term (e.g. all genes on a microarray annotated with the term; labelled "Total Genes" in Table 1). Since these are the numbers used to test for enrichment, these values should be available from the same tool used to generate the list of significant terms. The hierarchical relationships of these categories are then obtained from the GO tree whether from within a software package like GeneSpring GX, or from the web platform AmiGO [12] or a downloadable database [13]. GO categories are considered query terms and are sorted in ascending order of "Total Genes". This order of query terms is maintained throughout both phases. Care should be taken after sorting to ensure that in cases of equal totals for a pair of terms in a parent-child relationship, parents are ranked after children in the list to avoid missed identification.
Table 1

Sample Gene Ontology terms and GO Trimming results

IDGO AccessionGO Termp-valueDiff. GenesTotal Genes
1GO:0031532actin cytoskeleton reorganization0.00338512
2GO:0009620response to fungus3.04E-091012
3GO:0006037cell wall chitin metabolic process3.04E-091012
3GO:0044036cell wall macromolecule metabolic process3.04E-091012
3GO:0010383cell wall polysaccharide metabolic process3.04E-091012
4GO:0003796lysozyme activity3.04E-091012
GO:0001101response to acid3.04E-091012
5GO:0046578regulation of Ras protein signal transduction6.77E-04613
5GO:0051056regulation of small GTPase mediated signal transduction6.77E-04613
6GO:0008061chitin binding1.21E-081013
7GO:0006032chitin catabolic process1.21E-081013
3GO:0006030chitin metabolic process1.21E-081013
8GO:0004568chitinase activity1.21E-081013
9GO:0003746translation elongation factor activity0.00727514
3GO:0071554cell wall organization or biogenesis3.86E-081014
GO:0008643carbohydrate transport0.0226518
10GO:0005885Arp2/3 protein complex0.0226518
9GO:0006414translational elongation0.00485618
11GO:0045165cell fate commitment0.0284519
GO:0035091phosphoinositide binding0.00652619
7GO:0006026aminoglycan catabolic process2.29E-061019
12GO:0030126COPI vesicle coat0.0140622
12GO:0030663COPI coated vesicle membrane0.0140622
7GO:0000272polysaccharide catabolic process1.36E-061122
13GO:0003007heart morphogenesis0.0175623
12GO:0030660Golgi-associated vesicle membrane0.0215624
12GO:0030137COPI-coated vesicle0.0215624
14GO:0005859muscle myosin complex0.0215624
14GO:0016460myosin II complex0.0215624
15GO:0001725stress fiber0.00546724
15GO:0032432actin filament bundle0.00546724
15GO:0042641actomyosin0.00697725
3, 7GO:0006022aminoglycan metabolic process1.04E-051126
16GO:0030239myofibril assembly0.0134728
16GO:0031032actomyosin structure organization0.00441829
17GO:0016363nuclear matrix0.00134930
17GO:0034399nuclear periphery0.00174931
18GO:0007586digestion4.24E-061333
19GO:0001726ruffle0.0146835
20GO:0005529sugar binding6.58E-061439
3, 7GO:0005976polysaccharide metabolic process2.20E-041240
GO:0005938cell cortex0.0407842
16GO:0010927cellular component assembly involved in morphogenesis0.0174943
21GO:0022627cytosolic small ribosomal subunit1.20E-051546
22GO:0004322ferroxidase activity4.18E-071746
22GO:0016724oxidoreductase activity, oxidizing metal ions, oxygen as acceptor4.18E-071746
6GO:0030247polysaccharide binding7.26E-051447
6GO:0001871pattern binding7.26E-051447
22GO:0016722oxidoreductase activity, oxidizing metal ions1.18E-061749
23GO:0022625cytosolic large ribosomal subunit0.0428950
7GO:0016052carbohydrate catabolic process6.10E-051552
4, 8GO:0004553hydrolase activity, hydrolyzing O-glycosyl compounds5.05E-051657
21GO:0015935small ribosomal subunit1.58E-051758
24GO:0005200structural constituent of cytoskeleton0.02101159
23GO:0015934large ribosomal subunit0.03281163
19GO:0031252cell leading edge0.03641164
4, 8GO:0016798hydrolase activity, acting on glycosyl bonds2.74E-041665
6, 20GO:0030246carbohydrate binding1.64E-041875
2GO:0051707response to other organism0.004511576
21, 23GO:0022626cytosolic ribosome9.78E-062498
25GO:0016791phosphatase activity0.031417111
16GO:0032989cellular component morphogenesis0.016018111
25GO:0016311dephosphorylation0.036517113
26GO:0003735structural constituent of ribosome3.52E-0628118
21, 23GO:0033279ribosomal subunit5.92E-0628121
1, 16GO:0030036actin cytoskeleton organization0.0059721124
21, 23GO:0044445cytosolic part3.27E-0527125
7GO:0009057macromolecule catabolic process0.013121133
25GO:0042578phosphoric ester hydrolase activity0.045720142
GO:0006955immune response0.048620143
1, 16GO:0030029actin filament-based process0.018922146
27GO:0005615extracellular space0.0050724146
21, 23GO:0005840ribosome8.61E-0529146
19GO:0042995cell projection0.0077524151
10, 14, 15GO:0015629actin cytoskeleton0.028224169
1, 16GO:0007010cytoskeleton organization0.011529198
3, 7GO:0005975carbohydrate metabolic process0.012131216
9GO:0006412translation4.07E-0438229
27GO:0044421extracellular region part0.013533235
21, 23GO:0030529ribonucleoprotein complex0.0010143281
24, 26GO:0005198structural molecule activity2.31E-0553316
21, 23GO:0005829cytosol0.0027547330
27GO:0005576extracellular region0.0084146341
9GO:0003723RNA binding5.39E-0453356
9GO:0003676nucleic acid binding0.0086673588
13, 28GO:0007275multicellular organismal development0.020389764
10, 14, 15, 21, 23GO:0043232intracellular non-membrane-bounded organelle0.018695819
10, 14, 15, 21, 23GO:0043228non-membrane-bounded organelle0.018695819
11, 13, 16, 28GO:0032502developmental process0.0307105934
18, 28GO:0032501multicellular organismal process0.02731191067

List of experimentally-derived Gene Ontology terms significantly enriched (p-value ≤ 0.05; no MTC) by the differential gene list. Columns show identifiers for hierarchical connections, GO accession numbers, GO term descriptors, enrichment p-values, numbers of genes in differentially regulated gene list annotated with GO term, and total numbers of genes on microarray annotated with GO term. Terms in plain text are retained after GO Trimming with all uniqueness threshold values. Bolded terms are removed with a 0% threshold. Bolded and italicized terms are additionally removed when 40% is used as uniqueness threshold for GO Trimming, and bolded, italicized and underlined terms are additionally removed with a 50% threshold.

Sample Gene Ontology terms and GO Trimming results List of experimentally-derived Gene Ontology terms significantly enriched (p-value ≤ 0.05; no MTC) by the differential gene list. Columns show identifiers for hierarchical connections, GO accession numbers, GO term descriptors, enrichment p-values, numbers of genes in differentially regulated gene list annotated with GO term, and total numbers of genes on microarray annotated with GO term. Terms in plain text are retained after GO Trimming with all uniqueness threshold values. Bolded terms are removed with a 0% threshold. Bolded and italicized terms are additionally removed when 40% is used as uniqueness threshold for GO Trimming, and bolded, italicized and underlined terms are additionally removed with a 50% threshold. For any set of genes annotated with a common GO term, it can be said that this GO category contains this set of genes. For each term in the list of GO categories, all ancestors of the term (i.e. having parent-child relationships with the term) are examined, again starting with the term containing the fewest genes and moving towards broader terms. All parent-child paths are labelled with unique identifiers to mark the hierarchical connections between terms. Two terms may have a common parent, so each of these two children will be labelled with different identifiers, and the parent will be labelled with both identifiers. A term may have no parents present in the GO list, and therefore will have no identifier. Once the entire list has been processed in the first phase, two types of trimming can be applied (Figure 1B). The first more strict approach removes terms that are entirely redundant. The second approach, soft trimming, uses more relaxed stringency, and terms that are largely redundant can be removed. A uniqueness threshold was designed to filter terms based on the respective gene sets contained by the parent and child categories. A value of 0% is used for the strict approach and a value of 40% is used for the soft trimming approach. If the parent category contains the same set of genes as the child category (i.e. parent contains 0% unique genes), it is deemed fully redundant and removed from the strict list (and soft list). With respect to the soft trimming threshold, if the parent term contains 40% or fewer additional genes than the child term (e.g. the child term contains ten genes; the parent term contains these ten plus an additional four), the parent term is removed from the soft trimmed list. In both of these examples, the more specific child category is retained. If a category is involved in multiple paths in the list and so has more than one identifier, it can only be removed if a descendant shares all identifiers (i.e. has the same ID set) and indicates that the parent category is redundant. When both soft and strict trimming are performed concurrently, soft trimmed terms should not be removed from the list until the end of the process as they may still be used in the strict trimming approach. Note that some IDs may be processed multiple times through the course of the list with different untrimmed terms as the child. This allows for broader terms to be trimmed based on the representation of intermediate terms. During the second phase, it is still important to check for ancestry before trimming, since the structure of the GO hierarchy (i.e. a term can have multiple parents) may allow two terms to have the same ID set, yet not be in a parent-child relationship. As a final note, the 40% value is somewhat arbitrary, and can be raised or lowered, but provides a cut-off that can be used to eliminate terms that seem to be generally unworthy of separate discussion from related terms. Once trimming is complete, all terms that were soft or strict trimmed can be fully removed from the GO list and the reduced list can be presented as fully processed.

Testing

We present here an example of the use of this method in removing redundant terms from an enriched Gene Ontology term list. This sample dataset was taken from a recent experiment exploring the transcriptional effects of sea lice (Lepeophtheirus salmonis) on pink salmon (Oncorhynchus gorbuscha) [14]. Although the parameters for the data presented here are slightly different from those for the GO lists shown in this earlier experiment, the biological question and the majority of the information remains the same. An initial list of 90 GO terms (Table 1) were enriched from an input list of 3388 differentially regulated entities. GeneSpring GX 11.0 was used to determine GO enrichment and the hierarchical relationships between enriched terms; the GO database used was from November 04, 2010. Transcriptomic analysis can typically result in many enriched Gene Ontology terms, and in an attempt to reduce the number of terms enriched by chance, and to present a more manageable example dataset, only GO terms containing 5 or more differentially expressed genes were retained. This pre-filtering was done simply by imposing a threshold on the "Diff. Genes" value in a spreadsheet. This is an independent step from GO Trimming and should not have a substantial influence on the procedure. The DAVID tool [11] for finding enrichment offers a gene count threshold as well (default is 2) for the reason that terms with very few genes are less trustworthy as real trends. Through the GO Trimming process with 0%, 40%, and 50% uniqueness thresholds, the list was substantially reduced (Table 1), with 19 of 90 terms identified as completely redundant (0% threshold; bolded text). With the use of a 40% soft trimming threshold, another 15 terms were found to be largely redundant (bolded and italicized text). To show the relative flexibility of the threshold value, we also performed the procedure with a threshold of 50% (bolded, italicized and underlined text). With this reduced stringency, only two additional terms were removed from the list when compared with the list after the use of the 40% threshold. Note that several of the most specific and the most general terms are retained, and many of intermediate specificity are discarded. Looking at the list trimmed using the conservative approach (0% threshold), 19 of 90 terms were shown to be redundant. For example, "oxidoreductase activity, oxidizing metal ions" and "oxidoreductase activity, oxidizing metal ions, oxygen as acceptor" are removed and the term "ferroxidase activity" is retained. With slightly less stringency, we can remove a number of terms that offer little additional information to the analysis. For example, "polysaccharide catabolic process" adds only one gene to the set of those annotated with the term "chitin catabolic process". Highly similar terms that remain in the list are often a result of being from different GO domains, such as Biological Process and Molecular Function, the top-level categories of "chitin catabolic process" and "chitinase activity", respectively. Also "sister" terms that appear quite similar but are not in a parent-child relationship (e.g. "cell wall chitin metabolic process" and "chitin catabolic process") cannot be eliminated because they are from different hierarchical paths, and therefore may refer to distinct processes or functions. In addition to this sample dataset, we performed a comparison of GO Trimming with other methods that attempt to ease interpretation or take into account the interdependencies in the hierarchy during enrichment testing. For this comparison, zebrafish (Danio rerio) was deemed to be a suitable organism of study because each of the methods we wished to compare provided the ability to use a ZFIN identifier [15] in annotating genes with GO terms. Accordingly, we found a microarray experiment studying hypoxia in D. rerio [16] that resulted in a large number of differentially expressed genes using a well annotated array [17]. Of 1520 significantly differentially regulated entities, 1017 had official gene symbols which were used to link to ZFIN IDs in the GO Consortium's ZFIN annotation file (May 27, 2011) [18]. 617 genes had a corresponding ZFIN ID with GO annotation. This set of 617 ZFIN IDs was the sample dataset or the list of differentially regulated genes. In the same manner, of 42990 entities on the whole array, 24888 had a gene symbol, of which 12674 had a linked ZFIN ID associated with GO annotation. This was the population or total set of genes. It is apparent that level of annotation and choice of statistical test have a large influence on the results of enrichment testing. For example, GOstat uses a χ2 test and does not permit custom annotation files [10]. Therefore, the default ZFIN annotation database was used, which appeared to have annotation for all but 17 terms in the sample list. This resulted in a large difference in significant terms when compared to the traditional term-for-term method employed in the Ontologizer (175 vs. 236 terms; p-value ≤ 0.1) [19]. Additionally, GOstat did not include an option for a 0.05 p-value cut-off. The DAVID tool also used its own associations with ZFIN IDs, resulting in a different list of enriched terms. Due to the differences in statistical tests and annotation, we restricted the formal comparison to those tests which could be performed using the Ontologizer [19], including elim, weight, and the parent-child methods [4,8]. The traditional term-for-term method was also included, and the output of this method was used as the input list for the GO Trimming process. A p-value cut-off of 0.05 was employed, and no MTC was used, as the impact of MTC may differ between methods. In this sample dataset, no gene count threshold was used. The significantly enriched terms (p-value ≤ 0.05) resulting from each method are presented in Additional file 1. Any term enriched through one or more methods is listed in the table, and the enrichment is represented by a p-value. In summary, the term-for-term method resulted in the largest number of enriched terms (147), followed by elim (137), weight (86), GO Trimming (80), parent-child union (78), and parent-child intersection (46). The term-for-term output included all but 24 terms; the elim, parent-child intersection and parent-child union methods resulted in the inclusion of some additional terms. We used the D. rerio dataset on two other methods: GOstat clustering [10] and DAVID clustering [11] (data not shown). Instead of reducing the number of terms produced by enrichment testing, the significant terms are clustered into groups that aim to improve interpretation of the results.

Discussion

GO Trimming was designed with the idea of reducing redundancy while fully utilizing the size and detail of the GO database. We believe this method is versatile and can be tailored to the needs of researchers while still being systematic by nature so that it can be easily integrated into an analysis workflow. The first sample dataset (Table 1) provides a good example of what GO Trimming does and does not do. In this example, there is no real biological information lost to the researcher through the trimming process. Nor in fact, is there any biological information added that was not already present. The p-value of individual terms is not adjusted. Neither does GO Trimming serve to intentionally eliminate false positives. In fact this process is independent of MTC, as MTC can be applied during the enrichment testing, and GO Trimming may be performed on the output list. After trimming, the information in Table 1 becomes more focused and balanced, making interpretation easier. Redundant terms no longer overwhelm the list as in the cases of IDs 3 and 12. Terms in a unique path, such as "immune response", a biologically important term, do not become lost in long lists [20]. Once redundant terms are removed, such as the terms related to polysaccharide, actin, and oxidoreductase functions, it becomes easier to consider and present the entire list, and terms such as "ruffle", which was important in the context of the experiment [14], can come to the forefront. Additionally, it becomes more feasible to present the list in the manuscript, instead of picking out only a select few to discuss. Ultimately it is up to the researcher how to interpret and discuss results, however GO Trimming provides a way to assist in this process. Although the GO Trimming output list is easier to manage and interpret than an untrimmed list, it is important to use a non-destructive workflow where the full list of terms is retained for potential further exploration of specific results. We encourage researchers to append the full GO category list to published articles as supplemental documents, but for general table and text presentation, the trimmed list should be used. Furthermore, working with GO lists in this way can familiarize the researcher to the general patterns and functions present in the data. Adding identifiers to parent-child relationships in the GO lists not only assists with the trimming process, but also connects terms with related functions and properties, allowing for ease in locating reoccurring themes. For example in the sample dataset above (Table 1), "actin cytoskeleton reorganization" and "myofibril assembly" share the parent "actin cytoskeleton organization and biogenesis". Terms with multiple identifiers represent a synthesis of information, as they represent the union of multiple paths in the GO hierarchy. Alternatively, terms without identifiers are those that have no parents or children present in the list. This in itself provides some information about the term and the associated genes. Increased understanding of the connections between terms will allow for increased comprehension of the processes under investigation. Understanding these connections between terms is a similar benefit to that offered by GOstat or DAVID clustering [10,11]. These clustering methods cannot be directly compared to GO Trimming, primarily due to the structure of the output. Instead of reducing terms in the output lists, terms are organized into categories that share information. One benefit of GO Trimming is the more manageable presentation. With clustering, either the researcher can present all terms, which can result in very sizeable lists, or the researcher can select a representative from each cluster to present. If a representative is selected, such as the most significant term, a main function or process may be preserved, but other valuable information could be lost. With respect to GOstat specifically, the clustering method is highly inclusive while creating clusters. Any term containing a subset of genes annotated to another term will be clustered together. This does not take into account the GO hierarchy, which in some cases may be beneficial, in that closely related terms under difference roots (e.g. "biological process" and "molecular function") can be clustered together. However, this can also result in more disparate terms being grouped into a cluster, simply by containing common genes. Regarding the clustered output of DAVID, non-significant terms appear to be included, which could be removed after clustering. Also, clusters can consist entirely of terms that are essentially redundant. Overall, these clustering methods can aid in interpretation of results, however the problem of redundancy is not addressed, the GO hierarchy is not taken into account, and clusters can be too inclusive. The comparison between GO Trimming and methods employed by the Ontologizer [19] provides insight into benefits and drawbacks of each method. There are a few trends identifiable based on specificity of enriched terms (Additional file 1). It is apparent that elim and weight methods produce more specific enriched terms and fewer general terms. This is not necessarily negative, since specific terms are arguably more interesting and informative to a researcher, although it can be informative to examine higher level terms. One major benefit of Gene Ontology is the ability to identify functions and processes at different depths [21]. Furthermore, there appear to be many redundant terms enriched using the elim and weight methods through the Ontologizer (e.g. many parents of "negative regulation of neutrophil chemotaxis"). The parent-child union method seems to result in a much lower level of redundancy and provides a lighter but still informative set of terms. The method behind it appears to be strong, in checking for enrichment of a term in the context of its parent(s), but it too results in some redundancy in the list (e.g. "branching morphogenesis of a tube", "morphogenesis of branching epithelium", "morphogenesis of a branching structure"; Additional file 1). The parent-child intersection method seems to result in much fewer terms being enriched, and may be too stringent, resulting in information being lost. These methods are more directed towards decorrelating terms from each other so as to minimize the effect of genes being inherited through the hierarchy and to reduce false positives [4]. Compared to these other methods, GO Trimming is highly effective at reducing redundancy at both specific and general levels. For example, regarding the parents of "negative regulation of neutrophil chemotaxis", GO Trimming removes 10-12 closely related terms, many of which are included in the results of the other methods (with the exception of parent-child intersection). At the general level, many parent terms of "ATP binding" are removed by GO Trimming. While it does not address the issue of false positives, GO Trimming specifically targets and reduces redundancy without losing information, which may occur through more stringent methods.

Conclusion

We have a developed a novel and important method for systematically reducing redundancy in Gene Ontology datasets. The simplicity of this method allows for ease of incorporation into a typical transcriptomic workflow, while still using the full structure of the GO hierarchy. It focuses on improving interpretation and presentation, and compares well against other GO enrichment methods that take into consideration interdependencies in the GO hierarchy. With the exception of the stringent parent-child intersection method, the resulting list of terms contains the least redundancy, offering a cleaner, more focused representation of the dataset. With this method, researchers are able to analyze and present terms in a way that will provide the most information about the genes and systems being studied.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SGJ and BJGS developed the algorithm and drafted the manuscript. DRM implemented the algorithm. BFK provided discussion of ideas and assisted in preparing the manuscript. All authors read and approved the final manuscript.

Additional file 1

Supplementary Table 1. Comparison of GO Trimming and enrichment methods on . Using the Ontologizer tool, a number of methods produce statistically enriched GO terms (p-value ≤ 0.05; no MTC) from a set of 617 differentially regulated genes. The union of GO terms enriched by one or more methods is presented, along with the p-values each method produced for enriched terms, sorted by "Total Genes" from specific terms to general terms. GO Trimming was performed on the output of the traditional term-for-term method using a 40% soft trimming threshold. P-values from the term-for-term method are presented for those terms retained by the GO Trimming method. Click here for file
  15 in total

1.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

2.  DAVID: Database for Annotation, Visualization, and Integrated Discovery.

Authors:  Glynn Dennis; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; H Clifford Lane; Richard A Lempicki
Journal:  Genome Biol       Date:  2003-04-03       Impact factor: 13.583

3.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes.

Authors:  Tim Beissbarth; Terence P Speed
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

4.  Ontological analysis of gene expression data: current tools, limitations, and open problems.

Authors:  Purvesh Khatri; Sorin Drăghici
Journal:  Bioinformatics       Date:  2005-06-30       Impact factor: 6.937

5.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure.

Authors:  Adrian Alexa; Jörg Rahnenführer; Thomas Lengauer
Journal:  Bioinformatics       Date:  2006-04-10       Impact factor: 6.937

6.  Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis.

Authors:  Steffen Grossmann; Sebastian Bauer; Peter N Robinson; Martin Vingron
Journal:  Bioinformatics       Date:  2007-09-11       Impact factor: 6.937

7.  Hypoxia alters gene expression in the gonads of zebrafish (Danio rerio).

Authors:  Dalma Martinovic; Daniel L Villeneuve; Michael D Kahl; Lindsey S Blake; Jeffrey D Brodin; Gerald T Ankley
Journal:  Aquat Toxicol       Date:  2008-09-19       Impact factor: 4.964

8.  GO PaD: the Gene Ontology Partition Database.

Authors:  Gil Alterovitz; Michael Xiang; Mamta Mohan; Marco F Ramoni
Journal:  Nucleic Acids Res       Date:  2006-11-10       Impact factor: 16.971

9.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists.

Authors:  Da Wei Huang; Brad T Sherman; Qina Tan; Joseph Kir; David Liu; David Bryant; Yongjian Guo; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki
Journal:  Nucleic Acids Res       Date:  2007-06-18       Impact factor: 16.971

10.  Generation of Gene Ontology benchmark datasets with various types of positive signal.

Authors:  Petri Törönen; Petri Pehkonen; Liisa Holm
Journal:  BMC Bioinformatics       Date:  2009-10-07       Impact factor: 3.169

View more
  37 in total

1.  Mural lymphatic endothelial cells regulate meningeal angiogenesis in the zebrafish.

Authors:  Neil I Bower; Katarzyna Koltowska; Cathy Pichol-Thievend; Isaac Virshup; Scott Paterson; Anne K Lagendijk; Weili Wang; Benjamin W Lindsey; Stephen J Bent; Sungmin Baek; Maria Rondon-Galeano; Daniel G Hurley; Naoki Mochizuki; Cas Simons; Mathias Francois; Christine A Wells; Jan Kaslin; Benjamin M Hogan
Journal:  Nat Neurosci       Date:  2017-05-01       Impact factor: 24.884

2.  Assessing identity, redundancy and confounds in Gene Ontology annotations over time.

Authors:  Jesse Gillis; Paul Pavlidis
Journal:  Bioinformatics       Date:  2013-01-06       Impact factor: 6.937

3.  A proteomic portrait of dinoflagellate chromatin reveals abundant RNA-binding proteins.

Authors:  Mathieu Beauchemin; David Morse
Journal:  Chromosoma       Date:  2017-08-29       Impact factor: 4.316

4.  Using predictive specificity to determine when gene set analysis is biologically meaningful.

Authors:  Sara Ballouz; Paul Pavlidis; Jesse Gillis
Journal:  Nucleic Acids Res       Date:  2017-02-28       Impact factor: 16.971

Review 5.  Pathway analysis for genome-wide genetic variation data: Analytic principles, latest developments, and new opportunities.

Authors:  Micah Silberstein; Nicholas Nesbit; Jacquelyn Cai; Phil H Lee
Journal:  J Genet Genomics       Date:  2021-02-26       Impact factor: 4.275

6.  Drosophila MOV10 regulates the termination of midgut regeneration.

Authors:  Masahiko Takemura; Nanako Bowden; Yi-Si Lu; Eriko Nakato; Michael B O'Connor; Hiroshi Nakato
Journal:  Genetics       Date:  2021-05-17       Impact factor: 4.562

7.  Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress.

Authors:  Nedeljka Rosic; Paulina Kaniewska; Chon-Kit Kenneth Chan; Edmund Yew Siang Ling; David Edwards; Sophie Dove; Ove Hoegh-Guldberg
Journal:  BMC Genomics       Date:  2014-12-02       Impact factor: 3.969

8.  The central nervous system transcriptome of the weakly electric brown ghost knifefish (Apteronotus leptorhynchus): de novo assembly, annotation, and proteomics validation.

Authors:  Joseph P Salisbury; Ruxandra F Sîrbulescu; Benjamin M Moran; Jared R Auclair; Günther K H Zupanc; Jeffrey N Agar
Journal:  BMC Genomics       Date:  2015-03-11       Impact factor: 3.969

9.  Regulation of the ESC transcriptome by nuclear long noncoding RNAs.

Authors:  Jan H Bergmann; Jingjing Li; Mélanie A Eckersley-Maslin; Frank Rigo; Susan M Freier; David L Spector
Journal:  Genome Res       Date:  2015-06-05       Impact factor: 9.043

10.  Small-scale gene duplications played a major role in the recent evolution of wheat chromosome 3B.

Authors:  Natasha M Glover; Josquin Daron; Lise Pingault; Klaas Vandepoele; Etienne Paux; Catherine Feuillet; Frédéric Choulet
Journal:  Genome Biol       Date:  2015-09-09       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.