| Literature DB >> 19811693 |
Bart H J van den Berg1, Chamali Thanthiriwatte, Prashanti Manda, Susan M Bridges.
Abstract
UNLABELLED: The widespread availability of microarray technology has driven functional genomics to the forefront as scientists seek to draw meaningful biological conclusions from their microarray results. Gene annotation enrichment analysis is a functional analysis technique that has gained widespread attention and for which many tools have been developed. Unfortunately, most of these tools have limited support for agricultural species. Here, we evaluate and compare four publicly available computational tools (Onto-Express, EasyGO, GOstat, and DAVID) that support analysis of gene expression datasets in agricultural species. We use AgBase as the functional annotation reference for agricultural species. The selected tools were evaluated based on i) available features, usage and accessibility, ii) implemented statistical computational methods, and iii) annotation and enrichment performance analysis. Annotation was assessed using a randomly selected test gene annotation set and an experimental differentially expressed gene-set--both from chicken. The experimental set was also used to evaluate identification of enriched functional groups.Comparison of the tools shows that they produce different sets of annotations for the two datasets and different functional groups for the experimental dataset. While DAVID, GOstat and Onto-Express annotate comparable numbers of genes, DAVID provides by far the most annotations per gene. However, many of DAVID's annotations appear to be redundant or are at very high levels in the GO hierarchy. The GOSlim distribution of annotations shows that GOstat, Onto-Express and EasyGO provide similar GO distributions to those found in AgBase while annotations from DAVID show a different GOSlim distribution, again probably due to duplication and many non-specific terms. No consistent trends were found in results of GO term over/under representation analysis applied to the experimental data using different tools. While GOstat, David and Onto-Express could retrieve some significantly enriched terms, EasyGO did not show any significantly enriched terms. There was little agreement about the enriched terms identified by the tools.Entities:
Mesh:
Year: 2009 PMID: 19811693 PMCID: PMC3226198 DOI: 10.1186/1471-2105-10-S11-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Standard set of tool parameters
| Parameter | Value |
|---|---|
| Maximum p value | 0.10 |
| Maximum GO depth | 5 |
| False discovery correction | FDR |
| Statistical method | OntoExpress & EasyGO: Hypergeometric |
A consistent set of tool parameters was used where possible to make the results more comparable. Note that there was no one set of statistical methods available for all tools.
Identifier mapping for experimental data set
| Identifier | FHCRC whole array | FHCRC differentially expressed |
|---|---|---|
| Probe ID | 15227 | 53 |
| Entrez Gene ID | 9277 | 33 |
| UniprotKB accession no. | 8838 | 33 |
A variety of gene identifiers are accepted as input by the evaluated tools. Entrez Gene IDs and UniProtKB accession numbers corresponding to each Array ID were retrieved to make the data sets compatible with each tool. Not all EST sequences on the microarrays have corresponding identifiers in all databases.
Statistical tests implemented in evaluated tools
| Tool | Chi-Square | Hypergeometric | Fisher's Exact | Binomial |
|---|---|---|---|---|
| Onto-Express | √ | √ | √ | √ |
| EasyGO | √ | √ | √ | |
| GOstat | √ | √ | ||
| DAVID | √ * |
The subset of tools selected provides a wide variety of statistical tests for the significance of gene annotation enrichment analysis.
* Modified Fisher's exact test known as EASE.
Multiple testing correction methods implemented in evaluated tools
| Tool | Benjamini FDR | Yekutieli FDR | Holm p-value | Bonferroni | Sidak |
|---|---|---|---|---|---|
| Onto-Express | √ | √ | √ | √ | |
| EasyGO | √ | ||||
| GOstat | √ | √ | √ | ||
| DAVID | √ | √ | √ |
Multiple testing correction is used to correct for the occurrence of false positive identifications by adjusting p-values derived from multiple statistical tests.
Annotation performance
| Tool | # Genes input | #Genes recognized | #Genes annotated | #Annotations retrieved |
|---|---|---|---|---|
| Onto-Express | 60 | 60 | 56 | 313 |
| EasyGO | 60 | 56 | 45 | 339 |
| GOstat | 60 | 60 | 56 | 303 |
| DAVID | 60 | 60 | 58 | 1662 |
| AgBase | 60 | 60 | 49 | 474 |
| Onto-Express | 31 | 29 | 24 | 328 |
| EasyGO | 31 | 31 | 21 | 104 |
| GOstat | 31 | 31 | 25 | 227 |
| DAVID | 31 | 26 | 26 | 615 |
| AgBase | 31 | 27 | 22 | 136 |
For each tool, the number of gene identifiers used as input, the number of genes recognized, the number of genes for which some GO annotation was retrieved, and the total number of annotations for all genes is given for both the Test Set and the Experimental Set.
Figure 1Comparison GOSlim distribution for the Test Set. The distribution of the Gene Ontology annotations in the Test Set in different GOSlim categories was computed for the three GO ontologies: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC) using GOSlimViewer at AgBase. AgBase serves as a baseline of retrieved annotations.
Gene annotation enrichment analysis
| Experimental Set | |||
|---|---|---|---|
| OntoExpress | 81 | 19 | 6 |
| EasyGO | 1 | 1 | 1 |
| GOstat | 0 | 5 | -1* |
| DAVID | 33 | 38 | 8 |
The number of GO terms in the Experimental Set found to be enriched for each ontology (Biological process = BP, molecular function = MF, Cellular Component = CC) are given when using the parameters listed in Table 1.
*under-represented GO term
Figure 2Comparison GOSlim distribution for the Experimental Set. The distribution of the Gene Ontology annotations in the Experimental Set in different GOSlim categories was computed for the three GO ontologies: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC) using GOSlimViewer at AgBase. AgBase serves as a baseline of retrieved annotations.