| Literature DB >> 21743060 |
Mikhail Jiline1, Stan Matwin, Marcel Turcotte.
Abstract
MOTIVATION: Annotation Enrichment Analysis (AEA) is a widely used analytical approach to process data generated by high-throughput genomic and proteomic experiments such as gene expression microarrays. The analysis uncovers and summarizes discriminating background information (e.g. GO annotations) for sets of genes identified by experiments (e.g. a set of differentially expressed genes, a cluster). The discovered information is utilized by human experts to find biological interpretations of the experiments. However, AEA isolates and tests for overrepresentation only individual annotation terms or groups of similar terms and is limited in its ability to uncover complex phenomena involving relationship between multiple annotation terms from various knowledge bases. Also, AEA assumes that annotations describe the whole object of interest, which makes it difficult to apply it to sets of compound objects (e.g. sets of protein-protein interactions) and to sets of objects having an internal structure (e.g. protein complexes).Entities:
Mesh:
Year: 2011 PMID: 21743060 PMCID: PMC3157920 DOI: 10.1093/bioinformatics/btr337
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.AEA approach. Study Set is a set of genes identified by an experiment (such as differentially expressed genes or one of the clusters). Universe Set is the set of all the genes that participated in the experiment or some other reference set of genes that the study set will be compared against. Annotation Database is a source of annotations attached to genes. Result of the analysis is a set of annotations that are over- or underrepresented in the Study Set comparing to the Universe Set.
Fig. 2.Bag-of-annotations data model.
Typical types of background knowledge
| Type | Description | Source |
|---|---|---|
| Annotations | Annotations are associative relations between objects of interest in a study set and objects in annotation databases. This is the part of knowledge typically covered by the bag-of-annotations representation. For logic-based representation, annotation relations may also include attributes characterizing confidence in the annotation (for not-curated data sources), source of the annotation, etc. | Content of biological databases |
| Structured background knowledge | Structured background knowledge reflects relations between annotations themselves. Typically, it contains the definition of ontology or a map of annotation terms. | Meta information about a biological database |
| Expert knowledge | Expert knowledge contains higher level relations about annotation terms and their organization that are not directly expressed by the structured knowledge. For example, for ontology analysis it is customary to add notions such as parent, child, sibling; for a graph, it is neighbor, clique and node distance. | Experts in bioinformatics and biology, published research based on data from biological databases. |
| Other knowledge | Other knowledge may include information describing phenotypes tested, environmental impact, experimental setup, etc. | Experiment description |
Logic-based representation of GO annotations
| Type | Formula | Comments |
|---|---|---|
| Annotations | go_annotation (aah1,go_0005634,c). | The formula states that gene AAH1 is annotated with GO category GO:0005634 from the component ontology. |
| Structured background knowledge | go_is_a (go_0044424,go_0044464). go_part_of(go_0044424,go_0005622). | The formulas define relations between GO categories. The whole GO direct acyclic graph can be represented in such way. |
| Expert knowledge | go_anc(A,P) :- go_is_a (A,P). go_anc (A,P) :- go_is_a (A,X), go_anc(X,P). go_sibling(A,B) :- go_is_a(A,P), go_is_a(B,P). | The formulas define useful relations on a graph such as ancestor and sibling. |
Fig. 4.An example of a synthesized annotation concept. Preservation of the internal structure of object ‘genetic interaction’ allows reasoning on and inferring statements dealing with the substructure of the object under analysis. Particularly, in this experiment we were able to utilize quantifiers such as both…, any…, one… to better understand relationship between GO categories and the set of interacting genes.
Fig. 3.An example of a synthesized annotation concept. chromosome_num predicate describes the relation between a gene and a chromosome, chromosome_loc tests the location of the gene on a chromosome against a learned model (the location is specified in base pairs, the predicate parameters are populated with the mean and variance of two normal distributions modeling the study and universe sets), go_category specifies the relation between a gene and a GO term.
Microarray datasets
| Dataset name | Dataset description |
|---|---|
| Bioconductor | Data of T- and B-cell acute lymphocytic LEUKEMIA from the Ritz |
| ALL | Laboratory at the DFCI |
| GSEA gender | Transcriptional profiles from male and female lymphoblastoid cell lines |
| GSEA p53 | Transcriptional profiles from p53+ and p53 mutant cancer cell lines |
| GSEA diabetes | Transcriptional profiles of smooth muscle biopsies of diabetic and normal individuals |
| GSEA leukemia | Transcriptional profiles from leukemias—0ALL and AML |
| GSEA lung cancer | Transcriptional profiles from lung cancer outcome datasets |
Annotation databases for microarray datasets
| Name | Description |
|---|---|
| GO | Gene ontology, released October 2009 Gene ontology annotation for human, released October 2009 |
| GCM (gene to chromosome mapping) | Gene to chromosome mapping (chromosome, chromosome band, and start/end base pairs) from Ensembl 56 database |
| GO + GCM | Combination of the two annotation sources above |
Quantitative performace evaluation of AEA and ACSEA on gene expression microarray datasets
| Dataset | Annotations | QvAvr1 | QvAvr5 | QvAvr10 | QvAvr25 | ||||
|---|---|---|---|---|---|---|---|---|---|
| AEA | ACSEA | AEA | ACSEA | AEA | ACSEA | AEA | ACSEA | ||
| ALL | GO | 6.33e-02 | 1.45e-01 | 3.41e-01 | 7.36e-01 | ||||
| GCM | 4.55e-01 | 8.91e-01 | 9.45e-01 | 9.78e-01 | |||||
| GO + GCM | 1.30e-01 | 2.33e-01 | 5.88e-01 | 8.35e-01 | |||||
| GSEA Gender | GO | 8.25e-04 | 2.29e-02 | 2.76e-01 | 7.11e-01 | ||||
| GCM | 1.25e-04 | 2.22e-01 | 6.11e-01 | 8.44e-01 | |||||
| GO + GCM | 8.12e-04 | 6.38e-03 | 7.08e-02 | 6.05e-01 | |||||
| GSEA p53 | GO | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | ||||
| GCM | 6.23e-04 | 6.84e-01 | 8.42e-01 | 9.37e-01 | |||||
| GO + GCM | 1.76e-02 | 8.04e-01 | 9.02e-01 | 9.61e-01 | |||||
| GSEA Diabetes | GO | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | ||||
| GCM | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | |||||
| GO + GCM | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | 1.00e + 00 | |||||
| GSEA Leukemia | GO | 2.48e-01 | 5.51e-01 | 7.76e-01 | 9.10e-01 | ||||
| GCM | 6.49e-01 | 9.30e-01 | 9.65e-01 | 9.86e-01 | |||||
| GO + GCM | 3.25e-01 | 6.81e-01 | 8.41e-01 | 9.36e-01 | |||||
| GSEA Lung Cancer | GO | 2.67e-01 | 8.53e-01 | 9.26e-01 | 9.71e-01 | ||||
| GCM | 1.79e-05 | 5.41e-01 | 7.70e-01 | 9.08e-01 | |||||
| GO + GCM | 5.29e-04 | 6.55e-01 | 8.28e-01 | 9.31e-01 | |||||
Lesser is better (better values are highlighted). The differences in performance are statistically significant with 95% (n = 1) and 99% (n = 5, 10, 25) confidence levels.
Annotation databases for genetic interaction screens
| Name | Description |
|---|---|
| GO | GO annotations. Bioconductor GO.db package, version 2.2.11. Bioconductor org.Sc.sgd.db package, version 2.2.12 |
| GCM (Gene to chromosome mapping) | Gene to chromosome mapping. Bioconductor org.Sc.sgd.db package, version 2.2.12 |
| GO + GCM | Combination of the two annotation sources above |
Quantitative performance evaluation of AEA and ACSEA on genetic interaction screens
| Dataset | Annotations | QvAvr1 | QvAvr5 | QvAvr10 | QvAvr15 | ||||
|---|---|---|---|---|---|---|---|---|---|
| AEA | ACSEA | AEA | ACSEA | AEA | ACSEA | AEA | ACSEA | ||
| PAM | GO | 8.27e-09 | 5.70e-06 | 1.20e-04 | 2.75e-04 | ||||
| GCM | 6.08e-06 | 2.88e-01 | 6.44e-01 | 7.63e-01 | |||||
| GO + GCM | 8.31e-09 | 3.20e-06 | 9.20e-05 | 2.28e-04 | |||||
| K-means | GO | 7.36e-06 | 1.42e-06 | 3.50e-04 | 1.66e-03 | 2.68e-03 | |||
| GCM | 1.12e-02 | 1.66e-02 | 3.81e-01 | 6.79e-01 | 7.86e-01 | ||||
| GO + GCM | 2.59e-07 | 3.41e-07 | 2.12e-06 | 2.14e-05 | |||||
Lesser is better (better values are highlighted). The differences in performance are statistically significant with 99% (n=5, 10, 15) confidence level.