| Literature DB >> 29444698 |
David J Osumi-Sutherland1, Enrico Ponta2, Melanie Courtot3, Helen Parkinson3, Laura Badi2.
Abstract
BACKGROUND: The Gene Ontology (GO) consists of over 40,000 terms for biological processes, cell components and gene product activities linked into a graph structure by over 90,000 relationships. It has been used to annotate the functions and cellular locations of several million gene products. The graph structure is used by a variety of tools to group annotated genes into sets whose products share function or location. These gene sets are widely used to interpret the results of genomics experiments by assessing which sets are significantly over- or under-represented in results lists. F Hoffmann-La Roche Ltd. has developed a bespoke, manually maintained controlled vocabulary (RCV) for use in over-representation analysis. Many terms in this vocabulary group GO terms in novel ways that cannot easily be derived using the graph structure of the GO. For example, some RCV terms group GO terms by the cell, chemical or tissue type they refer to. Recent improvements in the content and formal structure of the GO make it possible to use logical queries in Web Ontology Language (OWL) to automatically map these cross-cutting classifications to sets of GO terms. We used this approach to automate mapping between RCV and GO, largely replacing the increasingly unsustainable manual mapping process. We then tested the utility of the resulting groupings for over-representation analysis.Entities:
Keywords: EL; GO; OWL; enrichment; gene ontology; gene set enrichment analysis; ontology mapping; over-representation analysis
Mesh:
Substances:
Year: 2018 PMID: 29444698 PMCID: PMC5813370 DOI: 10.1186/s13326-018-0175-z
Source DB: PubMed Journal: J Biomed Semantics
Results table for RCV cannabinoid
| GO name | GO ID | manual | auto | checked | black listed | is obsolete |
|---|---|---|---|---|---|---|
| regulation of endocannabinoid signaling pathway | GO 2000124 | 1 | 1 | 1 | 0 | 0 |
| cannabinoid signaling pathway | GO 0038171 | 1 | 1 | 1 | 0 | 0 |
| endocannabinoid signaling pathway | GO 0071926 | 1 | 0 | 1 | 0 | 0 |
| cannabinoid receptor activity | GO 0004949 | 0 | 1 | 1 | 0 | 0 |
| cannabinoid biosynthetic process | GO 1901696 | 0 | 1 | 1 | 0 | 0 |
The table shows the mapping of an RCV term“cannabinoid” to a set of GO terms, comparing manual mapping (manual column) with automated mapping (auto column). The automated mapping results from an OWL query for processes in which a cannabinoid participates, or that regulates a process in which a cannabinoid participates. The automated mapping found three additional GO terms compared to the manual mapping. In this case, no manually mapped terms were obsolete in GO and all automated mappings were approved
Ontology metrics: Counts of OWL entity and axioms types in the ontology used for mapping
| entity/axiom type | Count |
|---|---|
| Logical axioms | 142,894 |
| Classes | 53,799 |
| Object properties | 153 |
| SubClassOf axioms | 113,104 |
| EquivalentClasses | 29,386 |
| DisjointClasses | 148 |
| GCI | 6910 |
| SubObjectPropertyOf | 164 |
| InverseObjectProperties | 28 |
| TransitiveObjectProperty | 16 |
| ReflexiveObjectProperty | 1 |
| SubPropertyChainOf | 46 |
Fig. 1Summary of mapping results a. Distribution of manual mappings not found by automated mapping X axis = number of manual-only mappings. Y axis = Number of RCV terms. Over 80% of mappings are completely automated or require less than 10 manual mappings. b. Distribution of automated mappings not present in the original manual mapping. X axis = number of auto-only mappings. Y axis = Number of RCV terms. Many new mappings were uncovered by automation
Fig. 2Use of RCV-derived gene sets to identify immune cell types. RCV-derived gene sets (Y-axis); immune cell type transcriptomes (X-axis); over-representation is indicated in red; under-representation in blue. Cell-type transcriptomes are clustered based on similarity of enrichment profile across gene sets
Fig. 3Comparison of RCV derived gene sets and tissue derived gene sets for identification of immune-cell rich tissues Over-representation of RCV-derived gene sets (Y-axis) in tissue-type transcriptomes (X-axis) is indicated in red, under-representation in blue. Tissue-type transcriptomes are clustered based on similarity of enrichment profile across gene sets (X-axis) and gene sets are clustered by similarity of enrichment profile across tissues (Y-axis). Only the immune-rich tissue cluster of gene sets is shown in this figure. For the fully enrichment analysis please see Additional file 1
Fig. 4Comparison of RCV derived gene sets and tissue derived gene sets for identification of brain derived tissues. Over-representation of RCV-derived gene sets (Y-axis) in tissue-type transcriptomes (X-axis) is indicated in red, under-representation in blue. Tissue-type transcriptomes are clustered based on similarity of enrichment profile across gene sets (X-axis) and gene sets are clustered by similarity of enrichment profile across tissues (Y-axis). Only the brain tissue cluster gene sets is shown in this figure. For the full enrichment analysis please see Additional file 1
Overlap between cell-specific gene sets derived from RCV and cell expression data is low
| Gene sets | Jaccard Index |
|---|---|
| B cells rcv vs Lymphocyte B FOLL ts | 0.064 |
| NK cells rcv vs Lymphocytes NK ts | 0.000 |
| T cells rcv vs Lymphocytes T various tsa | 0.025a |
| T helper rcv vs Lymphocytes T H ts | 0.032 |
| dendritic cell rcv vs Dendritic cells ts | 0.000 |
| granulocyte rcv vs Granulocyte INFL ts | 0.082 |
| lymphocyte rcv vs Lymphocytes NK ts | 0.071 |
| macrophage rcv vs Macrophage PB ts | 0.033 |
| mast cell rcv vs Mast cell PB ts | 0.045 |
Column one lists the two gene sets compared. Column 2 lists the Jaccard similarity coefficient comparing the two gene sets (0 = no overlap, 1 = full overlap.) aIn the case of T cells the average of a range of T-cell expression datasets is shown