| Literature DB >> 25356683 |
Hai Fang1.
Abstract
I introduce an open-source R package 'dcGOR' to provide the bioinformatics community with the ease to analyse ontologies and protein domain annotations, particularly those in the dcGO database. The dcGO is a comprehensive resource for protein domain annotations using a panel of ontologies including Gene Ontology. Although increasing in popularity, this database needs statistical and graphical support to meet its full potential. Moreover, there are no bioinformatics tools specifically designed for domain ontology analysis. As an add-on package built in the R software environment, dcGOR offers a basic infrastructure with great flexibility and functionality. It implements new data structure to represent domains, ontologies, annotations, and all analytical outputs as well. For each ontology, it provides various mining facilities, including: (i) domain-based enrichment analysis and visualisation; (ii) construction of a domain (semantic similarity) network according to ontology annotations; and (iii) significance analysis for estimating a contact (statistical significance) network. To reduce runtime, most analyses support high-performance parallel computing. Taking as inputs a list of protein domains of interest, the package is able to easily carry out in-depth analyses in terms of functional, phenotypic and diseased relevance, and network-level understanding. More importantly, dcGOR is designed to allow users to import and analyse their own ontologies and annotations on domains (taken from SCOP, Pfam and InterPro) and RNAs (from Rfam) as well. The package is freely available at CRAN for easy installation, and also at GitHub for version control. The dedicated website with reproducible demos can be found at http://supfam.org/dcGOR.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25356683 PMCID: PMC4214615 DOI: 10.1371/journal.pcbi.1003929
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
A summary of ontologies, infrastructures and functions included in dcGOR.
| Description | |
|
| |
|
| Knowledge on functions; annotate domains from SCOP, Pfam, InterPro and RNA families from Rfam |
|
| Knowledge on human diseases; annotate SCOP domains only |
|
| Knowledge on human phenotypes; annotate SCOP domains only |
|
| Knowledge on mouse phenotypes; annotate SCOP domains only |
|
| Knowledge on enzyme activities; annotate SCOP domains only |
|
| Knowledge on functions and others; annotate SCOP domains only |
|
| Knowledge on pathways; annotate SCOP domains only |
|
| |
|
| S4 class for representing data information (e.g. domains) |
|
| S4 class for representing ontologies |
|
| S4 class for representing domain-centric annotations |
|
| S4 class for storing enrichment outputs |
|
| S4 class for storing domain networks |
|
| S4 class for storing RWR-based contact outputs |
|
| S4 class for storing contact networks |
|
| |
|
| Create an object of S4 class ‘InfoDataframe’ from an input file |
|
| Create an object of S4 class ‘Onto’ from input files |
|
| Create an object of S4 class ‘Anno’ from input files |
|
| |
|
| Enrichment analysis; return an object of S4 class ‘Eoutput’ |
|
| Enrichment output visualisation |
|
| Semantic similarity calculation; return an object of S4 class ‘Dnetwork’ |
|
| Random walk with restart; return an object of S4 class ‘Coutput’ |
|
| Annotation propagation according to true-path rule |
|
| Conversion between different graph classes |
|
| Loading RData into the current environment |
Figure 1Domain-based enrichment analysis using GOBP terms.
Only the most significant 5 terms/nodes (outlined in black; explained in the bottom-right panel) are visualised along with their ancestral terms. Nodes are coloured according to adjusted p-values.
Figure 2In-depth analysis for network-level understanding.
(A) Heatmap visualisation of the semantic similarity between pairs of domains according to their annotations by Disease Ontology (DO). (B) Network representation of the pairwise domain semantic similarity. It is a weighted and undirected network, with edge thickness indicating semantic similarity between a pair of domains/nodes. Nodes are labeled by both numeric id and textual description. (C) A table listing GOMF terms and their annotated domains (used as domain seeds for random walk with restart, RWR). Notably, terms used here are only those with at least 3 annotatable domains that are also in the domain network (see Figure 2B). (D) Contact (statistical significance) network between GOMF terms in Figure 2C, as estimated by RWR on the domain network in Figure 2B. Only those significant contacts/edges (adjusted p-values<0.1) are shown, with thickness indicating the contact strength (z-score).
Figure 3Enrichment analysis of promiscuous Pfam domains using GOBP terms (left) and GOMF terms (right).
Only the most significant terms/nodes (adjusted p-values<0.05; outlined in black) are visualised along with their ancestral terms. Nodes are coloured according to adjusted p-values.
Figure 4Heatmap visualisation of the GO overall semantic similarity between pairs of promiscuous Pfam domains.
Domains are ordered according to hierarchical clustering by the package ‘supraHex’.