| Literature DB >> 25875762 |
Hirokazu Chiba1, Hiroyo Nishide2, Ikuo Uchiyama3.
Abstract
Recently, various types of biological data, including genomic sequences, have been rapidly accumulating. To discover biological knowledge from such growing heterogeneous data, a flexible framework for data integration is necessary. Ortholog information is a central resource for interlinking corresponding genes among different organisms, and the Semantic Web provides a key technology for the flexible integration of heterogeneous data. We have constructed an ortholog database using the Semantic Web technology, aiming at the integration of numerous genomic data and various types of biological information. To formalize the structure of the ortholog information in the Semantic Web, we have constructed the Ortholog Ontology (OrthO). While the OrthO is a compact ontology for general use, it is designed to be extended to the description of database-specific concepts. On the basis of OrthO, we described the ortholog information from our Microbial Genome Database for Comparative Analysis (MBGD) in the form of Resource Description Framework (RDF) and made it available through the SPARQL endpoint, which accepts arbitrary queries specified by users. In this framework based on the OrthO, the biological data of different organisms can be integrated using the ortholog information as a hub. Besides, the ortholog information from different data sources can be compared with each other using the OrthO as a shared ontology. Here we show some examples demonstrating that the ortholog information described in RDF can be used to link various biological data such as taxonomy information and Gene Ontology. Thus, the ortholog database using the Semantic Web technology can contribute to biological knowledge discovery through integrative data analysis.Entities:
Mesh:
Year: 2015 PMID: 25875762 PMCID: PMC4395280 DOI: 10.1371/journal.pone.0122802
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1RDF model of ortholog information based on OrthO.
(A) Hierarchical structure of classes and properties in OrthO. OrthO includes 12 classes (owl:Class) and 20 properties (15 of owl:ObjectProperty and 5 of owl:DatatypeProperty). (B) Schematic representation of RDF graph structure of ortholog information described using OrthO. The elliptical nodes represent instances of classes. The directed edges represent properties. The dotted lines represent possible links to other resources.
List of ontologies available at the MBGD SPARQL endpoint.
|
|
|
|
|---|---|---|
| Ortholog Ontology (OrthO) | orth: |
|
| An ontology for MBGD | mbgd: |
|
| An ontology for GO annotation | goa: |
|
| The RDF Concepts Vocabulary (RDF) | rdf: |
|
| The RDF Schema vocabulary (RDFS) | rdfs: |
|
| The OWL 2 Schema vocabulary (OWL 2) | owl: |
|
| Dublin Core Metadata Element Set, Version 1.1 | dc: |
|
| DCMI Metadata Terms | dct: |
|
| Vocabulary of Interlinked Datasets (VoID) | void: |
|
| SKOS Vocabulary | skos: |
|
| Provenance, Authoring and Versioning (PAV) | pav: |
|
| Ontological Gene Orthology (OGO) | ogo: |
|
| FALDO: Feature Annotation Location Description Ontology | faldo: |
|
| UniProt core ontology | up: |
|
| RDF representation of taxonomy | tax: |
|
| RDF representation of GO | go: |
|
This table shows the selected list of ontologies available at MBGD SPARQL endpoint. The prefixes for each ontology used in this study are shown. For the full list and additional details of the available ontologies, see the documentation page of MBGD SPARQL Search (http://mbgd.genome.ad.jp/sparql).
Fig 2An example orthology relation and its RDF representation.
(A) A schematic illustration of an orthology relation from the OrthoXML documentation (http://orthoxml.org/0.3/orthoxml_doc_v0.3.html#trees). Here, each node is assigned a URI and a class that are required for RDF representation. The filled circles representing speciation events are assigned the orth:OrthologGroup class. (B) RDF representation (Turtle format) of the example shown in A.
List of the datasets available at the MBGD SPARQL endpoint.
|
|
|
|
|---|---|---|
| MBGD ortholog groups |
| 76,155,196 |
| MBGD genes |
| 686,902,009 |
| MBGD organisms |
| 31,397 |
| MBGD chromosomes and plasmids |
| 6,796,757 |
| Cross-references from MBGD to UniProt |
| 8,012,666 |
| eggNOG COG |
| 42,787,220 |
| eggNOG NOG |
| 21,469,150 |
| eggNOG proteins |
| 24,572,358 |
| eggNOG organisms |
| 1,144 |
| UniProt-GOA |
| 274,338,183 |
For detailed information on the datasets, including their derivation, see the documentation page of MBGD SPARQL Search (http://mbgd.genome.ad.jp/sparql).
Fig 3The portal page of MBGD SPARQL Search.
Fig 4Retrieval of ortholog information of a specific protein.
(A) Schematic diagram of the RDF graph structure related to the query in B. The elliptical nodes represent resources. Specifically, the shaded elliptical nodes where classes are shown in italics represent the instances of the classes. In the unshaded elliptical node, the URI of the resource is directly shown. (B) SPARQL query to get GO annotation of an ortholog group. The prefix declarations are omitted for readability; the full description of the SPARQL query is included in S1 Dataset. (C) Search results of the query shown in B.
Fig 5Comparison of ortholog information from different data sources.
(A) Schematic diagram of the RDF graph structure related to the query in B. The elliptical nodes represent instances of classes. The rectangular nodes represent literals (integers in this example). (B) SPARQL query to compare orthologs between MBGD and eggNOG. The first line enables the inference based on sub-class and sub-property relations (see Methods). (C) Search results of the query shown in B.
Fig 6Retrieval of phylogenetic patterns of orthologs related to a specific function.
(A) Schematic diagram of the RDF graph structure related to the queries in B and C. (B) SPARQL query to get MBGD clusters including members related to the GO term GO:0009288 (bacterial-type flagellum). (C) SPARQL query to obtain organisms that contain members of an ortholog group. (D) Search results of the query shown in B. (E) The results obtained from the queries shown in B and C visualized using R (the R source code is included in S1 Dataset). The number of target organisms in each phylum is shown in parenthesis. After obtaining the output from R, the phyla containing gram-positive bacteria (+) and genes functioning in the flagellar export system (*) are marked, and the blue line was added to represent clusters with relatively wide organismal distribution (in at least 16 phyla).
Time required for executing SPARQL queries.
|
|
|
|---|---|
|
| 1.2 |
|
| 3.3 |
|
| 1.6 |
|
| 19.1 |