| Literature DB >> 23735196 |
Alison Callahan1, José Cruz-Toledo, Michel Dumontier.
Abstract
BACKGROUND: A key activity for life scientists in this post "-omics" age involves searching for and integrating biological data from a multitude of independent databases. However, our ability to find relevant data is hampered by non-standard web and database interfaces backed by an enormous variety of data formats. This heterogeneity presents an overwhelming barrier to the discovery and reuse of resources which have been developed at great public expense.To address this issue, the open-source Bio2RDF project promotes a simple convention to integrate diverse biological data using Semantic Web technologies. However, querying Bio2RDF remains difficult due to the lack of uniformity in the representation of Bio2RDF datasets.Entities:
Year: 2013 PMID: 23735196 PMCID: PMC3632999 DOI: 10.1186/2041-1480-4-S1-S1
Source DB: PubMed Journal: J Biomed Semantics
Bio2RDF datasets currently available
| Dataset | Namespace | # of triples | # of unique subjects | # of unique predicates | # of unique objects |
|---|---|---|---|---|---|
| affymetrix | 44469611 | 1370219 | 79 | 13097194 | |
| biomodels | 589753 | 87671 | 38 | 209005 | |
| ctd | 141845167 | 12840989 | 27 | 13347992 | |
| drugbank | 1121468 | 172084 | 75 | 526976 | |
| ncbigene | 394026267 | 12543449 | 60 | 121538103 | |
| goa | 80028873 | 4710165 | 28 | 19924391 | |
| hgnc | 836060 | 37320 | 63 | 519628 | |
| homologene | 1281881 | 43605 | 17 | 1011783 | |
| interpro | 999031 | 23794 | 34 | 211346 | |
| iproclass | 211365460 | 11680053 | 29 | 97484111 | |
| irefindex | 31042135 | 1933717 | 32 | 4276466 | |
| mesh | 4172230 | 232573 | 60 | 1405919 | |
| ncbo | 15384622 | 4425342 | 191 | 7668644 | |
| ndc | 17814216 | 301654 | 30 | 650650 | |
| omim | 1848729 | 205821 | 61 | 1305149 | |
| pharmgkb | 37949275 | 5157921 | 43 | 10852303 | |
| sabiork | 2618288 | 393157 | 41 | 797554 | |
| sgd | 5551009 | 725694 | 62 | 1175694 | |
| taxon | 17814216 | 965020 | 33 | 2467675 | |
The Bio2RDF datasets currently available for SPARQL querying and download at http://bio2rdf.org. The total number of triples, number of unique subject, number of unique predicates and number of unique objects are listed along with the Bio2RDF namespace for each dataset.
* Datasets new to the Bio2RDF network
† InterPro contains 13 domain resources, iRefIndex contains 13 interaction resources, and NCBO contains 107 OBO ontologies.
Figure 1Bio2RDF datasets showing namespace connectivity Circles represent Bio2RDF datasets and the links between them represent a relation between one dataset and another based on IRI namespaces. Datasets with many links may be considered ‘hubs’ that serve to connect multiple datasets in the Bio2RDF network. A All Bio2RDF datasets and their namespace connectivity. B Detail of namespace connectivity for the PharmGKB dataset.
Top 5 results of a federated SPARQL query to search for SGD genes related to the GO function with label “zinc ion binding”
| Gene identifier | Gene label | Description | Protein identifier |
|---|---|---|---|
| YMR083W | Mitochondrial alcohol dehydrogenase isozyme III | http://bio2rdf.org/sgd:S000004688gp | |
| YBR145W | Alcohol dehydrogenase isoenzyme V | http://bio2rdf.org/sgd:S000000349gp | |
| YDR216W | Carbon source-responsive zinc-finger transcription factor | http://bio2rdf.org/sgd:S000002624gp | |
| YER017C | Component of the mitochondrial inner membrane m-AAA protease | http://bio2rdf.org/sgd:S000000819gp | |
| YIL044C | ADP-ribosylation factor (ARF) GTPase activating protein (GAP) effector | http://bio2rdf.org/sgd:S000001306gp | |
SGD gene identifier, label, partial description and corresponding protein identifier for each query result
Bio2RDF vocabulary and SIO mapping metrics
| Dataset | # of classes | # of object properties | # of class exact mappings | # of class intermediate mappings |
|---|---|---|---|---|
| 1 | 15 | 0 | 1 | |
| 0 | 2 | 0 | 0 | |
| 4 | 3 | 3 | 1 | |
| 15 | 58 | 6 | 9 | |
| 1 | 2 | 0 | 1 | |
| 1 | 30 | 0 | 1 | |
| 1 | 5 | 0 | 1 | |
| 7 | 21 | 2 | 5 | |
| 5 | 6 | 0 | 5 | |
| 3 | 46 | 0 | 3 | |
| 7 | 47 | 3 | 4 | |
| 4 | 16 | 1 | 3 | |
| 18 | 89 | 1 | 17 | |
| 11 | 16 | 0 | 11 | |
| 16 | 8 | 6 | 10 | |
| 0 | 3 | 0 | 0 | |
| 42 | 40 | 7 | 33 | |
The number of classes and object properties mapped to SIO for each Bio2RDF dataset vocabulary ontology, as well as details on quality of class mappings. Exact matches indicate that the Bio2RDF vocabulary class is a direct subclass of its mapped parent SIO class. Intermediate matches are those that are not exact matches in the SIO class hierarchy but for which there was not a more precise parent class candidate.
Figure 2Graphical representation of results of the SPARQL query for CTD chemicals that participate in the same molecular process as SGD proteins. The SGD gene S000004901 encodes a gene product that is typed as ‘protein’ in the SGD dataset. This protein is annotated to be a participant in the GO process “RNA splicing”. Chemicals from the CTD dataset that are also participants in this GO process are retrieved. This query is possible because of cross-dataset links between CTD and SGD that result from mapping their Bio2RDF vocabularies to SIO: the SGD type ‘protein’ was mapped as a subclass of the SIO ‘protein’ class, while the CTD type ‘chemical’ was mapped as a subclass of the SIO ‘chemical entity’ class. The SGD dataset uses SIO relation ‘encodes’ to relate genes and gene products, and both the SGD and CTD datasets use the SIO relation ‘is participant in’ to link entities to GO processes. Square boxes are SIO OWL classes. Rounded boxes are RDF resources that are part of Bio2RDF linked datasets.