| Literature DB >> 25903923 |
Kevin M Livingston1, Michael Bada2, William A Baumgartner3, Lawrence E Hunter4.
Abstract
BACKGROUND: The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.Entities:
Mesh:
Year: 2015 PMID: 25903923 PMCID: PMC4448321 DOI: 10.1186/s12859-015-0559-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1KaBOB Construction. Depicts the incremental construction of KaBOB. Labeled arrows represent processes that flow from inputs to outputs. Construction starts with downloading files and flows through translating them into RDF and then iteratively querying and producing more RDF. Steps marked with ** involve multiple sets of rules being run and their output loaded in sequence.
Figure 2Example ICE Records and corresponding BIO Concepts. Depicts an excerpt of the knowledge representation in KaBOB. Ovals are used to depict instances, and rectangles classes. Single line arrows represent triples and point from their subject to their object and are labeled with their property. The iao:denotes links that cross from the ICE to the BIO side are emphasized with dashed arrows. The double arrows are shorthand for representing an owl:Restriction on the given property with some values from the object value. This figure depicts two GO annotation records that are then converted to biomedical concepts using the same rule (rule not depicted). Additionally sets of gene identifiers are also depicted that denote their corresponding gene concept. On the BIO side the relations between genes, proteins, and gene or gene product aggregate classes are also shown. Other than the records and their field values, generated by the file parsers, all other links are the output of applying rules.
Size of KaBOB
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| human only | 13,830,676 | 1.5 | 144,489,737 | 2.0 | 7,615,547 | 0.2 | 165,935,960 | 3.6 |
| human +7 major model organisms | 13,830,676 | 1.5 | 369,027,022 | 4.9 | 34,968,305 | 0.7 | 417,826,003 | 7.1 |
| all organisms | 13,830,676 | 1.5 | 9,584,033,541 | 126 | n/a | n/a | n/a | n/a |
Lists the size of the various collection of RDF generated in the KaBOB build process, recorded in number of triples and size on disk. The first three major columns include the imported OBOs, the ICE records (output of the file parsers), and the generated triples (output of the rules and ID merging). The fourth column is the sum of the first three. The rows represent subsets of the KaBOB data based on organisms included. The subsets are human-only, human plus seven major model organisms (listed in the paper), and the final row is for all organisms combined. Due to the scale of the data in the final subset this data is currently incomplete.
Number of entities / ID sets
|
| # | # |
|
|---|---|---|---|
| human only | 336,472 | 952,807 | 14 MB |
| human +7 major model organisms | 1,513,932 | 3,644,255 | 56 MB |
List the number of entities or ID sets in each subset of KaBOB. Each ID set is the collection of identifiers from multiple data sources that are intended to denote the same biomedical concept. Number of ID sets, number of triples, and size on disk is reported.