| Literature DB >> 28298238 |
Ali Hasnain1, Qaiser Mehmood2, Syeda Sana E Zainab2, Muhammad Saleem3, Claude Warren4, Durre Zehra2, Stefan Decker2, Dietrich Rebholz-Schuhmann2.
Abstract
BACKGROUND: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain.Entities:
Keywords: Life sciences dataset; Linked open data; SPARQL query federation
Mesh:
Year: 2017 PMID: 28298238 PMCID: PMC5353896 DOI: 10.1186/s13326-017-0118-0
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1BioFed architecture. ARDI comes from previous work by Hasnain et al. [4, 16]
Hardware statistics
| Endpoint name | Operating system | CPU(GHz) | RAM | Hard disk |
|---|---|---|---|---|
| Chebi | Window 7 Professional Service Pack 1 64 bits | 2.90, i7 | 8GB | 148GB |
| LinkedTCGA | Window 7 Professional Service Pack 1 64 bits | 2.90, i7 | 8GB | 148GB |
| Sider | Window 7 Professional Service Pack 1 64 bits | 2.90, i7 | 8GB | 148GB |
| Dailymed | Window 7 Professional Service Pack 1 64 bits | 2.26, 2 Duo | 4GB | 148GB |
| Medicare | Window 7 Professional Service Pack 1 64 bits | 2.26, 2 Duo | 4GB | 148GB |
| LinkedCT | Window 7 Professional Service Pack 1 64 bits | 2.53, i5 | 4GB | 297GB |
| Diseasome | Window 7 Professional Service Pack 1 64 bits | 2.53, i5 | 4GB | 297GB |
| Affymetrix | Ubuntu 14.04 LTS 64 bits | 1.80, i5 | 8GB | 256GB |
| Drugbank | Ubuntu 14.04 LTS 64 bits | 1.80, i5 | 8GB | 256GB |
| Kegg | Ubuntu 14.04 LTS 64 bits | 2.53, i5 | 8GB | 320GB |
Fig. 2Datasets connectivity. Connectivity overview of some Life science data sets through classes/properties, used in experimental setup
Dataset statistics
| Dataset | Triples | Subjects | Predicates | Objects | Classes | Structuredness |
|---|---|---|---|---|---|---|
| Chebi | 4772706 | 50477 | 28 | 772138 | 1 | 0.340 |
| DrugBank | 517023 | 19693 | 119 | 276142 | 8 | 0.726 |
| Kegg | 1090830 | 34260 | 21 | 939258 | 4 | 0.919 |
| Affymetrix | 44207146 | 1421763 | 105 | 13240270 | 3 | 0.506 |
| Dailymed | 162972 | 10015 | 28 | 67782 | 6 | 0.663 |
| Diseasome | 72445 | 8152 | 19 | 27704 | 4 | 0.543 |
| Sider | 101542 | 2674 | 11 | 29410 | 4 | 0.924 |
| Medicare | 44500 | 6825 | 6 | 23308 | 3 | 0.843 |
| LinkedCT | 9804652 | 981880 | 90 | 3808369 | 13 | 0.840 |
| Linked TCGA-A | 35329868 | 5782962 | 383 | 8329393 | 23 | 0.998 |
| Total | 96103684 | 8318701 | 810 | 27513774 | 69 | - |
Comparison of the queries in terms of basic graph patterns #BGPs, Triple Patterns #TP, total vertices TVs, join vertices JVs, join vertices to total vertices Ratio R and mean join vertices degree D per query
| Query | QueryType | #BGPs | #TP | TVs | JVs | R | D | SPARQL Clauses |
|---|---|---|---|---|---|---|---|---|
| SQ1 | Simple | 2 | 4 | 10 | 2 | 0.20 | 2.0 | UNION |
| SQ2 | Simple | 1 | 7 | 15 | 4 | 0.266 | 2.5 | X |
| SQ3 | Simple | 1 | 6 | 13 | 4 | 0.307 | 2.250 | X |
| SQ4 | Simple | 1 | 5 | 11 | 3 | 0.272 | 2.333 | X |
| SQ5 | Simple | 2 | 5 | 12 | 3 | 0.250 | 2.0 | OPTIONAL |
| SQ6 | Simple | 1 | 3 | 7 | 2 | 0.285 | 2.0 | X |
| SQ7 | Simple | 1 | 4 | 9 | 3 | 0.333 | 2.0 | X |
| SQ8 | Simple | 1 | 3 | 7 | 2 | 0.285 | 2.0 | X |
| SQ9 | Simple | 1 | 8 | 14 | 2 | 0.117 | 4.5 | DISTINCT |
| SQ10 | Simple | 1 | 8 | 17 | 2 | 0.117 | 4.5 | DISTINCT |
| CQ1 | Complex | 2 | 8 | 18 | 4 | 0.222 | 2.5 | DISTINCT, OPTIONAL, FILTER |
| CQ2 | Complex | 2 | 8 | 19 | 4 | 0.210 | 2.25 | OPTIONAL, FILTER |
| CQ3 | Complex | 1 | 10 | 19 | 4 | 0.210 | 3.75 | DISTINCT, FILTER, REGEX |
| CQ4 | Complex | 1 | 6 | 13 | 4 | 0.307 | 2.25 | X |
| CQ5 | Complex | 2 | 10 | 22 | 3 | 0.136 | 3.666 | OPTIONAL |
| CQ6 | Complex | 2 | 12 | 24 | 6 | 0.25 | 3.0 | OPTIONAL |
| CQ7 | Complex | 1 | 8 | 17 | 4 | 0.235 | 2.75 | X |
| CQ8 | Complex | 1 | 6 | 13 | 2 | 0.153 | 3.5 | X |
| CQ9 | Complex | 1 | 9 | 19 | 5 | 0.263 | 2.6 | FILTER |
| CQ10 | Complex | 2 | 9 | 20 | 3 | 0.15 | 3.333 | OPTIONAL |
Comparison of the source selection in terms of number of ASK #AR, total triple pattern-wise sources selected #TP, source selection time SST in msec and total number of results retrieved #R per query. T/A = Total/Avg., where Total is for #TP, #AR, and Avg. is for #SST
| FedX(cold) | BioFed | |||||||
|---|---|---|---|---|---|---|---|---|
| Query | #AR | #TP | SST | #R | #AR | #TP | SST | #R |
| SQ1 | 40 | 4 | 3374 | 5146 | 0 | 4 | 1061 | 5146 |
| SQ2 | 70 | 7 | 3513 | 3 | 20 | 7 | 386 | 3 |
| SQ3 | 60 | 8 | 3194 | 393 | 10 | 8 | 280 | 403 |
| SQ4 | 50 | 7 | 3234 | 6 | 20 | 7 | 6255 | 28 |
| SQ5 | 50 | 6 | 3289 | 1620 | 0 | 6 | 849 | 1620 |
| SQ6 | 30 | 3 | 3281 | 8120 | 0 | 3 | 47 | 8120 |
| SQ7 | 40 | 19 | 4088 | 27 | 0 | 19 | 3804 | 27 |
| SQ8 | 30 | 2 | 3587 | 0 | 0 | 2 | 165 | 0 |
| SQ9 | 80 | 11 | 3218 | - | 10 | 11 | 297 | - |
| SQ10 | 80 | 11 | 3234 | - | 10 | 11 | 268 | - |
| T/A | 530 | 78 | 3401 | - | 70 | 78 | 1341 | - |
| CQ1 | 80 | 9 | 3354 | - | 10 | 9 | 249 | - |
| CQ2 | 80 | 9 | 3242 | 4 | 10 | 9 | 2238 | 4 |
| CQ3 | 100 | 28 | 3148 | 7 | 20 | 28 | 1743 | - |
| CQ4 | 60 | 12 | 3136 | 133986 | 0 | 12 | 1967 | 134025 |
| CQ5 | 100 | 16 | 3751 | 2940 | 10 | 16 | 1122 | 2940 |
| CQ6 | 120 | 18 | 4675 | 4781 | 10 | 18 | 694 | 4781 |
| CQ7 | 80 | 8 | 3283 | 372 | 0 | 8 | 713 | 372 |
| CQ8 | 60 | 6 | 9621 | 21 | 20 | 6 | 560 | 21 |
| CQ9 | 90 | 9 | 7112 | - | 10 | 9 | 195 | - |
| CQ10 | 90 | 15 | 93852 | 22888 | 0 | 15 | 1345 | 63948 |
| T/A | 860 | 104 | 135174 | - | 90 | 104 | 1082 | - |
| Net T/A | 1390 | 182 | 69287 | - | 160 | 182 | 1211 | - |
Result set completeness and correctness: Table 5 below represents the result completeness and correctness
| System | SQ3 (393) | SQ4 (28) | CQ3(7) | CQ4 (133986) | CQ10(22888) |
|---|---|---|---|---|---|
| FedX | 393 | 6 | 7 | 133986 | 22888 |
| BioFed | 403 | 28 | - | 134025 | 63948 |
The values in brackets tells the actual data size. The symbol - means either the query didn’t return the complete results or unlimited query execution time
Fig. 3Query execution time for simple category queries. Comparison of simple queries execution time run on FedX and BioFed
Fig. 4Query execution time for complex category queries. Comparison of complex queries execution time run on FedX and BioFed