Literature DB >> 25638809

SPARQL-enabled identifier conversion with Identifiers.org.

Sarala M Wimalaratne¹, Jerven Bolleman¹, Nick Juty¹, Toshiaki Katayama¹, Michel Dumontier¹, Nicole Redaschi¹, Nicolas Le Novère², Henning Hermjakob¹, Camille Laibe¹.

Abstract

MOTIVATION: On the semantic web, in life sciences in particular, data is often distributed via multiple resources. Each of these sources is likely to use their own International Resource Identifier for conceptually the same resource or database record. The lack of correspondence between identifiers introduces a barrier when executing federated SPARQL queries across life science data.
RESULTS: We introduce a novel SPARQL-based service to enable on-the-fly integration of life science data. This service uses the identifier patterns defined in the Identifiers.org Registry to generate a plurality of identifier variants, which can then be used to match source identifiers with target identifiers. We demonstrate the utility of this identifier integration approach by answering queries across major producers of life science Linked Data.
AVAILABILITY AND IMPLEMENTATION: The SPARQL-based identifier conversion service is available without restriction at http://identifiers.org/services/sparql.

Entities: Chemical

Mesh：

Year: 2015 PMID： 25638809 PMCID： PMC4443684 DOI： 10.1093/bioinformatics/btv064

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Semantic Web technologies such as the Resource Description Framework (RDF; http://www.w3.org/TR/rdf-primer/) and SPARQL (http://www.w3.org/TR/rdf-sparql-query/) offer a powerful paradigm for publishing and exploring life science data through standardization of format and data access. For example, the open source Bio2RDF (Callahan ) project converts dozens of public biological databases and datasets from legacy formats into RDF, and provides a mechanism to explore these as Linked Data. Recently, established bioinformatic organizations such as DBCLS (http://togows.dbcls.jp/), NCBI (https://pubchem.ncbi.nlm.nih.gov/rdf/), neXtProt (Chichester ) and the EMBL-EBI in collaboration with the UniProt consortium (Jupp ) have made some datasets available in RDF, thereby significantly extending the network of the Linked Open Data. All efforts use HTTP-based International Resource Identifiers (IRIs) to identify and link data items. This facilitates querying across network-linked resources, but the lack of a universal identifier system requires mappings across all the different identifiers in use. Identifiers.org (Juty ) provides resolvable persistent IRIs used to identify individual records (based on the existing entity identifiers assigned directly by the data providers). Although some linked data providers such as Bio2RDF and the EBI now make their data available with identifiers.org URIs (or mappings to them), this practice is not widely implemented. Therefore, the identifier mismatch makes it difficult to query multiple datasets simultaneously. String manipulation, supported by SPARQL, may be used for this purpose but requires users to know in advance the IRI types being used in each resource, making it a cumbersome and inefficient solution. To address the issue of identifier heterogeneity, we have developed a SPARQL-based service that generates on-the-fly identifier mappings for registered IRI patterns. Here, we describe our novel method and demonstrate its functionality through service-enabled federated SPARQL queries. This system offers an automatic way to link and query over a rapidly growing number of semantic web friendly life science datasets.

2 Methods

We implemented a SPARQL-based service that generates a set of variant identifiers based on a provided identifier. This service, implemented using the OpenRDF Sesame SPARQL engine (http://www.openrdf.org/), translates an incoming query pattern of the form owl:sameAs ?targetIRI and generates a set of triples with the specific subject, predicate, and the generated target IRI. The service queries the curated Identifiers.org Registry to determine the originating data collection, then obtains alternative IRIs patterns, and finally generates and returns alternative IRIs.

3 Results

The Identifiers.org Registry contains 531 data collections and over 1300 IRI patterns. The service can be used to find alternative but equivalent IRIs, or check whether two IRIs identify the same concept. For supported data collections, this service eliminates the need to know the set of valid IRI patterns in advance and the need to devise elaborate string manipulation operations in a federated SPARQL query. The query example below illustrates how the service can be used to query across datasets with different IRI schemes. In this example, we run a federated query to find human proteins from UniProt and their domains from InterPro Bio2RDF that are used in a model’s components (of type SBML species) from BioModels Linked Dataset (Wimalaratne ). This query can be executed using BioModels SPARQL endpoint (http://www.ebi.ac.uk/rdf/services/biomodels/sparql) and takes around 20 s. The service bridges the gap between the Identifiers.org-specified, Bio2RDF-specified and UniProt-specified identifiers. Further examples are readily available at http://identifiers.org/documentation. PREFIX rdfs: <> PREFIX owl: <> PREFIX dcterms: <> PREFIX sbmlrdf: <> PREFIX bqbio: <> PREFIX biomodel: <> PREFIX up:<> PREFIX taxon:<> PREFIX database:<> SELECT DISTINCT ?protein ?protein_domain ?domain_label WHERE { # query for species annotations in model BIOMD0000000372 biomodel:BIOMD0000000372 sbmlrdf:species?s. ?s sbmlrdf:name?species. ?s bqbio:isVersionOf?protein_term. # query for other IRIs for a given species annotation IRI SERVICE <>} ?protein_term owl:sameAs?protein. } # query for human proteins and their matches to domains # in the InterPro database SERVICE <>{ ?protein a up:Protein; up:organism taxon:9606; rdfs:seeAlso?protein_domain. ?protein_domain up:database database:InterPro. } # query for other IRIs for a given protein domain IRI SERVICE <>{ ?protein_domain owl:sameAs?uris. } # query for protein domain labels SERVICE<> { ?uris dcterms:title?domain_label. } }

4 Discussion

Leveraging the wealth of biomedical big data for discovery requires simple and effective approaches to tame the challenge of working with heterogeneous, overlapping and diverse data. Of particular concern is assignment of different identifiers for identical resources as well as for conceptually identical resources. Identifier integration is the subject of much research that focuses either on integrating conceptually identical objects or their relations (van Iersel ; Wein ; Chambers ). In contrast, our work focuses on the problem of having multiple identifiers for the same database object, which is an emerging issue among semantic web data providers. Our solution is rapid, scalable, and will grow to provide new identifier-based mappings as additional IRI patterns are added to the Identifiers.org Registry.

5 Conclusion

This IRI conversion service, provided by Identifiers.org as a SPARQL service, will enable users to focus on asking meaningful questions across biological datasets of interest rather than figuring out how to generate the right identifiers.

6 in total

1. Improvements in the Protein Identifier Cross-Reference service.

Authors: Samuel P Wein; Richard G Côté; Marine Dumousseau; Florian Reisinger; Henning Hermjakob; Juan A Vizcaíno
Journal: Nucleic Acids Res Date: 2012-04-27 Impact factor: 16.971

2. Identifiers.org and MIRIAM Registry: community resources to provide persistent identification.

Authors: Nick Juty; Nicolas Le Novère; Camille Laibe
Journal: Nucleic Acids Res Date: 2011-12-02 Impact factor: 16.971

3. BioModels linked dataset.

Authors: Sarala M Wimalaratne; Pierre Grenon; Henning Hermjakob; Nicolas Le Novère; Camille Laibe
Journal: BMC Syst Biol Date: 2014-08-15

4. The EBI RDF platform: linked open data for the life sciences.

Authors: Simon Jupp; James Malone; Jerven Bolleman; Marco Brandizi; Mark Davies; Leyla Garcia; Anna Gaulton; Sebastien Gehant; Camille Laibe; Nicole Redaschi; Sarala M Wimalaratne; Maria Martin; Nicolas Le Novère; Helen Parkinson; Ewan Birney; Andrew M Jenkinson
Journal: Bioinformatics Date: 2014-01-11 Impact factor: 6.937

5. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services.

Authors: Martijn P van Iersel; Alexander R Pico; Thomas Kelder; Jianjiong Gao; Isaac Ho; Kristina Hanspers; Bruce R Conklin; Chris T Evelo
Journal: BMC Bioinformatics Date: 2010-01-04 Impact factor: 3.169

6. UniChem: a unified chemical structure cross-referencing and identifier tracking system.

Authors: Jon Chambers; Mark Davies; Anna Gaulton; Anne Hersey; Sameer Velankar; Robert Petryszak; Janna Hastings; Louisa Bellis; Shaun McGlinchey; John P Overington
Journal: J Cheminform Date: 2013-01-14 Impact factor: 5.514

6 in total

5 in total

1. Enabling semantic queries across federated bioinformatics databases.

Authors: Ana Claudia Sima; Tarcisio Mendes de Farias; Erich Zbinden; Maria Anisimova; Manuel Gil; Heinz Stockinger; Kurt Stockinger; Marc Robinson-Rechavi; Christophe Dessimoz
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

2. RegenBase: a knowledge base of spinal cord injury biology for translational research.

Authors: Alison Callahan; Saminda W Abeyruwan; Hassan Al-Ali; Kunie Sakurai; Adam R Ferguson; Phillip G Popovich; Nigam H Shah; Ubbo Visser; John L Bixby; Vance P Lemmon
Journal: Database (Oxford) Date: 2016-04-07 Impact factor: 3.451

3. Protein Data Bank Japan (PDBj): updated user interfaces, resource description framework, analysis tools for large structures.

Authors: Akira R Kinjo; Gert-Jan Bekker; Hirofumi Suzuki; Yuko Tsuchiya; Takeshi Kawabata; Yasuyo Ikegawa; Haruki Nakamura
Journal: Nucleic Acids Res Date: 2016-10-26 Impact factor: 16.971

4. Automatic generation of bioinformatics tools for predicting protein-ligand binding sites.

Authors: Yusuke Komiyama; Masaki Banno; Kokoro Ueki; Gul Saad; Kentaro Shimizu
Journal: Bioinformatics Date: 2015-11-05 Impact factor: 6.937

5. The Virtual Metabolic Human database: integrating human and gut microbiome metabolism with nutrition and disease.

Authors: Alberto Noronha; Jennifer Modamio; Yohan Jarosz; Elisabeth Guerard; Nicolas Sompairac; German Preciat; Anna Dröfn Daníelsdóttir; Max Krecke; Diane Merten; Hulda S Haraldsdóttir; Almut Heinken; Laurent Heirendt; Stefanía Magnúsdóttir; Dmitry A Ravcheev; Swagatika Sahoo; Piotr Gawron; Lucia Friscioni; Beatriz Garcia; Mabel Prendergast; Alberto Puente; Mariana Rodrigues; Akansha Roy; Mouss Rouquaya; Luca Wiltgen; Alise Žagare; Elisabeth John; Maren Krueger; Inna Kuperstein; Andrei Zinovyev; Reinhard Schneider; Ronan M T Fleming; Ines Thiele
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5 in total