Literature DB >> 24413672

The EBI RDF platform: linked open data for the life sciences.

Simon Jupp1, James Malone, Jerven Bolleman, Marco Brandizi, Mark Davies, Leyla Garcia, Anna Gaulton, Sebastien Gehant, Camille Laibe, Nicole Redaschi, Sarala M Wimalaratne, Maria Martin, Nicolas Le Novère, Helen Parkinson, Ewan Birney, Andrew M Jenkinson.   

Abstract

MOTIVATION: Resource description framework (RDF) is an emerging technology for describing, publishing and linking life science data. As a major provider of bioinformatics data and services, the European Bioinformatics Institute (EBI) is committed to making data readily accessible to the community in ways that meet existing demand. The EBI RDF platform has been developed to meet an increasing demand to coordinate RDF activities across the institute and provides a new entry point to querying and exploring integrated resources available at the EBI.

Entities:  

Mesh:

Year:  2014        PMID: 24413672      PMCID: PMC3998127          DOI: 10.1093/bioinformatics/btt765

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The European Bioinformatics Institute (EBI) is the largest bioinformatics resource provider in Europe. Our databases are accessible via dedicated interfaces, web services, data download and (in a few cases) direct database access. Modern research in the life sciences necessitates an understanding of data at many different levels: multi-omics, from cells to biological systems, across many different species and studying many different experimental conditions. The biology underpinning these research questions is intrinsically connected, yet data are often collected and stored in technology or domain-specific repositories. Efforts in the Semantic Web community are already beginning to invest in technology that enables data to be readily integrated (Belleau ; Katayama ; Marshall ). One method used among the Semantic Web community is using the W3C’s resource description framework (RDF) model to represent data. RDF provides a common mechanism for describing data and querying data using SPARQL. To better serve complex research questions across resources, and to meet an increased demand on the EBI to produce RDF, we have developed an RDF platform. The aim of such a platform is to offer users the ability to ask questions using multiple connected resources that share common identifiers and have a common format (RDF) and query interface (SPARQL). This platform complements other existing data access modes such as our Web site and RESTful web services, but additionally contains explicit links between the different data resources. This enables a single query to be asked across multiple distributed datasets and across a range of biological domains. This approach has been applied for the following EBI resources: Gene Expression Atlas (Kapushesky ), ChEMBL (Gaulton ), BioModels (Li ), Reactome (Matthews ), BioSamples (Gostev ) and also includes a collaboration with the UniProt Consortium to deliver UniProt RDF (Redaschi and UniProt Consortium, 2009).

2 METHODS

The RDF platform presents a coordinated effort to bring together RDF resources from multiple services and databases at the EBI. The development of the platform began by collecting requirements from both a scientific and a technical perspective. The scientific requirements were gathered as a series of use cases and competency questions collected from research scientists and users of EBI services. In particular, we were looking for questions that required data to be integrated from multiple resources and that are not trivial to answer with our existing infrastructure due to the disparate nature of the data. These questions were used to identify points of integration between resources. The scientific use cases informed the technical requirements on what infrastructure, in terms of both software and hardware, would be needed to deliver a stable and scalable platform. Given RDF technology is still maturing, there are open questions on how to deliver such a platform on this scale; our existing infrastructure is delivered after evaluation of various technologies that will be the subject of another paper. Data from UniProt, ChEMBL, Reactome and BioModels represents curated knowledge from protein sequence and function, bio-active molecules and their targets, to biochemical pathways and computational models of molecular interactions. The Gene Expression Atlas database provides differential gene expression data from a variety of samples that are highly annotated and curated using the Experimental Factor Ontology (EFO) (Malone ). Generating linked RDF for these resources provides a new entry point for exploring the data, such as putting gene expression in the context of protein function, pathways and drug targets. An outline of how resources are connected is shown in Figure 1.
Fig. 1.

Connections between services (boxes) and ontologies (circles). The graph illustrates how the data are linked within the RDF platform, enabling queries to span all data. Asterisk: ENSEMBL to UniProt (gray line) mappings are included via expression atlas

Connections between services (boxes) and ontologies (circles). The graph illustrates how the data are linked within the RDF platform, enabling queries to span all data. Asterisk: ENSEMBL to UniProt (gray line) mappings are included via expression atlas The graph-based nature of the RDF data model provides a natural fit for explicitly publishing how data are connected. In RDF, resources are identified using uniform resource identifiers (URIs), which provide a web-based global identification system. Guidelines for minting new URIs for EBI resources were established using the new rdf.ebi.ac.uk domain (details can be found at http://www.ebi.ac.uk/rdf/documentation/uris-ebi-data). Canonical URIs are used when existing databases, such as UniProt, already provide stable URIs. In cases where no canonical URIs are provided by external resources, the Identifiers.org registry of scientific identifiers (Juty ) was used to provide a referencing URI. As part of the URI strategy, every effort has been made to ensure all EBI RDF datasets only use URIs that can be dereferenced using http, supporting content negotiation for human-orientated HTML views, alongside machine processable versions in various RDF syntaxes. Using common URI schemes assists data integration with RDF. In addition, ontologies provide a mechanism to semantically describe the data, and the OWL ontology language can be serialized in RDF. The EBI makes extensive use of ontologies to annotate data, however, the richness of these annotations is rarely available in native RDF for exploitation by external applications. The EBI RDF platform adopts a range of common vocabularies and ontologies to annotate data. The ontologies used span common biomedical terminologies such as the Gene Ontology, Chemical Entities of Biological Interest, UBERON, Cell Type Ontology, Biological Pathways Exchange, EFO and more. Additionally, we adopted metadata standards for describing datasets and provenance such as Dublin Core, Data Catalog Vocabulary and Vocabulary of Interlinked Datasets.

3 RESULTS

Complete dumps of the RDF data are available via FTP downloads. These are published in line with existing production and release cycles, ensuring the most up-to-date data are readily available. We are also using triple store technology to index the RDF files and make them available for querying and exploration via SPARQL endpoints and our linked data browser. The underlying infrastructure at the EBI is built on open source triple store technology provided by OpenLink, (http://www.openlinksw.com/), whereas the UniProt data are served by the SIB’s Vital-IT HPC platform using technology from OntoText (http://www.ontotext.com/). We developed LODEStar (http://www.ebi.ac.uk/fgpt/sw/lodestar/) as a generic SPARQL endpoint and linked data browser to provide a consistent interface and some enhanced functionality for querying and browsing EBI-based datasets. In addition to providing access to the underlying data, an equally important component of the platform is the Web site at http://www.ebi.ac.uk/rdf that provides an entry point to discover all RDF resources being served by the EBI. This site includes documentation on how to find the datasets and provides examples of how to query the data using the SPARQL endpoints (http://www.ebi.ac.uk/rdf/example-sparql-queries). We also provide examples showing developers how they can use the SPARQL API programmatically from common programming environments like Perl, Java and R.

4 CONCLUSION

The EBI RDF platform allows explicit links to be made between datasets using shared semantics from standard ontologies and vocabularies, facilitating a greater degree of data integration. SPARQL provides a standard query language for querying RDF data. Data that have been annotated using ontologies, such as EFO and the Gene Ontology, enable data integration with other community datasets and provides the semantics to perform rich queries. Publishing these datasets as RDF along with their ontologies provides both the syntactic and semantic integration of data long promised by semantic web technologies. As the trend toward publishing life science data in RDF increases, we anticipate a rise in the number of applications consuming such data. This is evident in efforts such as the Open PHACTS platform (http://www.openphacts.org) and the AtlasRDF-R package (https://github.com/jamesmalone/AtlasRDF-R). Our aim is that the EBI RDF platform enables such applications to be built by releasing production quality services with semantically described RDF to enable pertinent biomedical use cases to be addressed.
  9 in total

1.  Bio2RDF: towards a mashup to build bioinformatics knowledge systems.

Authors:  François Belleau; Marc-Alexandre Nolin; Nicole Tourigny; Philippe Rigault; Jean Morissette
Journal:  J Biomed Inform       Date:  2008-03-21       Impact factor: 6.317

2.  Modeling sample variables with an Experimental Factor Ontology.

Authors:  James Malone; Ele Holloway; Tomasz Adamusiak; Misha Kapushesky; Jie Zheng; Nikolay Kolesnikov; Anna Zhukova; Alvis Brazma; Helen Parkinson
Journal:  Bioinformatics       Date:  2010-03-03       Impact factor: 6.937

3.  BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models.

Authors:  Chen Li; Marco Donizelli; Nicolas Rodriguez; Harish Dharuri; Lukas Endler; Vijayalakshmi Chelliah; Lu Li; Enuo He; Arnaud Henry; Melanie I Stefan; Jacky L Snoep; Michael Hucka; Nicolas Le Novère; Camille Laibe
Journal:  BMC Syst Biol       Date:  2010-06-29

4.  The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies.

Authors:  Toshiaki Katayama; Mark D Wilkinson; Gos Micklem; Shuichi Kawashima; Atsuko Yamaguchi; Mitsuteru Nakao; Yasunori Yamamoto; Shinobu Okamoto; Kenta Oouchida; Hong-Woo Chun; Jan Aerts; Hammad Afzal; Erick Antezana; Kazuharu Arakawa; Bruno Aranda; Francois Belleau; Jerven Bolleman; Raoul Jp Bonnal; Brad Chapman; Peter Ja Cock; Tore Eriksson; Paul Mk Gordon; Naohisa Goto; Kazuhiro Hayashi; Heiko Horn; Ryosuke Ishiwata; Eli Kaminuma; Arek Kasprzyk; Hideya Kawaji; Nobuhiro Kido; Young Joo Kim; Akira R Kinjo; Fumikazu Konishi; Kyung-Hoon Kwon; Alberto Labarga; Anna-Lena Lamprecht; Yu Lin; Pierre Lindenbaum; Luke McCarthy; Hideyuki Morita; Katsuhiko Murakami; Koji Nagao; Kozo Nishida; Kunihiro Nishimura; Tatsuya Nishizawa; Soichi Ogishima; Keiichiro Ono; Kazuki Oshita; Keun-Joon Park; Pjotr Prins; Taro L Saito; Matthias Samwald; Venkata P Satagopam; Yasumasa Shigemoto; Richard Smith; Andrea Splendiani; Hideaki Sugawara; James Taylor; Rutger A Vos; David Withers; Chisato Yamasaki; Christian M Zmasek; Shoko Kawamoto; Kosaku Okubo; Kiyoshi Asai; Toshihisa Takagi
Journal:  J Biomed Semantics       Date:  2013-02-11

5.  Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments.

Authors:  Misha Kapushesky; Tomasz Adamusiak; Tony Burdett; Aedin Culhane; Anna Farne; Alexey Filippov; Ele Holloway; Andrey Klebanov; Nataliya Kryvych; Natalja Kurbatova; Pavel Kurnosov; James Malone; Olga Melnichuk; Robert Petryszak; Nikolay Pultsin; Gabriella Rustici; Andrew Tikhonov; Ravensara S Travillian; Eleanor Williams; Andrey Zorin; Helen Parkinson; Alvis Brazma
Journal:  Nucleic Acids Res       Date:  2011-11-07       Impact factor: 16.971

6.  The BioSample Database (BioSD) at the European Bioinformatics Institute.

Authors:  Mikhail Gostev; Adam Faulconbridge; Marco Brandizi; Julio Fernandez-Banet; Ugis Sarkans; Alvis Brazma; Helen Parkinson
Journal:  Nucleic Acids Res       Date:  2011-11-16       Impact factor: 16.971

7.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification.

Authors:  Nick Juty; Nicolas Le Novère; Camille Laibe
Journal:  Nucleic Acids Res       Date:  2011-12-02       Impact factor: 16.971

8.  ChEMBL: a large-scale bioactivity database for drug discovery.

Authors:  Anna Gaulton; Louisa J Bellis; A Patricia Bento; Jon Chambers; Mark Davies; Anne Hersey; Yvonne Light; Shaun McGlinchey; David Michalovich; Bissan Al-Lazikani; John P Overington
Journal:  Nucleic Acids Res       Date:  2011-09-23       Impact factor: 16.971

9.  Reactome knowledgebase of human biological pathways and processes.

Authors:  Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio
Journal:  Nucleic Acids Res       Date:  2008-11-03       Impact factor: 16.971

  9 in total
  86 in total

1.  Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF.

Authors:  Barbara Bushman; David Anderson; Gang Fu
Journal:  J Libr Metadata       Date:  2016-01-25

Review 2.  Protein Bioinformatics Databases and Resources.

Authors:  Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal:  Methods Mol Biol       Date:  2017

3.  GlycoRDF: an ontology to standardize glycomics data in RDF.

Authors:  Rene Ranzinger; Kiyoko F Aoki-Kinoshita; Matthew P Campbell; Shin Kawano; Thomas Lütteke; Shujiro Okuda; Daisuke Shinmachi; Toshihide Shikanai; Hiromichi Sawaki; Philip Toukach; Masaaki Matsubara; Issaku Yamada; Hisashi Narimatsu
Journal:  Bioinformatics       Date:  2014-11-11       Impact factor: 6.937

4.  A semantic database for integrated management of image and dosimetric data in low radiation dose research in medical imaging.

Authors:  Bernard Gibaud; Marine Brenet; Guillaume Pasquier; Alex Vergara Gil; Manuel Bardiès; John Stratakis; John Damilakis; Nicolas Van Dooren; Joël Spaltenstein; Osman Ratib
Journal:  AMIA Annu Symp Proc       Date:  2021-01-25

5.  DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web.

Authors:  Maxat Kulmanov; Fernando Zhapa-Camacho; Robert Hoehndorf
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

6.  SCALEUS: Semantic Web Services Integration for Biomedical Applications.

Authors:  Pedro Sernadela; Lorena González-Castro; José Luís Oliveira
Journal:  J Med Syst       Date:  2017-02-18       Impact factor: 4.460

7.  Supporting inter-topic entity search for biomedical Linked Data based on heterogeneous relationships.

Authors:  Nansu Zong; Sungin Lee; Jinhyun Ahn; Hong-Gee Kim
Journal:  Comput Biol Med       Date:  2017-05-31       Impact factor: 4.589

8.  Isomorphic semantic mapping of variant call format (VCF2RDF).

Authors:  Emanuel Diego S Penha; Egiebade Iriabho; Alex Dussaq; Diana Magalhães de Oliveira; Jonas S Almeida
Journal:  Bioinformatics       Date:  2017-02-15       Impact factor: 6.937

Review 9.  Knowledge Representation and Management: a Linked Data Perspective.

Authors:  M Barros; F M Couto
Journal:  Yearb Med Inform       Date:  2016-11-10

10.  Learning from biomedical linked data to suggest valid pharmacogenes.

Authors:  Kevin Dalleau; Yassine Marzougui; Sébastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
Journal:  J Biomed Semantics       Date:  2017-04-20
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.