Literature DB >> 35380605

ELIXIR biovalidator for semantic validation of life science metadata.

Isuru Liyanage¹, Tony Burdett¹, Bert Droesbeke^2,3, Karoly Erdos¹, Rolando Fernandez¹, Alasdair Gray⁴, Muhammad Haseeb¹, Flavia Penim¹, Cyril Pommier^5,6, Philippe Rocca-Serra⁷, Mélanie Courtot^1,8, Frederik Coppens^2,3.

Abstract

SUMMARY: To advance biomedical research, increasingly large amounts of complex data need to be discovered and integrated. This requires syntactic and semantic validation to ensure shared understanding of relevant entities. This article describes the ELIXIR biovalidator, which extends the syntactic validation of the widely used AJV library with ontology-based validation of JSON documents.
AVAILABILITY AND IMPLEMENTATION: Source code: https://github.com/elixir-europe/biovalidator, Release: v1.9.1, License: Apache License 2.0, Deployed at: https://www.ebi.ac.uk/biosamples/schema/validator/validate.

Entities: Chemical

Year: 2022 PMID： 35380605 PMCID： PMC9154242 DOI： 10.1093/bioinformatics/btac195

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Today’s genomics data ecosystem has been described as a ‘Tower of Babel’, due to an ever-increasing amount of data generated, using different technologies, in a widening number of domains, hosted in a constantly growing number of databases. This massive diversification makes data science an extremely labour intensive and thus a costly undertaking. Data FAIRification (Wilkinson ) aims at addressing those challenges by promoting adherence to a set of principles that facilitate data reuse and interoperability. Validation of metadata describing biomedical entities is a crucial part of this process. However, rules for validation are often hard coded in specific resources, and not shared efficiently. Moreover, checklists such as those used by archives (Harrison ) can still lead to various interpretations and diverging implementations, resulting in data heterogeneity which prevents its efficient reuse. Therefore, next to clear documentation of best practices, real-world implementations of tools enforcing shared validation processes are needed. JavaScript Object Notation (JSON) is an IETF standard specifying a lightweight data interchange format. JSON Schema is a vocabulary to specify the structure of a JSON document. Both JSON and JSON Schema are extensively used for data exchange, APIs and standard definitions. Whilst JSON Schema provides a comprehensive vocabulary to validate the structure and the syntax of a JSON document, it contributes little to checking semantics of the content. In life sciences, compliance to metadata schemas often mandates assessing if a value adheres to specified ontologies—e.g. check that the value of a ‘disease’ attribute is a subclass of a disease ontology term. To ensure high-quality metadata, such strict validation checks are required, specifically via queries based on the ontology structure itself. To address this, we have extended the JSON Schema vocabulary with custom keywords that describe how a particular property constrained to an ontology term identifier should be validated. This paper describes how we deployed the ELIXIR biovalidator and applied it to plant related use cases to enhance FAIRness of the data collected and submitted to public archives.

2 Implementation

We have developed the ELIXIR biovalidator, a tool for validating life sciences metadata, encoded as JSON documents, against declarative metadata standards that are encoded as JSON Schema. The ELIXIR biovalidator is based on the widely used Ajv JSON Schema validator (Poberezkin, 2021). Through the addition of validation rules for user-defined keywords, we have augmented the validator with ontology-based constraints, such as isValidTerm to check if a given ontology term exists in the EMBL-EBI Ontology Lookup Service (OLS) (Jupp ). At the time of writing, the ELIXIR biovalidator supports four extended keywords for ontology and taxonomy validation (elixir-europe, 2021). These four keywords enable different modalities of ontology-based validation against any class in the OLS. For example, the keyword graph_restriction, used with a parent term ID and an ontology ID, allows us to express that a JSON property such as disease_ontology_id can only have terms that are from the Phenotype and Trait Ontology (PATO) or Monarch Disease Ontology (MONDO). Furthermore, these terms must be a subclass of the disease classes PATO:0000461 or MONDO:0000001. The ELIXIR biovalidator is capable of running as a service or as a one-time script to validate a given JSON document against a schema (elixir-europe, 2021). When run as a service, users can validate using the web interface or an API, which is more suited for batch validations. A Docker image is available for testing in a local environment. The biovalidator is currently deployed in the data ingest system for the Human Cell Atlas project as well as the EMBL-EBI BioSamples (Courtot ), where it was used to ensure compliance of over 18 million samples to multiple checklists, such as MiXS and MIAPPE [Minimal Information about a Plant Phenotyping Experiment (Papoutsoglou )] for genomic and plant metadata, respectively.

3 Validation of plant metadata

Plant research institutes across the globe have developed databases and tools to manage and store plant phenotyping data, tailored to their specific use cases. MIAPPE is an open, community driven metadata standard that adequately describes plant phenotyping experiments. The Breeding API [BrAPI (Selby )] was developed synergistically with MIAPPE to provide a common, programmatic interface ensuring databases and tools interoperability through the use of a common metadata representation; BrAPI is therefore a web service API implementation of MIAPPE. This standardized API enables the development of scripts that work on all BrAPI-enabled plant phenotyping databases. One such script, BrAPI2Biosamples, can be used to export JSON objects using the MIAPPE nomenclature (Supplementary material). The ELIXIR biovalidator can validate these objects of (user-provided) metadata for high-quality FAIR data for plant phenotyping. The ontology validation ensures semantic validity of any ontology terms present in MIAPPE-compliant data. This also facilitates the submission of MIAPPE-compliant data to BioSamples, as the same validator is used by BioSamples for validating sample metadata either before or at the submission time (Fig. 1). The development of an independent module allowed for the integration of the ELIXIR biovalidator into the BrAPI ecosystem. In the future, we will also implement the validation in data management platforms such as FAIRDOM/SEEK (Wolstencroft ) and the ISA (Johnson ) model and its JSON Schema definition.

Fig. 1.

Data validation for the plant use case. A data submitter uses an institutional data repository as a broker to submit Biosamples metadata through the API, which is validated against the MIAPPE JSON Schema. This metadata from the plant phenotyping databases is exposed through the Breeding API (BrAPI) and formatted using the BrAPI2Biosamples script to JSON objects. These objects can be validated using the ELIXIR biovalidator against an MIAPPE JSON Schema checklist

4 Conclusion

The ELIXIR biovalidator allows to verify compliance of both the structure and content of JSON documents by extending the existing JSON Schema syntax. The biovalidator is capable of validating ontology terms embedded in JSON documents against requirements. Enabling this quality control for community standards is crucial to develop semantic interoperability in a distributed ecosystem of FAIR digital objects, as envisioned in the European Open Science Cloud Interoperability Framework (Corcho ). In the future, we plan to further extend the biovalidator by adding support for identifier cross-reference checking by integrating it with Identifiers.org (Juty ). This will enable the biovalidator to check the validity of accessions present in the JSON data. Click here for additional data file.

8 in total

1. Identifiers.org and MIRIAM Registry: community resources to provide persistent identification.

Authors: Nick Juty; Nicolas Le Novère; Camille Laibe
Journal: Nucleic Acids Res Date: 2011-12-02 Impact factor: 16.971

2. SEEK: a systems biology data and model management platform.

Authors: Katherine Wolstencroft; Stuart Owen; Olga Krebs; Quyen Nguyen; Natalie J Stanford; Martin Golebiewski; Andreas Weidemann; Meik Bittkowski; Lihua An; David Shockley; Jacky L Snoep; Wolfgang Mueller; Carole Goble
Journal: BMC Syst Biol Date: 2015-07-11

3. BrAPI-an application programming interface for plant breeding applications.

Authors: Peter Selby; Rafael Abbeloos; Jan Erik Backlund; Martin Basterrechea Salido; Guillaume Bauchet; Omar E Benites-Alfaro; Clay Birkett; Viana C Calaminos; Pierre Carceller; Guillaume Cornut; Bruno Vasques Costa; Jeremy D Edwards; Richard Finkers; Star Yanxin Gao; Mehmood Ghaffar; Philip Glaser; Valentin Guignon; Puthick Hok; Andrzej Kilian; Patrick König; Jack Elendil B Lagare; Matthias Lange; Marie-Angélique Laporte; Pierre Larmande; David S LeBauer; David A Lyon; David S Marshall; Dave Matthews; Iain Milne; Naymesh Mistry; Nicolas Morales; Lukas A Mueller; Pascal Neveu; Evangelia Papoutsoglou; Brian Pearce; Ivan Perez-Masias; Cyril Pommier; Ricardo H Ramírez-González; Abhishek Rathore; Angel Manica Raquel; Sebastian Raubach; Trevor Rife; Kelly Robbins; Mathieu Rouard; Chaitanya Sarma; Uwe Scholz; Guilhem Sempéré; Paul D Shaw; Reinhard Simon; Nahuel Soldevilla; Gordon Stephen; Qi Sun; Clarysabel Tovar; Grzegorz Uszynski; Maikel Verouden
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

4. BioSamples database: FAIRer samples metadata to accelerate research data management.

Authors: Mélanie Courtot; Dipayan Gupta; Isuru Liyanage; Fuqi Xu; Tony Burdett
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

5. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

6. Enabling reusability of plant phenomic datasets with MIAPPE 1.1.

Authors: Evangelia A Papoutsoglou; Daniel Faria; Daniel Arend; Elizabeth Arnaud; Ioannis N Athanasiadis; Inês Chaves; Frederik Coppens; Guillaume Cornut; Bruno V Costa; Hanna Ćwiek-Kupczyńska; Bert Droesbeke; Richard Finkers; Kristina Gruden; Astrid Junker; Graham J King; Paweł Krajewski; Matthias Lange; Marie-Angélique Laporte; Célia Michotey; Markus Oppermann; Richard Ostler; Hendrik Poorter; Ricardo Ramı Rez-Gonzalez; Živa Ramšak; Jochen C Reif; Philippe Rocca-Serra; Susanna-Assunta Sansone; Uwe Scholz; François Tardieu; Cristobal Uauy; Björn Usadel; Richard G F Visser; Stephan Weise; Paul J Kersey; Célia M Miguel; Anne-Françoise Adam-Blondon; Cyril Pommier
Journal: New Phytol Date: 2020-04-25 Impact factor: 10.323

7. The European Nucleotide Archive in 2020.

Authors: Peter W Harrison; Alisha Ahamed; Raheela Aslam; Blaise T F Alako; Josephine Burgin; Nicola Buso; Mélanie Courtot; Jun Fan; Dipayan Gupta; Muhammad Haseeb; Sam Holt; Talal Ibrahim; Eugene Ivanov; Suran Jayathilaka; Vishnukumar Balavenkataraman Kadhirvelu; Manish Kumar; Rodrigo Lopez; Simon Kay; Rasko Leinonen; Xin Liu; Colman O'Cathail; Amir Pakseresht; Youngmi Park; Stephane Pesant; Nadim Rahman; Jeena Rajan; Alexey Sokolov; Senthilnathan Vijayaraja; Zahra Waheed; Ahmad Zyoud; Tony Burdett; Guy Cochrane
Journal: Nucleic Acids Res Date: 2020-11-11 Impact factor: 16.971

8. ISA API: An open platform for interoperable life science experimental metadata.

Authors: David Johnson; Dominique Batista; Keeva Cochrane; Robert P Davey; Anthony Etuk; Alejandra Gonzalez-Beltran; Kenneth Haug; Massimiliano Izzo; Martin Larralde; Thomas N Lawson; Alice Minotto; Pablo Moreno; Venkata Chandrasekhar Nainala; Claire O'Donovan; Luca Pireddu; Pierrick Roger; Felix Shaw; Christoph Steinbeck; Ralf J M Weber; Susanna-Assunta Sansone; Philippe Rocca-Serra
Journal: Gigascience Date: 2021-09-16 Impact factor: 6.524

8 in total