| Literature DB >> 27602295 |
Michel Dumontier1, Alasdair J G Gray2, M Scott Marshall3, Vladimir Alexiev4, Peter Ansell5, Gary Bader6, Joachim Baran1, Jerven T Bolleman7, Alison Callahan1, José Cruz-Toledo8, Pascale Gaudet9, Erich A Gombocz10, Alejandra N Gonzalez-Beltran11, Paul Groth12, Melissa Haendel13, Maori Ito14, Simon Jupp15, Nick Juty15, Toshiaki Katayama16, Norio Kobayashi17, Kalpana Krishnaswami18, Camille Laibe15, Nicolas Le Novère19, Simon Lin20, James Malone15, Michael Miller21, Christopher J Mungall22, Laurens Rietveld23, Sarala M Wimalaratne15, Atsuko Yamaguchi16.
Abstract
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.Entities:
Keywords: Data profiling; Dataset descriptions; FAIR data; Metadata; Provenance
Year: 2016 PMID: 27602295 PMCID: PMC4991880 DOI: 10.7717/peerj.2331
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Three component model for dataset description.
Summary of the resources in the ChEMBL example dataset description.
| Resource | Description | Number of triples |
|---|---|---|
|
| Summary level description of the ChEMBL dataset | 23 |
|
| Version level description corresponding to version 17 of the ChEMBL dataset | 42 |
|
| Distribution level description corresponding to an SQL dump of the ChEMBL17 database | 48 |
|
| Distribution level description corresponding to an RDF release of the ChEMBL17 database in the turtle serialisation | 107 |
Figure 2Example Summary Level description for the ChEMBL database.
The full example is available in the Supplemental Information.