| Literature DB >> 35165298 |
Yifan Chen1,2, E A Huerta3,4, Javier Duarte5, Philip Harris6, Daniel S Katz1, Mark S Neubauer1, Daniel Diaz5, Farouk Mokhtar5, Raghav Kansal5,6, Sang Eon Park7, Volodymyr V Kindratenko1, Zhizhen Zhao1, Roger Rusack8.
Abstract
To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.Entities:
Year: 2022 PMID: 35165298 PMCID: PMC8844008 DOI: 10.1038/s41597-021-01109-0
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Findable and Accessible principle assessment checks for the CMS H() Open Dataset.
| Metric | Evaluation |
|---|---|
| F1. (Meta)data are assigned globally unique and persistent identifiers. | |
| F2. Data are described with rich metadata. | |
| F3. Metadata clearly and explicitly include the identifier of the data they describe. | |
| F4. (Meta)data are registered or indexed in a searchable resource | |
| A1. (Meta)data are retrievable by their identifier using a standardized communications protocol | |
| A1.1: The protocol is open, free and universally implementable | |
| A1.2. The protocol allows for an authentication and authorization where necessary | |
| A2. Metadata should be accessible even when the data is no longer available | |
Interoperable and Reusable principle assessment checks for CMS H() Open Dataset.
| Metric | Evaluation |
|---|---|
| I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. | |
| I2. (Meta)data use vocabularies that follow FAIR principles. | |
| I3. (Meta)data include qualified references to other (meta)data. | |
| R1.1. (Meta)data are released with a clear and accessible data usage license. | |
| R1.2. (Meta)data are associated with detailed provenance. | |
| R1.3. (Meta)data meet domain-relevant community standards. | |
Fig. 1The distribution of labels is shown for a representative file in the training dataset.
Fig. 2Illustration of a jet with two secondary vertices (SVs) from the decay of two b hadrons resulting in charged-particle tracks (including a low-energy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence with a large impact parameter (IP) value.
Fig. 3The distributions of some salient jet features: (a) the soft-drop jet mass; (b) number of particle candidates; (c) number of secondary vertices; and (d) number of tracks, are shown for one file in the training dataset.