| Literature DB >> 31042286 |
Daniel S Falster1, Richard G FitzJohn2, Matthew W Pennell3, William K Cornwell1.
Abstract
The sharing and re-use of data has become a cornerstone of modern science. Multiple platforms now allow easy publication of datasets. So far, however, platforms for data sharing offer limited functions for distributing and interacting with evolving datasets- those that continue to grow with time as more records are added, errors fixed, and new data structures are created. In this article, we describe a workflow for maintaining and distributing successive versions of an evolving dataset, allowing users to retrieve and load different versions directly into the R platform. Our workflow utilizes tools and platforms used for development and distribution of successive versions of an open source software program, including version control, GitHub, and semantic versioning, and applies these to the analogous process of developing successive versions of an open source dataset. Moreover, we argue that this model allows for individual research groups to achieve a dynamic and versioned model of data delivery at no cost.Entities:
Keywords: data sharing; semantic versioning; version control
Mesh:
Year: 2019 PMID: 31042286 PMCID: PMC6506717 DOI: 10.1093/gigascience/giz035
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Overview of technologies used to maintain, store, and distribute versions of an evolving dataset as described in this article
|
|
|
|---|---|
|
| Open source version control system used for tracking progressive changes in a set of text files, typically computer code but also data |
|
| A commercial web platform [ |
|
| Widely used and open source language for data processing and statistical analysis[ |
|
| A package in |
|
| The process of assigning unique version numbers in a particular format to successive versions of a digital product[ |
API: application programming interface.
Example datasets currently delivered using the datastorr package for R
|
|
|
|---|---|
| traitecoevo/taxonlookup [ | Taxonomy of world’s land plants [ |
| traitecoevo/growthform [ | Growth form of world’s land plants [ |
| traitecoevo/baad.data [ | Size dimensions of plants for many species from across the world [ |
| ecohealthalliance/cites [ | Trade details from Convention on International Trade in Endangered Species (CITES) |
| madams1/nbadata [ | Statistics from the National Basketball Association (NBA) seasons 1996-97 to 2016-17 |
| madams1/floridainmates [ | Statistics on Florida state’s inmate population, from Florida Department of Corrections |
| traitecoevo/fungaltraits [ | Traits of world’s fungi species [ |
Figure 1Overview of the workflow, different parties, and technologies involved in maintaining and distributing versions of an evolving dataset via datastorr. Core features of our approach are shown with black boxes and arrows. Optional extensions are shown in grey (see Discussion for details).
Goals and requirements of different parties involved in creating and using an evolving dataset
|
|
|
|
|---|---|---|
| Developers | Create and distribute versions of an evolving dataset | Low technical overhead, low initial and ongoing cost and maintenance, easy workflow for releasing new versions, enable user feedback in error checking and contributions, long-term preservation |
| Contributors | Contribute to future versions of an evolving dataset | Add new data, report errors in existing data |
| Users (all) | Gain easy access to all versions of an evolving dataset | Access metadata and background information, access to all versions of a dataset, ability to provide feedback and contribute, long-term stability |
| Users (programmatic) | As above via machine access | Programmatic access to all versions of an evolving dataset, reproduce products using specific versions of an evolving dataset, easy installation |
Figure 2Semantic versioning allows dataset developers to communicate to users the types of changes that have occurred between successive versions of an evolving dataset, using a tri-digit label where increments in a number indicate major, minor, and patch-level changes, respectively. See text for further details.