| Literature DB >> 27764088 |
Edmund M Hart1, Pauline Barmby2, David LeBauer3, François Michonneau4,5, Sarah Mount6, Patrick Mulrooney7, Timothée Poisot8, Kara H Woo9, Naupaka B Zimmerman10, Jeffrey W Hollister11.
Abstract
Entities:
Mesh:
Year: 2016 PMID: 27764088 PMCID: PMC5072699 DOI: 10.1371/journal.pcbi.1005097
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Example of an untidy dataset (A) and its tidy equivalent (B).
Dataset A is untidy because it mixes observational units (species, location of observations, measurements about individuals), the units are mixed and listed with the observations, more than one variable is listed (both latitude and longitude for the coordinates, and genus and species for the species names), and several formats are used in the same column for dates and geographic coordinates. Dataset B is an example of a tidy version of dataset A that reduces the amount of information that is duplicated in each row, limiting chances of introducing mistakes in the data. By having species in a separate table, they can be identified uniquely using the Taxonomic Serial Number (TSN) from the Integrated Taxonomic Information System (ITIS), and it makes it easy to add information about the classification of these species. It also allows researchers to edit the taxonomic information independently from the table that holds the measurements about the individuals. Unique values for each observational unit facilitate the programmatic combination of information using “join” operations. With this example, if the focus of the study for which these data were collected is based upon the size measurements of the individuals (weight and length), information about “where,” “when,” and “what” animals were measured can be considered metadata. Using the tidy format makes this distinction clearer.