| Literature DB >> 26556502 |
Dominique G Roche1,2, Loeske E B Kruuk1,3, Robert Lanfear1,4, Sandra A Binning1,2.
Abstract
Policies that mandate public data archiving (PDA) successfully increase accessibility to data underlying scientific publications. However, is the data quality sufficient to allow reuse and reanalysis? We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. We suggest that cultural shifts facilitating clearer benefits to authors are necessary to achieve high-quality PDA and highlight key guidelines to help authors increase their data's reuse potential and compliance with journal data policies.Entities:
Mesh:
Year: 2015 PMID: 26556502 PMCID: PMC4640582 DOI: 10.1371/journal.pbio.1002295
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Journal and publication year of 100 reviewed studies with associated data publicly archived in the digital repository Dryad (http://datadryad.org/).
At the time of data deposition in the repository, journals had either a “strong” PDA policy or adhered to the Joint Data Archiving Policy (JDAP), both of which require that data necessary to replicate a study’s results be archived in a public repository. Datasets were examined to assess completeness and reusability.
| Journal | Policy | Number of Studies | |
|---|---|---|---|
| 2012 | 2013 | ||
|
| strong | 2 | 10 |
|
| JDAP | 16 | 13 |
|
| JDAP | 3 | 2 |
|
| JDAP | 17 | 10 |
|
| strong | 1 | 0 |
|
| strong | 2 | 3 |
|
| JDAP | 9 | 12 |
Fig 1How complete and reusable are publicly archived data in ecology and evolution?
The expectation of PDA that exists in genetics and molecular biology is rapidly permeating throughout ecology and evolution. With the advent of data archiving policies and integrated data repositories, journals and funders now have effective means of mandating PDA. However, the quality of publicly archived data associated with experimental and observational (nonmolecular) studies in ecology and evolution is highly variable. Illustration by Ainsley Seago.
Data completeness and reusability assessment.
Scoring system and criteria used to assess data completeness and reusability of 100 studies with data archived in the public repository Dryad.
| Data Completeness | ||
|---|---|---|
| Score | Description | Criteria |
| 5 | Exemplary | All the data necessary to reproduce the analyses and results (in practice) are archived. There is informative metadata with a legend detailing column headers, abbreviations, and units. |
| 4 | Good | All the data necessary to reproduce the analyses and results (in practice) are archived. The metadata are limited or absent, but column headings, abbreviations, and units can be understood from reading the paper. |
| 3 | Small omission | Most of the data necessary to repeat the analyses are archived except for a small amount (e.g., for a supporting or exploratory analysis). The metadata are informative OR the archived data can be interpreted from reading the paper. |
| 2 | Large omission | The main analyses in the paper cannot be redone because essential data are missing AND/OR insufficient metadata or information in the paper precludes interpreting the data AND/OR the authors archived summary statistics (e.g., means), but not the raw data used in the analyses. |
| 1 | Poor | The data are not archived OR the wrong data are archived OR insufficient information is provided in the metadata or paper for the data to be intelligible. |
|
| ||
|
|
|
|
| 5 | Exemplary | The data are archived in a nonproprietary, human- and machine-readable file format that facilitates data aggregation and can be processed with both free and proprietary software (e.g., csv, text; see |
| 4 | Good | The data are archived in a format that is designed to be machine readable with proprietary software (e.g., Excel), and the metadata are highly informative (such that column headings, abbreviations, and units can be understood in isolation from the original paper). [OR] The data are archived in a nonproprietary, human- and machine-readable file format, and the metadata are sufficiently informative to be understood when combined with information from the associated paper. Raw data are presented (perhaps in combination with processed data such as means). |
| 3 | Average | The data are archived in a format that is designed to be machine readable with proprietary software (e.g., Excel). The metadata are sufficiently informative to be understood when combined with information from the associated paper. Raw data are presented (perhaps in combination with processed data such as means). |
| 2 | Poor | The data are archived in a human- but not machine-readable format. The metadata are highly informative OR sufficiently informative to be understood with information from the associated paper. Raw data are presented (perhaps in combination with processed data such as means). |
| 1 | Very poor | The metadata are insufficient for the data to be intelligible even when combined with information from the associated paper AND/OR processed but not raw data are presented. |
N.B. Reusability was assessed for archived data independently of completeness. One point was subtracted when data were included as supplementary material on the journal website, except when the reusability score was 1 to avoid zero values (see S1 Text).
a Raw data were considered unprocessed data (e.g., trait values used in a principal component analysis rather than principle component scores, values underlying means presented in figures). Studies that did not archive duplicate or triplicate measurements to account for measurement error were not considered as missing raw data.
Fig 2Completeness and reusability scores.
Frequency distribution of public data archiving (PDA) scores for (A) completeness and (B) reusability across 100 studies in 2012 (light blue bars) and 2013 (dark blue bars). A score of 5 indicates exemplary archiving, and a score of 1 indicates poor archiving (see Table 2). Studies with completeness scores of 3 or lower (left of the red dashed line in panel A) do not comply with their journal's PDA policy. Studies to the left of the red dashed line in panel B have a reusability score between “average” (score of 3) and very poor (score of 1).
Fig 3The relationship between the reusability and completeness of archived datasets (R = 0.59, p < 0.001).
Empty circles are individual data points (offset to avoid overlap).
Key recommendations to improve PDA practices.
References listed provide specific details and more extensive discussion on these topics.
| Recommendation | Description | Ref. |
|---|---|---|
| 1. Be mindful of PDA | Plan for PDA before data collection so that data are well managed and prepared for deposition when a manuscript is submitted or published. | [ |
| 2. Make your data discoverable | Avoid archiving data as supplementary material. Use an established repository (e.g., figshare, Dryad, Knowledge Network for Biocomplexity (KNB), Zenodo) | [ |
| 3. Provide detailed metadata | Provide information about the data, including a description of column headings, abbreviations, units of measurement, and what figures and/or analyses the data correspond to. Other metadata can include how the data were collected and suggestions for how to best reuse them. | [ |
| 4. Use descriptive file names | Give data files names that are concise but indicative of their content. Avoid blank spaces. | [ |
| 5. Archive unprocessed data | As much as possible, share the data in their raw form. Provide both the raw and processed data used in the analyses. | [ |
| 6. Use standard file formats | Use file formats that are compatible with many different types of software (e.g., csv rather than excel files). | [ |
| 7. Facilitate data aggregation | Use existing standards whenever possible and deposit data in appropriate public databases (e.g., occurrence data in the Global Biodiversity Information Facility (GBIF), sequences in GenBank). Archive different types of data as distinct documents (not as multiple sheets in one document). Use standard table formats (columns for a variable type and rows for single observations), short variable names without spaces, and meaningful values for missing data (e.g., the abbreviation NA for “not applicable”). Avoid nested headers, merged cells, colour coding, footnotes, etc. | [ |
| 8. Perform quality control | Check the format (e.g., numeric versus string) and units of values in a table. Ask a colleague to review the data and metadata for completeness and clarity. | [ |
| 9. Chose a publishing license | Use well-established licences (e.g., Creative Commons licenses | [ |
| 10. Decide on an embargo | By default, data repositories release archived datasets immediately or upon publication of the associated paper. Some journals and repositories allow a one-year no-questions-asked embargo | [ |
a See Table 1 in [32,33] for further details and examples of recognized data repositories. Some repositories are free (e.g., figshare), and others have a data publishing charge [60]. Depending on the publishing journal, charges may be covered (http://datadryad.org/pages/integratedJournals).
b http://creativecommons.org/
c E.g., Dryad allows a one-year no-questions-asked embargo, but figshare offers no embargo option.