| Literature DB >> 32320428 |
Giovanni Colavizza1,2, Iain Hrynaszkiewicz3,4, Isla Staden1,5, Kirstie Whitaker1,6, Barbara McGillivray1,6.
Abstract
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository-rather than being available on request or included as supporting information files-are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.Entities:
Mesh:
Year: 2020 PMID: 32320428 PMCID: PMC7176083 DOI: 10.1371/journal.pone.0230416
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Data extraction and processing steps.
We first downloaded the PubMed open access collection (1) and created a database with all articles with a known identifier and which contained at least one reference (2; N = 1, 969, 175). Next we identified and disambiguated authors of these papers (3; S = 4, 253, 172) and calculated citations for each author and each publication from within the collection (4). We used these citation counts to calculate a within-collection H-index for each author. Our analysis only focuses on PLOS and BMC publications as these publishers introduced mandated DAS, so we filtered the database for these articles and extracted DAS from each publication (5). We annotated a training dataset by labelling each of these statements into one of four categories (6) and used those labels to train a natural language processing classifier (7). Using this classifier we then categorised the remaining DAS in the database (8). Finally, we exported this categorised dataset of M = 531, 889 publications to a csv file (9) and archived it (see Data and code availability section below).
Categories of DAS identified in our coding approach.
| Category | Definition | Example |
|---|---|---|
| 0 | Not available | |
| 1 | Data available on request or similar | |
| 2 | Data available with the paper and its supplementary files | |
| 3 | Data available in a repository |
Classification report by DAS category.
| Category | Precision | Recall | F1-score | Specificity | Support |
|---|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1 | 4 |
| 1 | 1.00 | 1.00 | 1.00 | 1 | 20 |
| 2 | 0.98 | 1.00 | 0.99 | 0.97 | 45 |
| 3 | 1.00 | 0.86 | 0.92 | 1 | 7 |
Fig 2Data availability statements over time.
All the histograms above show the number of publications from specific subsets of the dataset and classify them into four categories: No DAS (0), Category 1 (data available on request), Category 2 (data contained within the article and supplementary materials), and Category 3 (a link to archived data in a public repository). The vertical solid line shows the date that the publisher introduced a mandated DAS policy. A dashed line indicates the date an encouraged policy was introduced. The groups of articles are as follows. A: all BMC articles, B: all PLOS articles, C: all BMC Series articles, D: PLOS One articles, E: PLOS articles not published in PLOS One, F: articles from the BMC Genomics journal (selected to illustrate a journal that had high uptake of an encouraged policy), G: articles from the Trials journal (published by BMC, selected to illustrate a journal that has a very high percentage of data that can only be made available by request to the authors), H: articles from the Parasites and Vectors journal (selected to illustrate a journal that has an even distribution of the three DAS categories). Articles are binned by publication year.
Summary of variables used in the regression models.
| Variable | Description | Possible transformations |
|---|---|---|
| Number of citations received within a certain number of years | ||
| Number of authors. | ||
| Total number of references. | ||
| Publication year. | ||
| Publication month. | ||
| Mean H-index of authors at publication time. | ||
| Median H-index of authors at publication time. | ||
| DAS category (0 to 3. See | ||
| If PLOS (1) or not (0). | ||
| If published under an encouraged DAS mandate (1) or not (0). | ||
| If published under a required DAS mandate (1) or not (0). | ||
| Dummy variable, from Science-Metrix. | ||
Descriptive statistics for (non-trasformed) model variables over the whole dataset under analysis.
| Variable/Statistic | Minimum | Median | Mean | Maximum |
|---|---|---|---|---|
| 0 | 0 | 0.68 | 166 | |
| 0 | 0 | 1.13 | 483 | |
| 0 | 1 | 1.9 | 1732 | |
| 0 | 1 | 2.84 | 2233 | |
| 1 | 6 | 6.68 | 2442 | |
| 1 | 39 | 41.94 | 1097 | |
| 1997 | 2014 | 2013 | 2018 | |
| 1 | 7 | 5.43 | 12 | |
| 0 | 1 | 1.17 | 28 | |
| 0 | 1.2 | 1.56 | 28 |
Correlations among a set of variables.
The values on the top-right half of the table over the diagonal are Spearman’s correlation coefficients, the values on the bottom-left half of the table over the diagonal are Pearson’s correlation coefficients. All variables are transformed as in the description of the model.
| Variable | |||||||
| 0.16 | 0.14 | -0.02 | 0.25 | 0.2 | 0.22 | ||
| 0.16 | 0.16 | -0.01 | 0.2 | 0.06 | 0.11 | ||
| 0.14 | 0.16 | -0.02 | 0.39 | 0.32 | 0.1 | ||
| -0.01 | -0.01 | -0.03 | 0.01 | 0.01 | 0.02 | ||
| 0.25 | 0.18 | 0.41 | 0.02 | 0.85 | 0.12 | ||
| 0.19 | 0 | 0.28 | 0.02 | 0.82 | 0.08 | ||
| 0.24 | 0.15 | 0.13 | 0.02 | 0.14 | 0.07 |
OLS and robust LS estimates for the citation prediction model under discussion.
Coefficient standard errors are given in parentheses.
| (1) | (2) | |
|---|---|---|
| n_authors | 0.107 | 0.103 |
| (0.002) | (0.002) | |
| n_references_tot | 0.197 | 0.189 |
| (0.002) | (0.002) | |
| p_year | 0.011 | 0.011 |
| (0.0005) | (0.0005) | |
| p_month | −0.011 | −0.010 |
| (0.0005) | (0.0004) | |
| h_index_mean | 0.218 | 0.204 |
| (0.004) | (0.004) | |
| h_index_median | 0.007 | 0.008 |
| (0.001) | (0.001) | |
| C(das_category)1 | 0.085 | 0.072 |
| (0.024) | (0.023) | |
| C(das_category)2 | 0.059 | 0.057 |
| (0.019) | (0.018) | |
| C(das_category)3 | 0.252 | 0.271 |
| (0.012) | (0.012) | |
| C(journal_field)Agriculture, Fisheries & Forestry | −0.066 | −0.051 |
| (0.011) | (0.011) | |
| C(journal_field)Biology | -0.009 | 0.007 |
| (0.009) | (0.009) | |
| C(journal_field)Biomedical Research | −0.027 | −0.012 |
| (0.005) | (0.005) | |
| C(journal_field)Chemistry | −0.242 | −0.214 |
| (0.015) | (0.014) | |
| C(journal_field)Clinical Medicine | −0.033 | −0.021 |
| (0.004) | (0.004) | |
| C(journal_field)Enabling & Strategic Technologies | 0.047 | 0.054 |
| (0.005) | (0.005) | |
| C(journal_field)Engineering | −0.205 | −0.177 |
| (0.019) | (0.019) | |
| C(journal_field)General Science & Technology | −0.388 | −0.370 |
| (0.006) | (0.006) | |
| C(journal_field)Information & Communication Technologies | 0.007 | 0.025 |
| (0.013) | (0.013) | |
| C(journal_field)Philosophy & Theology | -0.011 | 0.012 |
| (0.026) | (0.026) | |
| C(journal_field)Psychology & Cognitive Sciences | −0.160 | −0.135 |
| (0.021) | (0.021) | |
| C(journal_field)Public Health & Health Services | 0.042 | 0.057 |
| (0.006) | (0.006) | |
| das_requiredTrue | 0.073 | 0.070 |
| (0.005) | (0.004) | |
| das_encouragedTrue | −0.052 | −0.048 |
| (0.004) | (0.004) | |
| is_plosTrue | 0.211 | 0.213 |
| (0.004) | (0.004) | |
| C(das_category)1:is_plosTrue | −0.077 | −0.066 |
| (0.025) | (0.025) | |
| C(das_category)2:is_plosTrue | −0.040 | −0.038 |
| (0.019) | (0.019) | |
| C(das_category)3:is_plosTrue | −0.163 | −0.192 |
| (0.014) | (0.014) | |
| Constant | −22.228 | −23.297 |
| (0.967) | (0.950) | |
| Observations | 367,836 | 367,836 |
| R2 | 0.144 | |
| Adjusted R2 | 0.144 | |
| Residual Std. Error (df = 367808) | 0.593 | 0.665 |
| F Statistic | 2,285.393 | |
*p<0.1;
**p<0.05;
***p<0.01