| Literature DB >> 24109559 |
Heather A Piwowar1, Todd J Vision.
Abstract
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.Entities:
Keywords: Bibliometrics; Data archiving; Data repositories; Data reuse; Gene expression microarray; Incentives; Information science; Open data
Year: 2013 PMID: 24109559 PMCID: PMC3792178 DOI: 10.7717/peerj.175
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Univariate correlations between article attributes and number of citations.
Citations were log transformed and count variables were square root transformed. Pearson correlations were used for numeric variables and polyserial correlations for binary and categorical variables.
| Attribute | Variable name | Correlation |
|---|---|---|
| How many citations did the study receive? | nCitedBy.log | 1.00 |
| What was the impact factor of the journal that published the study? | journal.impact.factor.tr | 0.45 |
| How many citations had been made from PMC to the last author’s previous papers? | last.author.num.prev.pmc.cites.tr | 0.30 |
| How many articles did the journal publish in 2008? | journal.num.articles.2008.tr | 0.25 |
| How many years had elapsed since the last author published his/her first paper? | last.author.year.first.pub.ago.tr | 0.24 |
| What was the mean citation score of the corresponding author’s institution? | institution.mean.norm.citation.score | 0.24 |
| How many citations had been made from PMC to the first author’s previous papers? | first.author.num.prev.pmc.cites.tr | 0.24 |
| How many of the journal’s studies were identified as having created microarray data? | journal.microarray.creating.count.tr | 0.23 |
| How many years had elapsed since the first author published his/her first paper? | first.author.year.first.pub.ago.tr | 0.22 |
| Was the corresponding author’s address in the USA? | country.usa | 0.18 |
| How many authors did the study have? | num.authors.tr | 0.17 |
| Was the study published in a journal considered a core clinical journal by MEDLINE? | pubmed.is.core.clinical.journal | 0.17 |
| How many previous papers had the last author published? | last.author.num.prev.pubs.tr | 0.15 |
| Did the study involve human subjects? | pubmed.is.humans | 0.08 |
| Was the study funded by the NIH? | pubmed.is.funded.nih | 0.07 |
| Was the study funded by an R-grant from the NIH? | has.R.funding | 0.07 |
| Did the study involve plants? | pubmed.is.plants | 0.07 |
| How many previous papers had the first author published? | first.author.num.prev.pubs.tr | 0.06 |
| Did the study involve cancer? | pubmed.is.cancer | 0.06 |
| How many cumulative years of NIH funding did the study receive? | nih.cumulative. years.tr | 0.03 |
| Was the corresponding author’s address in the UK? | country.uk | 0.03 |
| How many NIH grants did the study receive? | num.grants.via.nih.tr | 0.02 |
| What was the sum of the annual grants received from the NIH? | nih.sum.avg.dollars.tr | 0.01 |
| Did the study involve bacteria? | pubmed.is.bacteria | 0.01 |
| Was an associated dataset found in GEO or ArrayExpress? | dataset.in.geo.or.ae | 0.01 |
| How many of the last author’s previous papers were identified as creating gene | last.author.num.prev.microarray.creations.tr | 0.01 |
| Did the study use cultured cells? | pubmed.is.cultured.cells | −0.01 |
| How many of the first author’s previous papers were identified as creating gene | first.author.num.prev.microarray.creations.tr | −0.01 |
| Was this study listed as one that had reused data from GEO? | pubmed.is.geo.reuse | −0.01 |
| Was the corresponding author’s institution a government institution? | institution.is.govnt | −0.01 |
| Was the corresponding author’s address in Australia? | country.australia | −0.02 |
| Did the study receive interamural NIH funding? | pubmed.is.funded.nih.intramural | −0.03 |
| Was the corresponding author’s address in Canada? | country.canada | −0.05 |
| What is the rank of the corresponding author’s institution? | institution.rank | −0.06 |
| Was the last author female? | last.author.female | −0.07 |
| Was the first author female? | first.author.female | −0.08 |
| Was the corresponding author’s address in Japan? | country.japan | −0.10 |
| Did the study involve animals? | pubmed.is.animals | −0.11 |
| Was the corresponding author’s address in China? | country.china | −0.19 |
| Was the corresponding author’s address in Korea? | country.korea | −0.26 |
| Was the journal that published the study considered an open access journal? | pubmed.is.open.access | −0.30 |
| What year was the study published? | pubmed.year.published | −0.58 |
| What date was the study published? | pubmed.date.in.pubmed | −0.59 |
Proportion of sample published in most common journals.
| 1 | Cancer Res | 0.04 |
| 2 | Proc Natl Acad Sci USA | 0.04 |
| 3 | J Biol Chem | 0.04 |
| 4 | BMC Genomics | 0.03 |
| 5 | Physiol Genomics | 0.03 |
| 6 | PLoS One | 0.02 |
| 7 | J Bacteriol | 0.02 |
| 8 | J Immunol | 0.02 |
| 9 | Blood | 0.02 |
| 10 | Clin Cancer Res | 0.02 |
| 11 | Plant Physiol | 0.02 |
| 12 | Mol Cell Biol | 0.01 |
Proportion of sample published each year.
| 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 |
|---|---|---|---|---|---|---|---|---|
| 0.02 | 0.05 | 0.08 | 0.11 | 0.13 | 0.12 | 0.17 | 0.18 | 0.15 |
Figure 1Citation density for papers with and without publicly available microarray data, by year of study publication.
Figure 2Increased citation count for studies with publicly available data, by year of publication.
Estimates from multivariate analysis, lines indicate 95% confidence intervals.
Figure 3Cumulative number of datasets deposited in GEO each year, and cumulative number of third-party reuse papers published that directly attribute GEO data published each year, log scale.
Figure 4Number of papers mentioning GEO accession numbers.
Each panel represents reuse of a particular year of dataset submissions, with number of mentions on the y axis, years since the initial publication on the x axis, and a line for reuses by the data collection team and a line for third-party investigators.
Figure 5Cumulative number of third-party reuse papers, by date of reuse paper publication.
Separate lines are displayed for different dataset submission years.
Figure 6Scatterplot of year of publication of third-party reuse paper (with jitter) vs number of GEO datasets mentioned in the paper (log scale).
The line connects the mean number of datasets attributed in reuse papers vs publication year.
Figure 7Proportion of data reused by third-party papers vs year of data submission.
These estimates are a lower bound: they only considered reuse by papers in PubMed Central, and only when reuse was attributed through direct mention of a GEO accession number.
Figure 8Proportion of data submissions that contributed to data reuse papers, by year of reuse paper publication and dataset submission.
Each panel includes a cohort of data reuse papers published in a given year. The lines indicate the proportion of datasets that were mentioned, in aggregate, by the data reuse papers, by the year of dataset publication. The proportion is relative to the total number of datasets submitted in a given year.