| Literature DB >> 36246529 |
Libby Hemphill1,2, Amy Pienta1, Sara Lafia1, Dharma Akmon1, David A Bleckley1.
Abstract
Despite large public investments in facilitating the secondary use of data, there is little information about the specific factors that predict data's reuse. Using data download logs from the Inter-university Consortium for Political and Social Research (ICPSR), this study examines how data properties, curation decisions, and repository funding models relate to data reuse. We find that datasets deposited by institutions, subject to many curatorial tasks, and whose access and preservation is funded externally, are used more often. Our findings confirm that investments in data collection, curation, and preservation are associated with more data reuse.Entities:
Year: 2022 PMID: 36246529 PMCID: PMC9542848 DOI: 10.1002/asi.24646
Source DB: PubMed Journal: J Assoc Inf Sci Technol ISSN: 2330-1635 Impact factor: 3.275
Number of studies released and data users by year
| Release year | Studies | Data users |
|---|---|---|
| 2017 | 73 | 15,493 |
| 2018 | 120 | 19,389 |
| 2019 | 58 | 7526 |
| 2020 | 97 | 7463 |
| 2021 | 32 | 354 |
| Total | 380 | 50,225 |
FIGURE 1Frequency distribution of total data users by study
Variables and their definitions
| Variable type | Variable | Definition |
|---|---|---|
| Data attributes |
|
1 = Study is part of a recurring serial collection with new data archived over time (e.g., repeated cross‐sectional studies, longitudinal studies); 0 = Study is not part of a series |
|
|
1 = At least one of the study's principal investigators or depositors is an institution (e.g., U.S. Bureau of the Census); 0 = All of the study's principal investigators are individuals | |
|
| Number of variables in the study indicating the size of the study (note: Qualitative studies have zero variables; our sample includes 35 qualitative studies) | |
| Curatorial decisions |
| Number of metadata subject terms assigned by staff (including terms supplied by data contributor) to the study, indicating scope |
|
| Level of curation for the study indicating the set of curation activities performed in preparing the study where 3 indicates the most activities and 1 the fewest. Rarely, data and documentation are released in the format provided by the data producer, and these studies are called “fast release” (FR). Level 3, the highest level of curation, serves as the reference group in our regression models. | |
|
|
1 = Variable‐level metadata, including variable name, label, and value labels, are indexed for search in ICPSR's social science variable database; 0 = Variables are not indexed for search | |
|
|
1 = Question text from data collection instruments or other source documentation manually generated for all variables; 0 = No question text available for search | |
|
|
1 = Study data has been processed, compiled, and made available for online analysis; 0 = Not available for online analysis | |
| Archive funding model |
|
1 = Study was released by an externally‐sponsored, topical archive (e.g., National Archive of Criminal Justice Data) rather than the member‐sponsored archive (i.e., General Archive or Resource Center for Minority Data); 0 = Study was deposited in the ICPSR membership archive |
| Control variable |
| Number of days the study has been available (from study release to data pull date) |
| Dependent variable |
| Number of unique users that downloaded quantitative data files, specifically, from the study between January 2017 and April 2021 |
Descriptive statistics for data attributes, curatorial decisions, funding models, and data use
| Overall ( | |
|---|---|
| Series | |
| No | 138 (36.3%) |
| Yes | 242 (63.7%) |
| Number of variables | |
| Mean (SD) | 1328.158 (3395.758) |
| Range | 0.000–34094.000 |
| Institutional PI | |
| No | 212 (55.8%) |
| Yes | 168 (44.2%) |
| Curation level | |
| Level 1 | 82 (21.6%) |
| Fast release | 11 (2.9%) |
| Level 2 | 133 (35.0%) |
| Level 3 | 154 (40.5%) |
| Number of subject terms | |
| Mean (SD) | 12.053 (7.654) |
| Range | 2.000–48.000 |
| SSVD | |
| No | 21 (5.5%) |
| Yes | 359 (94.5%) |
| Question text | |
| No | 185 (48.7%) |
| Yes | 195 (51.3%) |
| SDA | |
| No | 211 (55.5%) |
| Yes | 169 (44.5%) |
| External funder | |
| No | 150 (39.5%) |
| Yes | 230 (60.5%) |
| Total data users | |
| Mean (SD) | 132.171 (207.820) |
| Range | 0.000–1790.000 |
Regression results
| Dependent variable: total_data_users | |
|---|---|
| Series (yes) | 0.891 |
| Number of variables | 1.000** |
| Institutional PI (yes) | 1.322** |
| Curation Level (fast release) | 0.345** |
| Curation level (Level 1) | 1.154 |
| Curation level (Level 2) | 0.617** |
| Number of subject terms | 1.031*** |
| SSVD (yes) | 0.777 |
| Question text (yes) | 1.342 |
| SDA (yes) | 0.750* |
| External funder (yes) | 4.273*** |
| Curation level (fast release):External funder (yes) | 0.744 |
| Curation level (Level 1): External funder (yes) | 0.606* |
| Curation level (Level 2): External funder (yes) | 0.967 |
| Constant | 0.060*** |
| Observations | 380 |
| Log likelihood | −2,063.611 |
| Theta | 0.959*** (0.064) |
| Akaike inf. crit. | 4,157.222 |
p < .1; **p < .05; ***p < .01.
FIGURE 2Marginal effects of curation level and external funder on data download numbers
| Dependent variable: | |||||||
|---|---|---|---|---|---|---|---|
| total_data_users | |||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | |
| Series (yes) | 1.283 | 0.836 | 0.895 | 0.891 | |||
| Number of variables | 1.000*** | 1.000*** | 1.000** | 1.000** | |||
| Institutional PI (yes) | 1.537*** | 1.277* | 1.331** | 1.322** | |||
| Curation level (fast release) | 0.152*** | 0.145*** | 0.301*** | 0.284*** | 0.345** | ||
| Curation level (Level 1) | 0.438*** | 0.419*** | 0.846 | 0.850 | 1.154 | ||
| Curation level (Level 2) | 0.527*** | 0.518*** | 0.606*** | 0.593*** | 0.617** | ||
| Number of subject terms | 1.019** | 1.017** | 1.032*** | 1.029*** | 1.031*** | ||
| SSVD (yes) | 0.607* | 0.664 | 0.685 | 0.751 | 0.777 | ||
| Question text (yes) | 1.007 | 1.031 | 1.273 | 1.282 | 1.342* | ||
| SDA (yes) | 0.515*** | 0.538*** | 0.699** | 0.747** | 0.750* | ||
| External funder (yes) | 4.246*** | 3.783*** | 3.660*** | 4.273*** | |||
| Curation level (fast release): External funder (yes) | 0.744 | ||||||
| Curation level (Level 1): External funder (yes) | 0.606* | ||||||
| Curation level (Level 2): External funder (yes) | 0.967 | ||||||
| Constant | 0.120*** | 0.472** | 0.068*** | 0.401*** | 0.087*** | 0.073*** | 0.060*** |
| Observations | 380 | 380 | 380 | 380 | 380 | 380 | 380 |
| Log likelihood | −2,137.064 | −2,115.939 | ‐2,095.893 | ‐2,110.307 | ‐2,069.210 | ‐2,065.205 | ‐2,063.611 |
| Theta | 0.703*** (0.045) | 0.768*** (0.049) | 0.833*** (0.054) | 0.786*** (0.051) | 0.937*** (0.062) | 0.954*** (0.063) | 0.959*** (0.064) |
| Akaike inf. crit. | 4,282.127 | 4,247.877 | 4,195.787 | 4,242.614 | 4,156.420 | 4,154.410 | 4,157.222 |
p < .1; **p < .05; ***p < .01.