| Literature DB >> 26207759 |
Kevin B Read1, Jerry R Sheehan2, Michael F Huerta2, Lou S Knecht2, James G Mork2, Betsy L Humphreys2.
Abstract
OBJECTIVE: This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "invisible" or not deposited in a known repository.Entities:
Mesh:
Year: 2015 PMID: 26207759 PMCID: PMC4514623 DOI: 10.1371/journal.pone.0132735
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Diagram of process taken to identify NIH-funded datasets via the published journal literature (Including Results).
PubMed searches identifying articles with funding support from the NIH.
| a) | 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) | 119,415 |
| b) | 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) |
|
Removal of articles that were not considered “research”.
| 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb] NOT (GDB [si] OR GENBANK [si] OR OMIM [si] OR PDB [si] OR PIR [si] OR RefSeq [si] OR SWISSPROT [si] OR ClinicalTrials.gov [si] OR ISRCTN [si] OR GEO [si] OR PubChem-Substance [si] OR PubChem-Compound [si] OR PubChem-BioAssay [si]) NOT molecular sequence data [mh:noexp] AND pubmed pmc all[sb] |
|
PubMed searches identifying when datasets were deposited in certain repositories (SI dataset).
| 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb] |
|
PubMed search identifying articles with the “Molecular Sequence Data” MeSH Heading (MSD dataset).
| 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb] |
|
PubMed search identifying articles in PMC.
| 2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb] |
|
Fig 2Example of the PubMed Central Acknowledgments where the authors have indicated the deposit of data in a specific repository; PMCID: PMC4085032.
Breakdown for subtraction of articles that mention the deposit of data.
| Procedure taken | Articles identified | Articles remaining |
|---|---|---|
| 1a. NIH-funded articles for 2011 in PubMed |
| 119,415 |
| 1b. NIH-funded articles for 2011 indexed for MEDLINE | 6,326 | 113,089 |
| 2. Articles with repository in [SI] field | 3,528 | 109,561 |
| 3. Articles with Molecular Sequence Data MeSH Heading | 3,460 | 106,101 |
| 4. PubMed cited articles not available in PMC | 24,497 | 81,604 |
| 5. Non-research articles | 9,694 | 71,910 |
| 6. Articles with repository in PMC Acknowledgements | 230 | 71,680 |
| 7. Additional articles not available in PMC | 198 | 71,482 |
| 8. Articles with repository in full-text XML (of 10,418 searched) | 1,825 | 69,657 |
| Total remaining articles used for subsequent analysis |
|
|
Fig 3Questions for annotating datasets contained in research articles.
Fig 4Repositories identified from the PubMed SI field and PMC Acknowledgements where datasets were deposited.
Fig 5Keywords identified from full-text XML data mining.
Estimated Number of Articles with a Dataset Stored in a Known Repository.
| Procedure taken | Articles Examined | Articles identified | % of Examined Articles |
|---|---|---|---|
| [SI] field ( | 113,089 | 3,528 | 3.1% |
| Molecular Sequence Data MeSH Heading ( | 109,561 | 3,460 | 3.2% |
| PMC Acknowledgements ( | 71,910 | 230 | 0.3% |
| Full-text XML ( | 10,418 | 598 | 5.7% |
|
|
|
|
|
Summary of Analysis of Invisible Datasets.
| Measure | Finding |
|---|---|
| Average number of datasets per article | All articles reviewed: 3.4 per article |
| Articles with two reviews: 2.9 per article | |
| Type of subject | Human subjects: 28.3% |
| Non-human animal subjects: 26.1% | |
| New vs. existing data | New datasets: 87% |
| Existing datasets: 13% |