| Literature DB >> 27092246 |
David Bousfield1, Johanna McEntyre2, Sameer Velankar2, George Papadatos2, Alex Bateman2, Guy Cochrane2, Jee-Hyub Kim2, Florian Graef2, Vid Vartak2, Blaise Alako2, Niklas Blomberg3.
Abstract
Data from open access biomolecular data resources, such as the European Nucleotide Archive and the Protein Data Bank are extensively reused within life science research for comparative studies, method development and to derive new scientific insights. Indicators that estimate the extent and utility of such secondary use of research data need to reflect this complex and highly variable data usage. By linking open access scientific literature, via Europe PubMedCentral, to the metadata in biological data resources we separate data citations associated with a deposition statement from citations that capture the subsequent, long-term, reuse of data in academia and industry. We extend this analysis to begin to investigate citations of biomolecular resources in patent documents. We find citations in more than 8,000 patents from 2014, demonstrating substantial use and an important role for data resources in defining biological concepts in granted patents to both academic and industrial innovators. Combined together our results indicate that the citation patterns in biomedical literature and patents vary, not only due to citation practice but also according to the data resource cited. The results guard against the use of simple metrics such as citation counts and show that indicators of data use must not only take into account citations within the biomedical literature but also include reuse of data in industry and other parts of society by including patents and other scientific and technical documents such as guidelines, reports and grant applications.Entities:
Keywords: Bibliometrics; Data archiving; Data citations; Data repositories; Data reuse; Open data; Patent analysis; Research impact
Year: 2016 PMID: 27092246 PMCID: PMC4821287 DOI: 10.12688/f1000research.7911.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. The prevalence of the top 5 four character IPC categories for the data set as a whole, those patents containing a data citation, and those patents having a data citation to UniProt or ENA.
Note individual patents can have several IPC annotations – these percentages are based on summing all instances, i.e. “one code, one vote”. For example, 17% of the IPC codes annotating UniProt-positive patents were A61K. Key to coding: A61K preparations for medical, dental, or toilet purposes; C12N micro-organisms or enzymes; A61B diagnosis, surgery, intervention; C07K peptides; G01N investigating or analysing materials by determining their chemical or physical properties; C12P fermentation or enzyme-using processes to synthesise a desired chemical compound; C12Q measuring or testing processes involving enzymes or micro-organisms; C07D heterocyclic compounds. Note absence of A61B from the more biological data sets, compared to the presence of C07K.
Annual total accessions mined in Europe PMC full-text content published between 2012 and 2014, e.g. 7016 articles in 2014 contained 37,767 references to ENA accessions.
Acc/Art is average accession references per article.
| Total accessions mined: | % Total | Articles: | Acc/ | |||
|---|---|---|---|---|---|---|
| Repository | 2012 | 2013 | 2014 | 2014 | 2014 | Art |
|
| 35897 | 33177 | 37767 | 42.6% | 7016 | 5.4 |
|
| 21198 | 22047 | 19461 | 21.9% | 5913 | 3.3 |
|
| 20528 | 21636 | 19252 | 21.7% | 3638 | 5.3 |
|
| 2308 | 2766 | 3925 | 4.4% | 846 | 4.6 |
|
| 1867 | 2051 | 2847 | 3.2% | 819 | 3.5 |
|
| 445 | 914 | 1511 | 1.7% | 1215 | 1.2 |
|
| 1148 | 1028 | 1484 | 1.7% | 451 | 3.3 |
|
| 896 | 1063 | 1190 | 1.3% | 420 | 2.8 |
|
| 534 | 569 | 612 | 0.7% | 419 | 1.5 |
|
| 204 | 293 | 389 | 0.4% | 116 | 3.4 |
|
| 139 | 224 | 269 | 0.3% | 67 | 4.0 |
|
| 85164 | 85768 | 88707 | 100% | 20920 | 4.2 |
PDB accession citations by annual publication cohort.
The rows show the year in which a PDB data entry was first made public. The columns denote the year in which a citation of that data accession was recorded. Thus each row displays the time-series of citations for the cohort of data entries published during a given year. Reasons why there are observations below the diagonal are discussed in the text. Mature cohorts (release years 2005–2011) were cited on average 0.21 times per accession per year.
| PDB
| NEW
| SUBSEQUENT CITATION OF ACCESSION IN EPMC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | ||
|
| 4,165 | 79 | 396 | 675 | 1,093 | 1,189 | 1,243 | 1,273 | 1,362 | 1,199 | 1,036 |
|
| 4,930 | 6 | 93 | 485 | 998 | 1,211 | 1,290 | 1,274 | 1,253 | 1,300 | 1,074 |
|
| 5,113 | 0 | 6 | 141 | 768 | 1,181 | 1,292 | 1,309 | 1,302 | 1,337 | 1,208 |
|
| 5,255 | 2 | 1 | 14 | 179 | 998 | 1,308 | 1,335 | 1,468 | 1,308 | 1,320 |
|
| 5,478 | 0 | 1 | 0 | 6 | 206 | 1,024 | 1,355 | 1,408 | 1,402 | 1,270 |
|
| 5,792 | 1 | 1 | 0 | 2 | 8 | 254 | 1,123 | 1,495 | 1,483 | 1,432 |
|
| 5,854 | 0 | 1 | 1 | 4 | 3 | 5 | 284 | 1,055 | 1,420 | 1,294 |
|
| 6,309 | 1 | 2 | 2 | 0 | 1 | 4 | 12 | 331 | 1,232 | 1,488 |
|
| 5,798 | 0 | 2 | 2 | 1 | 5 | 3 | 0 | 6 | 339 | 1,145 |
|
| 2,152 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 176 |
PDB source article citations by annual publication cohort.
Same format as per Table 3. Notice sustained levels of citation over time. Mature cohorts (publication year 2005–2011) were cited on average 6.73 times per source article per year.
| SOURCE
| NEW
| SUBSEQUENT CITATION OF SOURCE REFERENCE
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | ||
|
| 2,232 | 3,033 | 11,275 | 14,382 | 15,918 | 15,891 | 16,548 | 16,068 | 14,830 | 16,290 | 12,019 |
|
| 2,421 | 10 | 3,233 | 13,330 | 16,464 | 17,489 | 17,422 | 17,163 | 15,939 | 17,293 | 12,635 |
|
| 2,566 | 6 | 2 | 4,066 | 17,634 | 21,476 | 22,020 | 22,555 | 19,965 | 22,003 | 15,839 |
|
| 2,596 | 9 | 14 | 49 | 3,840 | 17,567 | 21,813 | 21,366 | 20,129 | 21,749 | 16,014 |
|
| 2,645 | 3 | 0 | 3 | 10 | 4,813 | 19,942 | 23,826 | 22,667 | 24,648 | 17,583 |
|
| 2,787 | 3 | 0 | 2 | 7 | 0 | 5,204 | 22,206 | 25,781 | 27,645 | 20,553 |
|
| 2,782 | 0 | 0 | 1 | 2 | 0 | 4 | 5,441 | 23,172 | 30,283 | 23,358 |
|
| 2,895 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 6,951 | 32,529 | 28,939 |
|
| 2,665 | 6 | 0 | 0 | 0 | 2 | 2 | 5 | 3 | 8,408 | 26,067 |
|
| 957 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 39 | 4,659 |
Data citations mined from a 2014 SureCHEMBL patent cohort.
Compare the averages with those in Table 1. Acc/Pat is the average number of accessions per patent per repository.
| Repository | Accessions | %Total | Patents | Acc/Pat |
|---|---|---|---|---|
|
| 34,634 | 30% | 1,002 | 34.6 |
|
| 33,097 | 28% | 4,074 | 8.1 |
|
| 26,206 | 22% | 322 | 81.4 |
|
| 14,127 | 12% | 1,387 | 10.2 |
|
| 3,612 | 3% | 1,093 | 3.3 |
|
| 1,877 | 2% | 97 | 19.4 |
|
| 1,769 | 2% | 254 | 7.0 |
|
| 1,158 | 1% | 115 | 10.1 |
|
| 601 | 1% | 46 | 13.1 |
|
| 30 | 0% | 19 | 1.6 |
|
| 117,111 | 8,409 | 19 |