| Literature DB >> 25789152 |
Şenay Kafkas1, Jee-Hyub Kim1, Xingjun Pi1, Johanna R McEntyre1.
Abstract
BACKGROUND: In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases.Entities:
Keywords: Accession number; Molecular biology databases; Supplementary data; Text mining
Year: 2015 PMID: 25789152 PMCID: PMC4363206 DOI: 10.1186/2041-1480-6-1
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Distribution of supplementary data by file formats. This figure describes distribution of supplementary files linked to the Europe PMC open access full text articles by different file formats. The “text convertible” format covers the formats which can be convertible to text such as pdf, xml, html and xsl.
Extraction patterns and contextual cues for databases
| Database | Patterns | Contextual cues |
|---|---|---|
| ENA | [A-Z][0–9]{5}; [A-Z]{2}[0–9]{6}; [A-Z]{3}[0–9]{5}; [A-Z]{4}[0–9]{8,10}; [A-Z]{5}[0–9]{7} | genbank, gen, ddbj, embl |
| UniProt | [A-N,R-Z][0–9][A-Z][A-Z, 0–9][A-Z, 0–9][0–9]; [O,P,Q][0–9][A-Z, 0–9][A-Z, 0–9][A-Z, 0–9][0–9] | swissprot, sprot, uniprot |
| PDBe | [0–9][A-Z, 0–9]{3} | pdb |
| InterPro | IPR[0–9]{6} | interpro |
| Pfam | PF(AM)?[0–9]{5} | hmm, family, pfam |
| ArrayExpress | E-[A-Z]{4}-[0–9]+ | arrayexpress |
| OMIM | [0–9]{6} | omim |
| Ensembl | ENS[A-Z]*G[0–9]{11}+ | ensembl |
| RefSeq | (AC|AP|NC|NG|NM|NP|NR|NT|NW|NZ|XM|XP|XR|YP|ZP|NS)_([A-Z]{4})*[0–9]{6,9}(?:[.][0–9]+)? | refseq |
| RefSNP | RS[0–9]{5,9} | snp |
Performance assessment results of the Whatizit ANA module
| Database | Evaluation | #TP | #FP | #FN | Precision (%) | Recall (%) | F-score (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| New | Old | New | Old | New | Old | New | Old | New | Old | New | Old | ||
|
|
| 276 | 267 | 10 | 7 | 170 | 181 | 96.50 | 97.45 | 61.88 | 59.60 | 75.41 | 73.96 |
|
| 286 | 274 | 0 | 0 | 170 | 181 | 100 | 100 | 62.72 | 60.22 | 77.10 | 75.17 | |
|
|
| 574 | 569 | 28 | 8 | 39 | 39 | 95.35 | 98.61 | 93.64 | 93.59 | 94.49 | 96.03 |
|
| 601 | 577 | 1 | 0 | 39 | 39 | 99.83 | 100 | 93.91 | 93.67 | 96.78 | 96.73 | |
|
|
| 568 | 529 | 32 | 30 | 12 | 50 | 94.67 | 94.63 | 97.93 | 91.36 | 96.27 | 92.97 |
|
| 620 | 559 | 0 | 0 | 12 | 50 | 100 | 100 | 98.10 | 91.79 | 99.04 | 95.72 | |
FP: False Positive, FN: False Negative, Old: Old Whatizit-ANA settings, New: New Whatizit-ANA settings.
Manual and automatic evaluation: In the automatic evaluation; we estimated the performance of the tool by assuming that publisher-supplied accession numbers in the articles are a gold standard for annotation. However, when we manually analysed the false positive annotations provided from our pipeline, we realised that the accession numbers provided in articles (the annotations that we assumed as gold standard in the automatic evaluation) might not be always complete or correct. Therefore, the annotations made by our tool, which were not already annotated in the article, were deemed false positives by the automatic evaluation, however, such annotations could be reassigned as true positives on manual inspection.
Figure 2Distribution of database citations in the OA-ePMC articles. This figure describes distribution of database citations in the Europe PMC open access full text articles.
Distribution of database citations in article body and supplementary data by databases in the OA-ePMC set
| Database | Supplementary data | Article body | Ratio | Shared citations |
|---|---|---|---|---|
| Ensembl | 1,292,198 | 1,152 | 1,121.70 | 23 (0.002%) |
| RefSeq | 2,540,260 | 2,864 | 886.96 | 178 (0.007%) |
| InterPro | 564,956 | 639 | 884.13 | 77 (0.014%) |
| UniProt | 2,972,519 | 9,387 | 316.66 | 540 (0.018%) |
| Pfam | 924,624 | 2,968 | 311.53 | 435 (0.047%) |
| RefSNP | 2,443,679 | 31,061 | 78.67 | 3,849 (0.16%) |
| ENA | 3,390,319 | 125,534 | 27.01 | 4,167 (0.12%) |
| PDBe | 197,850 | 44,269 | 4.47 | 2,805 (1.42%) |
| ArrayExpress | 2,377 | 1,565 | 1.52 | 53 (2.23%) |
| OMIM | 2,400 | 2,779 | 0.86 | 19 (0.80%) |
Figure 3Distribution of average number of database citations over years. Articles with supplementary data (left axis), Supplementary data (right axis). This figure describes distribution of average number of database citations over years in supplementary data and in the bodies of articles which have supplementary data.
Figure 4Average number of database citations in article bodies by including and excluding ENA. This figure describes distribution of average number of database citations in article bodies by excluding and including ENA citations.
Distribution of database citations in the supplementary data of the top 5% articles by databases
| Database | Total number of articles containing database citations in their supplementary data | % of database citations in the supplementary data of the top 5% articles |
|---|---|---|
| ENA | 2,458 | 88.78% |
| PDBe | 1,274 | 86.36% |
| RefSNP | 1,167 | 95.05% |
| UniProt | 1,059 | 83.87% |
| RefSeq | 721 | 63.39% |
| Pfam | 617 | 70.15% |
| InterPro | 499 | 72.46% |
| Ensembl | 377 | 67.62% |
| ArrayExpress | 66 | 88.35% |
| OMIM | 57 | 63.79% |