| Literature DB >> 21980353 |
Joachim Baran1, Martin Gerner, Maximilian Haeussler, Goran Nenadic, Casey M Bergman.
Abstract
BACKGROUND: The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2011 PMID: 21980353 PMCID: PMC3183000 DOI: 10.1371/journal.pone.0024716
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1An example MartScript for automatically importing data and creating a pubmed2ensembl BioMart.
MartScript commands shown describe how an external data source is imported and added to the BioMart build process, how the SQL commands for the actual BioMart creation are generated and executed, and how meta-data that describe filter and attribute names are imported and customized into pubmed2ensembl. The example script is augmented with comments that explain the semantics of the commands.
Figure 2Screenshot of the extended MartView interface implementing NCBI eUtilities search and other visualization improvements.
The central search box (outlined in red) provides new functionality to BioMart that allows integrated queries over text in PubMed/PubMed Central and genomic data in Ensembl. Results can also be displayed in a “list” format in a single row (red arrow), and publications linked to genes via gene name recognition show an excerpt of the underlying textual evidence as a pop-up window (blue arrow).
Total number of PMIDs, gene IDs, and gene ID-PMID pairs for all data sources in the pubmed2ensembl database.
| Source | PMIDs | Gene IDs | Gene ID-PMID pairs |
| Entrez | 469,872 | 102,415 | 1,652,017 |
| MEDLINE | 1,867,773 | 36,310 | 3,439,750 |
| PMC | 102,405 | 25,870 | 635,885 |
| EMBL BLAST | 69,764 | 64,335 | 129,530 |
| EMBL XREF | 28,982 | 82,940 | 170,722 |
| text2genome | 9,128 | 11,560 | 24,188 |
| Total (non-redundant) | 2,093,066 | 148,019 | 5,459,005 |
Note that PMC and text2genome data sources only include documents in the OA subset of PMC.
Figure 3Relationship between the number of publications and genes summed across all data sources in the pubmed2ensembl database plotted by species.
The ten species with largest total numbers of publications and genes linked are labeled with their common names.
Number of overlapping gene-PMID pairs found in pairwise combinations of data sources in the pubmed2ensembl database.
| MEDLINE | PMC | EMBL BLAST | EMBL XREF | text2genome | |
| Entrez | 365,970 | 16,876 | 61,245 | 94,358 | 4,234 |
| MEDLINE | - | 43,022 | 28,785 | 15,559 | 4,004 |
| PMC | - | - | 1,195 | 516 | 9,779 |
| EMBL BLAST | - | - | - | 39,203 | 1,434 |
| EMBL XREF | - | - | - | - | 489 |
Degree of overlap among gene-PMID pairs in different pairwise combinations of data sources in the pubmed2ensembl database.
| Entrez | MEDLINE | PMC | EMBL BLAST | EMBL XREF | text2genome | |
| Entrez | - | 22.2% | 1.0% | 3.7% | 5.7% | 0.3% |
| MEDLINE | 10.6% | - | 1.3% | 0.8% | 0.5% | 0.1% |
| PMC | 2.7% | 6.8% | - | 0.2% | 0.1% | 1.5% |
| EMBL BLAST | 47.3% | 22.2% | 0.9% | - | 30.3% | 1.1% |
| EMBL XREF | 55.3% | 9.1% | 0.3% | 23.0% | - | 0.3% |
| text2genome | 17.5% | 16.6% | 40.4% | 5.9% | 2.0% | - |
Numbers reflect the percentage of overlapping gene-PMID pairs for a column-row combination, relative to the total number of gene-PMID pairs from the data source on the left-most column. For example, gene-PMID pairs found in both Entrez Gene and MEDLINE comprise 22.2% of the total number of pairs from the Entrez source but only 10.6% of the total number of pairs from the MEDLINE source.
Precision and recall of pubmed2ensembl data sources relative to the BioCreAtIvE I and II Gene Normalization data sets.
| Species | Source | PMIDs | gene-PMIDpairs | TP | FP | FN | Precision | Recall |
| human | Entrez | 529 | 1,393 | 831 | 562 | 585 | 0.597 | 0.587 |
| MEDLINE | 431 | 1,059 | 898 | 161 | 518 | 0.848 | 0.634 | |
| PMC | 5 | 45 | 13 | 1,046 | 1,403 | 0.289 | 0.009 | |
| EMBL BLAST | 348 | 517 | 385 | 132 | 1,031 | 0.745 | 0.272 | |
| EMBL XREF | 279 | 406 | 300 | 106 | 1,116 | 0.739 | 0.212 | |
| text2genome | 2 | 2 | 2 | 0 | 1,414 | 1.000 | 0.001 | |
| Total | 531 | 1,416 | - | - | - | - | - | |
| mouse | Entrez | 469 | 7,248 | 983 | 6,265 | 311 | 0.136 | 0.760 |
| MEDLINE | 317 | 456 | 428 | 28 | 866 | 0.939 | 0.331 | |
| PMC | 4 | 8 | 7 | 449 | 1,287 | 0.875 | 0.005 | |
| EMBL BLAST | 98 | 148 | 110 | 38 | 1,184 | 0.743 | 0.085 | |
| EMBL XREF | 46 | 57 | 53 | 4 | 1,241 | 0.930 | 0.041 | |
| text2genome | 1 | 1 | 1 | 0 | 1,293 | 1.000 | 0.001 | |
| Total | 481 | 1,294 | - | - | - | - | - | |
| fruitfly | Entrez | 308 | 1,511 | 1,442 | 69 | 124 | 0.954 | 0.921 |
| MEDLINE | 45 | 58 | 53 | 5 | 1,513 | 0.914 | 0.034 | |
| PMC | 2 | 6 | 2 | 56 | 1,564 | 0.333 | 0.001 | |
| EMBL BLAST | 67 | 84 | 74 | 10 | 1,492 | 0.881 | 0.047 | |
| EMBL XREF | 0 | 0 | 0 | 0 | 1,566 | N/A | 0.000 | |
| text2genome | 2 | 4 | 4 | 0 | 1,562 | 1.000 | 0.003 | |
| Total | 314 | 1,566 | - | - | - | - | - | |
| yeast | Entrez | 116 | 393 | 300 | 93 | 750 | 0.763 | 0.286 |
| MEDLINE | 230 | 393 | 384 | 9 | 666 | 0.977 | 0.366 | |
| PMC | 14 | 47 | 23 | 370 | 1,027 | 0.489 | 0.022 | |
| EMBL BLAST | 42 | 52 | 42 | 10 | 1,008 | 0.808 | 0.040 | |
| EMBL XREF | 0 | 0 | 0 | 0 | 1,050 | N/A | 0.000 | |
| text2genome | 1 | 2 | 1 | 1 | 1,049 | 0.500 | 0.001 | |
| Total | 345 | 1,050 | - | - | - | - | - |
Gene-PMID pairs found in both a pubmed2ensembl data source and a BioCreAtIvE I and II Gene Normalization data sets were considered true positives (TP). If a gene-PMID pair was present in a pubmed2ensembl data source but not in a BioCreAtIvE data set, it was considered a false positive (FP). Conversely, if a gene-PMID pair was not present in a pubmed2ensembl data source but was present in a BioCreAtIvE data set, it was considered a false negative (FN). Precision was calculated as TP/(TP+FP); Recall was calculated as TP/(TP+FN). Rows labeled with Total indicated the total number of PMIDs and gene-PMID pairs in the BioCreAtIvE evaluation data set for each species. Values with division by zero are labeled “N/A”.