| Literature DB >> 20298516 |
Christopher Southan1, Péter Várkonyi, Sorel Muresan.
Abstract
BACKGROUND: Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets.Entities:
Year: 2009 PMID: 20298516 PMCID: PMC3225862 DOI: 10.1186/1758-2946-1-10
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Database update numbers.
| Dataset | Oct 2006 | Oct 2008 | 2008–2006 | 2008–2006 | Filtration |
|---|---|---|---|---|---|
| 1488288 | 2054151 | 565863 | 38% | -8% | |
| 542858 | 658198 | 115340 | 21% | -8% | |
| 1034548 | 1484218 | 449670 | 43% | -7% | |
| 1933 | 3675 | 1742 | 90% | -4% | |
| n/a | 8864 | n/a | n/a | -1% | |
| n/a | 5228 | n/a | n/a | -11% | |
| n/a | 389 | n/a | n/a | -6% | |
| n/a | 4901 | n/a | n/a | -11% | |
| 128120 | 180856 | 52736 | 41% | -18% | |
| 7268193 | 14965539 | 7697346 | 106% | -23% | |
| 3318 | 4652 | 1334 | 40% | -2% | |
| 5626 | 5706 | n/a | n/a | -8% | |
| 35671 | 7472 | n/a | n/a | -3% | |
| 6070 | 5311 | n/a | n/a | -63% | |
| n/a | 233284 | n/a | n/a | -1% | |
| n/a | 24203 | n/a | n/a | -4% | |
| n/a | 7428 | n/a | n/a | -31% | |
| 3723 | 4545 | 822 | 22% | -7% | |
| 1018 | 1341 | 323 | 32% | -3% | |
| 1737 | 2999 | 1262 | 73% | -6% | |
| 131831 | 144383 | 12552 | 10% | -26% | |
| 159867 | 176600 | 16733 | 10% | -4% | |
| 1118 | 1435 | 317 | 28% | -5% | |
This shows reductions in the compound figures after filtration in this work (2008). For those data sources that were also included in our previous publication the changes (2006 – 2008) are shown in absolute numbers and as a percentage. The reasons for the decrease in two of these are explained in the methods section. New sources that are not applicable to 2006/2008 changes are labelled n/a.
Compounds-per-protein and per-document.
| Database or subset | Document | Protein ID type | Total | Human | Cpds-per-protein | Cpds-per-document |
|---|---|---|---|---|---|---|
| GVKBIO | 87747 | Entrez Gene | 3292 | 1468 | 604 | 22 |
| GVKBIO journals | 51810 | Entrez Gene | 2660 | 1146 | 239 | 12 |
| GVKBIO patents | 35937 | Entrez Gene | 1765 | 952 | 815 | 40 |
| GVKBIO DD | 26825 | Entrez Gene | 733 | 339 | 5 | 0.14 |
| GVKBIO CCD | 27286 | Entrez Gene | 1224 | 610 | 7 | 0.32 |
| WOMBAT | 10205 | Swiss-Prot | 1979 | 1095 | 91 | 18 |
| DrugBank | n/a | Swiss-Prot | 1625 | 1356 | 3 | n/a |
| PubChem actives | n/a | RefSeq | 72 | n/a | 104 | n/a |
| PubChem PDB | n/a | RefSeq | 818 | n/a | 14 | n/a |
| BindingDB | 1142 | Swiss-Prot | 297 | 97 | 112 | 19 |
| MDDR | 137754 | n/a | n/a | n/a | n/a | 1.4 |
| DNP | 7765 | n/a | n/a | n/a | n/a | 18 |
Column three is the type of protein identifier used for the count of all species (column four) and human proteins (column five). In columns six and seven the filtered compound totals are taken from Additional file 1. The compound ratios are calculated with respect to total proteins and documents. For boxes labelled n/a the information was either not applicable or not available. For reference we have included a compounds-per-protein calculation for the PubChem actives subset even though there are no document-protein links analogous to the other sources.
Figure 1Compounds extracted from Journals. A Venn diagram showing the content overlap and differences between the databases or subsets that contain compound structures extracted from medicinal chemistry journals. The data source name and compound totals are given outside each of the three circles.
Figure 2Comparison of GVKBIO, WOMBAT and PubChem. A Venn diagram showing the content overlap and differences between GVKBIO, WOMBAT and PubChem. The older 2006 versions are shown in (A) and 2008 from this publication in (B). The data source name and compound totals are given outside each of the three circles.
Figure 3Comparison of drug databases. A Venn diagram showing the content overlap and differences between GVKBIO DD, MDDR launched and DrugBank approved. The 2006 versions are shown in (A) and 2008 from this publication in (B). The data source name and compound totals are given outside each of the three circles.