| Literature DB >> 26579214 |
Saber A Akhondi1, Sorel Muresan2, Antony J Williams3, Jan A Kors1.
Abstract
BACKGROUND: A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers.Entities:
Keywords: Chemical databases; Chemical name ambiguity; Molecular structure; Non-systematic chemical identifiers; Quality control
Year: 2015 PMID: 26579214 PMCID: PMC4646925 DOI: 10.1186/s13321-015-0102-6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Number of compounds and non-systematic identifiers in different chemical databases
| Database | Compounds | Identifiers | Identifiers/compound |
|---|---|---|---|
| PubChem | 4,232,875 | 15,211,133 | 3.6 |
| ChemSpider | 6,646,902 | 10,063,709 | 1.5 |
| ChemSpider-V | 654,052 | 850,601 | 1.3 |
| HMDB | 37,761 | 308,733 | 8.2 |
| NPC | 14,814 | 131,290 | 8.9 |
| TTD | 2977 | 105,407 | 35.4 |
| ChEBI | 15,633 | 41,956 | 2.7 |
| ChEMBL | 21,398 | 28,011 | 1.3 |
| DrugBank | 3769 | 26,780 | 7.1 |
Ambiguity of non-systematic identifiers and the average number of compounds per ambiguous identifier, within databases
| Database | Unique identifiers | Ambiguous identifiers | Ambiguity (%) | Compounds/ambiguous identifier |
|---|---|---|---|---|
| HMDB | 173,455 | 26,430 | 15.2 | 6.1 |
| TTD | 100,570 | 4607 | 4.6 | 2.1 |
| ChEMBL | 26,910 | 1050 | 3.9 | 2.1 |
| NPC | 112,717 | 3455 | 3.1 | 2.1 |
| ChemSpider | 9,691,277 | 245,541 | 2.5 | 2.5 |
| ChEBI | 41,023 | 827 | 2.0 | 2.1 |
| PubChem | 14,937,728 | 201,621 | 1.3 | 2.4 |
| ChemSpider-V | 842,128 | 5401 | 0.6 | 2.3 |
| DrugBank | 26,759 | 20 | 0.1 | 2.1 |
Number of shared non-systematic identifiers between databases, ambiguity of the shared identifiers (first figure in parentheses, in italics), and the percentage of shared identifiers that are ambiguous within at least one of the databases (second figure in parentheses)
| Database | ChEBI | ChEMBL | ChemSpider | ChemSpider-V | DrugBank | HMDB | NPC | PubChem |
|---|---|---|---|---|---|---|---|---|
| ChEMBL | 1886 ( | |||||||
| ChemSpider | 28,281 ( | 23,584 ( | ||||||
| ChemSpider-V | 5081 ( | 4303 ( | ||||||
| DrugBank | 2981 ( | 4108 ( | 19,222 ( | 6985 ( | ||||
| HMDB | 4529 ( | 2325 ( | 27,608 ( | 11,774 ( | 5515 ( | |||
| NPC | 5516 ( | 6858 ( | 62,527 ( | 18,709 ( | 22,377 ( | 6815 ( | ||
| PubChem | 24,331 ( | 25,607 ( | 2,275,338 ( | 99,334 ( | 24,929 ( | 35,905 ( | 68,280 ( | |
| TTD | 4854 ( | 5019 ( | 50,182 ( | 8305 ( | 17,232 ( | 6256 ( | 23,669 ( | 98,853 ( |
Effect of standardization on the ambiguity of non-systematic identifiers (in %) within databases
| Database | FICTS | uICTS | FuCTS | FIuTS | FICuS | FICTu |
|---|---|---|---|---|---|---|
| HMDB | 15.2 | 15.2 | 15.2 | 15.2 | 15.2 | 15.2 |
| TTD | 4.6 | 1.8 | 2.1 | 2.0 | 2.1 | 2.1 |
| ChEMBL | 3.9 | 2.0 | 3.8 | 3.9 | 3.9 | 3.4 |
| NPC | 3.1 | 2.7 | 2.7 | 2.7 | 2.7 | 2.7 |
| ChemSpider | 2.5 | 2.3 | 2.5 | 2.5 | 2.2 | 1.9 |
| ChEBI | 2.0 | 1.8 | 1.9 | 1.4 | 1.8 | 1.6 |
| PubChem | 1.4 | 1.2 | 1.3 | 1.3 | 0.6 | 0.6 |
| ChemSpider-V | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.3 |
| DrugBank | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
Effect of standardization on the ambiguity of non-systematic identifiers (in %) across databases
| Database | Standardization | ChEBI | ChEMBL | ChemSpider | ChemSpider-V | DrugBank | HMDB | NPC | PubChem |
|---|---|---|---|---|---|---|---|---|---|
| ChEMBL | FICTS | 39.5 | |||||||
| uICTS | 22.0 | ||||||||
| FICTu | 32.6 | ||||||||
| ChemSpider | FICTS | 30.9 | 29.9 | ||||||
| uICTS | 28.4 | 25.0 | |||||||
| FICTu | 19.5 | 17.8 | |||||||
| ChemSpider-V | FICTS | 39.9 | 43.6 | ||||||
| uICTS | 36.5 | 34.1 | |||||||
| FICTu | 26.1 | 27.3 | |||||||
| DrugBank | FICTS | 28.7 | 39.6 | 50.7 | 45.2 | ||||
| uICTS | 15.5 | 22.6 | 41.4 | 37.2 | |||||
| FICTu | 23.3 | 32.6 | 35.9 | 33.4 | |||||
| HMDB | FICTS | 49.6 | 48.4 | 57.3 | 43.9 | 30.7 | |||
| uICTS | 47.4 | 36.1 | 54.4 | 42.4 | 30.4 | ||||
| FICTu | 32.3 | 33.0 | 34.4 | 23.3 | 16.1 | ||||
| NPC | FICTS | 40.7 | 46.4 | 60.2 | 48.6 | 21.9 | 44.4 | ||
| uICTS | 31.2 | 31.1 | 45.9 | 37.3 | 21.3 | 43.5 | |||
| FICTu | 26.8 | 36.2 | 45.1 | 31.6 | 13.5 | 21.2 | |||
| PubChem | FICTS | 36.9 | 33.1 | 17.7 | 41.6 | 46.8 | 43.3 | 49.8 | |
| uICTS | 32.9 | 25.2 | 16.1 | 37.1 | 37.6 | 40.9 | 38.4 | ||
| FICTu | 24.1 | 24.0 | 9.0 | 25.4 | 34.6 | 26.7 | 35.1 | ||
| TTD | FICTS | 27.7 | 36.9 | 32.3 | 40.3 | 18.2 | 43.0 | 22.4 | 25.4 |
| uICTS | 20.9 | 24.6 | 27.8 | 32.7 | 16.8 | 41.1 | 20.6 | 21.6 | |
| FICTu | 15.2 | 26.0 | 17.8 | 23.0 | 10.1 | 22.0 | 9.2 | 13.8 |