| Literature DB >> 23237381 |
Saber A Akhondi1, Jan A Kors, Sorel Muresan.
Abstract
BACKGROUND: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.Entities:
Year: 2012 PMID: 23237381 PMCID: PMC3539895 DOI: 10.1186/1758-2946-4-35
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Number of structures (MOLs) and systematic identifier counts for databases in this study
| DrugBank | 6506 | 6391 | 6504 | 6489 |
| ChEBI | 21367 | 19076 | 19725 | 18798 |
| HMDB | 8534 | 8534 | 8534 | 7727 |
| PubChem | 5069294 | 5069293 | 5069294 | 4769031 |
| NPC | 8024 | 0 | 8018 | 0 |
Figure 1Chemical representations of anastrozole.
Figure 2Comparison of MOL representation with systematic identifiers.
Successful conversion (in %) of MOL files and systematic identifiers to InChI(ca)
| DrugBank | 98.9 | 100 | 99.1 | 93.6 |
| ChEBI | 90.6 | 100 | 96.8 | 69.8 |
| HMDB | 100 | 99.9 | 100 | 38.1 |
| PubChem | 100 | 100 | 100 | 92.6 |
| NPC | 99.7 | - | 100 | - |
Consistency of MOLs and systematic identifiers (in % agreement) within databases
| DrugBank | 98.2 | 98.5 | 90.0 |
| ChEBI | 96.5 | 96.5 | 75.3 |
| HMDB | 89.3 | 37.2 | 55.7 |
| PubChem | 97.7 | 97.8 | 87.2 |
| NPC | - | 93.4 | - |
Effect of different standardisation rules on the consistency between MOL files and systematic identifiers (in % agreement)
| DrugBank | MOL–InChI | 98.2 | 99.0 | 99.0 | 99.0 | 99.4 | 99.8 |
| MOL–SMILES | 98.5 | 98.6 | 98.6 | 98.6 | 99.5 | 99.7 | |
| MOL–IUPAC | 90.0 | 90.1 | 90.0 | 90.1 | 93.5 | 96.2 | |
| ChEBI | MOL–InChI | 96.5 | 98.9 | 98.5 | 98.4 | 99.2 | 99.6 |
| MOL–SMILES | 96.5 | 96.6 | 96.6 | 96.6 | 99.6 | 99.8 | |
| MOL–IUPAC | 75.3 | 75.6 | 75.4 | 77.1 | 79.7 | 91.9 | |
| HMDB | MOL–InChI | 89.3 | 89.8 | 89.7 | 90.3 | 89.9 | 98.5 |
| MOL–SMILES | 37.2 | 37.3 | 37.2 | 38.0 | 43.1 | 98.3 | |
| MOL–IUPAC | 55.7 | 55.8 | 55.8 | 57.5 | 58.8 | 84.8 | |
| PubChem | MOL–InChI | 97.7 | 97.9 | 97.9 | 97.9 | 99.3 | 99.9 |
| MOL–SMILES | 97.8 | 97.9 | 97.9 | 97.8 | 99.2 | 99.9 | |
| MOL–IUPAC | 87.2 | 87.7 | 87.5 | 87.2 | 93.7 | 97.2 | |
| NPC | MOL–SMILES | 93.4 | 93.5 | 93.4 | 93.4 | 98.0 | 99.8 |
Agreement between MOL files of compounds that have a cross-reference in one database (row) to another database (column)
| DrugBank | - | 72.1% (1666) | - | 93.7% (4723) | - |
| ChEBI | 54.3% (1288) | - | 45.6% (114) | - | - |
| HMDB | - | 64.0% (1433) | - | 76.0% (2217) | - |
| PubChem | - | - | - | - | - |
| NPC | 76.7% (1320) | - | - | 25.8% (9557) | - |
The number of cross-references is given in parentheses.
Agreement between MOL files of compounds that have a cross-references in one database (row) to another database (column) after stereochemistry standardisation
| DrugBank | - | 91.4% | - | 95.6% | - |
| ChEBI | 68.6% | - | 93.0% | - | - |
| HMDB | - | 82.0% | - | 89.8% | - |
| PubChem | - | - | - | - | - |
| NPC | 93.4% | - | - | 47.6% | - |