| Literature DB >> 33431044 |
A Patrícia Bento1, Anne Hersey1, Eloy Félix1, Greg Landrum2, Anna Gaulton1, Francis Atkinson1,3, Louisa J Bellis1,4, Marleen De Veij1, Andrew R Leach5.
Abstract
BACKGROUND: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.Entities:
Keywords: ChEMBL; Chemistry; Curation; Open source; RDKit; Standardisation
Year: 2020 PMID: 33431044 PMCID: PMC7458899 DOI: 10.1186/s13321-020-00456-1
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Penalty scores and annotation that are output from the Checker module
| Penalty score | Penalty explanation |
|---|---|
| 7 | Error-9986 (Cannot process aromatic bonds) Illegal input InChI: Unknown element(s) |
| 6 | All atoms have zero coordinates InChI: Accepted unusual valence(s) InChI: Empty structure Molecule has 3D coordinates Molecule has a radical that is not found in the known list Molecule has six (or more) atoms with exactly the same coordinates Number of atoms less than 1 Polymer information in mol file V3000 mol file |
| 5 | InChI_RDKit/Mol stereo mismatch Mol/Inchi/RDKit stereo mismatch RDKit_Mol/InChI stereo mismatch Molecule has a bond with an illegal stereo flag Molecule has a bond with an illegal type Molecule has a crossed bond in a ring Molecule has two (or more) atoms with exactly the same coordinates |
| 2 | InChI_Mol/RDKit stereo mismatch Molecule has a stereo bond in a ring Molecule has an atom with multiple stereo bonds Molecule has a stereo bond to a stereocenter Molecule has the 3D flag set for a 2D conformer Other InChI Warnings |
7 is the most serious penalty score and 2 the least important
Fig. 1Examples of the multicomponent forms of paroxetine and amphetamine and how they have been aggregated by use of the GetParent component
Checker total number of the different penalty scores output from subjecting the ChEMBL Literature set, the SureChEMBL set and the PubChem Set to the Checker process
| Penalty score | Penalty explanation | SureChEMBL | ChEMBL Literature | PubChem |
|---|---|---|---|---|
| 7 | Error-9986 (Cannot process aromatic bonds) | 4 | 0 | 0 |
| Illegal input | 0 | 1 | 0 | |
| InChI: Unknown element(s) | 3 | 0 | 1355 | |
| 6 | All atoms have zero coordinates | 0 | 0 | 12 |
| InChI: Accepted unusual valence(s) | 73 | 1 | 2155 | |
| InChI: Empty structure | 0 | 1 | 5824 | |
| Molecule has 3D coordinates | 0 | 1 | 1024 | |
| Molecule has a radical that is not found in the known list | 187 | 1 | 252 | |
| Molecule has six (or more) atoms with exactly the same coordinates | 3 | 0 | 206 | |
| Number of atoms less than 1 | 0 | 1 | 5824 | |
| Polymer information in mol file | 2 | 0 | 0 | |
| V3000 mol file | 594 | 0 | 0 | |
| 5 | InChI_RDKit/Mol stereo mismatch | 588 | 152 | 339 |
| Mol/Inchi/RDKit stereo mismatch | 0 | 0 | 28 | |
| RDKit_Mol/InChI stereo mismatch | 23 | 22 | 1479 | |
| Molecule has a bond with an illegal stereo flag | 1054 | 0 | 0 | |
| Molecule has a bond with an illegal type | 6 | 0 | 0 | |
| Molecule has a crossed bond in a ring | 34 | 36 | 134 | |
| Molecule has two (or more) atoms with exactly the same coordinates | 4 | 5 | 2367 | |
| 2 | InChI_Mol/RDKit stereo mismatch | 0 | 55 | 307 |
| Molecule has a stereo bond in a ring | 2359 | 5763 | 7061 | |
| Molecule has an atom with multiple stereo bonds | 1493 | 52 | 3660 | |
| Molecule has a stereo bond to a stereocenter | 331 | 27 | 983 | |
| Molecule has the 3D flag set for a 2D conformer | 0 | 0 | 5 | |
| Other InChI Warnings | 20188 | 34052 | 170678 | |
| No errors | 15015 | 111137 | 177815 |
Note that the number of penalty scores output is not the same as the number of compounds as some compounds return multiple penalty scores
Percentages of the compounds in each of the SureChEMBL, ChEMBL and PubChem sets returning each value as their maximum penalty score
| Penalty score | SureChEMBL | ChEMBL Literature | PubChem |
|---|---|---|---|
| 7 | 0.01 | 0.00 | 0.45 |
| 6 | 1.62 | 0.00 | 3.14 |
| 5 | 2.72 | 0.15 | 1.00 |
| 2 (non InChI) | 6.92 | 3.90 | 3.12 |
| 2 (InChI) | 28.77 | 20.35 | 32.59 |
| No errors | 59.95 | 75.60 | 59.70 |
The highest (most serious) resulting score is the one recorded for each compound
Checker penalty scores on the current version of ChEMBL (ChEMBL 26)
| Penalty score | Penalty explanation | No of compounds |
|---|---|---|
| 6 | InChI: Accepted unusual valence(s) | 10 |
| Molecule has a radical that is not found in the known list | 9 | |
| Molecule has six (or more) atoms with exactly the same coordinates | 50 | |
| 5 | InChI_RDKit/Mol stereo mismatch | 810 |
| Mol/Inchi/RDKit stereo mismatch | 6 | |
| RDKit_Mol/InChI stereo mismatch | 771 | |
| Molecule has a crossed bond in a ring | 632 | |
| Molecule has two (or more) atoms with exactly the same coordinates | 259 |
Compounds where the exclude flag is set are excluded from this analysis
Fig. 2Examples of standardisations that have been applied to a set of compounds. The compound structure before and after standardisation is shown. a Fix hypervalent nitro groups, b remove explicit H atoms, c fix covalently drawn alkaline metals connected to O or N to ionic forms, d Standardise sulphoxides to charge separated form, e normalise (straighten) allene bonds
Summary of the number of compounds that have changed InChIKeys following standardisation for the SureChEMBL, ChEMBL literature and PubChem deposited set
| InChIKey layer change | SureChEMBL | ChEMBL Literature | PubChem |
|---|---|---|---|
| Connectivity | 15 | 13 | 67 |
| Connectivity and Protonation | 5 | 1 | 33 |
| Protonation | 67 | 297 | 4358 |
| Stereochemistry | 11 | 0 | 16 |
| Stereochemistry and Protonation | 0 | 0 | 4 |
| Total no of changed InChIKeys after standardisation | 98 | 311 | 4478 |
| Total no of compounds | 520174 | 147008 | 297864 |
| % changes InChIKeys | 0.19 | 0.21 | 1.50 |
This also includes the number of compounds in the dataset and the percentage of the total sets with changed InChIKeys
Fig. 3Examples of compounds from the ChEMBL literature set where the InChIKey changed on standardisation due to the rebalancing of the charge on the compound
Fig. 4Examples of approved drugs standardised by the ChEMBL RDKit Standardizer and the PubChem standardiser
Fig. 5The composition and number of the compounds containing more than one component in ChEMBL 26 as identified by the GetParent module. The numbers in brackets refer to the number of compounds in each grouping that contain isotopes
Fig. 6Examples of applying the GetParent module to some representative ChEMBL compounds containing varying combinations of salts, isotopes and solvents. The “Child” is the compound before and “Parent” the compound after the process has been applied