| Literature DB >> 35643512 |
Daniela Dolciami1,2,3, Eloy Villasclaras-Fernandez1, Christos Kannas4, Mirco Meniconi2,5, Bissan Al-Lazikani6, Albert A Antolin7,8.
Abstract
BACKGROUND: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.Entities:
Keywords: Canonicalization; Compound hierarchy; FDA-approved drugs; KNIME; Standardization; Tautomerism; canSAR
Year: 2022 PMID: 35643512 PMCID: PMC9148294 DOI: 10.1186/s13321-022-00606-7
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Fig. 1Scheme of canSARchem chemical registration and standardization pipeline. Input chemical structures are firstly validated through the Checking step where SDFs are parsed, molecules with empty mol blocks are removed and valid structures are progressed through a sanitisation step. Standardized compounds are then generated to be submitted to the canonicalization step. Salt stripping with neutralization is used to obtain the unsalted canonical representatives that are finally stripped of stereochemistry and isotopes to give the abstract compounds. In the valganciclovir example, salt and stereochemistry stripping are key steps to enable data integration on the basis of chemical structure. Indeed, the abstract form is the same for the two input structures
Fig. 2Implemented KNIME workflow for compound standardization and registration. A Structure import and creation of variables for the writing of output files. B Structures checker. C Standardization. D Output generation for standardized compounds. E Canonical representative generation. F Salt stripping and generation of unsalted canonical representative output file/s. G Stereoisomery, isomerism and isotopes strip with the generation of the abstract compounds. H Node to execute in order to run the pipeline
Description of canSAR hierarchy and steps to generate each output
| 3,157,884 unique 2D structures in canSAR | canSAR Hierarchy | Input file | Description | Steps |
|---|---|---|---|---|
| 2,668,609 Standard Forms and Canonical Representatives | Standardized Compound (SC) | Original source DBs in SDF format | Standard form | 1. Checker and 2. Standardizer |
| SDF parsing and filtering of empty molblocks | ||||
| RDKit sanitization | ||||
| Structure Standardization through RDKit Standardizer | ||||
| Canonical Representative (CR) | SC Output | Canonical Representative | 3. Generation of Canonical Representative | |
| Generation of at most 30 canonical tautomers | ||||
| Prevent canonicalization in the presence of chiral center | ||||
| Time-out for canonicalization set at 250 ms | ||||
2,304,805 Unsalted Canonical Representatives | Unsalted Canonical Representative (UCR) | CR Output | Free base canonical tautomer | 4. Salt strip |
| Strip inorganic and organic counterions | ||||
| Strip solvents and fragments | ||||
| Strip shorter SMILES string | ||||
| Keep first fragment with two identical SMILES strings | ||||
| Neutralization | ||||
2,162,736 Abstract Representation | Abstract Representation (AR) | UCR Output | Abstract representation (Canonical compound stripped of salts and stereochemistry) | 5. Generation of abstract structure to get parent compounds |
| Strip stereochemistry | ||||
| Strip cis/trans and E/Z isomerism | ||||
| Strip isotopes |
The total number of unique 2D structures in canSAR is reported for each hierarchy level together with the input file and the steps carried out to generate the corresponding output files
Fig. 3Example of the advantages of canSAR hierarchy. Different tautomers, salt forms and double bond isomerism for ganciclovir are submitted to canSARchem. The generation of canonical representatives, unsalted canonical representatives and abstract compound allows to group and consolidate all the 5 input entries in the same family enabling the user to quickly identify all the alternative forms that exist in canSAR. This, in turn, makes the user aware of the data measured and registered versus the different compound forms. Circles at the top represent the different data types and the number of bioactivity data points associated with each one of them. As it can be observed, the biochemical assays of ganciclovir against metabotropic glutamate receptors from BindingDB (associated with GA2, in blue), could be missed if the user only explored the tautomer associated with the ‘GANCICLOVIR’ name tag
Fig. 4Example of the advantages of canSAR hierarchy (II). Different salt forms and stereoisomers for BMS-863233. Salt strip and clear stereochemistry steps allow to generate the compound family. It is important to highlight that the racemate BMS-863233 is associated with a large-scale selectivity profiling data whilst the pure enantiomer OSX is associated with structural and ADME data. Circles at the bottom represent the different data types and the number of bioactivity data points associated with each one of them
Fig. 5Hierarchy of sunitinib. Example of the limitations of canSARchem hierarchy. Different tautomers, salt forms and double bond isomerism for sunitinib are submitted to canSARchem. The generation of canonical representative, unsalted canonical representative and abstract compound allows to group and consolidate all the 6 input entries in the same family. However, the 3H-pyrrole of canSAR289623 and canSAR775524 (in blue) has very different properties than the pyrrole of the other compounds in the family and a high energy barrier of interconversion that is unlikely to enable tautomerism between these compounds in solution. Therefore, it is debatable if these compounds belong to the same hierarchy. Circles at the top represent the different data types and the number of bioactivity data points associated with each one of them
Number of compounds from external sources processed through the canSAR pipeline
| ChEMBL27 | BindingDB | PDB | ||
|---|---|---|---|---|
| # Structures in source db | 1,941,411 | 898,561 | 35,258 | |
| 1.Checker | SDF parsing errors | 0 | 3 | 0 |
| RDKIT sanitization errors | 1 | 70,385 | 0 | |
| Empty molblock | 0 | 6 | 0 | |
| 2. Standardizer | Total standardized structures | 1,941,410 | 828,167 | 35,076 |
| 3. Canonical Tautomer generation | Modified structures | 179,276 (9.23%) | 19,721 (6.14%) | 2911 (8.3%) |
| Total new structures | 101,642 | 1207 | 2862 | |
| 4. Salt strip | Structures strip from salt, fragments, solvents | 101,833 (5.2%) | 1264 (0.39%) | 266 (0.75%) |
| 5. Abstract structure generation | Generated abstract structures | 682,321 (35%) | 118,074 (36.78%) | 19,691 (55.85%) |
Compounds modified in each pipeline step are reported for every external database
Fig. 6Examples of rejected structures for aromaticity (a and b) or valence (c and d) errors. a BindingDB MonomerID 60884 and b BindingDB MonomerID 185783 are incorrectly represented as aromatic, wrong portion of the molecule are highlighted in red. In c BindingDB MonomerID 142162 and d BindingDB MonomerID 289106, the valence of the red Nitrogen atom is wrong
Comparison of ChEMBL, PubChem and canSAR chemical structure standardization pipelines
| canSAR pipeline | ChEMBL pipeline [ | PubChem pipeline [ | ||
|---|---|---|---|---|
| Pipeline availability | Freely available as a Knime Workflow (ChemAxon components require a license) | Open source, public available in GitHub and as Conda package | Open source web-based and programmatic interface | |
| Chemical Structure Curation | 1. Checker | |||
| 2. Standardizer | ||||
| 2.1 Aromaticity standardization | Kekulization | Kekulization | Kekulization | |
| 2.2 Atom valence | ||||
| 2.3 Radicals | ||||
| 2.5 Hydrogens treatment | Remove explicit Hs | Remove explicit Hs | Convert implicit Hs into explicit | |
| 2.6 Metal bonds disconnection | ||||
| 2.7 Apply normalization rules | Conversion of functional group into the preferred form | |||
| 2.8 Verify stereochemistry | ||||
| 3. Generation of Canonical Tautomers | ||||
| 4. Salt Strip | ||||
| 5. Abstract structure | ||||
| 5.1 Tautomers | ||||
| 5.2 Stereoisomers | ||||
| 5.3 E/Z and cis/trans isomers | ||||
| 5.4 Isotopes | ||||
| 5.5 Salts, solvents | ||||
| Uniqueness identifier | NonStandard InChI | Standard InChI and corresponding hashed InChIKey | De-aromatized isomeric canonical SMILES | |
| Actions on compounds which fail to be processed | Compounds are corrected where possible Compounds failed to be sanitised or standardized are not loaded into the DB | Different actions based on perceived errors Compounds with fatal errors are not uploaded into ChEMBL database (penalty score of 7) or uploaded without Molfile (penalty score of 6) or uploaded but prioritized for manual curation (penalty score 5 or 2) | Structures rejected from the standardization pipeline are not pushed into the Compounds database | |
| Employed tools/libraries | RDKit, MolVS and ChemAxon | RDKit and MolVS | OpenEye Scientific Software, Inc. C++ | |
| Parent compounds usage | Canonical tautomers as well as compounds strip from salt and chirality are used to create a compound hierarchy displayed in the structure synopsis | Compounds strip from salts and isotopes are used to get parents shown as alternative forms of compounds | Canonicial tautomers, isolated and neutralized covalent units used to generate related, parent and component compounds | |
| Mapping bioactivity data | Bioactivity and structural data on the same compound form are aggregated through chemical structure Bioactivity data are mapped against the form it is measured on | Bioactivity data on the same compound form are aggregated through chemical structure Bioactivity data are mapped against the form it is measured on | Bioactivity data are linked to unstandardized compounds (substances) | |
Differences and similarities of the various step for compounds standardization are highlighted
aNot performed through the released pipeline. For compounds registered in DB by the PubChem team, parent structures can be retrieved
Standardized and abstract structure generated through canSAR, ChEMBL and PubChem pipelines
Examples show how the three examined pipelines deal with specific chemical issues: 1. Valence of Nitro group. 2. Metal bond disconnection 3. Sulphoxides standardization 4. Tautomer generation and salt strip 5 and 6. Double bond isomerism and tautomer generation. Compounds analysed have been chosen among those shown in the comparison tables in Bento’s paper [8]
Detections of problematic structure (I)
| Sure_ChEMBL (SI1) | Pubchem (SI2) | ChEMBL literautre (SI3) | |
|---|---|---|---|
| Structures # | 52,074 | 297,864 | 147,008 |
| ChEMBL pipeline errors (not uploaded structures) | 849 (1.6%) | 10,692 (3.59%) | 0 |
| ChEMBL uploaded structures | 51,225 (98.37%) | 287,172 (96.41%) | 100% |
| canSAR pipeline errors (rejected structures) | 114 (0.22%) | 7431 (2.5%) | 3 (0.002%) |
| SDF parsing errors | 0 | 0 | 0 |
| Sanitization errors | 110 | 1540 | 2 |
| Standardization errors | 4 | 67 | 0 |
| Empty molblock | 0 | 5824 | 1 |
| canSAR accepted structures | 51,960 (99.78%) | 290,433 (97.5%) | 147,005 (99.99%) |
Comparison with ChEMBL Checker on Supplementary files available in ChEMBL chemical standardization pipeline paper [8]. The canSAR pipeline is overall more inclusive with a lower percentage of rejected structures
Detections of problematic structure (II)
| PubChem | ||
|---|---|---|
| Structures # | 375,397 | |
| PubChem pipeline errors (rejected compounds) | 375,397 (100%) | |
ERRORS found PubChem Checker | Invalid isotope specifications | 141 |
| Valence check | 364,946 (97.22%) | |
| Identical charges on adjacent atoms or invalid valence after valence bond canonicalization | 10,243 | |
| Excess the limit of 999 explicit atoms | 65 | |
| canSAR pipeline errors (rejected compounds) | 285,552 (76.07%) | |
| SDF parsing errors | 0 | |
| Sanitization errors | 270,131 (71.96%) | |
| Standardization errors | 2954 (0.78%) | |
| Empty molblock | 12,467 (3.37%) | |
| canSAR accepted structures | 89,845 (23.93%) |
Comparison with PubChem Checker on Supplementary files available from PubChem chemical standardization pipeline paper [9]. Additional file 4 was used. The canSAR pipeline is overall more inclusive with a lower percentage of rejected structures but a superior performance in correcting wrong structure ahead of importing them in the database
Example of compounds rejected by PubChem pipeline
| Entry | Input molecule | canSAR standardization (FICTS) | PubChem errors | ChEMBL standardization | ChEMBL errors |
|---|---|---|---|---|---|
| 1 |
|
| Detect illegal valence for element “S” |
| 6, [‘InChI: accepted unusual valence(s)’], (2, ‘InChI: metal was disconnected’) |
| 2 |
|
| Detect illegal valence for element “S” |
| 2, ‘InChI: Metal was disconnected’ |
| 3 |
|
| Unable to fix pentavalent nitroso group |
| No issues detected |
| 4 |
|
| No issues detected |
| No issues detected |
Examples of compounds rejected by the PubChem pipeline. All these compounds are valid according to the canSAR pipeline, which results to be more inclusive. Of note, inclusiveness of canSAR seems to be higher in edge cases with organometallic complexes where the valence is not easily perceived
FDA-approved drugs enrichment
| All canSAR compounds | FDA-approved drugs | Chi-squared test | |
|---|---|---|---|
| Total number of unique valid structures | 1,988,409 | 2450 | – |
| Compounds registered as non canonical forms | 192,933 (9.7%) | 429 (17.51%) | χ2 = 8,558,547,933 p-value < 0.0001 |
| Chiral compounds | 890,562 (44.78%) | 1031 (42.01%) | Not enriched |
FDA-approved drugs were compared to all compounds registered in canSAR for canonicalization and chirality enrichment. A significant p-value for compounds modified during canonicalization has been found through the Chi-squared test