| Literature DB >> 26155308 |
Karen Karapetyan1, Colin Batchelor2, David Sharpe2, Valery Tkachenko1, Antony J Williams3.
Abstract
BACKGROUND: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.Entities:
Keywords: Chemistry; Validation; cvsp
Year: 2015 PMID: 26155308 PMCID: PMC4494041 DOI: 10.1186/s13321-015-0072-8
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1A depiction of how a chemical structure can change between InChI generation and InChI conversion. The original structure on the left was the hypothetical structure input to the InChI algorithm to generation the InChI string shown at the top of the figure. The conversion of the InChI string back to a visual form of the structure using Accelrys Draw resulted in changes including disconnection of the metal, changing the bonds and the ionization state of the halogen
Fig. 2The upload screen for CVSP – the various supported file formats are listed. Compressed file formats (ZIP and gzip) are supported
Fig. 3The SDF upload screen for CVSP
Fig. 4The filtering and download user interface screen for processing of the results set from CVSP
Fig. 5XML rules under user Profiles
Fig. 6Not unique dearomatization
Fig. 7Direction of stereo bonds makes no sense
Fig. 8Non stereo center is marked with stereo bond
Fig. 9Stereo center has both up and down bonds
Fig. 10Two wedged or to hashed bonds at same center
Fig. 11Same atom is destination and the origin of stereo bonds
Rankings (smallest numbers indicating most acidic) and SMARTS strings to identify acid and base substructures for competitive ionization of molecules based on FDA 2007
| Group | Acid SMARTS | Conjugated Base SMARTS | Rank |
|---|---|---|---|
| OSO3H | OS(=O)(=O)[O;H] | OS(=O)(=O)[O-] | 10 |
| SO3H | [!O]S(=O)(=O)[O;H] | [!O]S(=O)(=O)[O-] | 20 |
| OSO2H | O[S;D3](=O)[O;H] | O[S;D3](=O)[O-] | 30 |
| SO2H | [!O][S;D3](=O)[O;H] | [!O][S;D3](=O)[O-] | 40 |
| OPO3H2 | OP(=O)([O;H])[O;H] | OP(=O)([O;H])[O-] | 50 |
| PO3H2 | [!O]P(=O)([O;H])[O;H] | [!O]P(=O)([O;H])[O-] | 60 |
| CO2H | C(=O)[O;H] | C(=O)[O-] | 70 |
| Arom-SH | c[S;H] | c[S-] | 80 |
| OPO3H- | OP(=O)([O;H])[O-] | OP(=O)([O-])[O-] | 90 |
| PO3H | [!O]P(=O)([O;H])[O-] | [!O]P(=O)([O-])[O-] | 100 |
| Phthalimide | O = C2c1ccccc1C(=O)[N;H]2 | O = C2c1ccccc1C(=O)[N-]2 | 110 |
| CO3H | C(=O)O[O;H] | C(=O)O[O-] | 120 |
| α-carbon to NO2 group | O = N(O)[C;H] | O = N(O)[C-] | 130 |
| SO2NH2 | S(=O)(=O)[NH2] | S(=O)(=O)[NH-] | 140 |
| OB(OH)2 | OB([OH])[OH] | OB([OH])[O-] | 150 |
| B(OH)2 | [!O]B([OH])[OH] | [!O]B([OH])[O-] | 160 |
| Arom-OH | c[OH] | c[O-] | 170 |
| SH aliphatic | C[SH] | C[S-] | 180 |
| OBO2H | OB([OH])[O-] | OB([O-])[O-] | 190 |
| BO2H | [!O]B([OH])[O-] | [!O]B([O-])[O-] | 200 |
| Cyclopentadiene | [CH2]1C = CC = C1 | [C-]1C = CC = C1 | 210 |
| Amide | C(=O)[NH2 | C(=O)[N;H;-] | 220 |
| Imidazole | c1cnc[n]1 | c1cnc[n-]1 | 230 |
| Aliphatic OH | [CX4][OH] | [CX4][O-] | 240 |
| H at α-carbon to carboxyl | O = C[CH] | O = C[C-] | 250 |
| H at α-carbon to acetyl | OC(=O)[CH] | OC(=O)[C-] | 260 |
| H at sp carbon | C#[CH] | C#[C-] | 270 |
| H at α -carbon of sulfone group | CS(=O)(=O)C[CH] | CS(=O)(=O)C[C-] | 280 |
| H at α-carbon of sulfoxide | C[S;D3](=O)C[CH] | C[S;D3](=O)C[C-] | 290 |
| Amine | [CX4][NH2] | [CX4][N;H;-] | 300 |
| Benzyl | c[C;D4;H] | c[C;D3;-] | 310 |
| H at sp2 carbon | [CX3;H] | [CX3;-] | 320 |
| H at sp3 carbon | [CX4;H] | [CX3-] | 330 |
Fig. 12Partially ionized molecule
Fig. 13Howarth projection of monosaccharide
Example SMARTS abbreviations and how they are interpreted by the code
| Abbreviation | Interpretation |
|---|---|
| {NM} | Non-metals less carbon (here He, B, N, O, F, Ne, Si, P, S, Cl, Ar, Ge, As, Se, Br, Kr, Sb, Te, I, Xe, Po, At) |
| {M} | Metals (everything else) |
| {Pn} | Pnictogens (here P, As, Sb) |
| {Hal} | Halogens (here F, Cl, Br, I) |
| {M_V6} | Metals with maximum valency 6 (Cr, Mo, W, Mn, Pt) |
| {TM} | Transition metals |
| {TM^Hg} | Transition metals apart from mercury (needed for FDA rules) |
| {M_ + 1} | Metals with a charge of +1. |
Fig. 14Depiction of the ring-walking algorithm. If we start at the topmost node on the ring and proceed clockwise recording whether there is a left turn (L) or a right turn (R) at every node, then we obtain a six-character “signature”. This indicates whether the hexagon is homotropous (all turns are the same), a chair, a boat or a yet more exotic shape
Selected hexagon signatures, names (if any) and graphical depictions
| Signatures | Name | Example | Signatures | Name (if any) | Example |
|---|---|---|---|---|---|
| LRLLRR, RLRRLL | Boat |
| LLLRRR | Twist–boat |
|
| LRRRRR,RLLLLL | Half chair |
| LRLRLR | — |
|
| LLLLLL, RRRRRR | Homotropous |
| LRLRRR, RLRLLL | — |
|
| LRRLRR, RLLRLL | Chair |
| LRLLRR, RLRRLL | — |
|
Some comparison of DrugBank and ChEMBL datasets
| DrugBank | ChEMBL | Examples | |
|---|---|---|---|
| Errors | |||
| Query bonds | 2 | 0 | DB00115 |
| Stereocenters: stereotypes of non-opposite bonds match | 1 | 292 | DB08128, CHEMBL1183153, CHEMBL1971333 |
| Stereocenters: stereotypes of opposite bonds mismatch | 2 | 2542 | DB00877, CHEMBL1237110 |
| Stereocenters: one bond up, one down | 1 | 182 | DB01590, CHEMBL552998, CHEMBL1237113 |
| Stereocenters: implicit hydrogen near stereocenter | 1 | 1 | DB00910, CHEMBL2314995 |
| Non-unique dearomatization | 57 | 0 | DB01705 |
| Unknown atom symbol (“A”, “*” - polymers) | 3 | 0 | DB01344 |
| Bad Valence (Indigo) | 1 | 0 | DB01747 |
| InChI generation failed | 4 | 2 | DB03846, CHEMBL1770360 |
| Warnings | |||
| InChI does not match structure | 36 | N/A | DB00162 |
| Name does not match structure | 24 | N/A | DB08346 |
| SMILES does not match structure | 48 | N/A | DB00520 |
| Contains only multiple instances of same molecule | 0 | 25 | CHEMBL607305 |
| Not a neutral system | 314 | 14337 | DB00118, CHEMBL13045 |
| Angle between bonds too small | 2 | 164 | DB00362, CHEMBL59973 |
| Free carbon monoxide | 0 | 5 | CHEMBL108869 |
| Unusual valence | 49 | 119 | DB01703, DB03492, CHEMBL2028143, CHEMBL2028140 |
| Relative stereo (wedge or hash bonds but no chiral flag in molfile) | 1183 | 151203 | DB00140, CHEMBL1801886 |
| More than one radical atom | 2 | 4 | DB04119, CHEMBL606910 |
| Information | |||
| Contains enol function | 64 | 11898 | DB00554, CHEMBL62289 |
| Stereobond in ring | 4 | 943 | DB00877, CHEMBL1864961, CHEMBL1864961 |
| Contain unknown stereobond | 32 | 23451 | DB00162, CHEMBL1866933 |
| Contain metal-nitrogen bond | 25 | 60 | DB02003, CHEMBL611725 |
| Contain partially undefined stereo | 24 | 26862 | DB00462, CHEMBL63248 |
| Strongest acid not ionized first | 3 | 164 | DB04798, CHEMBL8056 |
| Contains L-pyranose | 185 | 5887 | DB00199, CHEMBL66563 |
| Contains metal-oxygen bond | 32 | DB00526, CHEMBL611725 | |