| Literature DB >> 29777317 |
Miguel Quirós1, Saulius Gražulis2,3, Saulė Girdzijauskaitė3, Andrius Merkys2, Antanas Vaitkus2.
Abstract
Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical properties from molecular structure. In this article, the ongoing work to describe the chemical connectivity of entries contained in the Crystallography Open Database (COD) in SMILES format is reported. This collection of SMILES is publicly available for chemical (substructure) search or for any other purpose on an open-access basis, as is the COD itself. The conventions that have been followed for the representation of compounds that do not fit into the valence bond theory are outlined for the most frequently found cases. The procedure for getting the SMILES out of the CIF files starts with checking whether the atoms in the asymmetric unit are a chemically acceptable image of the compound. When they are not (molecule in a symmetry element, disorder, polymeric species,etc.), the previously published cif_molecule program is used to get such image in many cases. The program package Open Babel is then applied to get SMILES strings from the CIF files (either those directly taken from the COD or those produced by cif_molecule when applicable). The results are then checked and/or fixed by a human editor, in a computer-aided task that at present still consumes a great deal of human time. Even if the procedure still needs to be improved to make it more automatic (and hence faster), it has already yielded more than 160,000 curated chemical structures and the purpose of this article is to announce the existence of this work to the chemical community as well as to spread the use of its results.Entities:
Keywords: Crystal structure database; Crystallography Open Database; Molecular structure; Open access to scientific data; SMILES; Substructure search
Year: 2018 PMID: 29777317 PMCID: PMC5959826 DOI: 10.1186/s13321-018-0279-6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Example of aromatic or non-aromatic choices made for the representation of several compounds with arguable aromaticity
| Pyrrole | c1ccc[nH]1 |
| Thyophene | c1cccs1 |
| Cyclopentadiene | C1=CC=CC1 |
| Cyclopentadienyl anion | [cH-]1cccc1 |
| Cyclopentadienone | c1(=O)cccc1 |
| 2-pyridone | c1(=O)[nH]cccc1 |
| Uracil | C1(=O)NC(=O)NC=C1 |
| Pyridine oxide | c1ccccn=O |
| Quinone | O=C1C=CC(=O)C=C1 |
| Anthraquinone | c12ccccc1C(=O)c1ccccc1C2(=O) |
| 9-Methylene-fluorene | c12ccccc1C(=C)c1ccccc12 |
| Imidazolidene metal carbene | [Ni]=C1N(C)C=CN1C |
|
| c12c3c4c5c1c1c6c7c2c2c8c3c3c9c4c4c%10c 5c5c1c1c6c6c%11c7c2c2c7c8c3c3c8c9c4c4c 9c%10c5c5c1c1c6c6c%11c2c2c7c3c3c8c4c4c 9c5c1c1c6c2c3c41 |
| A tetramethyl derivative of C | C12(C)C3C4(C)c5c1c1c6c7c2c2C8(C)C=3C3 (C)c9c4c4c%10c5c5c1c1c6c6c%11c7c2c2c7c 8c3c3c8c9c4c4c9c%10c5c5c1c1c6c6c%11c2c 2c7c3c3c8c4c4c9c5c1c1c6c2c3c41 |
Fig. 1Structural formulae and SMILES strings of some compounds from the COD. a Caffeine as an example of compound with arguable aromaticity. b A coordination compound including neutral and anionic donor atoms. c An example of borane cage compound. d An organometallic with η-6 and carbonyl ligands. e A pure enantiomer of a chiral compound. f) Both enantiomers of a compound present in a racemic crystal
Example of the representations chosen for different kinds of coordination compounds
| An ethylenediamine complex | [Ni]1[NH2]CC[NH2]1 |
| A phosphane complex | [Au][P](c1ccccc1)(c1ccccc1)c1ccccc1 |
| A water complex | [Zn]([OH2])([OH2])([OH2])([OH2]) ([OH2])[OH2] |
| A phenolate (anionic) complex | [Co]Oc1ccccc1 |
| Dichloridebis(pyridine) copper(II) | [Cu]([n]1ccccc1)([n]1ccccc1)(Cl)Cl |
| An imidazole complex (neutral) | [Mn][n]1c[nH]cc1 |
| An imidazolate complex (anionic) | [Mn]n1cncc1 |
| Bidentate acetate moieties | [Cd]12([O]=C(O1)C)[O]=C(O2)C |
| Acetylacetonate complex | [Gd]1[O]=C(C)C=C(C)O1 |
| Imino-enolate or amido-cetone complex (non-equivalent resonance forms) | [Cr]1[N](c1ccccc1)=C(C)C=C(C)O1 or [Cr]1N(c1ccccc1)C(C)=CC(C)=[O]1 |
Example of the representations chosen for some organometallic compounds
| Tetracarbonyl nickel | [Ni](C#[O])(C#[O])(C#[O])C#[O] |
| A compound with bridging carbonyls | [Co]1(C#[O])(C#[O])(C#[O])C(=O)[Co] (C#[O])(C#[O])(C#[O])C1=O |
| An alkyl derivative | [Pb](CC)(CC)(CC)CC |
| A | [Rh]123[CH]4=[CH]1CC[CH]2=[CH]3CC4 |
| A | [Ti]123[CH2]=[CH]1[CH]2=[CH2]3 |
| A compound with | [Pd]1234(C[CH]1=[CH2]2)C[CH]3=[CH2]4 |
| Ferrocene | [Fe]12345678([cH]9[cH]1[cH]2[cH]3[cH]4 9)[cH]1[cH]5[cH]6[cH]7[cH]81 |
| A | [Zr]1234[cH]5[cH]1[cH]2[c]13[c]45cccc1 |
| A | [Ru]12345[c]6(C(C)C)[cH]1[cH]2[c]3(C) [cH]4[cH]56 |
Fig. 2Scheme displaying the steps involved in obtaining the SMILES strings from the CIF files in the COD
Comparison of curated and OPSIN-derived SMILES
| Count | % | Type |
|---|---|---|
| 19,475 | 64.68 | Identical |
| 1648 | 5.47 | Missing description of configuration around double bonds in OPSIN |
| 17 | 0.06 | Different number of explicit H |
| 602 | 2.00 | Missing chirality information in OPSIN |
| 49 | 0.16 | Missing chirality information in this work |
| 1130 | 3.75 | Different representation of racemates |
| 2474 | 8.22 | Different representation of nitro groups |
| 33 | 0.11 | Different representation of other groups |
| 667 | 2.22 | Different charge settings |
| 18 | 0.06 | Different aromaticity settings |
| 302 | 1.00 | Different bond orders |
| 66 | 0.22 | Different representation of ionic compounds |
| 25 | 0.08 | Missing O moieties in OPSIN |
| 94 | 0.31 | Different connectivity |
| 74 | 0.25 | Different number of rings |
| 954 | 3.17 | Different number of moieties |
| 229 | 0.76 | Different configuration around double bonds |
| 233 | 0.77 | Different configurations at chiral centres |
| 50 | 0.17 | Missing moieties in OPSIN |
| 17 | 0.06 | Missing moieties in this work |
| 166 | 0.55 | Missing C atoms in OPSIN |
| 87 | 0.29 | Missing C atoms in this work |
| 190 | 0.63 | Different stoichiometry |
| 342 | 1.14 | Different chemical composition |
| 1167 | 3.88 | Reason different from those listed above |
| 30,109 | 100.00 | Total |
The discrepancies are listed in the table in increasing order of severity. Any entry displaying more than one discrepancy reason is included only in the category corresponding to the most serious discrepancy reason found (i.e., that closer to the bottom of the table)