| Literature DB >> 25798198 |
Alex M Clark1, Antony J Williams2, Sean Ekins3.
Abstract
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Graphical AbstractLab notebook entries must target both visualisation by scientists and use by machine learning algorithms.Entities:
Keywords: Cheminformatics; File formats; Machine learning; Open lab notebooks; Public data
Year: 2015 PMID: 25798198 PMCID: PMC4369291 DOI: 10.1186/s13321-015-0057-7
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Vendor links for PubChem records of aspirin (a) and cholesterol (b). Links shown in red were unavailable or broken when verified (October 2014).
Figure 2Two ChemSpider search results: deuterated ammonium bromide, (a) and (b), and aminophylline, (c) and (d). Examples (a) and (c) show the result as accessed by the ChemSpider Mobile app using the public API, while examples (b) and (d) show the web browser result page.
Figure 3A selection of compounds (b) based on a common scaffold (a). Canonical SMILES strings are shown in (c), and their re-depicted structures shown in (d).
Atom and bond properties, and currently reserved extensions, used by the molecule format
|
| |
|---|---|
|
| An arbitrary string, which typically matches one of the symbols from the periodic table. If not an element, and there is no inline abbreviation for the atom, then the overall representation does not encode a molecule, but rather a template or query. |
|
| 2D layout positions, in quasi-Angstrom units, with the idealised bond length being 1.5. |
|
| Formal atomic charge for the chemical species: must be an integer. |
|
| Number of unpaired electrons: a whole number. This is used to help calculate the valence, and is primarily relevant only for main block elements. |
|
| By default, implicit hydrogen atoms are calculated automatically for C, N, O, P and S, and zero for all other elements. Non default values allow the number of extra hydrogens to be specified explicitly, as 0 or more. |
|
| An arbitrary list of strings associated with the atom, some of which have prefixes that are reserved (see below). |
|
| |
|
| The two connecting atoms for the bond. |
|
| Bond order: a whole number, which is typically one of 0, 1, 2, 3, 4 or 5. Values of 4 and 5 are extremely rare, while values of 0 are used extensively for bonding arrangements that do not follow the simple Lewis octet rule. |
|
| Flat by default, but can also be inclined or declined (so-called wedge bonds) or non-stereospecific (usually drawn as squiggly lines). |
|
| An arbitrary list of strings associated with the atom, some of which have prefixes that are reserved (see below). |
|
| |
|
| Optional third dimension: the existence of z-coordinates implies that the molecule is not a flat 2D depiction but rather a 3D conformation. |
|
| Specific isotope enrichment, where the default value of 0 implies a natural isotope distribution. |
|
| Integer mapping number associated with the atom. This can be used for any purpose, but is often for correlating atoms in a series or a reaction. |
|
| Query properties used to specify how to match a variety of atom types. |
|
| Inline abbreviation, containing a terminal substructure fragment that defines the entire molecular species that the placeholder atom represents. Can be recursive, i.e. the abbreviation can contain its own abbreviations. |
|
| |
|
| Query properties used to specify how to match a variety of bond types. |
Figure 4Bromobenzene, drawn in full (a) and with an abbreviation (b). The Molfile with the plain text abbreviation is shown in (c), while the SketchEl representation with the abbreviation encoded inline is shown in (d).
Figure 5Three different representations of triethylsilane, using different degrees of abbreviation.
Figure 6Some of the datasheet aspects currently in use: (a) aspect (displayed by the Green Lab Notebook app), (b) aspect (displayed by the SAR Table app), (c) aspect (displayed by the Mobile Molecular DataSheet app) and (d) aspect (displayed by the Green Lab Notebook app).
Figure 7Using the open source editor to view and modify a datasheet that has an embedded aspect, which can be done safely and conveniently even though the application does not implement the aspect.
Figure 8(a) Original drawing of ferrocene carboxylic acid using a limited alphabet of bond types (CASRN 1271-42-7); (b) modified structure after automated processing (PubChem ID 11986122).
Figure 9Two descriptions of organic compounds that are unlikely to be understood by cheminformatics algorithms: (a) plain text annotation of chiral centers; (b) mixture of compounds with varied connectivity.
Figure 10Tin (II) chloride, (a) drawn naively; (b) interpreted incorrectly; (c) redundantly over-specified.
Figure 11Two representations of cyclopentadienyldicarbonyliron dimer: (a) diagram style preferred by chemists; (b) a more fundamental representation that does not mislead cheminformatics algorithms.
Figure 12Sharing chemical data using the service, which stores the raw datasheet with any applicable aspects. The default (a) view is an HTML5 page, using resizable vector graphics, which can be downloaded in a variety of informatics or customised graphics formats (b).
Figure 13The Open Drug Discovery Teams app showing some of the covered topics (a) and a detail view of some of the content obtained relating to the Ebola virus, in particular several structures of FDA approved drugs (b).