Literature DB >> 26136848

InChI, the IUPAC International Chemical Identifier.

Stephen R Heller¹, Alan McNaught², Igor Pletnev³, Stephen Stein¹, Dmitrii Tchekhovskoi¹.

Abstract

This paper documents the den class="Chemical">sign, layout and n class="Chemical">algorithms of the IUPAC International Chemicn class="Chemical">al Identifier, InChI.

Entities: Chemical Disease Gene Species

Keywords: Chemical identifier; Chemical structure linear notation; IUPAC standard; InChI; InChIKey

Year: 2015 PMID： 26136848 PMCID： PMC4486400 DOI： 10.1186/s13321-015-0068-4

Source DB: PubMed Journal: J Cheminform ISSN： 1758-2946 Impact factor: 5.514

Introduction

n class="Chemical">InChI is the Internpan>ationpan>n class="Chemical">al Chemical Identifier developed under the auspices of IUPAC, the International Union of Pure and Applied Chemistry [1], with principal contributions from n class="Chemical">NIST (the U.S. National Institute of Standards and Technology [2]) and the InChI Trust [3]. This paper documents the den class="Chemical">sign, layout and n class="Chemical">algorithms of InChI. It is intended to provide a ren class="Chemical">asonably detailed description without being overlong for a journal article. For a briefer introduction, which also provides more detail on historical and organizational matters, the reader is referred to a recent paper by the same authors [4]. For a more technical description, one may consult the InChI Technical Manual [5] and the free source codes of the InChI software, which are available from the InChI Trust [6]. The paper is orn class="Chemical">ganized n class="Chemical">as follows. First, we discuss the general concepts n class="Chemical">associated with chemical identifiers. Then we outline the design goals of InChI and our general approach, focussing on the InChI model of chemical structure and the hierarchical layered structure of the Identifier; the concept of Standard InChI is introduced. This is followed by a detailed description of each of the possible major InChI layers, accounting for molecular connectivity, charge, stereochemistry, isotopic enrichment, position of hydrogen atoms and bonding in metal compounds, and the sublayers associated with these layers. We then describe the workflow of InChI generation (normalization, canonicalization, and serialization stages), as well as generation of the compact hashed code derived from InChI (InChIKey); the related algorithms and implementation details are briefly discussed. Finally, we provide information about InChI Software, licensing, known problems/limitations, and future prospects for InChI.

Background

A chemicn class="Chemical">al identifier is a text label that denotes a chemical substancea. Then class="Chemical">se labels are of the utmost importance as they provide a convenient means of comparing and distinguishing chemicals in a variety of applications, from the design of new materials to legal and regulatory issues. The main requirement for an identifier is that the label must n class="Chemical">be unambiguous: the same label must always refer to the same substance, and no other substance may have this label. Two different substances must have different labels. Note that an identifier may not n class="Chemical">be stricn class="Chemical">tly unique in the sense that the same substance may be, on a case-by-case basis, denoted by several distinct synonymical labels (provided that the lists of synonyms for different substances do not overlap). An obvious example is given by IUPAC chemical nomenclature that allows one to produce and use different names for a single compound; nevertheless, all these names unambigously identify the compound. Though not necessary, strict uniqueness, which is always assigning a single label to a particular substance, is highly convenient and very desirable. The concept of “chemicn class="Chemical">al identifier” heavily relies on the conpan>cepts of chemicn class="Chemical">al substance and chemical identity. The IUPAC Compendium of Chemical Terminology, the “Gold Book”, defines “chemicn class="Chemical">al substance” as “Matter of constant composition best characterized by the entities (molecules, formula units, atoms) it is composed of. Physical properties such as density, refractive index, electric conductivity, melting point etc. characterize the chemical substance” [7]. As this definition implies, identity of a chemical substance is determined by its constituent units and properties. It is noteworthy that even at this highly general level, this consideration is somewhat restrictive. For example, the concept of “chemical substance” is not applicable to material that is not of constant composition (e.g., oil). This consideration is also somewhat counter-intuitive, for example, as concerns aggregate states, polymorphs, etc. Thus, most chemists would agree that “water” is a chemical substance, that may appear as steam, ice and liquid water, and that all three should have the same chemical identifier -- despite the fact that each may be isolated as a different state of matter, placed in a test tube or stored in a bottle. In other words, the “identifying power” of a chemical identifier is inherently limited. The oldest known chemicn class="Chemical">al identifiers are words of naturn class="Chemical">al languages descrin class="Chemical">bing common chemicals with terms like “water”, “iron” or “table salt”; they are trivial names, in modern nomenclature. Notably, trivial names exemplify the principle that a chemical identifier is not necessarily related to molecular structure. The identity of chemical substances denoted by trivial names was historically determined by a set of characteristic physical and chemical properties, long before exact structures were resolved, or even before the very concept of molecular structure evolved in the second half of the 19th century. Of course, today’s trivial names are associated with chemical stuctures (yet the structures may not be fully defined, as is common for natural products). A trivin class="Chemical">al name is an example of a registry-lookup chemicn class="Chemical">al identifier: it provides a unique label for the named substance but the label itn class="Chemical">self says nothing (or little) about the characteristic properties and structure. Such data are stored in electronic or printed registries (handbooks) that uniquely associate the label with the properties/structure. Retrieving reference data requires a registry lookup. More recent examples of registry-lookup identifiers are thon class="Chemical">se n class="Chemical">associated with large printed or electronic collections of chemical structures and properties – Beilstein and Gmelin Registry numbers [8], Chemical Abstracts Service (CAS) Registry numbers by the American Chemical Society [9], EC numbers from the European Community Inventory [10], CID and SID numbers assigned by PubChem [11], and identifiers assigned by ChEMBL [12], ChemSpider [13], etc. Note that n class="Chemical">all the above registry-lookup identifiers are n class="Chemical">also authority-n class="Chemical">assigned identifiers, that is, they are produced by assignment made by some authority. Typically, the authorities are the maintainers of chemical substance collections e.g. CAS numbers are assigned by Chemical Abstracts Service, and one needs to refer to CAS for a particular structure’s identifier. (Trivin class="Chemical">al chemicn class="Chemical">al names provide an interesting exception: then class="Chemical">se were/are assigned by the chemical community and their associated registry is a decentralized compendium of handbooks and nomenclature rules. In these cases, there is no algorithm for direct conversion of molecular structures to these identifying labels). Despite the widespren class="Disease">ad un class="Chemical">se of registry-lookup authority-n class="Chemical">assigned chemical identifiers, these types of identifier have a number of substantial drawbacks. For example, even the largest registry cannot include all known chemical substances. Furthermore, no registry can include a substance that has not previously existed and for which a hypothesized structure is drawn or computed. Furthermore, some authorities may impose restricted access and/or require payment for assigning labels and even for lookup in their registries. The n class="Chemical">alternative to authority-n class="Chemical">assigned is structure-n class="Chemical">based chemical identifiers. These are derived from molecular structural formulae, either drawn in print form or presented in digital form. When the algorithm for an identifier’s derivation and/or a related utility tool becomes publicly available, anyone has the ability to produce the identifier for a given structure. (Note that structure-based identifiers still may require registry lookup to recover the structure from the label). The earliest examples of structure-n class="Chemical">based identifiers are the systematic names of cln class="Chemical">assical chemical nomenclatures established either by IUPAC or by CAS. However, the nomenclature rules developed by these authorities are not easy to learn and practice, even for professional chemists. Misinterpretation may result in ambiguous naming. Finally, and most importantly, systematic chemical names are not well suited for digital representations and the internet: they tend to be too long and contain non-alphanumeric characters (i.e., other than Latin letters and digits). As an example, Figure 1 gives the IUPAC systematic chemical name for the marine toxin palytoxin [14]; this is compared with the much shorter InChIKey (discussed later).

Figure 1

Structure, IUPAC name and InChIKey for palytoxin [14].

Structure, IUPAC name and n class="Chemical">InChIKey for n class="Chemical">palytoxin [14]. n class="Chemical">Since the n class="Chemical">second half of the 20th century, chemicn class="Chemical">al structure linear notations have become well established as alternative to classical nomenclature. These textual labels are derived using specific algorithms from molecular structural formulae. They serve as handy textual substitutes for structural formulae, being much more convenient in database and internet applications. Examples are the pioneering Wiswesser line-formula notation, WLN [15]; the widely used SMILES [16,17]; notation by Sybyl, SLN [18,19]. An excellent review of these and other notations is given in [20]. Typicn class="Chemical">ally, then class="Chemical">se systems encode the chemicn class="Chemical">al structure, which is originally expressed (drawn or computer stored) in the paradigm of classical chemical structure theory [20]; most typically, the source representation is provided in a file using a connection table (CTFile) format (e.g. MOL and SDF, from MDL [21]). This “cln class="Chemical">asn class="Chemical">sicn class="Chemical">al model of chemical structure” assumes that a molecule is composed of atoms that are connected by bonds. Atoms are characterized by their chemical element, isotopic mass, integer formal charge, radical state and connection to other atoms. It is assumed that elements (more strictly, atoms of a particular element in a particular charge/radical state) have typical valence states, characterized by the number of bonds to neighbors. Typically, if the explicitly expressed number of connections is less than the characteristic valence, the necessary number of connections to implicit (not shown) hydrogen atoms are assumed. Atoms do have coordinates, but typically they are x,y-coordinates for visually pleasant drawings of structural formulae, which have no relation to x,y,z-coordinates of atomic nuclei in physical space. However, these coordinates may be useful to represent stereo configurations of double bonds, as well as other stereogenic elements. Bonds may be of single or multiple order. In some cases, “resonance” or “aromatic” bonds (of “one and a half” order) are also included. In all cases, bonds are pair-wise; no bond may involve three or more atoms. This model is quite different from the modern quantum chemistry description. In spite of this, it performs surprin class="Chemical">singly well in n class="Chemical">rationalizing chemical facts, and forms a solid basis (mathematically, it is an undirected multigraph with colored nodes) for nearly all structure-based chemical identifiers. One, however, should be cautious, bearing in mind the limitations of the model. A line notation structure-n class="Chemical">ban class="Chemical">sed identifier is, essentially, the connection table (with associated additional data) unfolded to a single line. This unfolding requires the use of a pre-defined order of numbering of atoms in the molecule. In addition, atoms always may be renumbered, and renumbering may (more often than not) change the identifier. In other words, uniqueness of an identifier requires a method for assigning unique, canonical, numbers to the atoms. Unfortunately, n class="Chemical">although the formats of most of the above mentioned line notationpan>s are publicly available (though not necessarily n class="Chemical">as detailed and formal as desirable), the related algorithms and software are not n class="Chemical">always available. Even in the cases where the algorithms are described, as is the case for the most widely employed system, SMILES [16,17], the original implementation and software for the algorithms remain proprietary. Moreover, for SMILES, the canonicalization algorithm was published over 25 years ago, but incomplete (without stereochemistry-related part). To compensate for this, “other commercial and open-source software developed their own algorithms for generating canonical SMILES all of which differed from each other and none of which are published” (O’Boyle [22]); however, the lack of the single commonly-adopted standard became a problem itself. Another problem is that most of the identifiers perceive the structurn class="Chemical">al formulae "just n class="Chemical">as drawn". This means, for example, that mesomeric structures, which undoubtedly represent the same substance, may surprisingly produce different labels. Also, the tautomers, which are most often presumed to be associated with the same substance (unless otherwise explicitly intended), are often labeled with different identifiers. Stereo isomers and isotopically enriched forms of the same parent compound present an additional source of ambiguity and inconsistent labelling. These factors lead to the undesirable and widespread result where the same substance has different labels, and vice versa. This further results in simultaneous problems in cross-referring various forms of the same or "nearly the same" substance (tautomers, stereo isomers, etc.). A typical pattern of cheminformatics work includes a) correction of drawing issues by normalizing to a preferred state and b) re-drawing molecules in intentionally different ways, dependent on context (e.g., including or omitting stereo wedges). The lack of universally-recognized standards for these correction and re-drawing transformations results in a drastic decrease in interoperability. To n class="Disease">address the lack of a nonpan>-proprietary, stricn class="Chemical">tly-unique standard chemical identifier, the InChI project wn class="Chemical">as initiated in 2000 by two authorities well-known for establishing standards, IUPAC and NIST. n class="Chemical">Since the n class="Chemical">InChI project was established, four major n class="Chemical">InChI software releases have appeared and each has introduced significant new features. The history of the development of InChI is documented in earlier reports and accounts [23-27]. For a general overview, we refer the reader to a recent paper [4].

Design and layout

This n class="Chemical">section provides informationpan> onpan> n class="Chemical">InChI design goals and the general approach chosen to meet them -- a method of constructing the Identifier which reflects the various features of chemicn class="Chemical">al structure in a hierarchical, layered manner. The concept of Standard InChI, which is specifically designed for inter-operability by selecting the most appropriate layers, is introduced. Then the major InChI layers: Main, Charge, Stereo, Isotopic, FixedH as well as the Reconnected layer, and their associated sublayers, are described in detail.

InChI design goals

n class="Chemical">InChI is a nonpan>-proprietary, Open Source, chemicn class="Chemical">al identifier intended to be an IUPAC approved and endorsed structure standard representation. The following features were conn class="Disease">sidered as critically important in designing the International Chemical Identifierb. ▪ Structure-n class="Chemical">ban class="Chemical">sed approach. Anybody anywhere should be able to produce InChI from just the structural formula of a chemical substance. ▪ Strict uniqueness of identifier. The same label n class="Chemical">always means the same substance, and the same substance n class="Chemical">always receives the same label (under the same labelling conditions). This is achieved through a well-defined procedure of obtaining canonical numbering of atoms. ▪ Non-proprietary, Open Source, free and open approach. o Free access to developed computer progn class="Chemical">rams. No payment is n class="Chemical">assumed under anclass="Chemical">pan>y circumstances. o Open access to the source code. Everybody is free to ren class="Disease">ad and un class="Chemical">se the source code. ▪ Applican class="Chemical">bility to the entire domain of “cln class="Chemical">assic organic chemistry” and, to a n class="Chemical">significant extent, to inorganic compounds, bearing in mind the eventual goal to extend InChI to cover all of chemistry. ▪ An class="Chemical">bility to genen class="Chemical">rate the same InChI for structures dn class="Chemical">rawn under (reasonably) different styles and conventions, specifically those represented by mesomers. ▪ Hien class="Chemical">rarchicn class="Chemical">al approach n class="Chemical">allowing encoding of molecular structure with different levels of “granularity”, dependent on algorithms and software switches. In particular, the ability to include/exclude stereochemical, isotopic and tautomeric information was considered necessary. ▪ An class="Chemical">bility to produce an identifier with some “default” switches, targeted to a fixed level of gn class="Chemical">ranularity and ensuring interoperability in large datan class="Chemical">bases. The current n class="Chemical">InChI (class="Chemical">pan> class="Chemical">InChI identifier version 1, InChI software version 1.04) implements these features in full. Normalization may modify input chemical structure by applying a consistent chemical model with the intent to make structures of the same compound drawn under (reasonably) different styles and conventions close if not identical, which is essential for generating the same InChI. Canonicalization of chemical structure upon generating InChI ensures strict uniqueness of the identifier. The layered structure of InChI allows targeting for specific applications (e.g., adding the ability to distinguish tautomers). A Standard InChI is specifically created for inter-operability. All of the development occurred under open-source paradigm. The following features were conn class="Disease">sidered important but not criticn class="Chemical">al. ▪ An class="Chemical">bility to exacn class="Chemical">tly restore the originn class="Chemical">al chemical structure based solely on the InChI identifier string. ▪ Compact form. ▪ An class="Chemical">bility to den class="Chemical">al with coordination and orn class="Chemical">ganometallic compounds, including those containing haptic bonds. The current implementation pren class="Chemical">serves then class="Chemical">se features to a significant extent. As measured by extent of correct InChI- > Structure- > InChI conversion of a ~39 million structures collection derived from PubChem Compound, the current software correctly restores the structure in ~99.95% casesc. n class="Chemical">As a responn class="Chemical">se to a requirement for a more compact identifier, a shorter hash-n class="Chemical">based InChI derivative, denoted InChIKey, was introduced. This came about from a discussion with a search engine company who explained that without a shortened compact version of InChI, no search engine would be able to properly search for a lengthy InChI string. Orn class="Chemical">ganon class="Chemical">metallics, inorn class="Chemical">ganics and other classes of compounds still present a significant challenge. This will be addressed in future versions of InChI. The following wn class="Gene">as considered as having low importance: ▪ Ability to be human read/parsed and manually edited.

InChI model of chemical structure

n class="Chemical">InChI is n class="Chemical">ban class="Chemical">sed on the “classical model of chemical structure” with some significant modifications and additions. The following principles constitute the basis of the InChI approach, or the “InChI model of chemical structure”. A molecule is compon class="Chemical">sed of atoms. Atoms are either skeletn class="Chemical">al (non-n class="Chemical">hydrogen atoms, as well as bridging hydrogen, as in diborane) or terminal hydrogen atoms (further called simply “hydrogens”). Skeletal atoms are pair-wise connected by bonds and are characterized by chemical element, integer formal charge, radical state, isotopic mass, associated implicit hydrogens, and bonds to other skeletal atoms. Hydrogens may be either connected to skeletal atoms or shared by a group of skeletal atoms (such groups may also share negative charge). n class="Chemical">All bonds are n class="Chemical">simple links (connections). That is, they have no "double", "triple" or other attributes. Bonds are formed pair-wise; thus, no bond may involve three or more atoms [except for n class="Chemical">hydrogen(s) shared by a group of skeletal atoms]. A molecule is coordinateless. However, the identifier repren class="Chemical">sents the configun class="Chemical">ration of stereogenic elements, n class="Chemical">as it is captured from source structural data amplified with either 2-D or 3-D coordinates.

Core parent structure

The most important n class="Chemical">aspect of class="Chemical">pan> class="Chemical">InChI is its hierarchical, layered nature. At the center of the InChI approach is the concept of core parent structure, which is a common archetype for the source structure and many related structures - (a) tautomeric, (b) stereo isomeric, (c) isotopically substituted, and (d) protonated/deprotonated forms. Additionally, (e) all the bonds to metal atoms are broken (although the bonding pattern is saved). The core parent structure has no precise tautomeric state, tautomeric "mobile" hydrogens are assigned to groups of skeletal atoms; it has no associated stereochemistry and no isotopic enrichment. Its protolytic centers are neutral, as the core parent is derived from the source structure by adding/removing the appropriate number of protons (to be more precise, the structure as a whole is neutralized). n class="Chemical">InChI descrin class="Chemical">bes the source structure n class="Chemical">as the derivative of its parent core with explicitly added features (items a-e above). The exact description requires specifying all of the five items, if they are applicable. Any other (incomplete) combination may be used. InChIs that have been generated with tautomerism excluded will be the same for the source structure and all of its (recognized) tautomers. Omitting stereo configurations means one will produce the same InChI for the source structure, as well as for all of its stereo isomers, etc. This model n class="Chemical">allows one to tune the identifier's resolving power, its "gn class="Chemical">ranularity". This is illustrated by Figure 2 (this Figure and related discussion provide just a brief introduction to InChI layers; a more detailed description of the layers can be found later in this paper).

Figure 2

InChI layered representation of the monoanion of 32P-labelled adenosine triphosphate. Left – input structure; right –core parent structure used by InChI, with canonical atomic numbers.

n class="Chemical">InChI layered repren class="Chemical">sentation of the monoanion of n class="Chemical">32P-labelled adenosine triphosphate. Left – input structure; right –core parent structure used by InChI, with canonical atomic numbers. In the n class="Chemical">InChI string, the core parent structure is encoded by a string compon class="Chemical">sed of several layers. Each layer is a character n class="Chemical">sequence starting with '/' (forward slash) and followed by a letter denoting the identity of the layer. Layers repren class="Chemical">sent: empiricn class="Chemical">al formula, the very first layer after the prefix "InChI=1/", in this example “Cn class="Chemical">10H16N5O13P3”; skeletal connections '/c'; hydrogens layer '/h' (indicating positions of immoveable and sharing of moveable Hs); charge layer '/q'; protonation/deprotonation '/p'. In Figure 2 the parent core structure is derived from the source structure by n class="Disease">adding one proton (i.e., onpan>e protonpan> must n class="Chemical">be removed to go back from parent to source, "/p-1"). It has the two groups of skeletn class="Chemical">al atoms sharing one hydrogen each, "(H,21,22)(H,23,24)", and two other groups sharing two hydrogens each, "(H2,11,12,13)(H2,18,19,20)". Here “H2” denotes two hydrogens and “18, 19, 20” denotes the numbered (non-hydrogen) atoms. n class="Disease">Additionn class="Chemical">al features are represented by the layers appearing further to the right. The stereochemistry layer (“/t4-,6-,7-,10-/m1/s1/” in Figure 2) includes the sublayer for tetn class="Chemical">rahedral centers '/t' complemented by two indicator stereo layers '/m1' and '/s1'. A double-bond stereochemical layer '/b', may also be present for other structures. [Note that the stereochemical layers may be optional if the stereochemistry is not known or does not need to be specified]. The next layer is the isotopic '/i' layer. Here "/i29 + 1" n class="Chemical">signifies that atom numn class="Chemical">ber 29 consists of the isotope with mn class="Chemical">ass increased by unity with respect to the natural value. Note that the isotopic layer may optionally include its own ‘/s’ stereo sublayer, as adding isotopic substitution to the core parent structure may change the stereogenic elements and their configurations. The next layer is the "n class="Chemical">FixedH" '/f' layer, which lists the exact poclass="Chemical">pan> class="Chemical">sition of tautomeric hydrogens (“/fC10H15N5O13P3/h18,21,23H,11H2/q-1”, Figure 2). Note that specifying the exact position of tautomeric hydrogens may change the ionization pattern considered by the InChI algorithm. Consequently, the FixedH layer may contain its own formula sublayer and charge sublayers (“/fC10H15N5O13P3” and “q-1”). Note also that this layer may optionally include its own ‘/s’ stereo sublayer, as adding exact positions of tautomeric hydrogens to the core parent structure may change the set of stereogenic elements and their configurations. n class="Chemical">InChI may n class="Chemical">be produced not only for a n class="Chemical">single structure, but also for a combination of components not bound to each other (this may be thought of as representing equimolar mixtures). In InChI, this is termed “disconnected structures.” In this case, each layer includes information about all of the components separated by ';' except for the chemical formula, which is dot-separated. This feature enables InChI to provide representations for metallic complexes, adducts, etc. where the bonding may not be known, ill-defined, or diffuse including the cases where three or more atoms may be involved in a “bond.”

Standard and non-standard InChI

The layered structure of n class="Chemical">InChI n class="Chemical">allows one to repren class="Chemical">sent a molecular structure with a desired level of detail. Accordingly, InChI software may generate different InChI strings for the same molecule, depending on the choice of a multitude of options (e.g., distinguishing or not distinguishing tautomers). This flexibility, however, may be considered a drawback with respect to standardization/interoperability. In 2009, the ‘standard’ InChI, which is always produced with fixed options, was introduced in response to these concerns. The standard n class="Chemical">InChI wclass="Chemical">pan> class="Chemical">as defined to ensure interoperability/compatibility between large databases/web searching and to facilitate information exchange. Its layered structure conforms to the following requirements. Standard n class="Chemical">InChI distinguishes n class="Chemical">between chemicn class="Chemical">al substances at the level of ‘connectivity’, ‘stereochemistry’, and ‘isotopic composition’. Connectivity is defined here as tautomer-invariant valence-bond connectivity with different tautomers having the same connectivity/hydrogen layer. The Standard InChI representation for organometallics does not include bonds to the metal. Stereochemistry is defined here as a configuration of stereogenic atoms and bonds where only absolute stereo or no stereo is allowed, and unknown stereo designations are treated as undefined. Isotopic composition is defined here as the mass number of isotopic atoms (when specified). For n class="Chemical">InChI version 1, the standard n class="Chemical">InChI is designated by the prefix “InChI=1S/” (that is, the letter ‘S’ immediately follows the Identifier version number, ‘1’). The non-standard InChI is designated by the prefix: “InChI=1/” (that is, the letter ‘S’ is omitted).

InChI valence schema

n class="Disease">Additionn class="Chemical">al details of the “InChI model of chemicn class="Chemical">al structure” are concerned with accounting for the specific properties of elements, namely, the valence schema. In particular, in many n class="Chemical">situationpan>s n class="Chemical">InChI treats metal and non-metal atoms differently. The following elements are considered as non-metals: H, He, B, C, N, O, F, Ne, Si, P, S, Cl, Ar, Ge, As, Se, Br, Kr, Te, I, Xe, At, Rn. All the others are metals. For n class="Chemical">all elements, n class="Chemical">InChI recognizes typical (standard) valence states. These standard valences are summarized in Table 1 (omitting noble gas elements for which valence is zero). Implicit hydrogen atoms are added to hypovalent non-metal atoms in order to reach the nearest higher standard valence, as indicated in Table 1 (however, no addition is made to reach the pentavalent state of neutral nitrogen and the tetravalent state of neutral sulfur atoms). Also, implicit hydrogen atoms are added to the following metal atoms: Li, Be, Na, Mg, Al, K, Ca, Ga, Rb, Sr, In, Sn, Sb, Cs, Ba, Tl, Pb, Bi, Po, Fr, Ra.

Table 1

Standard valences used by InChI

Element	Atomic charge
Element	−2	−1	0	1	2
H	-	-	1	-	-
Li	-	-	1	-	-
Be	-	-	2	1	-
B	3	4	3	2	1
C	2	3	4	3	2
N	1	2	3, 5	4	3
O	-	1	2	3, 5	4
F	-	-	1	2	3,5
Na	-	-	1	-	-
Mg	-	-	2	1	-
Al	3, 5	4	3	2	1
Si	2	3, 5	4	3	2
P	1, 3, 5, 7	2, 4, 6	3, 5	4	3
S	-	1, 3, 5, 7	2, 4, 6	3, 5	4
Cl	-	-	1, 3, 5, 7	2, 4, 6	3, 5
K	-	-	1	-	-
Ca	-	-	2	1	-
Sc	-	-	3	-	-
Ti	-	-	3, 4	-	-
V	-	-	2, 3, 4, 5	-	-
Cr	-	-	2, 3, 6	-	-
Mn	-	-	2, 3, 4, 6	-	-
Fe	-	-	2, 3, 4, 6	-	-
Co	-	-	2, 3	-	-
Ni	-	-	2, 3	-	-
Cu	-	-	1, 2	-	-
Zn	-	-	2	-	-
Ga	3, 5	4	3	-	1
Ge	2, 4, 6	3, 5	4	3	-
As	1, 3, 5, 7	2, 4, 6	3, 5	4	3
Se	-	1, 3, 5, 7	2, 4, 6	3, 5	4
Br	-	-	1, 3, 5, 7	2, 4, 6	3, 5
Rb	-	-	1	-	-
Sr	-	-	2	1	-
Y	-	-	3	-	-
Zr	-	-	4	-	-
Nb	-	-	3, 5	-	-
Mo	-	-	3, 4, 5, 6	-	-
Tc	-	-	7	-	-
Ru	-	-	2, 3, 4, 6	-	-
Rh	-	-	2, 3, 4	-	-
Pd	-	-	2, 4	-	-
Ag	-	-	1	-	-
Cd	-	-	2	-	-
In	3, 5	2, 4	3	-	1
Sn	2, 4, 6	3, 5	2, 4	3	-
Sb	1, 3, 5, 7	2, 4, 6	3, 5	2, 4	3
Te	-	1, 3, 5, 7	2, 4, 6	3, 5	2, 4
I	-	-	1, 3, 5, 7	2, 4, 6	3, 5
Cs	-	-	1	-	-
Ba	-	-	2	1	-
La	-	-	3	-	-
Ce	-	-	3, 4	-	-
Pr	-	-	3, 4	-	-
Nd	-	-	3	-	-
Pm	-	-	3	-	-
Sm	-	-	2, 3	-	-
Eu	-	-	2, 3	-	-
Gd	-	-	3	-	-
Tb	-	-	3, 4	-	-
Dy	-	-	3	-	-
Ho	-	-	3	-	-
Er	-	-	3	-	-
Tm	-	-	2, 3	-	-
Yb	-	-	2, 3	-	-
Lu	-	-	3	-	-
Hf	-	-	4	-	-
Ta	-	-	5	-	-
W	-	-	3, 4, 5, 6	-	-
Re	-	-	2, 4, 6, 7	-	-
Os	-	-	2, 3, 4, 6	-	-
Ir	-	-	2, 3, 4, 6	-	-
Pt	-	-	2, 4	-	-
Au	-	-	1, 3	-	-
Hg	-	-	1, 2	-	-
Tl	3, 5	2, 4	1, 3	-	-
Pb	2, 4, 6	3, 5	2, 4	3	-
Bi	1, 3, 5, 7	2, 4, 6	3, 5	2, 4	3
Po	-	1, 3, 5, 7	2, 4, 6	3, 5	2, 4
At	-	-	1, 3, 5, 7	2, 4, 6	3, 5
Fr	-	-	1	-	-
Ra	-	-	2	1	-
Ac	-	-	3	-	-
Th	-	-	3, 4	-	-
Pa	-	-	3, 4, 5	-	-
U	-	-	3, 4, 5, 6	-	-
Np	-	-	3, 4, 5, 6	-	-
Pu	-	-	3, 4, 5, 6	-	-
Am	-	-	3, 4, 5, 6	-	-
Cm	-	-	3	-	-
Bk	-	-	3, 4	-	-
Cf	-	-	3	-	-
Es	-	-	3	-	-
Fm	-	-	3	-	-
Md	-	-	3	-	-
No	-	-	2	-	-
Lr	-	-	3	-	-
Rf	-	-	4	-	-
Db	-	-	5	-	-
Sg	-	-	6	-	-
Bh	-	-	7	-	-
Hs	-	-	1	-	-
Mt	-	-	1	-	-
Ds	-	-	1	-	-
Rg	-	-	1	-	-
Cn	-	-	1	-	-

Standard vn class="Chemical">alences un class="Chemical">sed by n class="Chemical">InChI

Layout of InChI layers

Main layer: representing core parent structure

Empirical formula sublayer: representing composition

The chemicn class="Chemical">al formula is repren class="Chemical">sented according to Hill convention, that is, n class="Chemical">beginning with carbon atoms, then hydrogens, then all other elements in alphabetical order. This is the only layer prefixed with a single slash, ’/’, without a following character. Note that this formula may n class="Chemical">be different from the one n class="Chemical">seen for the source chemical structure, n class="Chemical">as it refers to the core parent structure. If the source structure depicts charged species, the InChI algorithm may protonate or deprotonate it to create a neutral parent (to ensure that the same basic layers will be generated for neutral and ionized forms). For example, n class="Chemical">InChI for Cl−, “n class="Chemical">InChI=1S/ClH/n class="Chemical">h1H/p-1”, has formula sublayer “ClH” (for the proton, InChI has no chemical formula sublayer: “InChI=1S/p + 1”). For the anion of adenosine triphosphate, ATP, InChI has the formula layer "C10H16N5O13P3".

Skeletal connections layer

This layer prefixed with ‘/c’ repren class="Chemical">sents conpan>nectionpan>s between skeletal atoms by listing the canonical numbers in the chain of connected atoms (branches are given in parenthen class="Chemical">ses). Note that the canonicn class="Chemical">al atomic numbers, which are used throughout the InChI, are always given in the formula’s element order. For example, “/C10H16N5O13P3” (the beginning of InChI for adenosine triphosphate) implies that atoms numbered 1–10 are carbons, 11–15 are nitrogens, 16–28 are oxygens, and 29–31 are phosporus. Hydrogen atoms are not explicitly numbered.

Hydrogens layer

This layer prefixed with ‘/h’ lists the bonds n class="Chemical">between the atoms in the structure, partitionpan>ed into n class="Chemical">as many as three sublayers. The first sublayer represents all bonds other than thon class="Chemical">se to non-bridging H-atoms, the second sublayer represents bonds of all immobile H-atoms, and the third sublayer provides locations of any mobile H-atoms. This last sublayer represents H-atoms that can be found at more than one location in a compound due to various types of tautomerism. This sublayer identifies the groups of atoms that share one or more mobile hydrogen atoms. In addition to hydrogen atoms, mobile H groups may contain mobile negative charges. These charges are included in the charge layer.

Charge layer

This layer provides information about net charge and is compon class="Chemical">sed of two sublayers.

Charge sublayer

This sublayer ‘/q’ is n class="Chemical">simply the net charge of the core parent structure.

Protonation/deprotonation sublayer

This sublayer ‘/p’ indicates the net numn class="Chemical">ber of protonpan>s removed from or n class="Disease">added to the source structure while deriving its core parent.

Mesomerism

Mesomerism is the concept related to the n class="Chemical">situation in which the molecular structure cannot n class="Chemical">be unambiguously represented by a single cln class="Chemical">assical structural formula; rather, two (or more) mesomeric structures must be drawn and considered to contribute to the overall picture. n class="Chemical">As the IUPAC Gold Book states, “mesomerism” is “Esn class="Chemical">sentin class="Chemical">ally synonymous with resonance. The term is particularly associated with the picture of π-electrons as less localized in an actual molecule than in a Lewis formula. The term is intended to imply that the correct representation of a structure is intermediate between two or more Lewis formulae” [28]. In other words, mesomers are considered as imaginary objects (or even drawing artifacts) that cannot be distinguished by a simple chemical identifier. Mesomerism is effectively eliminated in n class="Chemical">InChI. Mesomers have the same InChIs (this is true for all possible InChI layouts of layers). Actually, this is very natural. Mesomeric structures of a molecular entity have the same basic connectivity but differ in bond orders, and maybe by having atomic charges on different atoms. InChI does not use bond orders and does not place charges on particular atoms; the placement of hydrogen atoms in a mesomeric system, which would be important for InChI, is always the same. This is illustn class="Chemical">rated by Figure 3, which shows mesomers of n class="Chemical">formamide and n class="Chemical">nitromethane as well as the associated InChIs. A more complex example, Methylene Blue, is presented in Figure 4. Again, all InChIs for the mesomers are the same (note that their Symyx NEMA keys differ, as is shown in [20]).

Figure 3

Both Standard (upper 2 lines under each drawing) and FixedH (lower 2 lines) versions of InChI and InChIKey for mesomers are the same: formamide (left) and nitromethane (right).

Figure 4

Both Standard (upper 2 lines under each drawing) and FixedH (lower 2 lines) versions of InChI and InChIKey for mesomers are the same, as exemplified by Methylene Blue.

Both Standard (upper 2 lines under each dn class="Chemical">rawing) and class="Chemical">pan> class="Chemical">FixedH (lower 2 lines) versions of InChI and InChIKey for mesomers are the same: formamide (left) and nitromethane (right). Both Standard (upper 2 lines under each dn class="Chemical">rawing) and class="Chemical">pan> class="Chemical">FixedH (lower 2 lines) versions of InChI and InChIKey for mesomers are the same, as exemplified by Methylene Blue. Note that n class="Chemical">all the above discusclass="Chemical">pan> class="Chemical">sion about mesomerism and mesomers is equally applicable to aromaticity and resonance structures.

FixedH layer

This layer prefixed with ‘/f’ serves n class="Chemical">as the exact specification of tautomers. When potentin class="Chemical">ally mobile H atoms are detected and the user specifies that they should be immobile (tautomerism not allowed), this layer binds these H atoms to the atoms specified in the input structure. In the case where this causes a change in earlier layers, appropriate changes are added to this layer (earlier layers are not affected). Tautomers have the same Standard but different n class="Chemical">FixedH InChIs (and InChIKeys), Figure 5 shows this in the example of an isoguanosine derivative (this example was also used in [20] where it was noted that the tautomers have different Symyx NEMA keys but the same InChIKeys; however, only Standard InChIKeys were quoted).

Figure 5

Standard versions (upper 2 lines under each drawing) of InChI and InChIKey for tautomers are the same while FixedH versions (lower 2 lines) differ, as exemplified by an isoguanosine derivative.

Standard versions (upper 2 lines under each dn class="Chemical">rawing) of InChI and InChIKey for tautomers are the same while n class="Chemical">FixedH versions (lower 2 lines) differ, as exemplified by an isoguanosine derivative.

Stereochemistry layer

Overview of stereochemistry layer with its sublayers

The stereochemicn class="Chemical">al layer contains sublayers repren class="Chemical">senting double bond stereochemistry and tetrahedral stereochemistry (including n class="Chemical">allenes). The vn class="Chemical">alues in this layer depend onpan> the conpan>tents of preceding layers. For example, the vn class="Chemical">alue produced for the stereo layer will depend on whether it was derived from a main layer or Fixed-H layer or whether it belongs to an isotopic layer. Therefore, this type of layer may be present at several locations in an Identifier. Two distinct cln class="Chemical">asclass="Chemical">pan> class="Chemical">ses of stereochemistry are represented, sp2 (double bond or Z/E) and sp3 (tetrahedral). The double bond sublayer ‘/b’ precedes the tetrahedral sublayer ‘/t’. These sublayers do not affect each other if involved substituents are constitutionally different. Otherwise, the content of each sublayer may influence another one. For example, if the two stereo-enabled (chiral) ligands at the same end of a double bond are constitutionally identical, the double bond stereo depends on tetrahedral stereo configurations of these two ligands.

Double bond sp2 (Z/E) stereo layer ‘/b’

Expresn class="Chemical">sion of a stereo conpan>figun class="Chemical">ration is easily done in two-dimensionn class="Chemical">al drawings. When double bonds are rigid, stereoisomerism is ren class="Disease">adily repren class="Chemical">sented without ambiguity. However, in some cases in alternating bond systems, non-rigid bonds may be formally drawn as double bonds. Bonds in then class="Chemical">se systems, when discovered by class="Chemical">pan> class="Chemical">InChI algorithms, are not assigned stereo labels. InChI does not generate sp2 stereoisomerism information in small rings (less than 8 atoms).

Tetrahedral stereo layer ‘/t’

Tetn class="Chemical">rahedrn class="Chemical">al (typicn class="Chemical">ally, sp3) stereochemistry is readily represented using conventional wedge/hatch (out/in) bonds commonly employed in 2-D drawings. Relative tetrahedral stereochemistry is represented first, optionally followed by a tag to indicate absolute stereochemistry. In genern class="Chemical">al, the n class="Chemical">InChI algorithm marks the configuration of a stereogenic center or bond as either ‘+’ or ‘-‘, as shown in Figure 6. These marks have no relation to R,S or E,Z configurations (actually, they are based on considering canonical atomic numbers of substituents at the stereogenic center). When a stereo center configuration is not known , an ‘unknown’ descriptor may be specified (which will appear in the stereo layer). If a possible stereocenter is found, but no stereo information is provided, it will be represented in a stereolayer by a not-given (‘undefined’) flag.

Figure 6

Stereoisomers of menthol with associated InChI/Keys.

Stereoisomers of n class="Chemical">menthol with n class="Chemical">associated n class="Chemical">InChI/Keys. In current n class="Chemical">InChI software v. 1.04 (2011) a questionpan> mark (‘?’) is uclass="Chemical">pan> class="Chemical">sed, by default, for both ‘undefined’ and ‘unknown’ flags. However, in a non-standard InChI generated with option ‘SLUUD’ turned On, the symbol ‘u’ is used to indicate explicitly entered ‘unknown’ stereo (while ‘?’ is retained for ‘undefined’).

Isotopic layer

The Isotopic layer (n class="Chemical">signified with the prefix ‘/i’) identifies different isotopicn class="Chemical">ally labeled atoms. Exchangeable isotopic hydrogen atoms (deuterium and n class="Chemical">tritium) are listed separately. The layer also contains any changes in stereochemistry caused by the presence of isotopes.

Reconnected layer: coordination compounds and organometallics

To avoid many amn class="Chemical">biguities that typicclass="Chemical">pan> class="Chemical">ally arise when representing metal-containing compounds, the InChI algorithm breaks bonds to metal(s), that is, it “disconnects” these compounds. More details of disconnection procedure may be found in Section “Normalization of input structure”, sub-Section “Breaking bonds to metal atoms”. The originn class="Chemical">al class="Chemical">pan> class="Chemical">metal bonding scheme is preserved in so-called “reconnected layer”, which is optionally included into Identifier. This layer is signified with the prefix '/r' and simply contains all the layers appearing in the case where the InChI string is generated without breaking bonds to metal atoms. That is, ‘/r’ is followed by formula and all the other subsequent layers whichever are applicable.

InChIKey

InChIKey is a compact chemicn class="Chemical">al identifier derived from n class="Chemical">InChI. The InChIKey is always only 27-characters long. Consequently, it is a much more convenient identifier for searching the internet and indexing databases (see Figure 1). Indeed, based on conversations with search engine developers, it is a practical requirement for the InChIKey to be all upper case letters and a length that all search engines will accept without truncation or modification. A disadvantage of the InChIKey is that one loses the ability to algorithmically restore a structure from a textual label: InChIKey is a structure-based registry-lookup identifier, see Background section. Finding the structure corresponding to a given n class="Chemical">InChIKey requires n class="Chemical">searching on the Web or querying dedicated resolvers (e.g. those of ChemSpider [29] and NCI [30]; both are free to use). Of course, for specific targeted databases a lookup service may be added by developers/maintainers (as is implemented, e.g., in the UniChem [31] database Web services [32]). n class="Chemical">InChIKey is an encoded vern class="Chemical">sion of the hash codes calculated from a source n class="Chemical">InChI string, elaborated with convenience “flag symbols”. Hashing is a onpan>e-way mathematicn class="Chemical">al transformation typically un class="Chemical">sed to calculate a compact fixed length digital representation of a much longer string of arbitrary lengthd. As the hash function maps input values, strings, to the strongly compacted space, getting the same hash code for two different inputs (collision) is unavoidable. Of course, collision means loss of identifier’s uniqueness. However, the use of appropriate hashing details typically allows one to successfully utilize hash codes in various identification tasks. By den class="Chemical">sign, a gon class="Chemical">al of InChIKey is to partin class="Chemical">ally preserve the hierarchical layered structure of the parent InChI. The first block of 14 (out of total 27) characters for an InChIKey encodes core molecular constitution, as described by formula, connectivity, hydrogen positions and charge sublayers of the InChI main layer. The other structural features complementing the core data -– namely, exact positions of mobile hydrogens, stereochemical, isotopic and metal ligands, whichever are applicable -- are encoded by the second block of InChIKey. The possible protonation or deprotonation of the core molecular entity (described by the protonation sublayer of the InChI main layer), is encoded in the very last InChIKey flag character, see below. n class="Chemical">As a result, the first n class="Chemical">InChIKey block is n class="Chemical">always the same for the same molecular skeleton. All isotopic substitutions, changes in stereoconfiguration, tautomeric state and coordination bonding are reflected in the second block. InChIKey inherits the Standard or nonpan>-standard nature of the parent n class="Chemical">InChI (signified by a dedicated flag character). This inherited nature influences the “resolving power” of the identifier. For example, Standard InChIKey (produced from Standard InChI) does not account for tautomerism. In addition, it may also indicate only absolute stereo. It also does not account for the bonds of the originn class="Chemical">al structures to metal atoms, if they were present and disconnected on Standard InChI generation. Shown n class="Chemical">below is the current format of n class="Chemical">InChIKey (plean class="Chemical">se note that this is different from the initial format, which appeared in 2007, Software version 1.02-beta release). AAAAAAAAAAAAAA-BBBBBBBBFV-P n class="Chemical">All the symbols except the delimiter (a dash that is a minus n class="Chemical">sign) are uppercase English letters representing a “base-26” encoding. The overall length of InChIKey is fixed at 27 characters, including separators (dashes). As mentioned previously, lower case letters would be useless as web search engines do not differentiate between upper and lower case for searching. Here are the five distinct components: AAAAAAAAAAAAAA The first block: 14-chan class="Chemical">racters encoding core molecular constitution. BBBBBBBB The n class="Chemical">second block: 8-chan class="Chemical">racters encoding advanced structurn class="Chemical">al features whichever are applicable (stereochemistry, isotopic substitution, exact position of mobile hydrogens, metal ligation data). F Flag chan class="Chemical">racter: either ‘S’ for Standard InChI parent or ‘N’ for non-standard. V Vern class="Chemical">sion character: currently, ‘A’, which means 1. P Protonation/deprotonation flag. ‘N’ means no proton-related n class="Disease">ionization (“Neutrn class="Chemical">al”). Other options are: ᅟ This layout is exemplified in Figure 7.

Figure 7

InChIKey layout explained (using caffeine as an example).

InChIKey layout explained (using caffeine as an example). Note that different protonation states of the same compound will have Standard InChIKeys that differ onpan>ly by a n class="Chemical">single character, the protonation flag (unless both states have number of inserted/removed protons > 12). Moreover, since neutral and zwitterionic states of the same molecule have the same zero number of inserted/removed protons, they will also have the same Standard InChIKeys. However, non-standard InChIKeys generated from non-Standard InChIs (including FixedH sublayer) will allow one to distinguish between the states. This is exemplified by InChIKeys for various ionization states of L-lysine, Figure 8.

Figure 8

Standard (upper line under each drawing) and FixedH (lower line) InChIKeys for the various ionization states of L-lysine.

Standard (upper line under each dn class="Chemical">rawing) and class="Chemical">pan> class="Chemical">FixedH (lower line) InChIKeys for the various ionization states of L-lysine.

Overview of implementation

General workflow

The genern class="Chemical">al workflow of derivation of n class="Chemical">InChI from structurn class="Chemical">al data is illustrated by Figure 9. There are three major steps in the workflow: (a) normalization of input structure, that is, converting the supplied structural data into internal data structures conforming to the InChI chemical model; (b) canonicalization of atomic numbering, which accounts for atomic equivalence/inequivalence relations appearing under this model; and (c) serialization, that is, generating the final sequence of symbols, an InChI string. There is an optional fourth step (d); hashing of the InChI string and producing a compact InChIKey.

Figure 9

General workflow of InChI/Key generation.

Genern class="Chemical">al workflow of class="Chemical">pan> class="Chemical">InChI/Key generation.

Input data

The input data for n class="Chemical">InChI genen class="Chemical">ration is the structure repren class="Chemical">sented in “classical chemical structure” paradigm as atomic and bonding data, with optional addition of “0D” stereochemical data. The data may be supplied either as a molfile or SD file (input of inchi-1 executable, see InChI Software User’s Guide [33]) or C data structures (as described in the header file “inchi_api.h”, see InChI source code). Each atom is descrin class="Chemical">bed by a number of properties: its chemical element name; x,y,z- coordinates (n class="Chemical">all or any of them may be zero); list of bonded atoms; either the exact number of implicit hydrogen atoms (with separate indication for protium, deuterium, and tritium, if applicable) or flag signifying that implicit hydrogens should be added; isotopic mass; radical state; formal integer charge. Each bond is descrin class="Chemical">bed by its type anclass="Chemical">pan>d stereochemistry indicator, if applicable. A bonpan>d type may n class="Chemical">be single, double, or triple. “Resonant” or “aromatic” is not allowed; a chemicn class="Chemical">al structure described with aromatic bonds should be explicitly converted to a representation with alternating single and double bonds, prior to serving as InChI input. For end user convenience, as aromatic bonds may occur widely in molfiles and SD files (in explicit violation of file format specification [21]), they are typically tolerated by inchi-1 executable, which itself performs a conversion; however, the success is not guaranteed. The stereochemistry indicator is in wedge convention (preferably, in one-wedge style to avoid ambiguity) [34]; it indicates the wedge direction (“up”/”down”/”either”), as well as orientation of the wedge narrow point (towards the atom or out of it). The configurations of stereogenic double bonds are expressed via atomic coordinates. If all the coordinates are zero, “0D” stereochemical data may be added to specify configuration (applicable to input for InChI Library, API, procedures). n class="Chemical">InChI options are the switches that modify default n class="Chemical">behavior of InChI n class="Chemical">algorithms/software; they are described in a separate section.

Normalization of input structure

The first step of n class="Chemical">InChI production is normalization: converting the input structural data into data structures conforming to InChI repren class="Chemical">sentation rules, organization principles and model of chemical structure. If applicable, normn class="Chemical">alization starts from preprocesn class="Chemical">sing, correcting the input structurn class="Chemical">al formula according to several hard-coded “good drawing rules”, intended to ease further treatment. Some of these corrections correspond to mesomeric forms of functional groups as intentionally drawn by chemists (e.g., for nitro groups), while the others serve to correct strange drawing artifacts related to computer origin (which occur surprisingly often in large databases). In particular, normalization serves to exclude the issues concerning alternating bonds, resonance and aromaticity. The next step of normn class="Chemical">alizationpan> includes breaking bonpan>ds to n class="Chemical">metal ions, as InChI’s core parent structure is n class="Chemical">always metal-disconnected, to avoid numerous issues with different bonding models for metallated compounds. The next step is to find protons necessary for den class="Chemical">aling with variable protonpan>ationpan>, aclass="Chemical">pan> class="Chemical">gain to ensure elucidation of the (de)protonation-independent core parent structure. The finn class="Chemical">al normn class="Chemical">alization step includes the discovery of conventionn class="Chemical">al tautomeric patterns and ‘resonances’ that may occur due to bond alternation or positive charge migration along paths of alternating bonds. The normn class="Chemical">alization and the stereochemicn class="Chemical">al perception stages rely heavily on testing whether a bond order can be changed due to the pren class="Chemical">sence of an alternating bond circuit, as well as the possibility of a hydrogen atom, charge, or radical center to migrate along a path of alternating bonds. This testing is based on a matching algorithm described in detail in ref. [35]. For the fixed H layer, only moving pon class="Chemical">sitive charges n class="Chemical">along paths of alternating bonds are allowed.

Correcting input structural formula

This includes the following tn class="Chemical">ransformations, whichever are applicable to the originn class="Chemical">al structure (note that this step is still performed “within” the clasn class="Chemical">sical structure model and extensively operates with bond orders and atomic charges).

Moving charge from hydrogen to heavy atom

ᅟ

Converting charge-separated patterns to neutral

ᅟ Example: ᅟ

Decreasing charge separation by increasing valence

ᅟ

Moving negative charge from central atoms in oxoanions

ᅟ Example: ᅟ ᅟ Example: ᅟ

Moving positive charge to create imine nitrogen

ᅟ

Annihilating adjacent opposite charges going to higher valence state

The full n class="Chemical">set of rules for annihilationpan> of n class="Disease">adjacent charges is documented in the InChI Technical Manual [5]. One particularly important rule concerns dn class="Chemical">rawing of nitro and similar groups: ᅟ

Breaking bonds to metal atoms

To avoid many amn class="Chemical">biguities that typicclass="Chemical">pan> class="Chemical">ally arise when representing metal-containing compounds, the InChI algorithm always breaks bonds to metal(s), that is, it disconnects these compounds. However, this is implemented in a different manner for "simple salts" and for coordination/organometallic compounds.

Disconnecting simple salts

Simple salts, for InChI, are compounds of type M-X or Y-M-X formed by metal atom M and “acids” HX, HY. Acids here are the substances of the following three kinds: ᅟ In “n class="Chemical">salts” dclass="Chemical">pan> class="Chemical">rawn connected, metals are connected to the acid by single bonds only and do not have H-atoms connected to them. Metal valences should be the lowest known to InChI valence or, for some metals, the valence may also be the 2nd lowest valence. Positively charged metals should have the lowest valence known to InChI (see Table 1). Upon disconnection, atom X (X = Hal or O) of the acid receives a single negative charge; the charge of the metal is incremented. Substances drawn as H4N-X are disconnected to NH3 and HX. Note that compounds formed by many n class="Chemical">inorganic acids do not fit the above n class="Chemical">salt definition. For example, sodium nitrate is treated as a coordination compound (so may be reconnected on user request). n class="Chemical">Severn class="Chemical">al examples of n class="Chemical">salt disconnection are shown below: ᅟ

Disconnecting other metal-containing compounds

In an effort to den class="Chemical">al with the various different conventionpan>s un class="Chemical">sed for drawing organon class="Chemical">metallic compounds, all metal atoms are disconnected in the main layer. In the process, the charges for disconnected halogens, O, S, Se, Te, N, P, As, and B are adjusted if possible by transferring charge to the metal atom. The n class="Chemical">InChI n class="Chemical">algorithm may n class="Chemical">be instructed, by a software switch, to add to the identifier a “reconnected” layer that contains all bonds given in the input structures, including those to metal. Note that a disconnected “salt” (previous section) cannot be reconnected this way. At this point rules for annihilating n class="Disease">adjacent oppon class="Chemical">site charges going to higher valence state, see above, are applied a second time, to the disconnected structure.

Eliminating radicals and converting aromatic bonds to alternating single and double

This is the first step out of n class="Chemical">severn class="Chemical">al that may change bonds in the structure in a systematic order n class="Chemical">along alternating bond paths. Before attempting this change, the algorithm detects bonds (highlighted with red below) and marks them as fixed. The order of these bonds will not be allowed to change. ᅟ Elimination of radicals can be illustrated as follows: ᅟ The convern class="Chemical">sion of aromatic bonds to n class="Chemical">alternating single and double bonds is done through rn class="Disease">adical cancellation, for example: ᅟ

Finding [de]protonation pattern which leads to neutral core parent structure

This step occurs for n class="Disease">ionized structures. It converts various (de)protonpan>ationpan> forms to the same parent neutrn class="Chemical">al structure, memorizing associated changes in the protonation layer. The necessary condition for this step is a pren class="Chemical">sence, in the input structure, of charges +1 or -1 located on nonpan>-n class="Chemical">metal atoms that have standard valences (see Table 1). The total charge on then class="Chemical">se atoms is counted and used later. Charges on atoms that are adjacent to other charged atoms are not counted. Non-ring bonds altered during variable protonation processing are marked as non-stereogenic. The so-called aggressive (‘hard’) proton removal or addition procedure is described below.

Remove protons from charged heteroatoms

This step removes protons from protonated atoms and places them in a n class="Chemical">separate proton (charge) layer. If the structure contains atom Y′H+ (m ≥ 1, Y′ is N, P, O, S, Se, or Te), then it is replaced with Y′H. This is a “simple removal” of a proton. Since some protonated atoms are, in effect, concealed by alternating bond conventions, a separate effort is made to find and disconnect these protons. This “hard removal” involves changing bonds and removing H from formally uncharged atoms. It may be illustrated as follows. If there exist atoms =N+ or ≡N+ and -NH (m ≥ 1, at least one neighbor of N must be Y or Sb) or =Y-QH (Y = C, N, P, As, S, Se, Te, Cl, Br; Q = O, S, Se, Te), then an attempt is made to find a fragment containing an alternating path (a, b,… are other atoms) and remove a proton: HN − b = c − d = N+ → HN+ = b − c = d − N → H N = b − c = d − N + H+ or HQ − Y = a − b = c − d = N+ < → HQ+ = Y − a = b − c = d − N < → Q = Y − a = b − c = d − N < + H+ More aggresn class="Chemical">sive tn class="Chemical">ransformations are n class="Chemical">also possible, for example, the following "hard" proton removal: ᅟ During this process: pon class="Chemical">sitive charges may n class="Chemical">be moved n class="Chemical">between N+, N− and N (except N in -N=Q); nen class="Chemical">gative charges may class="Chemical">pan> class="Chemical">be moved between N+, N−, N, and Q, Z in -Y=Q, =Y=Q, =Y-QX, ≡Y-QX, -C-ZX, -Q'-QX, ≡N+-QH, =N+=Q, -N−-QH, where Q is O, S, Se, or Te; Z is S, Se, or Te; X is H or -; Y ≠ C ≠ N may carry ±1 charge; N in -N=Q is excluded; or n class="Disease">atoms H may n class="Chemical">be moved n class="Chemical">between atoms described in (b). The neutrn class="Chemical">alization of pon class="Chemical">sitive and negative charges may occur. A simple exchange of atom H and a negative charge between two atoms without changing bonds is not allowed.

Remove protons from neutral heteroatoms

If the totn class="Chemical">al charge is pon class="Chemical">sitive and the structure hn class="Chemical">as fragments = C-QH, -Q-QH, C-ZH, or =N-QH, then hydrogen atoms are removed from the fragments and replaced with negative charges until either no more hydrogens are available or the charge has been reduced to zero. This is a “simple removal” of a proton; example: ᅟ If the totn class="Chemical">al charge is still pon class="Chemical">sitive then a “hard proton removn class="Chemical">al” procedure similar to the previously described one is executed. During this process: pon class="Chemical">sitive charges may n class="Chemical">be moved n class="Chemical">between atoms described in 1 (a); nen class="Chemical">gative charges may class="Chemical">pan> class="Chemical">be moved between atoms described in 1 (b); atoms to receive H if the procedure succeeds: Q in -C=Q, =C=Q, =n class="Chemical">N+=Q, anclass="Chemical">pan>d -n class="Chemical">N=Q; and n class="Disease">atoms H may n class="Chemical">be moved n class="Chemical">between atoms described in 1 (b) except atoms described in (f) above. If the procedure succeeds, it moves H from atoms descrin class="Chemical">bed in (g) to atom Q descrin class="Chemical">bed in (f). After that the H is removed from that Q as a proton, leaving nen class="Chemical">gatively charged O− thus reducing the positive charge.

Add protons to reduce negative charge

If the totn class="Chemical">al charge is negative or has n class="Chemical">become negative due to positive charge removal and the structure has fragments =C-Q−, -Q-Q−, C-Z−, or =N-Q−, then protons are added to the fragments replacing negative charges with atoms H until the total charge is reduced to minimal or zero. This is a “simple addition” of a proton. If the totn class="Chemical">al charge is still nen class="Chemical">gative then a “hard proton addition” procedure n class="Chemical">similar to the previously described one is executed. During this process: pon class="Chemical">sitive charges may n class="Chemical">be moved n class="Chemical">between atoms described in (a); atoms to receive nen class="Chemical">gative charge if the procedure succeeds are atoms descriclass="Chemical">pan> class="Chemical">bed in (f): nen class="Chemical">gative charges may class="Chemical">pan> class="Chemical">be moved between atoms described in (b) except atoms described in (i) above n class="Disease">atoms H may n class="Chemical">be moved n class="Chemical">between atoms described in (b). If the procedure succeeds it moves negative charge from atoms descriclass="Chemical">pan> class="Chemical">bed in (j) to atom Q described in (i). After that this negative charge is replaced with atom H, which is equivalent to a proton addition thus reducing the negative charge.

Analyzing mobile hydrogens and charge

Neutrn class="Chemical">al or class="Chemical">pan> class="Chemical">singly negatively charged tautomeric atoms and corresponding changeable bonds are detected and marked. Atoms that may exchange hydrogen atoms or negative charges are considered to belong to a “mobile H group”. If positive charges may be moved from an atom as described in the next section, "Moveable positive charge detection", this atom is also considered as possibly tautomeric. Mobile H groups that contain only negative charges are excluded from InChI. The existence of a ‘protonated’ n class="Chemical">site is sometimes not reclass="Chemical">pan> class="Disease">adily apparent in a structural drawing. The normalization algorithm is designed to resolve complications that arise from ambiguities introduced at previous step during “hard” or incomplete “simple” removal or addition of protons and in case of charged atoms resembling results of heterolytic dissociation. An example of such ambiguity is shown below: ᅟ

Simple tautomerism detection

The main layer must n class="Chemical">be the same for anclass="Chemical">pan>y arn class="Chemical">rangement of mobile hydrogen atoms. This is achieved by the logical removal of mobile H-atoms and the tagging of H-donor and H-receptor atoms. To identify these H-atoms we have adopted the straightforward varieties of H-transfer tautomerism listed in Table 2 (see also ref. [36]).

Table 2

Tautomerism patterns detected by InChI

M = Q − ZH ↔ MH − Q = Z or M = Q − Z⁻ ↔ M⁻ −Q = Z		M, Z = N^III, O^II, S^II, Se^II, Te^II (Roman superscripts designate chemical valence)
		Q = C, N, S, P, Sb, As, Se, Te, Br, Cl, I
		H = hydrogen, deuterium, or tritium
The “=” bond may be a double bond, a bond in the alternating single/double bond ring, or a “tautomeric” bond (shown in blue)
The H atom below can be replaced with a negative charge
	↔		↔
	↔		↔

ᅟ ᅟ n class="Chemical">InChI for n class="Chemical">guanine (optionn class="Chemical">al fixed H layer included) is InChI=1/C5H5N5O/c6-5-9-3-2(4(11)10-5)7-1-8-3/h1H,(H4,6,7,8,9,10,11)/f/h8,10H,6H2 The layers' meaning is: /n class="Chemical">h1H,(H4,6,7,8,9,10,11) atom numn class="Chemical">ber 1 hclass="Chemical">pan> class="Chemical">as one H, 4 atoms H are shared by atoms 6,7,8,9,10, and 11 /f/n class="Chemical">h8,n class="Chemical">10H,n class="Chemical">6H2 atom 6 has 2H, atom 8 has 1H, atom 10 has 1H. Tautomerism patterns detected by n class="Chemical">InChI

Moveable positive charge detection

Positive charges located on n class="Disease">N-atoms are considered moveable n class="Chemical">along alternating bonds between these atoms. This also applies to phosphorus atoms. Atoms that may exchange positive charges are assigned to a “mobile charge group”. The interference between mobile H and mobile charges may occur. Hypotheticn class="Chemical">al structures (a), (b), and (c) below serve n class="Chemical">as an illustration. ᅟ Structure (b) wn class="Chemical">as obtained from structure (a) by formn class="Chemical">ally moving the positive charge from left to right n class="Chemical">along an alternating bond path. This allows the discovery in structure (b) of a tautomeric pattern (highlighted in blue). Bonds that may be changed by moving positive charges are highlighted in green. Structure (c) shows another tautomeric form obtained from structure (b). Note that structure (c) does not allow movement of a positive charge back from right to left. These three structures generate the same standard InChI: InChI=1S/C6H13N3O/c1-8(2)5-6(10)7-9(3)4/h5H,1-4H3/p + 1 but InChI possessing FixedH layer for structure (c) differs from those of structures (a) and (b): (a,b) InChI = 1/C6H13N3O/c1-8(2)5-6(10)7-9(3)4/h5H,1-4H3/p + 1/fC6H14N3O/h10H/q + 1 (c) InChI = 1/C6H13N3O/c1-8(2)5-6(10)7-9(3)4/h5H,1-4H3/p + 1/fC6H14N3O/h7H/q + 1 For the purpon class="Chemical">se of detecting stereogenic bonpan>ds, the n class="Chemical">algorithm must also provide a means for testing whether a bond order is changeable. InChI assumes that a changeable bond cannot support Z/E stereoisomerism. This is accomplished by introducing fictitious bonds and atoms (un class="Chemical">sed only for internal processing) that represent a mobile H group (red H below) and charge group (red plus below). In the mobile H group fictitious double bonds (red) point to the atom-donors of H or negative charge; in the mobile positive charge group fictitious single bonds point to positively charged atoms. ᅟ After the discovery of a new mobile group it is class="Chemical">pan> class="Disease">added to the structure. This results in the discovery of changeable bonds. In case of the structure (a), adding a charge group allows one to discover changeable bond N-C (shown in blue) and, as a result, discover the mobile H group. These processing steps correct for common ambiguities in input information for conjugated systems where Z/E stereochemistry is implied by the drawing, but was not really intended.

Additional normalization

n class="Chemical">As mentionpan>ed above, complicationpan>s arin class="Chemical">se from ambiguities introduced at “hard” or incomplete “simple” removal or n class="Disease">addition of protons and in the case of charged atoms resembling results of heterolytic dissociation. Since there could be more than one possible set of added/removed proton locations or more than one alternating path for “hard” addition or removal, ambiguities may be introduced. These ambiguities are specifically addressed and in most cases fixed (for the details, see InChI Technical Manual [5]).

Perception of isotopic data

The isotopic structurn class="Chemical">al layer is the most stn class="Chemical">raightforward to compute. In the example n class="Chemical">below, the isotopic layer is ‘/i1 + 1,4 + 1D’. It contains the canonical atom number followed by the isotopic shift (13 – 12 = +1) followed by isotopic hydrogen (D), if present. ᅟ The only complexity arin class="Chemical">ses for isotopicclass="Chemical">pan> class="Chemical">ally labeled hydrogen atoms that can undergo tautomerism. In the mobile H group these hydrogen atoms are treated as non-isotopic; the number of these mobile isotopic hydrogen atoms is appended to the ”exchangeable isotopic hydrogen atoms” part of the isotopic layer. The same is done to isotopic hydrogen atoms that may be subject to heterolytic bond dissociation in aqueous solution (for example, D in R-SD). Note that it is possible that two isotopic layers appear, one of which is applied to the main layer with mobile H and another to the main layer without mobile H or to the Fixed-H layer, e.g., “InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/i/hD2/f/h2-3H2/i2D2”.

Perception of stereochemical features

The n class="Chemical">InChI n class="Chemical">algorithm supports perception of stereochemistry for both 2D (x,y-coordinates of atoms are given; planar depiction) and 3D (x,y,z-coordinates of atoms are given) can class="Chemical">ses. For perception of stereo configun class="Chemical">rationpan>s in two-dimenn class="Chemical">sional drawings, the InChI algorithm supports two different systems of wedged and hatched bond interpretations. By default, the convention “narrow end of wedge points to stereocenter”, is used. It suggests that the bond affects the stereochemistry of only one atom. Another - “perspective” - system is invoked by selecting the “narrow end of wedge points to stereocenter is OFF”, “/NEWPSOFF”, option. Here a wedged or hatched bond affects the stereochemistry of the two atoms it connects. Both systems assume that the narrow end of the bond is in the plane of the drawing. In the 3D-can class="Chemical">se, the parity is direcclass="Chemical">pan> class="Chemical">tly calculated from the atomic coordinates, 2-dimensional Up and Down wedged and hatched bond symbols being ignored. However, “Either” (wavy) bonds in the 3-dimensional case still provide “unknown” stereochemistry. It is stn class="Chemical">raightforward to cn class="Chemical">alculate stereodescriptors in can class="Chemical">ses where neighbors to a stereogenic element are not constitutionally identical: the parities are calculated from canonical numbers and geometry. Tetrahedral parity is ‘+’ if the canonical numbers of neighbors increase clockwise when observed from a hydrogen atom or an atom that has the smallest canonical number; parity of a double bond is ‘-’ if neighbors with greater canonical numbers are located on the same side of the bond. When constitutionn class="Chemical">ally identicn class="Chemical">al neighbors are present, several equivalent canonical numberings are possible. To resolve this ambiguity, the algorithm finds a numbering that minimizes a specific internal representation of the stereo layer. In this case, it is desirable to determine whether a possibly stereogenic element is in fact stereogenic. To determine this, the following heuristic approach is used. A pair of constitutionally identical neighbors (termed right and left neighbors) of a possibly stereogenic element is selected. These two neighbors and adjacent atoms are mapped on their constitutionally equivalent counterparts. After the mapping is complete, the canonical numbers are switched between left and right (this leaves the non-stereochemical part of the identifier unchanged). Stereochemical layers corresponding to these two canonical numberings are compared. If the only change that occurs is to the stereogenic element in question and there are not more than two such constitutionally identical stereogenic elements, then these elements are not marked as stereogenic.

Double bond stereochemistry

When un class="Chemical">sing input originating from drawings, the perception of formal double bonds capable of supporting Z/E isomerism (Table 3) relies on atomic coordinates.

Table 3

Double bonds treated as possibly stereogenic (only one of two atoms connected by a possibly stereogenic double bond is shown)

Double bonds treated n class="Chemical">as posclass="Chemical">pan> class="Chemical">sibly stereogenic (only one of two atoms connected by a possibly stereogenic double bond is shown) In alternclass="Chemical">pan>ating n class="Chemical">single/double bond cyclic systems, bond-finding algorithms determine whether a formn class="Chemical">al double bond can exist between each of the two attached atoms. If such a bond can be drawn between sp2 hybridized atoms, and the remainder of the π-electron structure can be completed with alternating bonds, that bond is presumed to be a double bond, hence stereogenic (can support Z/E isomerism). Replacement of n class="Disease">adjacent charges with incremented bonpan>d orders produces structures with two double bonpan>ds conpan>nected to a n class="Chemical">nitrogen atom. In reality, one or both double bonds are in place of a single bond or a bond/charge resonance. The rules for stereogenic bond recognition are summarized in Table 4. Recognized stereogenic bonds are drawn in blue.

Table 4

Detection of stereogenic bonds in =N= fragments

Input fragment(s)	Normalized fragment	Interpreted for stereogenic bond detection as




		No stereo



		No stereo

Detection of stereogenic bonds in =n class="Chemical">N= fn class="Chemical">ragments The n class="Chemical">InChI supports a ‘not-knownpan>’ descriptor for marking double bonds where the Z/E isomer is not certain. That is, the stereolayers would be different for (Z)-but-2-ene, (E)-but-2-ene and but-2-ene.

Tetrahedral stereochemistry

Stereochemicn class="Chemical">al descriptors will n class="Chemical">be procesn class="Chemical">sed for tetrahedral atoms such as C, Si and Ge. InChI recognizes the centers listed in Table 5 as capable of supporting sp3 stereochemistry.

Table 5

Tetrahedral centers treated as possibly stereogenic

An atom or positive ion N, P, As, S, or Se is not treated as stereogenic if it has (a) a terminal H atom neighbor or (b) at least two terminal neighbors, −XH and − XH , (n + m > 0) connected by any kind of bond, where X is O, S, Se, Te, or N. Phosphines and arsines are always treated as stereogenic even with H atom neighbors.

Tetn class="Chemical">rahedrn class="Chemical">al centers treated n class="Chemical">as possibly stereogenic An atom or pon class="Chemical">sitive ion n class="Chemical">N, P, n class="Chemical">As, S, or Se is not treated as stereogenic if it has (a) a terminal H atom neighbor or (b) at least two terminal neighbors, −XH and − XH , (n + m > 0) connected by any kind of bond, where X is O, S, Se, Te, or N. Phosphines and arsines are always treated as stereogenic even with H atom neighbors. The parity of a stereogenic atom is cn class="Chemical">alculated class="Chemical">pan> class="Chemical">as a volume of an oriented tetrahedron. A wide end of a wedge bond is lifted at an angle of 45° to the plane; a wide end of a hatched bond is lowered at 45° from the plane. Before the volume is calculated all bonds are reduced to the same length. A warning is issued if the central atom is outside of the tetrahedron. When a complete stereo-description is provided, it is stn class="Chemical">raightforward to derive the n class="Chemical">InChI for a stereoisomer. Problems may arise for representation of structures that contain inexact stereochemical information. In these cases stereochemical layers of InChI for different input representations of the same substance will match only if they contain precisely the same sets of inexact information. Moreover, stereochemical layers for inexact structures will not match stereochemical layers for a fully described stereoisomer. Nevertheless, n class="Chemical">significant interest wn class="Chemical">as expressed for including partial stereochemicn class="Chemical">al information in the InChI. For this purpose, absolute and unknown stereochemical descriptors can be employed, as shown below (left structure is absolute, the C-BrC2H stereocenter in the right structure is unknown): ᅟ Repren class="Chemical">senting relative and absolute stereochemistry of the whole structure is illustn class="Chemical">rated for tartaric acid (it is known that the structure is described by either structure 1 or 2): ᅟ The identifiers for then class="Chemical">se structures (caclass="Chemical">pan> class="Chemical">se of absolute stereochemistry) are 1. n class="Chemical">InChI=1S/C4H6O6/c5-1(3(7)8)2(6)4(9)10/h1-2,5-6H,(H,7,8)(H,9,10)/t1-,2-/m1/s1 2. n class="Chemical">InChI=1S/C4H6O6/c5-1(3(7)8)2(6)4(9)10/h1-2,5-6H,(H,7,8)(H,9,10)/t1-,2-/m0/s1 n class="Chemical">InChI conn class="Disease">siders both enantiomers and selects the one that hn class="Chemical">as the “smaller” identifier. /m0 signifies that the selected one has exactly the same stereo arrangement as the input structure; /m1 means that the selected one has the inverse arrangement. /s1 means absolute stereochemistry was requested. To identify relative stereochemistry the /m n class="Chemical">segment of the identifier is dropped. As a result the identifiers (case of relative stereochemistry) are the same: 1. n class="Chemical">InChI=1/C4H6O6/c5-1(3(7)8)2(6)4(9)10/h1-2,5-6H,(H,7,8)(H,9,10)/t1-,2-/s2 2. n class="Chemical">InChI=1/C4H6O6/c5-1(3(7)8)2(6)4(9)10/h1-2,5-6H,(H,7,8)(H,9,10)/t1-,2-/s2 /s2 means relative stereochemistry. The molfile structure format supports the specin class="Chemical">al feature, Chirn class="Chemical">ality Flag. If this flag is set, the tetrahedral stereo is absolute, otherwise relative. The InChI option “Include stereo from chiral flag” (/SUCF command line option) makes InChI calculate tetrahedral stereo according to the Chiral Flag. If Chiral Flag is set, “Include stereo from chiral flag” option is used, and InChI finds that the tetrahedral stereo descriptor does not change upon inversion of the structure, the warning "Not chiral" is issued. n class="Chemical">Allenes n class="Chemical">belong to the tetn class="Chemical">rahedral layer. However, to indicate stereochemistry of allenes in the input molfile a special effort may be required. Namely, the two bonds at the same end of allene system should be indicated by wedge as stereogenic (and having opposite Up/Down marks). This is a limitation of current InChI software. Cumulenes are treated as double bonds. Table 6 lists the rules used to recognize allenes and cumulenes:

Table 6

Cumulenes treated as possibly stereogenic

Terminal atoms

Middle atoms

n class="Chemical">Cumulenes treated n class="Chemical">as posn class="Chemical">sibly stereogenic Only n class="Chemical">cumulenes that have 3 double bonds and n class="Chemical">allenes that have 2 double bonds are treated as posn class="Chemical">sibly stereogenic. Canonicalization of allene and cumulene stereochemistry is performed together with the double bond stereochemistry. Some limitations of the n class="Chemical">InChI class="Chemical">pan> class="Chemical">algorithm of stereo recognition are considered in the InChI Technical Manual [5].

Canonicalization

Establishing canonicn class="Chemical">al numn class="Chemical">bers of chemical graph nodes (atoms) is a problem well-known to chemists, mathematicians and chemoinformaticians and well-discussed in the literature. For canonicn class="Chemical">alizationpan> that does not involve stereochemistry, the n class="Chemical">InChI approach is ban class="Chemical">sed upon the algorithm by McKay [37,38] (see also explanations in [39]]). This algorithm, modified to accommodate the layered structure of InChI, was implemented in the InChI software. The stereochemicn class="Chemical">al canonpan>icalization is based on an exhaustive mapping of non-stereochemicn class="Chemical">al canonical numbering on the structure using previously found constitutional equivalence of the atoms. It is an iterative process aimed at establishing the smallest internal representation of the stereochemical layer while keeping other previously found layers unchanged. To avoid combinatorial explosion in the case of highly symmetrical structures, the algorithim uses two approaches: (1) elimination of non-stereogenic elements and (2) a backtrack method that prunes the search tree [40]. The canonicn class="Chemical">alizationpan> is performed in stages; each stage n class="Disease">adds one more layer to ‘minimize’ while keeping previously found layers unchanged. Figure 10 shows the canonicn class="Chemical">alization flowchart. As can be seen, the first layer of the Identifier is actually a hydrogenless chemical formula and skeletal connections.

Figure 10

Canonicalization order flowchart.

Canonicn class="Chemical">alizationpan> order flowchart. Notes. Each n class="Chemical">set of canonpan>icn class="Chemical">al numberings is a subset of the previous one located up the tree. Δ(fixed H) = (numn class="Chemical">ber of fixed H on an atom) – (numn class="Chemical">ber of H in “mobile-H” structure on the same atom). Names in parenthen class="Chemical">ses e.g. (Ct_NoH) are names of data structures in the code. n class="Chemical">Below is a very brief and n class="Chemical">simplified descriptione (leaving aside mon class="Chemical">bile H treatment and technical details; almost all numerical examples below refer to 2-chlorobutane as illustrated by Figure 11).

Figure 11

Establishiment of the canonical atomic numbers for 2-chlorobutane (steps a-e are explained in the text).

Step A: hydrogenless constitution

The skeletn class="Chemical">al atoms are lan class="Chemical">belled with numericn class="Chemical">al "colors" in the following order of precedence. Ordering numn class="Chemical">ber of chemicn class="Chemical">al element in the sequence: carbon, other atoms in alphabetic order, bridging hydrogen. In case of C4H9Cl all C will be given color 1, Cl will be given 2 (Figure 11a). Numn class="Chemical">ber of connectionpan>s (numn class="Chemical">ber of bonds). In 2-chlorobutane CH3CH2CH(Cl)CH3 these are (in brackets): C[1]C[2]C[3](Cl[1])C[1] The resultant "ordered lists of colors" pren class="Chemical">sented in order of the atoms in the n class="Disease">semi-structural formula CH3CH2CH(Cl)CH3 are: C: (1, 1); C: (1, 2); C: (1, 3); Cl: (2, 1); C (1, 1) (Figure 11b). Atoms are n class="Chemical">asn class="Chemical">signed new colors according to lexicogn class="Chemical">raphical comparison of the "color lists", in ascending order [for example, (1,1) < (1,2) < (2,1); (1, 2) < (1, 2, 1)] C: 1, 1 = > 2; C: 1, 2 = > 3 C: 1, 3 = > 4 Cl: 2, 1 = > 5 n class="Gene">C 1, 1 = > 2 (n class="Chemical">see Figure 11c) Notice that each color is equn class="Chemical">al to the numn class="Chemical">ber of atoms that have this or smn class="Chemical">aller color. Atoms are n class="Chemical">asn class="Chemical">signed new "ordered lists of colors": the first in the list is the color of the atom, the rest are sorted in n class="Chemical">ascending order colors of other atoms, connected to this atom (Figure 11d): C: 2, 3 n class="Gene">C: 3, 2, 3 n class="Gene">C: 4, 2, 3, 5 Cl: 5, 4 n class="Gene">C 2, 4 Atoms are n class="Chemical">asn class="Chemical">signed new colors according to lexicogn class="Chemical">raphical comparison of the "color lists", in ascending order (Figure 11e) C: 2, 3 = > 1 n class="Gene">C: 3, 2, 3 = > 3 n class="Gene">C: 4, 2, 3, 5 = > 4 Cl: 5, 4 = > 5 n class="Gene">C 2, 4 = > 2 Steps 3–4 are repeated until n class="Chemical">all new colors are different or no more changes occur (for n class="Chemical">2-chlorobutane the colors - canonical numbers - have alren class="Disease">ady been found, see Figure 11e). The resultant colors produce a so cn class="Chemical">alled equitable partition, in a way which is conpan>ceptun class="Chemical">ally almost same as the intermediate result of the SMILES-2 n class="Chemical">algorithm [17]. If some of the colors are still identicn class="Chemical">al, then the smn class="Chemical">allest is picked up and reduced to the previous color + 1. For example, if colors are (this example does not refer to n class="Chemical">2-chlorobutane): 1,2,5,5,5,7,7 then the smn class="Chemical">allest duplicated color is 5, the previous color is 2. A color of one of the colored-5-atoms will n class="Chemical">be reduced from 5 to 2 + 1 = 3. Repeat steps 3–6 until n class="Chemical">all colors class="Chemical">pan> class="Chemical">become different (this is almost same as obtaining the final result of the SMILES-2 algorithm) and save the "connection table". To make the reading easier, the process of obtaining this table (actually, a list of number) is split into 3 steps. The connection table is mn class="Disease">ade out of n class="Chemical">segments, ordered in ascending order of the color of the first atom in a segment. The number of the segments is the number of atoms. Each segment starts with the color of an atom and is followed by a colon and a sorted list of the colors of atoms, connected to it: 1:3; 2:4; 3:1,4; 4:2,3,5; 5:4; n class="Chemical">Since this conpan>nectionpan> table conpan>tains each conpan>nectionpan> 2 times (for example, the bonpan>d between atoms of color 1 and 3 is in the segments "1:3" and "3:1"), it is rewritten by excluding colors that are greater than the first color in the segment: 1; 2; 3:1; 4:2,3; 5:4; The delimiters now are redundant n class="Chemical">becaun class="Chemical">se the members of each segment are always smaller than the first member of the segment. This is the final connection table to be saved and used later: 1, 2, 3, 1, 4, 2, 3, 5, 4 There could n class="Chemical">be a great den class="Chemical">al of an class="Chemical">rbitrariness in choosing the atom whose color was to be reduced at step 6 (in the example, 3 atoms have color 5; each of them could be chosen). Therefore, repeat step 7 for all possible sequences of choosing the atoms whose color is reduced. Lexicographically compare each obtained connection table to the previously saved and keep the smallest one together with the assignment of the colors to the atoms. These colors are the canonical numbers for the hydrogenless structure. If two connection tables are identicn class="Chemical">al, then atoms that have same colors in two conpan>nection tables n class="Chemical">belong to the same equivalence cln class="Chemical">ass; this information is saved and used. The equivalence class is the smallest color in the equivalence group. (This approach may be found in, e.g. [41]. However, the algorithm by McKay implemented in InChI allows one to avoid a combinatorial explosion in typical chemical structures, obtain equivalence classes, and even the order of the permutation group and its generators). At this time, a canonicn class="Chemical">al numclass="Chemical">pan> class="Chemical">bering (colors) for a hydrogenless structure and the canonical equivalence classes (=the smallest color in each set of equivalent atoms) are obtained. Make new colors out of the canonicn class="Chemical">al equivn class="Chemical">alence classes and repeat steps 3–8 if these colors are different from the colors previously used at Step 3. Obtain the new minimal connection table. Use these classes as initial colors in the next steps (If equivn class="Chemical">alence classes are, for example, 1, 1, 1, 4, 4, 5, 5, 5 then the corresponding colors are 3, 3, 3, 5, 5, 8, 8, 8)

Step B. Add hydrogen atoms to the structure

Un class="Chemical">se previously obtained equivn class="Chemical">alence clasn class="Chemical">ses at Step A.9 and use the previously obtained minimal connection table for the comparison. Run Steps A.3-8 with the following difference: each time the connection tables are compared at Step A.8, in case of identical connection tables also compare the list of terminal atoms H in the following form: where numn class="Chemical">ber_of_H(c) is the number of terminal atoms H attached to the atom that has color c; n = number of atoms. Save the minimn class="Chemical">al list of the terminn class="Chemical">al atoms found this way together with the n class="Chemical">assignment of the colors to the atoms. Also obtain the equivalence classes as was done earlier. At this time, the canonicn class="Chemical">al colors (numclass="Chemical">pan> class="Chemical">bering) of the non-isotopic non-tautomeric structure are obtained.

Step C. Add isotopic composition to the structure

where iso_weight(c) is the "isotopic weight" of the atom to which the color c wn class="Chemical">as n class="Chemical">asn class="Chemical">signed. For each atom the isotopic weight is calculated according to the formula:where If the structure is isotopic, then n class="Disease">add one more list to compare whether the connection tables and the lists of terminal atoms H are same: nH1 = number of terminclass="Chemical">pan> class="Chemical">al atoms of protium attached to the atom nH2 = number of terminclass="Chemical">pan> class="Chemical">al atoms of deuterium attached to the atom nH3 = number of terminclass="Chemical">pan> class="Chemical">al atoms of tritium attached to the atom shift = [(integrn class="Chemical">al) mclass="Chemical">pan> class="Chemical">ass of the isotopic atom] - [rounded average atomic mass] Note: n class="Chemical">hydrogen H is treated differenn class="Chemical">tly from its isotope n class="Chemical">protium: H has "natural" isotopic composition while protium is treated as an isotopic atom. In case of an atom not isotopic the shift is 0 by definition. If the atom is isotopic and its mn class="Chemical">ass numn class="Chemical">ber is greater than or equn class="Chemical">al to the rounded average atomic mass (that is, shift is not negative) then the shift is incremented, to avoid shift = 0 for isotopes. If the formula produces iso_weight equn class="Chemical">al to 0 (the atom and the attached H are not isotopic) then iso_weight(c) is set equal to a very large number that exceeds any iso_weight. This forces isotopic atoms to n class="Chemical">assume the least possible canonical numbers. In the can class="Chemical">se of moclass="Chemical">pan> class="Chemical">bile H the steps are somewhat different, namely: (m-a) n class="Disease">Add a list of only thon class="Chemical">se H that are not mobile (n class="Chemical">similar to B.1 above) and minimize both the connection table (it will be same) and the list. (m-b) n class="Disease">Add mobile groups n class="Chemical">as pseudoatoms connected by directed edges (it means that these pseudoatoms are not included in the connection table segments of the real atoms) to the atoms where the mobile H and possibly negative charges may reside and canonicalize this structure. Numbers of H and (−) in the groups are in one more list to minimize. The result is the Mobile H group canonical numbering and the corresponding equivalence classes, including equivalence classes of the mobile H (and possibly negative charge) groups. Mobile H groups that have only negative charges are not included in this process. (m-c) n class="Disease">Add isotopic list (n class="Chemical">similar to n class="Gene">C.1 above) to the number of lists to be minimized. Do not include in it the exchangeable isotopic atoms H. The result of the minimization is the Mobile H canonical numbering and equivalence classes for the isotopic structure. (f-a) For the fixed mobile H (n class="Chemical">FixedH option) start with the results of (m-a) and n class="Disease">add a list of the fixed positions of the mobile H (colors of the atoms where these H reside) and numbers of these atoms H. The result of the minimization is the Fixed-H canonical numbering and equivalence classes. (f-b) n class="Disease">Add isotopic list (n class="Chemical">similar to n class="Gene">C.1 above). The minimization result is the Fixed-H canonical numbering and equivalence classes for the isotopic Fixed-H structure. Repeat Step B, adding the list of isotopic weights to those already minimized. At this point the modified n class="Chemical">algorithm [37] is finclass="Chemical">pan>ished. It should n class="Chemical">be pointed out that for the sake of n class="Chemical">simplicity, avoiding dependence on the hardware or operating system, and the posn class="Chemical">sibility to reproduce the results "by hand", the efficiency of the original McKay algorithm has been reduced. The greatest impact is due to abandoning hashing for the connection table comparison and introducing lists to be minimized additional to the connection table. Also the implemented algorithm for calculating the equitable partition from the given colors is less efficient than the one suggested in ref. [37].

Step D. Stereochemistry

For the found canonicn class="Chemical">al colors (numbers) calculate double bond >X=Y < and cumulene >W=X=Y=Z < parities. Namely, for each atom at the ends of the double bond or cumulene find connected to it by a single bond the atom that has larger canonical number. If these two found atoms are in "cis" positions then the parity is (−), otherwise the parity is (+). Save parities list c1[1], c2[1], p[1], c1[2], c2[2], p[2],…, c1[n1], c2[n1], p[n1] arn class="Chemical">ranged in class="Chemical">pan> class="Chemical">ascending order of (c1[i],c2[i]) pairs where n1 = number of possibly stereogenic double bonds and cumulenes; c1[i] > c2[i] are colors of the atoms at the end of a double bond or cumulene; p[i] is the parity ("u" > "?" > " + " > "-"). The precedence order is determined n class="Chemical">as follows. Let a1 > a2 and b1 > b2 class="Chemical">pan> class="Chemical">be the colors of the atoms for two double bonds, (a1,a2) and (b1,b2). We consider that (a1,a2) > (b1,b2) if and only if (i) a1 > b1 or (ii) a1 = b1 and a2 > b2. For each n class="Chemical">allene >X=Y=Z < consider a tetrahedron that hn class="Chemical">as as its apices the four atoms connected by single bonds to the allene atoms X and Z. Looking at other apices from the apex that has the smallest canonical number and consider canonical numbers of these three other apices arranged in ascending order. If it is clockwise then the parity is (+), otherwise it is (−). Save parities list: c[1], p[1], c[2], p[2],…, c[n2], p[n2] arn class="Chemical">ranged in n class="Chemical">ascending order of c[i], where n2 = numn class="Chemical">ber of possibly stereogenic allenes; c[i] are the colors of atoms Y; p[i] are the parities. For each posn class="Chemical">sibly stereogenic atom conpan>n class="Disease">sider a tetrahedron that has n class="Chemical">as its apices the four atoms connected this possibly stereogenic atom. If you look at other apices from the apex that has the smallest canonical number and see canonical numbers of these three other apices arranged in ascending order clockwise then the parity is (+), otherwise it is (−). Save parities list: c[1], p[1], c[2], p[2],…, c[n3], p[n3] arn class="Chemical">ranged in n class="Chemical">ascending order of c[i], where n3 = numn class="Chemical">ber of possibly stereogenic atoms; c[i] are the colors of the atoms; p[i] are the parities. Note. Terminn class="Chemical">al n class="Chemical">hydrogen atoms do not have colors (canonical numbers). In parity calculations, hydrogen atoms are assumed to have colors less than the smallest color of other atoms, that is, less than 1. The values of their colors c are assumed to be: c[H] < c[protium] < c[deuterium] < c[tritium] < 1. In the special case of all four hydrogen atoms connected to the same atom, the atom is not stereogenic. In the can class="Chemical">se of a tetclass="Chemical">pan> class="Chemical">rahedral atom that has only 3 bonds (for example, >S=O or >N-) the direction of the lone electron pair is used as one more bond; c[lone pair] < c[H]. Repeat steps 1–3 for n class="Chemical">all other mappings of the canonpan>icn class="Chemical">al numbers on the atoms that produce same results as in Step B or C and find the mapping(s) that produce the lexicographicn class="Chemical">ally smallest result in this order of the lists: D.1, D.2, D.3. To each result apply a heuristic to detect posn class="Chemical">sibly stereogenic elements that in reality are not stereogenic; if such elements have been found then remove their parities and repeat D.1-4. Repeat steps 1–4 for the spatin class="Chemical">ally inverted structure. Accept the onpan>e that hn class="Chemical">as smaller stereo (D.1 stereo should be same, except can class="Chemical">ses of constitutionally identical neighbors differing in tetrahedral stereochemistry). Set "inverted" flag if the inverted stereo was selected. This procedure is applied to n class="Chemical">all connected components of the whole structure.

Serialization

The n class="Chemical">sequence of n class="Chemical">InChI layers and sublayers is strictly determined; genen class="Chemical">rating the corresponding sequence of characters is mostly a technical issue that does not need specific comments. It is noteworthy that some complication comes from the fact that n class="Chemical">InChI may repren class="Chemical">sent a substance composed of several sublayers (connected components) not bound to each other; a definite order of these components must be figured out. For this purpose, the components are sorted (the "greater" component appears first in the InChI string, and descending order of the components is applied) using the comparison function taking into account numerous component data details. For convenience, the n class="Chemical">InChI Software may complement an class="Chemical">pan> class="Chemical">InChI string with an Auxiliary Information (“AuxInfo”) line that provides some explanations of the InChI content and its relation to the input parent structure. In particular, AuxInfo contains mapping of canonical numbers on original atom numbers, information on detected constitutional equivalence of atoms and mobile H groups, stereo of the inverted structure, and reversibility information that is sufficient to recalculate the Identifier and, in case of input from a Molfile, reconstruct the Molfile (except the order of the bonds). The exact AuxInfo layout may be found in the InChI Technical Manual [5].

Generation of InChIKey

An n class="Chemical">InChIKey is genen class="Chemical">rated from the corresponding n class="Chemical">InChI string. The first step is pre-procesn class="Chemical">sing. The very n class="Chemical">beginning of the source string, either “InChI=1/” or “n class="Chemical">InChI=1S/” is removed, while the Standard/non-standard nature of the identifier is memorized in the flag character of the InChIKey. The string obtained is split into three parts. Part #1 is the leading substring that comprises formula, connectivity (prefixed by ‘/c’), and hydrogens (‘/h’) sublayers. Part #2 is the protonation layer, prefixed by ‘/p‘. Part #3 is all the rest of the string. After the pre-procesn class="Chemical">sing, hash codes of parts #1 and #3 are calculated n class="Chemical">separately; then, they are encoded using base-26 schema into the InChIKey first and second blocks. The content of the part #2 is not hashed; instead, it is used to select an appropriate character for the protonation/deprotonation flag.

Encoding

The hn class="Chemical">ash code is the n class="Chemical">sequence of n class="Chemical">bits. It is represented, in InChIKey, by uppercase English letters (base-26 encoding). This choice is intentional. Of courn class="Chemical">se, the hn class="Chemical">ash code may n class="Chemical">be expressed by letters, digits, their mix, or even with bare 0’s and 1’s. However, a particular representation may influence utility for applications like publishing or internet search. Web search engines may tend to break the text "on the border" between letters and non-letters, trying to detect "words" since the words of human languages do not contain digits or punctuation marks. Though the exact behavior may vary, in general, it is more robust to proceed with nothing but letters. Using only letters increases chances that a search engine would consider InChIKey as a single "word" (or phrase) and would index it as such. Also, the robust approach assumes use of only upper-case letters - to avoid possible confusions.

Hash codes

InChIKey hash codes are cn class="Chemical">alculated using the SHA-256 cryptographic hash function of the SHA-2 family [42]. Internally, the full 256-bit codes are calculated; then they are truncated to ensure InChIKey’s compactness. (Note that the truncation of the hash code is explicitly allowed by the SHA-2 description). The hn class="Chemical">ash code going to the first block (repren class="Chemical">senting molecular skeleton, or connectivity) is truncated to 65 bits, and the hn class="Chemical">ash code going to the second block (stereo/protonation/isotopic substitution isomers) – to 37 bits. A cryptogn class="Chemical">raphic hash function is un class="Chemical">sed in order to increase the chances that collision resistance will be as close to the theoretical limit as possible. However, due to the very essence of hash functions, collisions (the same InChIKey for different InChIs/structures) are unavoidable in very large collections.

Collision resistance

n class="Chemical">As wn class="Chemical">as mentioned earlier, a n class="Chemical">single InChIKey may occasionally map to two or more InChI strings, due to hash codes collision. This is unavoidable for even the perfect hash function, so the only viable approach is to set and keep a level of collision resistance regarded as sufficient for typical applications. At the InChIKey design, this level was placed at the size characteristic of the largest available real-world molecular databases, ≈(50–100) × 106 molecular skeletons; for stereoisomers/isotopomers/tautomers, the practical goal was to avoid collisions up to several thousand isomers for a given skeleton; some margin of safety was also presumed. Another den class="Chemical">sign gon class="Chemical">al wn class="Chemical">as to keep InChIKey reasonably short. Balancing and testing determined the current choice of the length of hash codes in the first and the second blocks as 65 and 37 bits, respectively. The estimated level of collision resistance was published when InChIKey was introduced in 2007, and the statement that accompanied this release was: “A theoretical – optimistic – estimate of collision resistance (i.e., the minimal size of a database at which a single collision is expected, that is, an event of the two hashes of two different InChI strings being the same) is 6.1 × 109 molecular skeletons × 3.7 × 105 stereo/isotopomers per skeleton ≈ 2.2 × 1015. To exemplify: the probability of a single first block collision in a database of 1 billion compounds is 1.3%. In other words, a single first block collision is expected in 1 out of 100/1.3 = 75 databases of 109 compounds each. For 108 (100 million) compounds in a database this probability is 0.014%.” n class="Chemical">Since 2007, two can class="Chemical">ses of InChIKey collisions have been reported, which prompted us to investigate if the initial estimates are valid. This work is described in a dedicated paper in the Journal of Cheminformatics [43] to which the reader is referred for detail. We only quote the conclusion: “the observed statistical characteristics of InChIKey collision resistance are in good agreement with theoretical expectations… the current design and implementation seem to meet their goals”.

Options available for InChI generation and behavior of InChI algorithms

Note that n class="Chemical">all switches that modify the class="Chemical">pan> class="Chemical">InChI act by appending layers, not by altering the core InChI.

Structure perception options

Then class="Chemical">se are dn class="Chemical">rawing style/edit n class="Species">flags that affect the input structure interpretation. It is assumed that the user may deliberately use these options to take into account specific features of structure collections. As the result, the perception options may be used for generating Standard InChI without the loss of its "standardness". The full list of perception options is n class="Chemical">as follows: Don class="Chemical">NotAddH; SNon; NEWPSOFF.

DoNotAddH

By default, n class="Chemical">InChI Software n class="Chemical">assumes that the input structure may contain "implied" n class="Chemical">hydrogen atoms and adds hydrogen atoms to eligible atoms to satisfy standard valences. Sometimes, this may produce undesirable results. Option DoNotAddH instructs the software to skip the addition of hydrogen atoms.

SNon

This option means that input stereo information (whatever it is and by whatever means it is repren class="Chemical">sented) is completely ignored. That is, n class="Chemical">InChI generated with SNon option intentionally lacks stereo layer(s). Note that SNon is a "perception option"; therefore, it may be used in the generation of Standard InChI without the loss of its "standardness".

NEWPSOFF

By default, when n class="Chemical">InChI Software anclass="Chemical">pan>n class="Chemical">alyzes the effect of a wedged bond on the stereo configuration of a tetrahedral stereogenic atom it assumes that the stereo configuration is affected by only those wedged bonds that have the narrow end pointing to the stereogenic atom in question. To use the alternative definition, where a wedged bond affects stereo configurations of both atoms it connects, one may use the option NEWPSOFF ("Narrow End of Wedge Points to Stereo is OFF"). Note that NEWPSOFF is a "perception option"; therefore, it may n class="Chemical">be un class="Chemical">sed in the generation of Standard InChI without the loss of its "standardness".

Stereo interpretation options

Then class="Chemical">se are n class="Chemical">severn class="Chemical">al options that modify the interpretation of input stereochemical data. In principle, they would be considered related to structure perception. However, as the Standard InChI, by definition, requires the use of absolute stereo (or no stereo at all), these 'stereo interpretation' options assume generation of non-standard InChI. The stereo interpretation options are: SRel; SRac; SUCF. n class="Chemical">SRel n class="Chemical">assumes that the compound is a n class="Chemical">single enantiomer but its absolute configuration is not known. n class="Chemical">SRac n class="Chemical">assumes that the compound is n class="Gene">a 1:1 mixture of enantiomers. One more stereo interpretation option n class="Chemical">SUCF onpan>ly applies to molfiles in which the CHIRn class="Chemical">AL flag is set. By default this is set to 0/off. The combinations are: n class="Chemical">SUCF on, CHIRAL 1 = > absolute stereo (default for InChI Software) n class="Chemical">SUCF on, CHIRn class="Chemical">AL 0 = > relative stereo n class="Chemical">SUCF off, CHIRn class="Chemical">AL ignored. This defaults to absolute stereo by n class="Chemical">InChI Software. Note that many dn class="Chemical">rawing progclass="Chemical">pan> class="Chemical">rams do not allow the user to specify the chiral flag so the information is very variable. It is more likely that the maintainer of a collection will know whether some or all of the compounds are of known chirality. Note that any of the above options makes a non-standard n class="Chemical">InChI. Even if the compound is not chirn class="Chemical">al and, as the result, InChI does not have /m and /s segments, any of these options makes the InChI non-standard.

InChI creation options

The 'n class="Chemical">InChI creation' optionpan>s affect what the n class="Chemical">InChI algorithm does, not just the structure perception. They modify the defaults specified for Standard InChI and n class="Chemical">significantly affect the result (e.g., additional InChI layers may appear). Using any of the creation options: n class="Chemical">SUU; SLUUD; RecMet; n class="Chemical">FixedH; KET; 15T makes the resulting identifier n class="Chemical">Nonpan>-stanclass="Chemical">pan>dard. n class="Chemical">SUU By default, n class="Chemical">InChI Software does not include in the Identifer an unknownpan>/undefined stereocenter unless at len class="Chemical">ast one defined stereo feature is present in the input structure. The n class="Chemical">SUU ("n class="Chemical">always Show Unknown or Undefined stereo") option is intended to n class="Chemical">alter this behavior. Using SUU results in inclusion of unknown/undefined stereo in all cases. Note that n class="Chemical">SUU is anclass="Chemical">pan> 'n class="Chemical">InChI creation' option; therefore, it makes a non-standard n class="Chemical">InChI even if there are no unknown or undefined stereo elements in the structure. The RecMet option appends the n class="Chemical">metal-reconpan>nected layer (/r) to the n class="Chemical">InChI, n class="Disease">adding to the identifier the ability to distinguish metal-bonding isomers. This is an 'InChI creation' option; therefore, it makes a non-standard InChI. The n class="Chemical">FixedH option appends an n class="Disease">additional fixed hydrogen layer (/f) to InChI, n class="Disease">adding to the identifier the ability to distinguish tautomers. Note that FixedH is an 'InChI creation' option; therefore, it makes a non-standard InChI.

Software

The reference implementation of the n class="Chemical">InChI n class="Chemical">algorithm is “InChI Software” - the software package distributed (and periodically updated) by the InChI Trust with the approval of IUPAC. The package contains both stand-alone executables as well as the API (Application Programming Interface) Library for InChI generation. The current n class="Chemical">InChI Software verclass="Chemical">pan> class="Chemical">sion 1.04 (the major version of the software is always the version number of the identifier, e.g., 1 for now) includes: inchi-1 - a 'command line' InChI generator, available in 32- and 64-bit versions for MS Windows and Linux; libinchi - InChI API library, available in 32 and 64-bit versions for MS Windows (dll) and Linux (.so library); winchi-1.exe - a graphical Windows application, which provides annotated InChI and AuxInfo together with graphic representation of the original and normalized, canonicalized chemical structure annotated with InChI-related information. The n class="Chemical">inchi-1 executable hn class="Chemical">as a normative role i.e. it acts n class="Chemical">as the final arbiter: by definition, the reference InChI for any molecule is InChI generated with inchi-1. A distribution package of n class="Chemical">InChI Software class="Chemical">pan> class="Chemical">also includes source code for all programs, examples of calling the InChI library, sample molfiles and SDfiles, etc. The source code (all written in pure C) is the ultimate resource for InChI algorithms in maximum detail. Usage of n class="Chemical">InChI Software is documented in the Un class="Chemical">ser Guide [33] and API [44] (intended for developers who use the InChI libn class="Chemical">rary). n class="Chemical">InChI Software n class="Chemical">allows one to produce both Standard and nonstandard n class="Chemical">InChIs, as well as their hashed representations, InChIKeys. By default, the Standard versions are produced. Modification of this n class="Chemical">behavior is achieved through the uclass="Chemical">pan> class="Chemical">se of special options ('options' are command-line switches for an executable; they are mirrored by the input parameters of InChI API procedures). If at least one of the options may result in non-standard InChI, the non-standard identifiers are produced.

Licensing and use of InChI Software

n class="Chemical">InChI Software is currenn class="Chemical">tly distributed under IUPAC/InChI-Trust n class="Chemical">InChI Licence No. 1.0 [45]. Everybody may ren class="Disease">ad the executable/libn class="Chemical">rary code and examine the details of n class="Chemical">InChI algorithms and their implementation. Everybody as well may freely use the executable or call InChI Software API procedures from within other software. However, a necessary note is that n class="Chemical">InChI, by intention, is n class="Chemical">assumed to have only a single software implementation, the reference implementation provided by IUPAC and n class="Chemical">InChI Trust (as concerns both the stand-alone executable inchi-1 and the API library, libinchi). Modification of the reference source code is not prohibited, see [45]. However, such modification invalidates the status of the identifier produced by the resulting software as the standard, “IUPAC International Chemical Identifier, InChI”. This means that everybody may modify and use InChI Software source codes in other projects (e.g., for the canonicalization of chemical structures in a non-InChI context). However, no “de novo”/”alternative”/”independent” implementation of InChI is expected. This approach n class="Chemical">serves to ensure the standard chan class="Chemical">racter of InChI and to avoid a common disaster of conflicting forks/implementations of formn class="Chemical">ally the same “standard”.

Known problems and limitations

The lack of coven class="Chemical">rage and limitationpan>s for many aren class="Chemical">as of chemical structures has been noted above. While many of the primary limitations are now n class="Chemical">being addressed by the various IUPAC Division VIII working groups, some of the most often cited comments are as follows: Standard n class="Chemical">InChI only distinguishes some types of stereo chemistry (e.g., cis/trans-platinum structures have the same InChI). n class="Chemical">InChI currenn class="Chemical">tly does not handle mixtures well (e.g., stoichiometry, positionn class="Chemical">al isomers, variable bonding situations, polymers). n class="Chemical">InChI is not a file format (the conpan>vern class="Chemical">sion structure - > InChI - > structure can, in a few cases, provide unden class="Chemical">sirable results). n class="Chemical">InChIKey, the hn class="Chemical">ashed n class="Chemical">InChI, is limited, in very few cases to date, in terms of variations it can support (i.e., collisions of multiple InChI to one InChIKey). n class="Chemical">InChI does not yet work for large drug molecules (e.g., antibodies with hundreds of amino acids). n class="Chemical">InChI does not handle n class="Chemical">all tautomers well (1–5, 1–7, 1–9, …, hydrogen shifts, etc.) in standard InChI, which is now n class="Chemical">being addressed by a new working group. Standard n class="Chemical">InChI does not honor bonpan>ds to n class="Chemical">metals. n class="Chemical">InChI is difficult to ren class="Disease">ad for n class="Species">humans.

Future prospects

n class="Chemical">As noted in this manuscript there are many aren class="Chemical">as of chemistry that need to be and will be addressed. The ren class="Disease">ader is encouraged to visit the IUPAC [1] and InChI Trust [3] web sites to learn about current and future plans for expansion and extension of InChI.

Conclusions

n class="Chemical">InChI is the Internpan>ationpan>n class="Chemical">al Chemical Identifier developed under the auspices of IUPAC with principal contributions from NIST and the n class="Chemical">InChI Trust. It is a non-proprietary, Open Source, chemical identifier possessing the following principal features: structure-n class="Chemical">ban class="Chemical">sed approach; strict uniqueness of identifier; applican class="Chemical">bility to the entire domain of “cln class="Chemical">assic organic chemistry” and, to a n class="Chemical">significant extent, to inorganic compounds; an class="Chemical">bility to genen class="Chemical">rate the same InChI for structures dn class="Chemical">rawn under (reasonably) different styles; hien class="Chemical">rarchicn class="Chemical">al, layered, approach n class="Chemical">allowing to encode the molecular structure with different levels of “granularity”/different set of layers (a Standard InChI is specifically created for inter-operability); n class="Chemical">InChI is complemented by its counterpart compact (hn class="Chemical">ashed, fixed-length) representation, an InChIKey. To date, n class="Chemical">InChI and InChIKey were proved to be useful tools for linking various pieces of chemical information.

Endnotes

an class="Chemical">As one of the anonpan>ymous referees hn class="Gene">as correctly pointed out, there is a difference between “substance” and “chemical substance”. For short, we un class="Chemical">se in this paper the term “substance” always to mean “chemical substance”, in the sense of IUPAC Gold Book definition, see below. bThe term “n class="Chemical">InChI” is un class="Chemical">sed in the following text not only for “Internationn class="Chemical">al Chemical Identifier” itself but also for “InChI algorithms and software” (whence statements like “InChI removes hydrogen…”); the exact meaning is evident from the context. cThe result wn class="Chemical">as obtained un class="Chemical">sing InChI Software v. 1.03 and Stdn class="Chemical">InChI; round-trip conversion experiments are documented in InChI v. 1.03 Software Release Notes available on the InChI Trust Download page [6] as a part of v. 1.03 documentation. dThe idea of hn class="Chemical">ashing InChI was suggested by n class="Chemical">Simon Quellen Field at a Google seminar on InChI in 2006. ePlean class="Chemical">se note that the originn class="Chemical">al version of this brief description (previously posted by one of the authors (DT; http://sourceforge.net/p/inchi/mailman/message/1619786/) to the n class="Chemical">inchi-discuss mailing list and then republished for the wider community by Apodaca in his “Depth-First” blog, http://depth-first.com/articles/2006/08/12/inchi-canonicalization-algorithm/) did contain some minor typos.

7 in total

1. One-wedge convention for stereochemical representations.

Authors: S K Lin; L Patiny; A Yerin; J L Wisniewski; B Testa
Journal: Enantiomer Date: 2000-12

2. SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries.

Authors: R Webster Homer; Jon Swanson; Robert J Jilek; Tad Hurst; Robert D Clark
Journal: J Chem Inf Model Date: 2008-12 Impact factor: 4.956

3. InChI - the worldwide chemical structure identifier standard.

Authors: Stephen Heller; Alan McNaught; Stephen Stein; Dmitrii Tchekhovskoi; Igor Pletnev
Journal: J Cheminform Date: 2013-01-24 Impact factor: 5.514

4. Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.

Authors: Noel M O'Boyle
Journal: J Cheminform Date: 2012-09-18 Impact factor: 5.514

5. InChI, the IUPAC International Chemical Identifier.

Authors: Stephen R Heller; Alan McNaught; Igor Pletnev; Stephen Stein; Dmitrii Tchekhovskoi
Journal: J Cheminform Date: 2015-05-30 Impact factor: 5.514

6. InChIKey collision resistance: an experimental testing.

Authors: Igor Pletnev; Andrey Erin; Alan McNaught; Kirill Blinov; Dmitrii Tchekhovskoi; Steve Heller
Journal: J Cheminform Date: 2012-12-20 Impact factor: 5.514

7. UniChem: a unified chemical structure cross-referencing and identifier tracking system.

Authors: Jon Chambers; Mark Davies; Anna Gaulton; Anne Hersey; Sameer Velankar; Robert Petryszak; Janna Hastings; Louisa Bellis; Shaun McGlinchey; John P Overington
Journal: J Cheminform Date: 2013-01-14 Impact factor: 5.514

7 in total

145 in total

Review 1. Common cases of improper lipid annotation using high-resolution tandem mass spectrometry data and corresponding limitations in biological interpretation.

Authors: Jeremy P Koelmel; Candice Z Ulmer; Christina M Jones; Richard A Yost; John A Bowden
Journal: Biochim Biophys Acta Mol Cell Biol Lipids Date: 2017-03-02 Impact factor: 4.698

2. NMReDATA, a standard to report the NMR assignment and parameters of organic compounds.

Authors: Marion Pupier; Jean-Marc Nuzillard; Julien Wist; Nils E Schlörer; Stefan Kuhn; Mate Erdelyi; Christoph Steinbeck; Antony J Williams; Craig Butts; Tim D W Claridge; Bozhana Mikhova; Wolfgang Robien; Hesam Dashti; Hamid R Eghbalnia; Christophe Farès; Christian Adam; Pavel Kessler; Fabrice Moriaud; Mikhail Elyashberg; Dimitris Argyropoulos; Manuel Pérez; Patrick Giraudeau; Roberto R Gil; Paul Trevorrow; Damien Jeannerat
Journal: Magn Reson Chem Date: 2018-05-16 Impact factor: 2.447

Review 3. Bioinformatics and systems biology of the lipidome.

Authors: Shankar Subramaniam; Eoin Fahy; Shakti Gupta; Manish Sud; Robert W Byrnes; Dawn Cotter; Ashok Reddy Dinasarapu; Mano Ram Maurya
Journal: Chem Rev Date: 2011-09-23 Impact factor: 60.622

4. biochem4j: Integrated and extensible biochemical knowledge through graph databases.

Authors: Neil Swainston; Riza Batista-Navarro; Pablo Carbonell; Paul D Dobson; Mark Dunstan; Adrian J Jervis; Maria Vinaixa; Alan R Williams; Sophia Ananiadou; Jean-Loup Faulon; Pedro Mendes; Douglas B Kell; Nigel S Scrutton; Rainer Breitling
Journal: PLoS One Date: 2017-07-14 Impact factor: 3.240

5. Many InChIs and quite some feat.

Authors: Wendy A Warr
Journal: J Comput Aided Mol Des Date: 2015-06-17 Impact factor: 3.686

Review 6. Glycosaminoglycanomics: where we are.

Authors: Sylvie Ricard-Blum; Frédérique Lisacek
Journal: Glycoconj J Date: 2016-11-30 Impact factor: 2.916

7. The mwtab Python Library for RESTful Access and Enhanced Quality Control, Deposition, and Curation of the Metabolomics Workbench Data Repository.

Authors: Christian D Powell; Hunter N B Moseley
Journal: Metabolites Date: 2021-03-12

Introduction

Background

Design and layout

InChI design goals

InChI model of chemical structure

Core parent structure

Standard and non-standard InChI

InChI valence schema

Layout of InChI layers

Main layer: representing core parent structure

Empirical formula sublayer: representing composition

Skeletal connections layer

Hydrogens layer

Charge layer

Charge sublayer

Protonation/deprotonation sublayer

Mesomerism

FixedH layer

Stereochemistry layer

Overview of stereochemistry layer with its sublayers

Double bond sp2 (Z/E) stereo layer ‘/b’

Tetrahedral stereo layer ‘/t’

Isotopic layer

Reconnected layer: coordination compounds and organometallics

InChIKey

Overview of implementation

General workflow

Input data

Normalization of input structure

Correcting input structural formula

Moving charge from hydrogen to heavy atom

Converting charge-separated patterns to neutral

Decreasing charge separation by increasing valence

Moving negative charge from central atoms in oxoanions

Moving positive charge to create imine nitrogen

Annihilating adjacent opposite charges going to higher valence state

Breaking bonds to metal atoms

Disconnecting simple salts

Disconnecting other metal-containing compounds

Eliminating radicals and converting aromatic bonds to alternating single and double

Finding [de]protonation pattern which leads to neutral core parent structure

Remove protons from charged heteroatoms

Remove protons from neutral heteroatoms

Add protons to reduce negative charge

Analyzing mobile hydrogens and charge

Simple tautomerism detection

Moveable positive charge detection

Additional normalization

Perception of isotopic data

Perception of stereochemical features

Double bond stereochemistry

Tetrahedral stereochemistry

Canonicalization

Step A: hydrogenless constitution

Step B. Add hydrogen atoms to the structure

Step C. Add isotopic composition to the structure

Step D. Stereochemistry

Serialization

Generation of InChIKey

Encoding

Hash codes

Collision resistance

Options available for InChI generation and behavior of InChI algorithms

Structure perception options

DoNotAddH

SNon

NEWPSOFF

Stereo interpretation options

InChI creation options

Software

Licensing and use of InChI Software

Known problems and limitations

Future prospects

Conclusions

Endnotes

Review 1. Common cases of improper lipid annotation using high-resolution tandem mass spectrometry data and corresponding limitations in biological interpretation.

Review 3. Bioinformatics and systems biology of the lipidome.

Review 6. Glycosaminoglycanomics: where we are.

Review 10. Getting the most out of PubChem for virtual screening.