Literature DB >> 31572779

BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules.

Tzyy-Shyang Lin¹, Connor W Coley¹, Hidenobu Mochigase¹, Haley K Beech¹, Wencong Wang¹, Zi Wang², Eliot Woods³, Stephen L Craig², Jeremiah A Johnson¹, Julia A Kalow³, Klavs F Jensen¹, Bradley D Olsen¹.

Abstract

Having a compact yet robust structurally based identifier or representation system is a key enabling factor for efficient sharing and dissemination of research results within the chemistry community, and such systems lay down the essential foundations for future informatics and data-driven research. While substantial advances have been made for small molecules, the polymer community has struggled in coming up with an efficient representation system. This is because, unlike other disciplines in chemistry, the basic premise that each distinct chemical species corresponds to a well-defined chemical structure does not hold for polymers. Polymers are intrinsically stochastic molecules that are often ensembles with a distribution of chemical structures. This difficulty limits the applicability of all deterministic representations developed for small molecules. In this work, a new representation system that is capable of handling the stochastic nature of polymers is proposed. The new system is based on the popular "simplified molecular-input line-entry system" (SMILES), and it aims to provide representations that can be used as indexing identifiers for entries in polymer databases. As a pilot test, the entries of the standard data set of the glass transition temperature of linear polymers (Bicerano, 2002) were converted into the new BigSMILES language. Furthermore, it is hoped that the proposed system will provide a more effective language for communication within the polymer community and increase cohesion between the researchers within the community.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31572779 PMCID： PMC6764162 DOI： 10.1021/acscentsci.9b00476

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Line notations that encode the connectivity of a molecule into a line of text are a very popular choice for storing chemical structures owing to their memory compactness, their simultaneous human readability and machine-friendliness, and their compatibility with most software and input systems.[1] In synergy with the advances in machine learning and data mining algorithms, a good line notation can enable data-driven research and materials discovery.[2−4] For small molecules, many line notations have been developed, including the simplified molecular-input line-entry system (SMILES),[5,6] the SYBYL line notation (SLN),[7] the Wiswesser line notation (WLN),[8] ROSDAL,[9] the modular chemical descriptor language (MCDL),[10] or more recently the international chemical identifier (InChI).[11] Among them, SMILES is the most popular linear notation, and it is generally considered the most human-readable variant, with by far the widest software support.[1] In practice, SMILES provides a simple set of representations that are suitable as labels for chemical data and as a memory compact identifier for data exchange between researchers. Moreover, SMILES and its extensions serve as descriptive codes that allow rapid generation of graphical objects that could be searched for chemical structures with tools such as Open Babel.[12] Furthermore, as a text-based system, SMILES is also a natural fit to many text-based machine learning algorithms. When combined with string kernels, SMILES strings can be used with kernelized learning methods such as the support vector machine.[13] These superior characteristics have made SMILES a perfect tool for translating chemistry knowledge into a machine-friendly form, and it has been successfully applied for small molecule property prediction[14−16] and computer-aided synthesis planning.[2,17,18] However, polymers have resisted description by these structural languages. This is because most structural languages such as SMILES have been designed to describe molecules or chemical fragments that are well-defined atomistic graphs. Since polymers are stochastic molecules, they do not have unique SMILES representations. As discussed by Audus and de Pablo in a recent viewpoint,[19] this lack of a unified naming or identifier convention for polymer materials is one of the major hurdles slowing down the development of the polymer informatics field. While pioneering efforts on polymer informatics such as the Polymer Genome project[20] have demonstrated the usefulness of SMILES extensions in polymer informatics, the fast development of new chemistry and the rapid development of materials informatics and data-driven research make the need for a universally applicable naming convention for polymers ever more urgent.[20−28] Recently, several notable schemes have been proposed as a potential solution to this issue: the hierarchical editing language for macromolecules (HELM) developed by the Pistoia Alliance,[29] the International Union of Pure and Applied (IUPAC) international chemical identifier (InChI),[11] and the CurlySMILES language,[30] an extension of SMILES that aims to provide support for polymers, composite materials, and crystals. However, while HELM is useful in describing macromolecules and biopolymers with well-defined structures, it is not designed to capture the stochastic nature of polymers. On the other hand, InChI is not specifically designed for polymers and does not support branched polymers, and CurlySMILES primarily focuses on polymers where structural features, such as the head–tail configuration, are already well-defined. Moreover, CurlySMILES requires the introduction of many new parameters to accompany its annotation syntax, which significantly increase the complexity of the language and reduce its readability. Finally, CurlySMILES does not support the encoding of randomly branched polymers. As such, this means that the need for a flexible structurally based identifier system that supports a wide variety of different polymeric structures remains extremely pressing. Here, a new structurally based construct is proposed as an addition to the highly successful SMILES representation that can treat the stochastic nature of polymer materials. Since polymers are high-molar-mass molecules, this construct is named BigSMILES. In BigSMILES, polymeric fragments are represented by a list of repeating units enclosed by curly brackets. The chemical structures of the repeating units are encoded using normal SMILES syntax, but with additional bonding descriptors that specify how different repeating units are connected to form polymers. As depicted in Figure , this simple design of syntax would enable the encoding of macromolecules over a wide range of different chemistries, including homopolymer, random copolymers and block copolymers, and a variety of molecular connectivities, ranging from linear polymers to ring polymers to even branched polymers. Except for a handful of new additional rules and operators, all syntax of BigSMILES follows the same syntax as the original SMILES. This means that, as in SMILES, BigSMILES representations are compact, self-contained text strings. Furthermore, a multitude of polymer structures that are more complicated than the examples schematically illustrated in Figure can be constructed through the composition of three new basic operators and original SMILES symbols. This is demonstrated in detail along with the discussion of BigSMILES syntax in the next section.

Figure 1

Schematic illustration of the syntax of BigSMILES and some of the structures that can be encoded using BigSMILES. Polymeric fragments are represented by a list of repeating units enclosed within curly brackets. Repeating units are composed of SMILES strings (represented by the red circles and blue squares in the left panel) with additional descriptors (black structures on the left panel) that give the details of the connectivity pattern between repeating units.

Syntax

The major extension of BigSMILES to SMILES is the introduction of an additional stochastic object that represents a fragment of a molecule that is intrinsically stochastic in its structure. Unlike small molecules, for which each string corresponds to a single chemical structure, references to the BigSMILES string refer to a group of molecules that have a distribution of structural features and properties. In analogy with statistical physics, this ensemble of polymer molecules consists of many molecular states, where each molecular state is an arrangement of atoms into a molecule that could possibly be realized. Each molecular state has some probability of occurrence, which is determined by the specific rules of chemistry governing the system as the molecules were formed. The rules determining the probability of observing a given molecule in a polymer may be extremely complex, involving changes in the probability of forming a given molecular configuration as a function of both time and space within a reactive system. While the exact quantification of structural features and properties can be difficult, the monomer and mechanism of polymerization used restrict the set of possible chemical structures present in the ensemble based on the generally known rules of connectivity. Exploiting this, a stochastic object is defined as a machine-friendly representation of the molecular ensemble, without specifying the probability of occurrence of any individual molecular state. The stochastic objects resemble the widely used structural formula representation that is commonly used to describe macromolecules. The object is identified by a pair of curly brackets around a comma-delimited list of repeating units of the polymer: The curly bracket is used to avoid conflict with other notation in the existing SMILES syntax. In this representation, each repeating unit within the object essentially resembles the repeating units that are bracketed within the parentheses in a structural formula. The comparison between a BigSMILES stochastic object and a corresponding structural formula is illustrated in Figure . In BigSMILES, the entire object, which is bracketed by the curly brackets, symbolizes a piece of a molecular fragment that has a random structure. Since BigSMILES is an extension of the SMILES language, the SMILES syntax for specifying chiral centers (namely, the use of “@” and “@@” symbols), aromatic atoms (the use of lowercase letters), electric charges, ring closures, and many other features[5] is retained to provide means for encoding detailed molecular structures. Section SVI (p S17) and Section SVII (p S18) in the SI contain more details on the treatment of tacticity and polyelectrolytes.

Figure 2

Comparison between the structural formula (top) of poly(ethylene-vinyl acetate) (EVA) and the BigSMILES representation (bottom) of EVA. The representations of ethylene monomer shaded in orange and vinyl acetate monomer shaded in green in the structural formula are very similar to the machine-friendly BigSMILES stochastic object representation. Note that the BigSMILES representation exists as both a simplified representation, in which “$” are omitted, and a full representation, where bonding sites to other repeating units are explicitly indicated. While it is conventional to draw the structural formula with the repeating units in their canonical orientation, this does not imply that all the repeating units are oriented in this specific configuration within the polymer chain; the orientation may be head-to-tail or head-to-head in many cases depending on the nature of the polymerization. Here, the nature of vinyl polymerization is captured by the BigSMILES representation by allowing the units to take both orientations.

Bonding Descriptor Syntax

In simple linear polymer segments, the repeat units may be written in a way such that the strings for each repeat unit may be directly concatenated together in any order or orientation to form a representation of a polymer molecule. Figure includes a representation of one such hypothetical polymer segment; however, in many cases (such as polymers synthesized via ring-opening polymerization with repeating units always in a specific orientation, or if there exist multiple sets of orthogonal reactions that prohibit the formation of a certain connection between some repeating units), more complex bonding patterns can arise. To differentiate and clearly specify different connectivity patterns between repeating units, two types of bonding descriptors are introduced. The first type of connection is AA type bonding, where connections can occur between any two bonding moieties within a group of possible moieties. This is commonly found in the bonds formed from chain polymerization of vinyl monomers, where each polymerized vinyl carbon can in principle connect to any other polymerized vinyl carbon found in other repeat units (allowing for head-to-head, tail-to-tail, and head-to-tail addition). Section SI (p S2) in the Supporting Information gives a more detailed discussion of this feature. For this type of connectivity, the “$” notation is used. For example, for a linear polymer segment formed from vinyl monomers ethylene and 1-butene, the stochastic object reads As illustrated by the example, in general, there are multiple equally valid representations for each repeating unit, and the bonding site to other repeating units (the position of symbol “$”) can be placed at any position in the repeating unit. Furthermore, there can be more than two such sites per repeat unit, which becomes useful when the notation is generalized to represent branched polymers. For example, Figure S2a (p S11, Section SIV in the SI) gives a representation for branched polyethylene using branching units with three or more connection sites to the other repeating units. It should be emphasized that the list in the BigSMILES stochastic object is defined based on repeating units rather than monomers. Therefore, for monomers such as isoprene, which may have up to four isomerization states upon polymerization,[31] each isomerization state is treated as a distinct repeating unit in the stochastic object, as illustrated in Figure b.

Figure 3

Examples to illustrate the syntax of BigSMILES for polymers synthesized via different chemistries. (a) Vinyl copolymer poly(ethylene-co-1-butene) formed from chain polymerization, (b) four distinct isomerization states for polyisoprene (1,2-addition isomer is retained for completeness despite the fact that its amount can be negligibly small in natural rubber), (c) step polymerized nylon-6,6, (d) step polymerized poly(alanine-co-glycine), and (e) poly(ethylene glycol) methacrylate formed from polymerization of epoxides. If multiple orthogonal sets of AA type connections exist within the same molecule, the symbol “$” can be appended with a positive integer n into “$n” to distinguish between different sets of connections. By default, “$” represents a single bond connection; however, if the repeating units are connected by other bonds, the bond type or bond order can be specified by using the SMILES bond order representation, with “$=n” for double bonds, “$#n” for triple bonds, and “$\n” and “$/n” for explicitly specifying the cis–trans isomerization states of single bonds directly adjacent to double bonds, respectively. Note that integer IDs n should serve as unique identifiers for the different sets of bonds and therefore not be reused within the same stochastic object. Since the scope of this identifier is only within the stochastic object (between the curly brackets), identifiers within different stochastic objects are distinct even if the IDs appear to be identical. Furthermore, while explicitly stating the additional bond order in every bonding descriptor enhances clarity, the first occurrence of the bond of a particular ID is treated as the definition for the connectivity pattern associated with the ID, and the details of the bond order can be omitted for simplicity in later occurrences. If there is only one group of bonds within the stochastic object, the integer ID can be dropped for simplicity if no additional descriptor (such as the bond order) is needed for the bond. In the special case where there are just two connective sites per repeat unit, and only one type of AA bond of bond order one exists, if the repeat unit is written such that these two sites are at the termini of the repeat unit, the symbol “$” may be omitted altogether, as in the case illustrated in Figure a,b. This provides a substantial simplification for a very wide range of common polymers and is referred to as the “simplified representation.” In the $ representation of AA bonding, any bond indicated by “$n” can be joined to any other bond “$n”, and the repeating unit in the polymeric structure need not connect in the orientation specified in the repeat unit list. Therefore, structures with repeat units in the flipped orientation are implicitly included. For instance, this bonding descriptor can be used for representing vinyl polymers, for which both the head-to-head and the head-to-tail configurations need to be included so the overall BigSMILES representations capture the full ensemble of the possible configuration of the polymer (Figure S1, p S2). Including both configurations is especially important in describing polymers such as poly(vinyl alcohol) or fluorinated vinyl polymers, for which there are known to be a significant number of head-to-head oriented pairs along the chain.[32] However, it is emphasized that while the bonding descriptor specifies the ensemble of possible configurations, it does not provide information on the relative weights for each of the configurations. For the second type of bonding, AB type bonding, a bonding moiety cannot connect directly to other moieties within the same group but can only connect to moieties in another conjugate group. This is commonly seen in monomers polymerized with condensation reactions. For example, in a polyamide, the amide bonds between monomers are always between an acid moiety and an amine moiety but never between two acid or two amine moieties. In this case, angle brackets “<” and “>” are used to indicate the bonds, where bonds must form between conjugate pairs of brackets. For example, the polymeric segment of nylon-6,6, as shown in Figure c, may be represented in BigSMILES as As the asymmetric bonding descriptor represents bonds and connectivity resulting from the reaction of a pair of conjugate end groups, such as polymers synthesized from the polycondensation reaction of a pair of end groups, conjugate symbols are selected for each of the two bonds. For instance, in the nylon-6,6 example, all the amine ends are denoted by the symbol “>”, whereas the carboxyl ends are denoted by the opposite symbol “<”; similarly, the amine ends on the polypeptide in Figure d share the symbol “>”, and the carboxyl ends are denoted by the opposite symbol “<”. Similar to the “$” symbol, if multiple groups of AB type bonds exist, or higher bond order is needed, the notation can be extended to “bn”, where b is either “–”, “=”, “#”, “\”, or “/” depending on the bond order or bond type, and n is a positive integer. Again, for single bonds, where b is “–”, b can be omitted for simplicity. Practical examples on the usage of the bonding descriptors and common errors in encoding BigSMILES strings are, respectively, provided in Section SIX (pp S45–S53) and Section SII (pp S6–S9) of the Supporting Information.

Fragment Name Definition Notation

In BigSMILES syntax, repeating units are represented by an extended version of SMILES strings. While this design ensures that BigSMILES strings are standalone and self-descriptive, in some cases it might be more beneficial to have some portions of the BigSMILES representations be replaced by more abstract but compact proxies, for example, the names of repeating units. This is especially helpful when the structure is complex, and the resulting BigSMILES representation becomes long. To facilitate understanding, a definition of molecular fragments that associate user-defined names with partial BigSMILES strings is allowed in BigSMILES using the following syntax: The definition of repeat unit names is placed at the end of the entire BigSMILES string, with each definition of fragment enclosed by curly brackets and delimited by periods. When fragments are used within the original BigSMILES object, a square bracket should be enclosed to avoid potential confusion of # with triple bonds. Note that the fragments should conform to the BigSMILES syntax and produce a syntactically valid BigSMILES string when embedded within the original BigSMILES object through a substring replacement. In addition, while having complete (fully bracketed on both sides) BigSMILES stochastic objects within the fragment definition is allowed, no bonding descriptors (except for those within fully bracketed stochastic objects) should be included within the fragment definition, so that all occurrences of the bonding to other repeating units appear explicitly in the BigSMILES stochastic objects. In many cases, the fragment definition notation can significantly increase the readability of the BigSMILES strings; two such examples are illustrated in Figure . Fragment notation also provides a way of introducing monomer libraries to improve the readability of BigSMILES. An initial library illustrating a wide variety of examples is provided in the Supporting Information (Section SI, p S3, Table S1).

Figure 4

Examples to illustrate useful features of BigSMILES syntax. First, pendant groups (a) or arms (b) can be replaced by user-defined names to improve readability. (c, d) Second, direct concatenation of BigSMILES stochastic objects provides simple representation for block copolymers. Finally, nesting of stochastic object becomes useful in representing copolymers with oligomer chain extenders (e) or polymer grafts (f).

Concatenation and Nesting of Stochastic Objects

The BigSMILES stochastic object defined earlier represents polymeric fragments. In principle, as in the SMILES language, the adjacent strings outside the stochastic object concatenated to the string within the stochastic object form a continuous chemical structure. However, to ensure chemical validity, how the termini connect to exterior strings should be specified using leading and trailing bonding descriptors within the curly brackets: The additional bonding descriptors indicate how the exterior atoms are connected to the fragment. Therefore, they should be conjugates to the specific desired terminal; i.e., the additional descriptor should be “>nb” if the desired terminal bond type is “carboxylate group, additional “>” would need to be added to both ends of the stochastic object to indicate that the terminal bonds are of connected to the carbon on the carboxylate group rather than the nitrogen on the amine group: It should be noted that this concatenation syntax only allows up to two connections to the exterior. In some cases, because of the nature of the repeating unit, by specifying the ending bonding orientation, the bond type at the beginning of the object is also determined. This is common in polycondensation of AB type monomers or ring opening polymerization, where the connectivity on one end completely determines the connectivity pattern on the other end. For example, if the end groups OH were to be positioned on the left of the stochastic object representing a glycine alanine copolymer, only the C-to-N orientation of the polyamide makes sense given the placement of the end group. In these cases where at least one end of the polymer is capped by external groups, if all repeating units within the stochastic objects have only a pair of conjugate connective sites belonging to the same AB bond group, and all repeating units are written so that the sites are placed at the termini with the same orientation, then “<” and “>” at the termini of the repeating units may be omitted to simplify the representation. With this simplification, the PEG example in Figure e may be simplified as whereas the previous glycine alanine copolymer can be simplified as Although the N-terminus of the polymer seems uncapped, there is an implicit hydrogen that terminates the polymer. Collectively, these simplifications for AB bonding are also referred to as the “simplified representation.” The SMILES feature that allows string concatenation to represent a continuous chemical structure enables blocks of polymeric structure in a copolymer to be written as the direct concatenation of several stochastic objects. For example, a polyethylene-block-polystyrene structure shown in Figure c can be easily encoded by concatenating the two polymers segments Similarly, this representation can be generalized to represent multiblock copolymers, such as the triblock poly(ethylene glycol)-block-poly(propylene glycol)-block-poly(ethylene glycol) (PEG-b-PPG-b-PEG) illustrated in Figure d. Note that, in this triblock copolymer example, the syntax is greatly simplified with the omission of the terminal “<” and “>” in the repeating units. In the BigSMILES syntax, it is possible to nest multiple levels of stochastic objects within a stochastic object to create more complex structures. To illustrate the syntax of nesting, consider synthesis of a polyurethane through polycondensation of 1,3-propanediol, ethylene glycol oligomers, and toluene diisocyanate (TDI), as illustrated in Figure e. The ethylene glycol oligomers are encoded as one stochastic object, and this can be nested in a second stochastic object representing the overall polyurethane polymer: This example can be easily generalized to describe polymers resulting from the polycondensation of more than two types of oligomers or repeating units. Another scenario that demonstrates the convenience of nesting is the representation of graft polymers. Consider polyisobutene-graft-poly(methyl methacrylate) (PIB-g-PMMA) synthesized by grafting from the linear copolymer of poly[isobutene-co-(m-bromomethylstyrene)], illustrated in Figure f, as an example. With the polymer graft nested within the backbone, the graft polymer can be represented by or, separately defining the polymer graft with the syntax provided in the previous section, the polymer can also be represented as When possible, readability and ease of comprehension will usually benefit from encoding a polymer in a non-nested way.

Branched Polymers and Polymer Networks

Up to now, all examples have been focused on linear polymer segments, where each repeat unit has two attachment points corresponding to the start and end of its SMILES string. However, the stochastic object can also be generalized to represent randomly branching polymers. For example, consider a low-density polyethylene (LDPE) molecule with long chain branching (Figure S2a, p S11, Section SIV). Its BigSMILES representation is Unlike other repeating units discussed up to this point, the second and third repeating units each have functionality larger than two (and therefore the “$” symbols cannot be omitted). Therefore, they serve as branching points, and the entire stochastic object represents a randomly branching structure, which resembles the structure of LDPE. Note that while linear segments of the LDPE molecule can have an odd number of carbons because of branching, the overall linear backbone of LDPE must have an even number of carbons. Hence, the repeating units in this case consist of molecular fragments with two carbons. In practical cases, the fraction of the last repeat unit should be very small compared to the other two repeating units, and this unit is retained in the list here for completeness. Other branched polymers or polymer networks can also be encoded similarly. In Section SIII (pp S10 and S11) of the Supporting Information, more examples, including hyperbranched polymers, end-linked polymer networks, and vulcanized networks, are given; additional discussion on noncovalent or dynamic networks can be found in Section SIV, pp S12 and S13 of the SI.

End Groups

In BigSMILES, there are two valid ways of specifying end groups. The first way is to explicitly append the end groups around the polymeric fragment represented by the stochastic object; this method allows specification of a deterministic end group. This was used in the previous section to specify the structure for a methacrylate terminated PEG, as illustrated in Figure e. The other way is to append the list of possible end groups as a comma-delimited list to the end of the list of repeating units, separated by a semicolon: The end groups are represented as if they are also repeating units, with the same bonding descriptors “$nb”, “nb” as repeating units that indicate the allowed connectivity patterns between repeating units and the end groups. However, the nature of end groups dictates that they should have only one possible bond to another repeating unit, to terminate the structure. For example, in the nylon-6,6 case, two different end groups are possiblewhere the carboxylic and amine end groups are included within the list of repeating units. Note that, in this example, hydrogen atoms are explicitly written for clarity. When end groups are specified using this representation, it means that all the unconnected bonds on the molecular fragment generated using the list of repeating units with two or more connections to other repeating units are capped with the specified end groups. This representation can be especially useful when there are multiple possible end groups. For instance, the variability of the end groups on the two ends of nylon-6,6 synthesized from polycondensation of adipic acid and hexamethylenediamine is implicitly considered by using this representation. The effectiveness of the latter representation is especially demonstrated by the following example. Consider linear polystyrene synthesized from AIBN initiated radical polymerization. It could have three different end groups depending on the route of termination: The possible terminal structures are illustrated in Figure a. In this example, the SMILES string leading to the random fragment is synonymous with specifying that the fragment already has one of its two ends capped by an initiator, indicated by the leading [$]. Therefore, it leaves only one unconnected bond on the fragment. The other end group can be one of the three end groups trailing the ethylene monomer in the list. The first possible case is that the other end group is also the initiator, which corresponds to the second entry on the list (first one in the end group list). This happens when termination by coupling takes place. The styrene repeating unit within the end group is written in a reversed orientation to emphasize the preferred configuration in polymerization. On the other hand, when termination from disproportionation occurs, the end of the polymer can be capped by either of the two groups at the end of the list.

Figure 5

(a) Illustration of possible termination products for free radical polymerization of polystyrene. (b) Polystyrene ring polymer synthesized from azide–alkyne click chemistry. Since the rings and cycles within the repeating units are independent of the macrocycle (that lead to the formation of the ring polymer), the ring closure integer identifier within the stochastic object is independent of the identifiers outside of the object even if the numbers were the same. (c) Ring polymer synthesized from ring expansion metathesis polymerization (REMP). When randomly branched polymers are considered, the representation that includes end groups into the list of repeating units has large advantages. Consider the hyperbranched polymer example in Figure S2b (p S11, Section SIV); if the end group for the #B moiety is #E, then the end groups of the hyperbranched polymer can be easily specified using the following representation: Note that, in this case, it is impossible to explicitly append end group #E to the polymer fragment, because different members of the ensemble of molecules represented by the stochastic fragment have different numbers of unclosed bonds.

Macrocycles

For macrocycles that are well-defined, such as the cycle structures in ring polymers, the macrocycles are encoded using the usual syntax for describing cycles in SMILES. To illustrate this, consider a polystyrene ring synthesized through alkyne azide click chemistry, as illustrated in Figure b. In this case, since rings and cycles within repeating units do not extend beyond a single repeating unit, the macrocycle associated with the ring polymer can be treated with the usual SMILES ring closure syntax. The integer identifier that was used within the repeating units for ring closure is considered to be independent of any ring closure ID that was used in other parts of the BigSMILES string. Therefore, the BigSMILES representation for this polymer readswhere the ring closure for the macrocycle is selected to be between the sulfur atom and its neighboring atom. Meanwhile, the ring closure denoted by 2 and 3 describes the ring closure in the phenyl group and the ring with nitrogen atoms. It should be emphasized that, similar to the ID used in bonding descriptors, the scope for ring indices within a stochastic object is local to the object, and independent of other ring indices not within the stochastic object. In principle, other well-defined, nonstochastic cycles can be encoded in a similar manner. For example, a ring polymer synthesized with ring expansion metathesis polymerization (REMP) developed by Grubbs and co-workers[33] can also be encoded with similar syntax, as illustrated in Figure c. On the other hand, randomly formed cycles, such as the random loops in polymer networks, cannot be explicitly enumerated because each cycle requires indexing in the SMILES language. While the examples shown in Figure S2b–d (p S11, Section SIV in the SI) do not explicitly present the possibility of macrocycle formation, the rules of connectivity implicitly allow it, and enumeration of molecular states represented by the BigSMILES structure according to algorithms for generating gel connectivity, such as the algorithms adopted by Stepto and co-workers,[34] Eichinger and co-workers,[35] or Olsen and co-workers,[36,37] will include the formation of these cyclic structures. Examples of BigSMILES strings for such structures are included in Section SIV (pp S10–S11) of the Supporting Information.

Ladder Polymers and Repeating Units with Multiatom Connections

The syntax up to this point assumes that neighboring repeating units are always connected through a single pair of atoms. However, for some materials, such as ladder polymers, this condition does not hold. To represent ladder polymers or other polymers with multiple connections between a single monomer pair, the bonding descriptors are nested by the following syntax: The outer layer (everything except the part bracketed by “[...]”) encodes the connectivity between the repeating units with the same syntax as detailed in previous sections. Atoms on a repeating unit connecting to the same neighboring repeating units are indicated by an identical outer layer bond type, bond order b and bond ID n. For detailed examples of the use of nested bonding descriptors, please refer to p S14, Section SV in the Supporting Information.

Discussion

BigSMILES provides a well-defined, compact, and machine-friendly extension to the SMILES language that allows stochastic polymer structures to be represented. In this stochastic sense, a polymeric material is actually a set of molecules which may be conceptualized as an ensemble of different chemical states (defined by the bonding pattern of atoms), each with a probability of occurrence within the set of molecules that represents the material. BigSMILES enables, in a compact form, the ensemble of different chemical states to be represented; however, it does not provide information on the probability of observing any given chemical state. This is conceptually similar to the chemical structure of a polymer, which does not specify, for example, the molar mass distribution. In principle, information about the probability of observing each molecular configuration within the ensemble can be quantified by measurement of physiochemical properties, such as the molar mass distribution, tacticity, or monomer reactivity ratios and feed ratios. However, developing an identifier notation by using a fixed set of property descriptors is challenging in practice. In most practical settings, only a few of the chemical structural features and properties of the macromolecules are characterized experimentally. Furthermore, the literature lacks consensus on how to treat this problem: researchers typically do not measure the same data using constant methods for each polymer, and data required to fully define molecular probabilities is usually missing. In some cases, measurements may not even be possible. This means that any form of encoding that relies on describing the macromolecules using a predefined set of properties will not meet the needs of the macromolecular community, nor will it be universally accepted. There are also substantial issues with data uncertainty and disagreements about evidence that have the potential to cause controversy. Therefore, to make the representation general and universally applicable, a syntax is developed that clearly separates the definition of the ensemble of molecular states accessible in a polymerization, a relatively noncontroversial topic, from the probability of achieving a given molecular state, a topic around which there is much greater debate and uncertainty of measurement. This is analogous to defining an ensemble of states in statistical mechanics without assigning the Boltzmann weights. While both are important for property calculation, by separating the two tasks it is possible to provide concrete molecular identifiers. Alternately, the demarcation of stochastic objects with curly brackets could enable additional specifications to be included in the list of elements beyond repeat units and end groups, providing an additional forum for the specification of certain additional chemical properties. In the current form, a single polymer can be represented by multiple distinct yet equally valid BigSMILES representations. For practical purposes, canonicalization of BigSMILES to provide a unique representation for each distinct polymer would be essential for the application of BigSMILES to polymer informatics. Software packages to accompany BigSMILES are also of prime importance for practical purposes because they would serve both as a standard representation generator and a tool that could help eliminate human errors. The developments of both the canonicalization scheme and the supporting software are currently in progress and will be reported in the near future. In its current form, BigSMILES can still be used as structural identifiers in applications such as a data entry identifier in polymer databases. To demonstrate its general applicability, the entries of a well-known data set[38] of glass transition data are converted into BigSMILES representations (cf. Section SVIII, pp S19–S44, in the Supporting Information). In addition to being used as identifiers, BigSMILES representations are designed in a way in which it can also be used as the basis of a chemical fingerprint generator. By considering pairs or triplets of repeating units and higher-order structures, chemical motifs with different levels of complexity and detail can be easily generated with the representation. These motifs can be used in cheminformatics applications to construct feature vectors that are fed into supervised learning models for property predictions. Furthermore, these generated motifs can also be used in chemical fragment search or chemical structure search. Finally, the structures of BigSMILES representations also allow generic chemical pattern searches. Queries such as “find polymers that are linear”, “find linear copolymers that have two components”, or “find branched polymers with trifunctional junctions” can be straightforwardly processed with regular expression or other pattern matching languages. These aforementioned features, including the generator for chemical fingerprints, chemical fragment search, and generic structural feature queries, will be implemented and demonstrated in future work. This capability enables access and searching of polymer materials from multiple levels of abstraction, which we believe will be highly convenient for the community.

Summary

In this work, a new text-based structural representation system designed to accommodate the stochastic nature of polymers is proposed. By adding a novel stochastic object to the widely used simplified molecular-input line-entry system, the features of SMILES can now be applied to polymers through BigSMILES. As the new representation system adds only a few elementary rules to the original syntax of SMILES while maintaining full compatibility with SMILES, most of the advantages of SMILES, including memory compactness, machine friendliness, and wide applicability, are retained in BigSMILES. Therefore, BigSMILES representations are excellent candidates for indexing identifiers in a polymer database system, as well as structural descriptors that could be used to search for polymer materials. Furthermore, as the chemical spaces represented by the BigSMILES strings can be straightforwardly probed with iterative generation of molecular fragments of varying sizes, BigSMILES representations can be readily used to automatically extract chemical subgraphs and generate molecular fingerprints. This feature can provide a convenient foundation for the generation of data sets that could be used along with machine learning models to fuel data-driven research. Ultimately, BigSMILES benefits the polymer community and increases cohesion between studies by providing a common language that is more effective and suitable for polymers.

20 in total

1. Computer-Aided Screening of Conjugated Polymers for Organic Solar Cell: Classification by Random Forest.

Authors: Shinji Nagasawa; Eman Al-Naamani; Akinori Saeki
Journal: J Phys Chem Lett Date: 2018-05-07 Impact factor: 6.475

2. HELM: a hierarchical notation language for complex biomolecule structure representation.

Authors: Tianhong Zhang; Hongli Li; Hualin Xi; Robert V Stanton; Sergio H Rotstein
Journal: J Chem Inf Model Date: 2012-09-26 Impact factor: 4.956

3. Polymer Informatics: Opportunities and Challenges.

Authors: Debra J Audus; Juan J de Pablo
Journal: ACS Macro Lett Date: 2017-09-15 Impact factor: 6.903

4. Machine Learning in Computer-Aided Synthesis Planning.

Authors: Connor W Coley; William H Green; Klavs F Jensen
Journal: Acc Chem Res Date: 2018-05-01 Impact factor: 22.384

5. Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.

Authors: Noel M O'Boyle
Journal: J Cheminform Date: 2012-09-18 Impact factor: 5.514

6. Rational design of all organic polymer dielectrics.

Authors: Vinit Sharma; Chenchen Wang; Robert G Lorenzini; Rui Ma; Qiang Zhu; Daniel W Sinkovits; Ghanshyam Pilania; Artem R Oganov; Sanat Kumar; Gregory A Sotzing; Steven A Boggs; Rampi Ramprasad
Journal: Nat Commun Date: 2014-09-17 Impact factor: 14.919

7. MoleculeNet: a benchmark for molecular machine learning.

Authors: Zhenqin Wu; Bharath Ramsundar; Evan N Feinberg; Joseph Gomes; Caleb Geniesse; Aneesh S Pappu; Karl Leswing; Vijay Pande
Journal: Chem Sci Date: 2017-10-31 Impact factor: 9.825

8. Using Machine Learning To Predict Suitable Conditions for Organic Reactions.

Authors: Hanyu Gao; Thomas J Struble; Connor W Coley; Yuran Wang; William H Green; Klavs F Jensen
Journal: ACS Cent Sci Date: 2018-11-16 Impact factor: 14.553

9. A graph-convolutional neural network model for the prediction of chemical reactivity.

Authors: Connor W Coley; Wengong Jin; Luke Rogers; Timothy F Jamison; Tommi S Jaakkola; William H Green; Regina Barzilay; Klavs F Jensen
Journal: Chem Sci Date: 2018-11-26 Impact factor: 9.825

10. Drug repositioning: a machine-learning approach through data integration.

Authors: Francesco Napolitano; Yan Zhao; Vânia M Moreira; Roberto Tagliaferri; Juha Kere; Mauro D'Amato; Dario Greco
Journal: J Cheminform Date: 2013-06-22 Impact factor: 5.514

13 in total

1. BpForms and BcForms: a toolkit for concretely describing non-canonical polymers and complexes to facilitate global biochemical networks.

Authors: Paul F Lang; Yassmine Chebaro; Xiaoyue Zheng; John A P Sekar; Bilal Shaikh; Darren A Natale; Jonathan R Karr
Journal: Genome Biol Date: 2020-05-18 Impact factor: 13.583

2. Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?

Authors: Iseult Lynch; Antreas Afantitis; Thomas Exner; Martin Himly; Vladimir Lobaskin; Philip Doganis; Dieter Maier; Natasha Sanabria; Anastasios G Papadiamantis; Anna Rybinska-Fryca; Maciej Gromelski; Tomasz Puzyn; Egon Willighagen; Blair D Johnston; Mary Gulumian; Marianne Matzke; Amaia Green Etxabe; Nathan Bossa; Angela Serra; Irene Liampa; Stacey Harper; Kaido Tämm; Alexander CØ Jensen; Pekka Kohonen; Luke Slater; Andreas Tsoumanis; Dario Greco; David A Winkler; Haralambos Sarimveis; Georgia Melagraki
Journal: Nanomaterials (Basel) Date: 2020-12-11 Impact factor: 5.076