Literature DB >> 34559562

Exploring and mapping chemical space with molecular assembly trees.

Yu Liu¹, Cole Mathis¹, Michał Dariusz Bajczyk¹, Stuart M Marshall¹, Liam Wilbraham¹, Leroy Cronin¹.

Abstract

The rule-based search of chemical space can generate an almost infinite number of molecules, but exploration of known molecules as a function of the minimum number of steps needed to build up the target graphs promises to uncover new motifs and transformations. Assembly theory is an approach to compare the intrinsic complexity and properties of molecules by the minimum number of steps needed to build up the target graphs. Here, we apply this approach to prebiotic chemistry, gene sequences, plasticizers, and opiates. This allows us to explore molecules connected to the assembly tree, rather than the entire space of molecules possible. Last, by developing a reassembly method, based on assembly trees, we found that in the case of the opiates, a new set of drug candidates could be generated that would not be accessible via conventional fragment-based drug design, thereby demonstrating how this approach might find application in drug discovery.

Entities: Chemical

Year: 2021 PMID： 34559562 PMCID： PMC8462901 DOI： 10.1126/sciadv.abj2465

Source DB: PubMed Journal: Sci Adv ISSN： 2375-2548 Impact factor: 14.136

INTRODUCTION

Chemical space is populated by a vast range of compounds, which can be characterized by their molecular composition, formula, graph representation, and reactivity (, ). The generation of molecules via their graphs can be enumerated to give an unimaginably vast number of at least 1060 small organic molecules (, , ), but this is unrealistic, as many of these molecules might be unstable or inaccessible synthetically. However, when exploring biochemistry, only a few hundred different types of “unique” small molecules are needed by the simplest living organisms (, ), indicating that the chemical space relevant to biology on Earth is a tiny fraction of chemical space that is possible (, ). For example, many of the related structures between these known compounds are undiscovered and unknown (e.g., opiates and cannabinoids are found to occur in related clusters of structures but no cross between these structure types are known). The issue therefore arises about how the space of molecules can be effectively searched (, , ) and what constrains molecules to be both thermodynamically possible and biologically accessible (, –) because although many molecules are physically possible, the number of molecules accessible by the current machinery of biology is smaller (, –). One way to explore the universe of molecules is to construct a chemical space as a dataset, e.g., GDB-17 database (, ), DrugBank (), and PubChem (), and then navigate the dataset using molecular descriptors (–). Searching these databases is inefficient because it requires exhaustively enumerating [or Bayesian optimization ()] and screening molecules for desirable properties. Similarly, it is possible to iteratively generate chemical subspaces, followed by filtering unwanted structures, until desired molecules are obtained. This can be done using a genetic algorithm (, ), extrapolation techniques (, ), or even using human intuition (). However, given the relative sizes of possible chemical space, compared to the number of interesting molecular structures, it is not clear how comprehensive any of these approaches will be. Recently, machine learning and statistical techniques have been introduced into the navigation of chemical space (, ). For example, hundreds of thousands of existing chemical structures were used to train a deep neural network so that each molecule can be assigned to a set of discrete coordinates in the continuous latent space of the neural network. Navigating within this discrete chemical space corresponds to navigating in the continuous latent space, which is much easier computationally (). Alternatively, a convolutional neural network can be trained directly on graph representations of molecules to infer their molecular features, and these can have a better predictive performance over the existing hand-crafted fingerprints in some applications (). These approaches offer improvement over raw enumeration and filtering because they compress the search space. However, while neural networks might make the navigation of chemical space more efficient, the space is obscured and important contingent information is not accessible (, ). One important question is how the current structure of observable chemistry relates to the space of biology. This is interesting because evolution has selected the machinery of biochemistry over the past 4 billion years on Earth (, , , ). By constructing the assembly tree from molecular structures, we will be able to not only use a molecular-based route to explore the information used to assemble the molecules found in biology but also infer which molecules are more likely to have shared pathways (i.e., infer the presence of new biological pathways). This is another important window not only on how the process of evolution by random selection leads to both conservation and reuse of biochemical pathways but also in the generation of novelty.

RESULTS AND DISCUSSION

Establishing the theory of assembly spaces of molecules

Assembly theory (, ) quantifies the constraints required to produce a molecule by measuring the minimum number of steps to produce the molecular graph thereof. Here, we apply this approach to explore the structure of chemical space and suggest a way to generate new compounds from the assembly space (see Fig. 1). In the adenine example, we chose the four chemical bonds that make molecule, namely, C─C, C═C, C─N, and C═N, as the basic building blocks. We call these building blocks and the molecular structures that will be produced therefrom the assembly building blocks and call the set of all assembly building blocks the assembly pool. In an assembly pool, any type of assembly building block is assumed to have infinite instances. One assembly step is precisely defined as the three sequential operations: (i) take two assembly building blocks from the assembly pool, (ii) join the two together in a way based on the particular system in question (in this context, that is, to superimpose certain atoms from the two building blocks to make a larger molecular structure), and (iii) add the new composite building block to the assembly pool. Once a sequence of the assembly steps can successfully construct the target molecule, this is defined as an assembly pathway for this molecule. Figure 1A shows one of many assembly pathways of adenine, which has seven assembly steps after which adenine is constructed and appears in the assembly pool and gives an assembly index of 7. Last, the assembly space of a molecule refers to all the assembly building blocks included in the assembly pathways. The assembly index of the shortest pathway to construct a molecule is referred to as the molecular assembly number (MA) of the molecule. The assembly pathway shown in Fig. 1A is the shortest one, calculated out by the Monte Carlo (MC) algorithm that we have developed (which is guaranteed to give the shortest assembly pathway when the computing time is sufficiently long; see section S3 for details), and adenine’s MA is thus 7.

Fig. 1.

Representations of an assembly pathway, by taking adenine as an example.

Representations of an assembly pathway, by taking adenine as an example.

(A) One of the many assembly pathways of adenine (it turns out to be the shortest one, according to our MC algorithm, explained later). The assembly pool (shown inside the dashed boxes) evolves with each assembly step. The colors denote which two assembly building blocks are used to make the new one (note that the color schemes are independent for each step). (B) The key-step representation of the assembly pathway. (C) The joint process for each key assembly step, which is used to work out the multiset representation. (D) The multiset representation of this assembly pathway. Strictly speaking, it should be written as {[1]1, [2]1} where the superscript “1” is the multiplicity of this assembly building block, that is, after canceling out, it appears once on the left-hand side of (C), but for simplicity, we only explicitly write down the multiplicity when it is not 1. In assembly pathways, the order of some steps can be switched without changing the length of the pathway, which leads to a combinatorically large number of trivial pathways that all have the same number of steps. For instance, as the pathway shown in Fig. 1A starts from building block [3] to make building block [6], we can either add a C─C bond to [3], then a C═C, and lastly add building block [2], just as the figure shows, or we can add [2] first, then a C═C, and, last, a C─C, which leads to another pathway. Taking Fig. 1A as an example, building block [1] must be made before [2] and [3] because [1] is used to make [2] and [3]; by the same logic, building block [2] must be made before [3] and [6] because [2] is used to make them. We can use these properties to represent assembly pathways without ambiguity by focusing on the steps in which order matters. We call those special building blocks that define the hierarchical relationships among the chemical structures the key assembly building blocks (they are also the assembly building blocks that are used more than once in the pathway) and their corresponding steps as the key assembly steps. Therefore, we can represent a pathway in terms of key steps, which eliminates all the trivial information. For example, Fig. 1B is the key-step representation of the pathway shown in Fig. 1A, and the key building blocks are [1] and [2]. The number of key building blocks can be elucidated from the key-step representation. We can explicitly write down the joint process for each key step and then remove the building blocks that appear on both sides (Fig. 1C). The building blocks left over constitute the target molecule nonrepetitively. Specifically, the target molecule adenine can be made from these chemical bonds and structures nonrepetitively and by the least number of assembly steps. The information of the basic building blocks is trivial and can thus be omitted. So, we lastly obtained the multiset representation of this assembly pathway, as shown in Fig. 1D (see section S2). The multiset representation can be readily determined from the key-step representation without ambiguity and vice versa. The latter emphasizes the hierarchical relationship between the building blocks, while the former emphasizes the information of constituents and provides a compact summary of the assembly space associated with this pathway. An assembly pathway of a molecule does not necessarily correspond to a realistic sequence of chemical reactions that produce this molecule. Instead, the shortest assembly pathway bounds the likelihood of the molecule forming probabilistically (which means that if the shortest assembly pathways of two molecules overlap, then they are likely to have shared synthetic pathways). No matter which methods or synthetic approaches are used, there will be no shorter way than this ideal one, which makes it an intrinsic property of a molecule.

Molecular assembly trees

The concept of assembly pathways and spaces can be naturally applied to two molecules, which allows us to look at the shortest assembly pathways that construct both simultaneously. In general, the shared shortest assembly pathway of A and B is not the union of the individual shortest assembly pathways of A and B. As an example, consider adenine and another nucleobase thymine (Fig. 2A). The shortest pathway of adenine alone is indicated by the blue dashed arrows on the left whose MA is 7 (the same pathway as in Fig. 1).

Fig. 2.

Two exemplified molecular assembly trees.

Two exemplified molecular assembly trees.

(A) The assembly space of adenine and thymine. The shortest assembly pathway for adenine alone is indicated by the blue dashed arrows, while the shortest assembly pathway for thymine alone is indicated by the red dashed arrows. The shortest assembly pathway to make adenine and thymine altogether is the one indicated by the black dashed arrows. (B) A molecular assembly tree for A, G, T, U, and C, which can also be written as {[2, 10, 11, 12, 13]}, whose index is calculated to be 16. Note that, in both (A) and (B), the colors are just used to help the reader recognize the building blocks, and the color schemes are independent; we also omitted the arrows starting from the basic building blocks for a better visualization. The shortest pathway of thymine alone is indicated by the red dashed arrows on the right, which can be written in multiset representation as {[7]}, and its MA is calculated to be 6. However, the shortest assembly pathway to make adenine and thymine altogether is indicated by the black dashed arrows in the middle, which does not overlap with either of the shortest pathways. It is {[8]2, [9]} in multiset representation (where the superscript “2” is the multiplicity of [8]), and its pathway index is calculated to be 12, which is smaller than 7 + 6, the sum of the two individual MAs. Molecular assembly theory can be extended further to three or more molecules, which allows us to look at the shortest assembly pathways that construct a group of molecules. The multimolecular assembly spaces tend to have a tree-like structure where different branches lead to different molecules (see Fig. 2A), but the number of key building blocks is still relatively small. Therefore, we refer to the shortest assembly pathways to make a group of molecules altogether as the molecular assembly tree (assembly tree for short) thereof and refer to its index as assembly tree MA. As an example, we built an assembly tree for the five nucleobases: adenine (A), guanine (G), thymine (T), uracil (U), and cytosine (C). We first extended the MC algorithm that we have developed for a single molecule to a group of molecules (see details in section S4). We then use this extended algorithm to compute the assembly tree and then visualize the tree manually, as shown in Fig. 2B (see section S4.3). The MA is 16, but it takes 43 steps to build the molecules bond by bond, and the minimum number of steps to construct them separately using an assembly process is 33. This relatively low MA reflects the fact that they share lots of common structures (even the common structures share quite a few common substructures), resulting in the hierarchy shown by the assembly tree, which represents a highly related subset of chemical space.

Biomolecules

For this study, we picked a dozen vital biomolecules to construct their assembly tree, including the five nucleobases (A, G, T, C, and U), pyruvate [a key intermediate in metabolic pathways across various organisms ()], and citrate [an intermediate of the vital metabolic pathway, the Krebs cycle, used by all aerobic organisms to release energy (, )]. In addition, we included d-ribose [the carbohydrate that serves as the backbone of RNA among various other functions (, )], nicotinamide adenine dinucleotide [NAD+; a vital cofactor that carries electrons from one reaction to another ()], adenosine diphosphate (ADP), adenosine triphosphate (ATP), and a symbolic RNA molecule. As we see in Fig. 3, there are lots of structures that are shared. We can imagine that as we include more biomolecules in the tree (e.g., various proteins and RNA and DNA sequences), more structures will be shared, and the tree will grow deeper and deeper but without growing much wider (i.e., having more hierarchical layers but not many nonrelated key blocks being added to the tree). This potentially “narrow” assembly tree is an indication that all of the vital biomolecules involved in extant life on Earth is not arbitrary but a consequence of millions of years of evolution.

Fig. 3.

The assembly tree of a dozen vital biomolecules, including the five nucleobases (A, G, T, C, and U), pyruvate, citrate, d-ribose, NAD+, ADP, ATP, and a symbolic RNA molecule.

In origin of life studies, one central problem is that most biomolecules are complex and thus seem very unlikely to emerge de novo from prebiotic chemistry, while life requires them altogether to function properly. An important clue that we gain from this study is that lots of chemical substructures are shared among these vital biomolecules. Hence, it can be argued that the set of processes producing these molecules together, as a set, could have been smaller than if they were produced de novo individually because they could have shared common pathways. The closer the biomolecules are in the assembly tree (i.e., they are more related), the easier it is to access these compounds. This is arguably why biomolecules exploited by extant life appear close in the assembly tree; otherwise, they would be too complex to emerge individually. For example, the fact that the five nucleobases AGTCU are closely related in the assembly tree indicates that it is not arbitrary nor a result of a frozen event that they serve as the fundamental units of the genetic code across all life on Earth and that the molecules may have been incorporated into protobiological systems because of their structural relatedness. We tested this idea by building assembly trees for alternated nucleobases (see section S5 for details). We found that MAs of alternated trees are always much higher (ranging from 31 to 38) than the one in reality, which is 16, as mentioned in the previous section. This analysis only indicates that the extant set of nucleobases were selected because of their relatedness, but why nature selects this particular set is another question since there could be many such sets. This intriguing and important question requires a much wider exploration of chemical space, yet we believe our methods provide a useful tool for framing this question.

Gene sequences

So far, we have focused on constructing assembly trees for molecules, but our theory can equally be applied to gene sequences. We take one hypothetical gene sequence X (60 bases; Fig. 4) as an example to illustrate how we can make use of the compressed information carried by X’s assembly tree to reconstruct itself with less efforts. The naïve method is to add one base at one time, and then, 59 steps in total are needed to construct X, but notice that some segments are repeated multiple times. If it is possible to produce these repetitive segments beforehand, then it is possible to take them directly and combine them with other segments in a precise way to obtain X, and this requires less than 59 steps [as producing repetitive segments is relatively easy (), e.g., polymerase chain reaction, the reconstruction of the original sequence takes less efforts than the naïve method]. As the assembly tree can filter all of the repetitive and redundant information (i.e., record the information of the sequence in the most compressed way), the information of this “precise way” is completely stored in the assembly tree. Then, we build X’s assembly tree, as shown in Fig. 4 (in this example, we only deal with one sequence X, and its assembly tree reduces to its shortest assembly pathway). Note that in this gene sequence case, we use nucleobases as the basic building blocks rather than chemical bonds as in the molecule cases.

Fig. 4.

The assembly tree of one hypothetical gene sequence X (as, here, we only deal with one sequence X, it reduces to its shortest assembly pathway in the key-step representation).

Here, we use nucleobases as the basic building blocks rather than chemical bonds in the molecule cases. Thus, we explicitly draw those nucleobases at the bottom for clarification.

The assembly tree of one hypothetical gene sequence X (as, here, we only deal with one sequence X, it reduces to its shortest assembly pathway in the key-step representation).

Here, we use nucleobases as the basic building blocks rather than chemical bonds in the molecule cases. Thus, we explicitly draw those nucleobases at the bottom for clarification. Before reconstructing, we introduce a way to compress the information hold in the assembly tree into a single string, which would be very useful in future studies, as the assembly tree of genomes would be huge and directly storing the tree structure is neither convenient nor efficient (note that while preserving the assembly tree information is our priority, we are not intended to defeat any sophisticated data compression technique per se). The assembly tree in Fig. 4 can be rewritten as CG_TTG_A1_GAC1G_3CTC4_1T5CAG42342543, denoted as Y. Now, we can reconstruct X based on Y. The first step is to construct CG by simply combining the individual bases C and G (one step) and construct TTG by combining the individual bases T, T, and G (two steps). Second, construct A1, where “1” stands for the first segment in Y, which is CG. Thus, we only need one step to obtain ACG since CG has been constructed before. Third, construct GAC1G, where we can reuse “1” again, and we thus need four steps. Fourth, construct 3CTC4, where we can reuse “3” (A1, the third segment in Y) and “4” (GAC1G, the fourth segment in Y), and we thus need four steps. Last, we can construct the original sequence X based on the last part of Y, namely, “1T5CAG42342543” where the integer stands for the corresponding segment that has been obtained before and can thus be reused. Thus, we need 13 steps here. In total, we need 1 + 2 + 1 + 4 + 4 + 13 = 25 steps, which is much less than the naïve 59 steps. Last, to quantify the increase of information from the original sequence X (60 letters) to the lossless and compressed version Y (36 letters), we can use Shannon entropy (, ), a widely used quantity to describe the information of a string. Shannon entropy H of a string X (with n letters) is defined as H(X) = − ∑(x) · log2p(x), where p(x) is the probability that the letter x appears in this sequence X, which is set to be equal to the times x appears in X divided by n, and the sum goes through every distinct letter. Therefore, we obtained that Shannon entropy of the original sequence H(X) is 1.851, while H(Y) is 3.251, increased by 1.76 times. We do think that the application of assembly theory to gene assembly has potential application to not only building new routes to engineer synthetic genomes by taking a series of genes, finding the common parts, and then finding the minimal route to assemble these parts to access all of the desired genes. Not only could this be used to efficiently build new function, but it also has promise to explore how evolution has reused genetic motifs beyond the current modular understanding and perhaps find more complex and conserved routes for the propagation of genetic information across different genes.

Plasticizers

Plasticizers are added to polymers and formulations to make them more plastic, to decrease viscosity and friction, and to increase flexibility (), but they can leach into the environment. This is a problem since these compounds have been shown to be toxic (). One big issue is that, in general, the evaluation of specific effects and prevalence of plasticizer molecules in the environment is hard, as so many different types are in use and many degradation pathways exist (). This means that potentially vast numbers of molecules related to the plasticizer parent are present in the environment, posing similar or even greater health risks than the parent. However, by exploring the assembly tree of plasticizer molecules (see Fig. 5), it might be possible to map the molecules that are potentially contaminated and even identify unknown or unexpected molecules.

Fig. 5.

The assembly tree of 10 commonly used plasticizers including BBP, DEHP, DEHA, and others.

For a clearer visualization, all plasticizers are made dimmer than other parts of the tree. The most central structures are highlighted green.

The assembly tree of 10 commonly used plasticizers including BBP, DEHP, DEHA, and others.

For a clearer visualization, all plasticizers are made dimmer than other parts of the tree. The most central structures are highlighted green. Here, we built the assembly tree of 10 commonly used plasticizers including BBP (benzyl butyl phthalate), DEHP (di-2-ethylhexyl phthalate), DEHA [bis(2-ethylhexyl) adipate], and others. As seen from Fig. 5, these 10 seemingly distinct molecules share lots of common structures (i.e., key assembly building blocks), which are also highly related. These 10 relatively large plasticizers are constructed by only a dozen of these key building blocks. In particular, the four central structures highlighted green directly or indirectly connect to most of them (except for tricresyl phosphate) and are highly related themselves. We can thus imagine that if we can detect and identify these central/characteristic chemical structures and fragments in an environment [e.g., using the mass spectrometry technique for assembly theory ()], then we should be able to track down the parent molecules and substantially narrow down the list of suspected contaminants. Nevertheless, it should be noticed that we only introduced the concept here. The development of a reliable detection method would require a massive tree inclusive of all relevant compounds both in terms of possible pollutants and products expected in the absence of pollution based on the environmental conditions.

Opiates

The search of opiate-based chemical space is an important test case since the family of compounds is highly distinctive with well-defined modular parts. One idea could be to use assembly trees to explore a set of potentially biochemically accessible new structures, as well as deduce what contingent information is present therein, and use them for finding new molecular targets. To do this, we computed the assembly tree of 10 compounds in the family of opiates (Fig. 6): Some of them are found in the opium plant (morphine, codeine, thebaine, and papaverine); some are synthetic opioids (fentanyl, remifentanil, methadone, pethidine, and diamorphine, also known as heroin), and the last one is salvinorin A, which is a κ-opioid receptor agonist () found in the Salvia divinorum plant but might not be properly considered as an opioid. The assembly tree captures some known features of the chemical space associated with these compounds: (i) Morphine, codeine, thebaine, and diamorphine are clustered in one place, with a major structure shared. (ii) Fentanyl and remifentanil are close in the tree, as they share a large substructure. (iii) In contrast, methadone, pethidine, and papaverine are not as closely related, sharing just a relatively small benzyl moiety. (iv) Salvinorin A is distinct from others, as it only connects with other compounds via very small structures.

Fig. 6.

The assembly tree of nine compounds in the family of opiates and one κ-opioid receptor agonist (salvinorin A).

The assembly tree of nine compounds in the family of opiates and one κ-opioid receptor agonist (salvinorin A).

Some of these opiates are natural (morphine, codeine, thebaine, and papaverine), while others are synthetic (fentanyl, remifentanil, methadone, pethidine, and diamorphine, also known as heroin). For a clearer visualization, all opioids are made dimmer than other parts of the tree. By computing the assembly tree of these compounds, we obtained not only the shortest but also other longer assembly pathways (another output of our algorithm; see section S4). The key building blocks included in all these pathways, constituting the assembly space of these compounds, are highly related and encode their structural information. Given that the molecules connected by these graphs are real and functionally interesting, we think it could be fruitful to see if navigating the trajectories defined by these pathways could lead to the discovery of new potential drug candidates. The idea to explore the space of natural products by fragmenting them is not new and has been tried before (–). For instance, the fragment-based drug discovery (FBDD) is a de novo generation strategy that uses fragments of known bioactive compounds to obtain new drug-like molecules (, ). The idea behind FBDD is that functionally comparable molecules share structural similarities; therefore, selecting fragments from molecules would propagate their properties, such as biological activity, to the newly generated compounds (–). Compared to the size of the comparable chemical space, the assembly space of these compounds is substantially smaller. We used the MC method to compute the pathways, so the size of the assembly space changes along with the number of MC steps. We found that the size of the assembly space is approximately 1500, which was consistent even as we increase the number of MC steps up to the number that is far larger than the number needed to have very short pathways found. In this case, the smallest MA is found to be 105, while the total number of bonds in these 10 compounds is 268 (see details in section S6 for how the size of the assembly space changes with the number of MC steps). This small number (~1500) is not because our approach or our program cannot find more unique structures, but the assembly space itself is intrinsically small since it only includes structures that are shared by at least two compounds, instead of any feasible structure or fragment. For a computationally tractable comparison, we used MOLGEN 5.0 () to calculate the total number of structures possible using a total of 10 carbon, nitrogen, oxygen, or sulfur species, which amounts to over 109 unique possibilities and is much larger than the assembly space here. As morphine has 21 nonhydrogen atoms, the number of possible structures will be much larger than 109. Next, we used the assembly trees for the purpose of de novo molecule generation by reconnecting the elements from the assembly pool. Critically, the products of these “reassembly” are closely related to the parent compounds, both structurally and in terms of functional properties. This means that the reassembly process locally explores the chemical space of the input compounds. To do this, we have implemented a method known as the Reassembler (see section S7 for details). Briefly, it connects the assembly pool elements through the same pattern in which they were disconnected from their parent compound(s). This is effectively the reverse process used to generate the tree in the first place. To prove that generated molecules retain similarity to the parent compound(s), we have generated the assembly pool of known natural opiates [in contrast to the 10 compounds in Fig. 6, here, we used codeine, morphine, noscapine, oripavine, papaverine, and thebaine (Fig. 7A) to avoid bias]. We used this assembly pool to generate 1000 opiate-like compounds, and an example set of these molecules is shown in Fig. 7B.

Fig. 7.

Comparison between natural opiates and opiate-like molecules generated using Reassembler.

Comparison between natural opiates and opiate-like molecules generated using Reassembler.

(A) shows the six opiates used to generate the assembly pools, and (B) shows six new opiate-like molecules generated from those assembly pools. See section S7.3 for more detailed information on more new compounds. For comparison purposes, we also generated 1000 random compounds only from the bonds present in the opiates (C─C, C═C, C─O, C═O, C─N, and C═N). For consistency, we have limited the molecular weight of all generated molecules and the unsaturation levels in the same range as the parent opiates, i.e., 281 to 368 Da and 9 to 12 degrees of unsaturation. To ensure that the generated molecules were at least chemically plausible, they were passed through two filters: The first filter uses SMARTS (SMILES arbitrary target specification) patterns that are commonly used to detect forbidden structures/structural motifs (), while the other is based on RDKit conformation optimizer (see section S7) (). If any of the forbidden structures were present or no conformation could be found at all, then the molecule was rejected, and another was generated in its place. As shown in Fig. 8A, molecules generated from assembly pools showed significantly higher similarity to opiates than the random compounds (we also compared our result with the molecules generated from arbitrary substructural fragments; see section S8 for details, where our result is still better). Furthermore, they also exhibited similar levels of drug-likeness to the opiates, measured using the “quantitative estimate of drug-likeness (QED)”, as opposed to random molecules which were significantly less drug-like (see Fig. 8B), showing that properties of parent molecules were retained just as well as the structural similarities. While two of the used parent opiates, morphine and oripavine, may look almost identical to the human eye, the seemingly small differences (i.e., morphine cyclohexene ring versus oripavine cyclohexadiene ring with extra methylation on its hydroxyl group) determine significant differences in their properties. Morphine is a common analgesic, while oripavine is not clinically useful because of its toxicity and low therapeutic index. This suggests that, in the chemical space, the distance between these molecules might be greater than the apparent structural similarities alone.

Fig. 8.

The comparison of 1000 molecule sets generated from opiate assembly pool (blue) and generated from individual bonds (red).

The comparison of 1000 molecule sets generated from opiate assembly pool (blue) and generated from individual bonds (red).

(A) According to the Tanimoto similarity measure, products of assembly pools were significantly more similar to the parent molecules (opiates) than randomly generated products. (B) QED shows that the assembly products, unlike their random counterparts, showed similar level of drug-likeness to opiates (denoted by gray dotted line). (C) On the basis of logP estimation, assembly products usually had higher logP than opiates (denoted by gray dotted line), while random molecules usually had lower logP. Therefore, a more discrete measure is needed to capture their overall similarity level. In the example of assembly-opitate-1 (AP1) (see Fig. 7B), it has a discrete structure comprising a skeleton that combines the assembly spaces of the opiate-based space surrounding the known opiates, such as morphine or codeine. Thus, it is easy to see the structural similarities between AP1 and morphine as shown by the Tanimoto similarity score ca. 0.24. Furthermore, the QED is 0.72, while the QED of morphine is 0.70, which is notably close. In addition, the logP of AP1 is 2.42, while morphine logP is 1.20. Nevertheless, the logP of morphine’s close relative, oripavine, is 2.12, which is close to our hypothetical AP1. Thus, our hybrid AP1 seems to occupy a position in chemical space intermediate between morphine and oripavine. The compound has similarity to most opiates, including codeine, thebaine, and even noscapine, thereby occupying the intermediate space between all the opiates. This is promising since these molecules could be reasonably considered as novel synthetic targets to be made (their constrained skeletons are themselves a hard target) and appear to be the first in a line of artificial natural products. A key question is whether it is cost effective to generate practical synthetic approaches to such molecules and to constrain the search of the assembly space around molecules that might be themselves easily made. Our results demonstrate how the assembly theory can be used to generate compressed representations of chemical space while retaining the relevant chemical and structural information. This means that it is possible to show how assembly spaces can be extended to multiple compounds, and we introduced a notation to effectively represent the key features of such assembly spaces. By developing an MC algorithm to calculate the shortest assembly pathways of a single molecule, we also show how it is possible to generate the assembly tree of any collection of molecules. We demonstrated this methodology in four distinct use cases, prebiotic chemistry, genetics, environmental chemistry, and drug discovery. The assembly tree of biomolecules shows that those vital molecules used in biology represent a compressed subset of the possible compounds, suggesting that they were subjected to evolutionary optimization. By analyzing the assembly space of plasticizers, it was possible to identify structural motifs common to many different pollutants. These motifs can be used as general-purpose signals to identify entire classes of pollutants in complex environmental samples. The analysis of the opioids provides a map to future drug development, by extracting the hierarchical relationships between compounds and identifying key components of possible drug candidates. The analysis of gene sequences demonstrates how assembly spaces can provide a lossless compression of sequences that retains the repeated motifs, demonstrating how complex gene sequences could be reconstructed from minimal genetic inputs. These four different case studies demonstrate the wide applicability of assembly trees as a tool in a diverse set of disciplines. Last, we developed a scheme to take a target set of molecules as templates for silico discovery and explore the enumeration of these targets using random and assembly tree–constrained approaches (Figs. 7 and 8). The evaluation of the properties provides a useful test of how assembly trees can capture information encoded with a chemical network constructed using external constraints from either biochemistry or technology. We hope that by exploring these trees, with information from other sources about what molecules are synthetically feasible, it will be possible to develop new routes to structure expansion that encode the transformations of chemical synthesis.

MATERIALS AND METHODS

The MC algorithms described in this manuscript were implemented in C++ and can be compiled using Visual Studio 2019. This code relied on the InChI standard libraries found at www.inchi-trust.org/. The Reassembler approach to generate new compounds from assembly pools was implemented in Python (with RDKit). Details on the implementation and instructions on how to use the software can be found in the Supplementary Materials.

48 in total

1. Ribose biosynthesis and evidence for an alternative first step in the common aromatic amino acid pathway in Methanococcus maripaludis.

Authors: D L Tumbula; Q Teng; M G Bartlett; W B Whitman
Journal: J Bacteriol Date: 1997-10 Impact factor: 3.490

Review 2. Navigating chemical space for biology and medicine.

Authors: Christopher Lipinski; Andrew Hopkins
Journal: Nature Date: 2004-12-16 Impact factor: 49.962

Review 3. Chemical Space: Big Data Challenge for Molecular Diversity.

Authors: Mahendra Awale; Ricardo Visini; Daniel Probst; Josep Arús-Pous; Jean-Louis Reymond
Journal: Chimia (Aarau) Date: 2017-10-25 Impact factor: 1.509

4. Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds.

Authors: Aaron M Virshup; Julia Contreras-García; Peter Wipf; Weitao Yang; David N Beratan
Journal: J Am Chem Soc Date: 2013-05-02 Impact factor: 15.419

Review 5. Privileged scaffolds for library design and drug discovery.

Authors: Matthew E Welsch; Scott A Snyder; Brent R Stockwell
Journal: Curr Opin Chem Biol Date: 2010-03-18 Impact factor: 8.822

6. One Among Millions: The Chemical Space of Nucleic Acid-Like Molecules.

Authors: Henderson James Cleaves; Christopher Butch; Pieter Buys Burger; Jay Goodwin; Markus Meringer
Journal: J Chem Inf Model Date: 2019-09-23 Impact factor: 4.956

Review 7. Principle and design of pseudo-natural products.

Authors: George Karageorgis; Daniel J Foley; Luca Laraia; Herbert Waldmann
Journal: Nat Chem Date: 2020-02-03 Impact factor: 24.427

8. Strategy to discover diverse optimal molecules in the small molecule universe.

Authors: Chetan Rupakheti; Aaron Virshup; Weitao Yang; David N Beratan
Journal: J Chem Inf Model Date: 2015-02-19 Impact factor: 4.956

9. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.

Authors: Rafael Gómez-Bombarelli; Jennifer N Wei; David Duvenaud; José Miguel Hernández-Lobato; Benjamín Sánchez-Lengeling; Dennis Sheberla; Jorge Aguilera-Iparraguirre; Timothy D Hirzel; Ryan P Adams; Alán Aspuru-Guzik
Journal: ACS Cent Sci Date: 2018-01-12 Impact factor: 14.553

10. PubChem: a public information system for analyzing bioactivities of small molecules.

Authors: Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2009-06-04 Impact factor: 16.971

1 in total

1. Ladderpath Approach: How Tinkering and Reuse Increase Complexity and Information.

Authors: Yu Liu; Zengru Di; Philip Gerlee
Journal: Entropy (Basel) Date: 2022-08-05 Impact factor: 2.738

1 in total