Literature DB >> 35106378

Routescore: Punching the Ticket to More Efficient Materials Development.

Martin Seifrid^1,2, Riley J Hickman^1,2, Andrés Aguilar-Granda^1,2, Cyrille Lavigne², Jenya Vestfrid^1,2, Tony C Wu^1,2, Théophile Gaudin^2,3, Emily J Hopkins¹, Alán Aspuru-Guzik^1,2,4,5.

Abstract

Self-driving laboratories, in the form of automated experimentation platforms guided by machine learning algorithms, have emerged as a potential solution to the need for accelerated science. While new tools for automated analysis and characterization are being developed at a steady rate, automated synthesis remains the bottleneck in the chemical space accessible to self-driving laboratories. Combining automated and manual synthesis efforts immediately significantly expands the explorable chemical space. To effectively direct the different capabilities of automated (higher throughput and less labor) and manual synthesis (greater chemical versatility), we describe a protocol, the RouteScore, that quantifies the cost of combined synthetic routes. In this work, the RouteScore is used to determine the most efficient synthetic route to a well-known pharmaceutical (structure-oriented optimization) and to simulate a self-driving laboratory that finds the most easily synthesizable organic laser molecule with specific photophysical properties from a space of ∼3500 possible molecules (property-oriented optimization). These two examples demonstrate the power and flexibility of our approach in mixed synthetic planning and optimization and especially in downselecting promising candidates from a large chemical space via an a priori estimation of the synthetic costs.

Entities: Chemical

Year: 2022 PMID： 35106378 PMCID： PMC8796309 DOI： 10.1021/acscentsci.1c01002

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Molecular design and discovery is a universal challenge across the chemical sciences, which requires exploring a vast chemical space.[1−3] Self-driving laboratories, also known as materials acceleration platforms (MAPs), have the potential to make faster, more efficient progress by “closing” the chemical discovery loop: integrating property prediction, synthesis, analysis, characterization, and experiment planning.[4−6] One of the key challenges in building self-driving laboratories is developing a platform capable of autonomously performing all experiments from synthesis to characterization. Automated synthesis platforms (ASPs) are therefore an integral element of MAPs: synthesis is the engine that drives the exploration of chemical space. At the moment, ASPs are only capable of performing a very limited set of reactions in comparison to human chemists.[7−12] As a result, the chemical space accessible to MAPs is limited by the reactions the ASP can perform, as well as the price and availability of the starting material library, since high-throughput experiments often require more material than manual synthesis. Consequently, molecules incorporating starting materials that are unavailable or cost-prohibitive cannot be explored, even though computations may predict them to have highly desirable properties. To this end, we envision a combined synthetic strategy including both manual and automated synthesis (Figure ), where human chemists synthesize the molecules inaccessible to the ASP while taking advantage of its increased throughput to more rapidly travel through chemical space.

Figure 1

Subway map of chemical space (a) depicting travel through chemical space from (b) 1,4-dibromobenzene and dimethyldichlorosilane (circle) to the target molecule (hexagon) using both manual reactions (blue) and automated iterative Suzuki–Miyaura cross-coupling reactions (pink). The combined automated and manual synthetic approach to traversing chemical space can be likened to a subway system in a large city. In this “chemical metropolis,” the cost of the starting materials is analogous to rental or housing prices: the closer you are to your target, the more expensive the starting materials. In this analogy, the subway lines—fast and efficient with limited stops—are the reactions carried out by the ASP. Manual reactions—slow and costly, but much more versatile—are walking to or from the subway station. Finally, the “fare” for traveling through chemical space is the monetary, material, and time costs of carrying out the syntheses. Quantifying the difficulty of synthesizing a target molecule is a very important challenge in both synthetic chemistry and cheminformatics. Commonly, synthetic accessibility is quantified on the basis of a variety of structural features of the target molecule, including the number of rings and stereocenters, the complexity of the target molecule’s graph representation, and similarity to the starting materials.[13−15] Other approaches consider these factors, as well as more practical considerations, such as the probability of finding a similar molecule or substructure in a database of purchasable starting materials or the costs of starting materials.[16,17] However, some of these metrics rely on weights for each factor that are assigned on the basis of fitting to expert opinion. In addition to the significant human labor required to determine the weights, this restricts the metric to evaluating only molecules similar to those that were scored by experts. Machine learning (ML) based approaches for calculating synthetic accessibility have recently been shown to accurately estimate the complexity of a target molecule and synthetic route.[18−21] However, these also face similar limitations in terms of transferability and training. Currently, no synthetic accessibility metrics exist for combined manual and automated synthetic routes. In this work, we present a new method to evaluate the cost of synthetic routes. The RouteScore requires no pretraining or fitting and is based on objective inputs and weights such as cost of labor and materials and human or robot time. Although it is designed with the “subway map” approach of combined manual and automated synthesis in mind, the RouteScore is equally adaptable to fully automated or fully manual synthesis. Furthermore, it can be used in both a priori synthetic route planning and in an a posteriori evaluation of syntheses. While the examples presented in this paper deal primarily with research-scale synthesis, the RouteScore framework could be adapted to process-scale synthesis by adding considerations important to process chemists in the “monetary cost” component of the equation. First, we describe how the RouteScore can be used to determine the most efficient synthetic route from many (structure-oriented) by comparing 10 different syntheses of a molecule with many known routes, modafinil. Then, we show how, by traveling through the chemical subway map, multiobjective optimization using the RouteScore as one of the objectives can be used to determine promising candidate molecules for organic laser molecules (property-oriented).

Results and Discussion

Cost of a Synthetic Route

To select the best synthetic route, we calculate the route’s cost per amount of target molecule produced. We primarily consider three factors in determining the cost of a reaction: time, money, and mass efficiency. The last two are rarely considered in academic settings.[17] Here, we define the mass cost as the total mass of reactants and reagents required for a reaction. This factor rewards reactions that efficiently build up the target molecular structure without creating additional waste (e.g., protecting groups, leaving groups, etc.), similarly to previously developed metrics for synthetic efficiency.[22−24] The monetary cost is defined as the sum of the cost of human and robotic labor and the total cost of the reactants and reagents used in the reaction. This is the factor that will be most variable among laboratories, institutions, and countries due to differences in labor and material costs. For the purpose of clarity, we have included a full breakdown of our calculations of the labor costs in Tables S1 and S2. The monetary cost of starting materials synthesized in a previous step along the route is not factored into the monetary cost of a subsequent reaction so as to avoid double-counting. A “step” is defined as a reaction that requires setting up and later cleaning labware, that is to say that one-pot multistep reactions—although they involve multiple chemical transformations—only count for a single step in the RouteScore, as these are generally more efficient. In the case of an a priori estimation of the RouteScore, the yield should be assumed to be 1. However, estimates of a reaction’s yield could also be provided by forward reaction prediction algorithms.[25−28] We define the total time cost (TTC) of combined human and robotic syntheses as follows: The surface of all possible time costs is a cone with minimum of 0 at tH = 0 and tM = 0 (Figure ). This results in a linear increase in the TTC for any combination of tH and tM. In the case where the hourly costs of human (CH) and machine (CM) labor are different, the surface is an elliptical cone where the semimajor and semiminor axes correspond to the ratio of CH and CM. We generally expect human time to be more expensive. This means that, for equal increases in tH and tM, an increase in tH results in an increase in the TTC that is proportional to CH/CM. Therefore, the RouteScore will disincentivize reactions that require large tH. Only taking into account tH could lead the RouteScore to favor reactions that require very large tM, which is also undesirable.

Figure 2

Plot of the TTC as a function of human and robot time.

Plot of the TTC as a function of human and robot time. On the basis of these considerations, we define the cost of a reaction step along the synthetic route (StepScore) to bewhere n is the molar quantity of a given reactant or reagent, C is its cost, and MW is its molecular weight. Since a synthetic step can require either only manual labor or both manual and automated labor, the terms tH,M and CH,M refer to the human (tH and CH) or machine (tM and CM) costs or both. For purely manual synthesis, CH, CM, and tM can be dropped, giving TTC = tH. When the TTC value for automated synthesis is determined, it is also important to account for the human time required for maintenance of the robotic chemist. The StepScore equation is sufficiently flexible for additional factors to be included in the calculations if needed. The purchasing cost of workup and purification materials (e.g., solvents, silica for chromatography, etc.) can easily be added to the cost in the same way as for the reactants in the ∑nC component of the equation. One challenge is that it can be very difficult to predict purification costs a priori because they are heavily dependent on the physical properties of each molecule. However, initial efforts to quantify the separability of major and minor products in a reaction have been demonstrated in the literature.[29] In the case of process chemistry, the removal and disposal of solvents, as well as the energy and waste disposal costs associated with workup and purification, are significant factors. These costs may be difficult to calculate independently since waste disposal and energy costs are often consolidated in the form of waste disposal contracts or building energy usage. If a detailed breakdown is available, then these costs could be included in the monetary cost component of the StepScore. Otherwise, these costs can be included in the operator costs (CH or CM) in the form of average hourly costs. An example of included operational costs is provided in Tables S1 and S2. In the case of a priori cost estimation such as the examples herein, workup and purification costs are not included because synthesis procedures reported in the literature often do not include sufficient information to calculate the cost of purification, such as exact volumes of solvent for column chromatography or recrystallization. To make syntheses at different scales comparable, the sum of all StepScores is normalized by the quantity of target material produced (nTarget). The RouteScore, with units of h·$·g·(mol of target molecule)−1, can therefore be expressed with this equation:

Synthetic Route Optimization for a Well-Studied Drug Molecule

It can be difficult to quantify the efficiency of a diverse set of synthetic routes. To demonstrate the usefulness of the RouteScore for addressing this challenge, we selected a drug, modafinil, which has many known synthetic routes (Table S3).[30−37] For each route (Figure ), we determined the required human time on the basis of our own estimates (see the Supporting Information for details) and calculated the RouteScore. The synthetic routes vary from a patented industrial-scale preparation[36] to a milligram-scale synthesis performed to screen modafinil’s anti-inflammatory activity.[32]

Figure 3

Ten routes to synthesize modafinil, with the route number shown in boldface and log(RouteScore) in gray underneath. The rings of each molecule are colored on the basis of how often that molecule appears in the 10 routes. Dashed arrows represent one-pot multistep reactions, which we treated as a single step. We find that the scale of the synthetic route and the number of steps do not correlate strongly with the RouteScore (Figure ). Routes 1 and 3 both start from diphenylmethanol, take three steps, and have similar overall yields (65–66%, Figure S3). The main difference between the two routes is in the procedure. Repeated drying and purification by recrystallization are labor-intensive (Table S4) and often involve loss of 5–10% of the product. Route 1 requires numerous recrystallizations, which raises the total labor time from 4 h (route 3) to 6.5 h (route 1). Unlike the other routes, route 5 is carried out as a one-pot multistep synthesis in bespoke 3D-printed reactionware, which is intended to minimize the human labor required to carry out syntheses. However, this route requires many small operations (e.g., preparing syringes and transferring solutions from one reactor module to another) which add up to 6 h of labor time. As a result, route 6, which is almost identical with route 5, requires slightly less time (5.5 h). The routes that include formation of the 2-(benzhydrylthio)acetyl chloride intermediate (routes 4, 5, 6, and 10) are much less efficient, likely due to the extra precautions and labor required to use reagents such as thionyl chloride and oxalyl chloride. Route 7 takes four steps, but ends up being less costly (log(RS) = 6.136) than routes 5 and 6 despite substantial labor costs (9.25 h) and a mediocre overall yield (34%) because the monetary cost of each step is quite low (on average $123 per step). Finally, routes 2 and 3, which use Nafion as a catalyst, are the most efficient because they require very little labor, are cheap to carry out, and efficiently utilize the catalyst and starting materials to build up the target molecule (Figure S3). Notably, route 3 is less costly than route 2 despite requiring more labor and having one more step because it uses a much cheaper method of introducing the thioether and amide groups. The 2-mercaptoacetamide reactant costs $2303 CAD/mol, while methyl thioglycolate only costs $29 CAD/mol and the amide can easily be synthesized from ammonia ($83 CAD/mol) at the last step. The effectiveness of this strategy is supported by a similar approach in the patented industrial synthesis (route 9).[36] Although route 9 is carried out on an industrial scale, it is the least efficient (log(RS) = 7.700) because it suffers from a below-average overall yield of 23% (Figure S3) and requires a significant amount of human labor (9.5 h). Since the RouteScore has identified this industrial-scale synthesis as being inefficient, its quantitative information can be used to translate the advantages of other syntheses of modafinil to a more efficient method to potentially produce large quantities of the target molecule.

Figure 4

Results of evaluating the 10 modafinil synthetic routes using the RouteScore. The horizontal axes correspond to the total human time required to perform each synthesis and the overall yield of the route. The vertical axis corresponds to log(RouteScore). Each point is colored on the basis of the number of synthetic steps and labeled by its route number.

Multiobjective Optimization of Organic Laser Molecules

To demonstrate the usefulness of the RouteScore approach for searching chemical space, we performed an in silico optimization of optoelectronic properties of potential organic laser molecules.[38] The use of organic laser molecules in the solid state could be a very interesting technology for portable devices and is a logical extension of organic light-emitting diode technology. The initial set of molecules are those that can be synthesized by two steps of automated iterative Suzuki–Miyaura cross-coupling (iSMC) reactions[7,39] (Figure a) from three groups of building blocks—A, B, and C (Figure S4)— to form A–B–C–B–A pentamers. The terms “building block” and “fragment” are sometimes used interchangeably in settings that include both computational and synthetic material design, which can cause confusion. Here, the term “building block” refers to a molecule, which has reactive functional groups, that is used as a reactant in the synthetic route. On the other hand, “fragment” refers to a structural template that is used in computational screening. We randomly picked 10 A blocks, 11 B blocks, and 18 C blocks from a list of aromatic compounds, resulting in a space of 1980 symmetric pentamers that could be synthesized in an automated fashion. Most of the blocks are commercially available; however, three are not (blue in Figure S4c). In our model system, those were prepared by manual synthesis (Figure S5) using procedures in the literature.[40−42] We estimated the human time required for each synthesis on the basis of prior experience (Table S5). Due to compatible functional groups, certain pentamers can be expanded with postautomation manual synthetic steps involving either nucleophilic aromatic substitution[43] (SNAr) by a carbazole (Figure b) or a Buchwald–Hartwig amination[44] (BHA) with 2-bromopyrazine, via a tert-butoxycarbonyl (Boc) deprotection step (Figure c).

Figure 5

Three syntheses used in our example: iterative Suzuki–Miyaura cross-coupling (a), nucleophilic aromatic substitution (b), and Buchwald–Hartwig amination (c). The following reagents were used for each general type of reaction: (i) XPhos Pd G2, K3PO4; (ii) Cs2CO3; (iii) K2CO3; (iv) Pd2(dba)3, DavePhos, NaO-t-Bu. Structures of the organic reagents are provided in Figure S6. The manually synthesized C blocks along with the SNAr and BHA reactions allow us to explore how adding manual synthetic steps into an otherwise automated synthetic exploration of chemical space affects the RouteScore. There are 198 pentamers only subjected to manual synthetic modification via Buchwald–Hartwig amination and 1231 pentamers only modified by manual SNAr reactions. Finally, there are 49 pentamers that undergo both SNAr and BHA. For these, we compare the cost of performing either the SNAr or BHA reactions first. Using only three general types of reactions and 41 total building blocks, we are able to access a chemical space of 3458 molecules. As expected, we find that the most efficient synthetic routes do not involve any manual synthetic steps after the automated pentamer synthesis (, Figure ). The relative cost for the manual synthesis of starting materials depends strongly on the particular intermediates and the reactions being carried out. In the iSMC set, synthetic routes with the three manually synthesized C blocks are ∼186 times more costly on average in comparison to pentamers synthesized exclusively from commercially available starting materials. Candidate molecules can also be synthesized using SNAr or BHA reactions. We find that for the set of 49 molecules that undergo both the SNAr and BHA reactions, it is less efficient to perform the BHA as the second step , than as the first step (). The difference in RouteScore between the SNAr-followed-by-BHA (S–B) and BHA-followed-by-SNAr (B–S) routes is due to the difference in mass of required starting materials for the Boc-deprotection and BHA reactions (Figures S7 and S8). In essence, the S–B routes have a higher RouteScore because larger-molecular-weight groups are added earlier in the synthetic route than is the case for the B–S routes. Therefore, the mass of starting material required to produce the same quantity (moles) of the target molecule is greater for S–B routes than for B–S routes. As a result, the RouteScore of S–B routes is ∼4% greater than that of B–S routes.

Figure 6

Violin plots of log(RouteScore) based on the type of synthesis used in the route. The numbers next to each violin correspond to the number of molecules in each set. The abbreviations are as follows: iSMC auto, molecules synthesized only by automated iSMC; iSMC man, molecules synthesized only by automated iSMC and manual building block synthesis; SNAr, molecules involving postfunctionalization with only SNAr reactions; BHA, molecules involving postfunctionalization with only BHA reactions; B–S, molecules where BHA reactions were performed before SNAr; S–B, molecules where SNAr reactions were performed before BHA. The white dot represents the median value, and the black box indicates the interquartile range. We compared the performance of the RouteScore to simply calculating a “naïve” score from the cost of the chemicals used in the synthesis and found no significant correlation (Figure S9) except for the iSMC auto molecules, where labor is a negligible factor because of automation. We also compared both the naïve score (Figure S11) and RouteScore (Figure S12) to those of the SAscore,[15] SCscore,[18] SYBA,[20] and RAscore[21] and found very little correlation. This is likely because each of these scores seeks to quantify synthetic accessibility or molecular complexity in different ways. In particular, RouteScore is more focused on the specific route, rather than giving a single score to a molecular structure. For example, the aforementioned scores would not be able to differentiate between S–B vs B–S routes discussed above, not to mention modafinil. One of the primary goals of MAPs is achieving an efficient inverse design of functional molecules.[5,45] Rather than enumeration of large combinatorial spaces of molecules with potentially costly property measurements, the inverse design paradigm seeks to discover molecules starting from a desired property or set of properties. Selecting molecules that satisfy multiple predefined targets simultaneously (e.g., strong emission in a particular wavelength range and low synthetic cost) is a critical but challenging decision-making process, especially when the property measurements are time- or resource-intensive. In this section, we simulate a MAP for the inverse design of organic laser molecules. The computationally predicted properties of laser molecules are optimized using a multiobjective, categorical variable approach. As objectives, we chose three figures of merit that are important for developing new organic laser molecules[38,46] and the RouteScore as objectives for the recently reported deep categorical Bayesian optimizer Gryffin.[47] The four targeted figures of merit in descending order of importance are (i) maximal fluorescence within a particular spectral range (400–460 nm in this case), (ii) minimal RouteScore, (iii) minimal spectral overlap between fluorescence and absorption spectra, and (iv) maximal fluorescence rate. First, maximizing fluorescence within a particular spectral range is necessary to develop a laser of a desired color, arguably the most critical property of any laser device. The RouteScore is chosen as the second most important figure of merit to reflect the necessity of finding organic laser molecules that can be synthesized in a cheap and efficient manner. Third, minimizing the spectral overlap corresponds to reducing losses from the self-absorption of emitted light, the inner filter effect.[48] Finally, maximizing the fluorescence rate should improve the quantum efficiency of the laser. The RouteScore is calculated as described above, while the other three figures of merit are derived from the results of high-throughput quantum chemical calculations (see the Supporting Information for details). There are three categorical variables, corresponding to the A, B, and C fragments (Figure S1), with 14, 13, and 19 options, respectively. This space corresponds to 3458 unique molecules. We use the scalarizing function Chimera[49] to simultaneously optimize the four objectives. Chimera attempts to optimize each objective in order of importance to bring its value within a desired threshold, as described in ref (49). We set absolute tolerances such that roughly 1% of the entire molecular space (34 out of 3458 molecules) satisfies all 4 tolerances simultaneously (Figure a). We execute 50 independently seeded optimization runs, each evaluating properties for 500 molecules. Nearing 500 evaluations, we observe asymptotic behavior of the optimizer for each target property. Optimization traces for the four target properties are presented as blue traces in Figure b–e.

Figure 7

Molecular space for the multiobjective optimization represented in four dimensions (a). The gray points do not satisfy the optimization thresholds. The red, purple, orange, and blue points correspond to the molecules with the best peak score, RouteScore, spectral overlap, and fluorescence rate, respectively (Figure S13). The peak scores of the full molecular space are shown in Figure S14. At each iteration of the multiobjective optimizations using Gryffin and Chimera, we plot the four properties that correspond to the measurement with the best merit: peak score (b), RouteScore (c), spectral overlap (d) and fluorescence rate (e). The shaded areas around the curves correspond to the bootstrapped 95% confidence interval. The gray shaded area indicates regions in which tolerances are not satisfied. The dashed lines correspond to the absolute tolerance that must be satisfied for the peak score (>0.67), RouteScore (<105 h $ g (mol target molecule)−1), spectral overlap (<0.2), and fluorescence rate (>0.16 ns–1). All four objectives are optimized simultaneously in the blue traces, while the RouteScore is excluded from the set of objectives in the maroon traces. In this work, we compute the four objective values for all 3458 molecules in our search space before commencing the optimization experiments. As such, we can apply the scalarizing function to the entire data set a priori and rank the candidate molecules on the basis of the merit returned by Chimera. The 34 satisfactory molecules ordered by the merit-based function constructed from the 4-objective hierarchy and absolute tolerances are shown in Figure S15, with their objective values being given in Table S6. Optimizations of the merit-based function should then converge upon the top-ranked molecule. Here, we calculate the merit for all of the molecules in the search space to evaluate the performance of Gryffin and Chimera. In the case of inverse design using Chimera with experimental data, this would not be feasible because we would not be able to find the global extrema without performing experiments for all 3458 molecules. Instead, an experimental optimization campaign could be carried out by selecting the top-ranked molecule after a predetermined number of iterations or by selecting the first molecule that is found to satisfy all four objectives. Additionally, it may not be evident how to set the threshold of the RouteScore for an unknown search space. We recommend two possible approaches: (i) setting the tolerance on the basis of the RouteScores of comparable known molecules or routes or (ii) setting the tolerance as RouteScore ≤ 0, which would lead to Chimera constantly trying to optimize the RouteScore objective. Gryffin rapidly identifies molecules with fluorescence spectra overlapping significantly with our target region (peak score >0.67). After the first objective is achieved, the RouteScore is decreased until its tolerance is satisfied after roughly 80 evaluations while the primary objective remains satisfied. The rapid decrease in the blue trace in Figure c at around 80 iterations is a result of “switching” from iSMC man molecules to iSMC auto molecules, between which there is a large gap in RouteScore values (see Figure ). In other words, the algorithm begins to evaluate molecules that have less costly syntheses whose fluorescence spectra fall into the target energy interval in the very first steps of the optimization. The tertiary objective tolerance is satisfied almost immediately after beginning the optimization. As we improve upon the quaternary fluorescence rate objective, we observe a slight regression upon the tertiary objective: i.e., an increase in the spectral overlap. To emphasize the effect of including the RouteScore in the set of objectives, we conduct additional optimization runs using only three objectives: peak score, spectral overlap, and fluorescence rate. The top 20 molecules according to the merit-based function constructed from this three-objective hierarchy and absolute tolerances are shown in Figure S16, with their objective values being given in Table S7. Optimization traces for these experiments are shown in Figure b–e in maroon. Without the additional task of minimizing the RouteScore, Gryffin identifies molecules as being meritorious solely on the basis of the properties derived from quantum chemical calculations. As such, molecules identified after 500 iterations have properties comparable to those of the molecules in the blue traces but are significantly more costly to synthesize (average RouteScore >107) in terms of a combination of effort, price, and materials needed. We also execute similar optimizations using the naïve score in place of the RouteScore (Figure S18) and find a similar performance in peak score and spectral overlap. However, the optimizations with the naïve score are not able to satisfy the fluorescence rate threshold. Recently, several studies have highlighted the efficiency of ML-driven experiment planners for achieving inverse design.[50−56] We follow suit for our simulated MAP by quantitatively comparing its aptitude for identifying synthetically feasible laser molecules to that of a simple random sampling strategy. Here we consider the following question: what fraction of total satisfactory molecules can each strategy identify given a budget of 500 evaluations (Figure S19)? In this context, satisfactory refers to a molecule whose properties simultaneously satisfy all of the tolerances. The Gryffin + Chimera strategy identifies on average 35 ± 3% of all satisfactory molecules after 500 evaluations, while random sampling identifies only 15 ± 1% of satisfactory candidates (Figure S19). This corresponds to on average about 12 hits with Gryffin + Chimera but only 5 hits with random sampling. For the entirety of the optimization experiment, the Gryffin + Chimera strategy evaluates on average a greater fraction of total satisfactory molecules, indicating that ML-driven experiment planning strategies yield greater exposure to promising candidates given budgeted resources than does random sampling. The results of our MAP simulation indicate that the RouteScore can be seamlessly used alongside photophysical figures of merit in the multiobjective inverse design of organic laser molecules. In our 50 optimizations, the Gryffin + Chimera strategy identified 12 distinct molecules (Figure a), all of which can be synthesized using only automated iSMC reactions. The optimizations overwhelmingly (42% of the time, Figure b) identify molecule 1 as the top candidate for synthesis. In contrast, the optimizations that only consider the peak score, spectral overlap, and fluorescence rate identify 10 molecules (Figure S20). Molecules 1, 2, and 3 in the four-objective (with RouteScore) optimizations are the same as molecules F, I, and J in the three-objective (without RouteScore) optimizations. However, these three molecules are only identified as the top choice in 12% of the three-objective optimizations, while they are identified as the top choice in 72% of the four-objective optimizations. Notably, many of the molecules identified in the RouteScore optimizations contain unusual substitution patterns for organic laser molecules.[38] For example, many of the top molecules are severely sterically hindered due to, e.g., 2,3- or ortho-substitution. This may be related to biases within the choice of building blocks and the target spectral range, since 400–460 nm corresponds to relatively high energy violet light. Nonetheless, this design motif may be worth exploring further experimentally and computationally.

Figure 8

Structures (a) of the top molecules from the four-objective optimizations, numbered by their ranking by merit and (b) the frequency with which they were found. Molecular structures of all satisfactory molecules are provided in Figure S16.

Conclusion

We have demonstrated a flexible new approach to quantifying the cost of synthesizing organic molecules, the RouteScore, on the basis of factors including the labor and monetary cost of the route, as well as the mass of material consumed. The RouteScore promotes more practical considerations about the amount of work required, rather than the elegance of the synthetic route. We have shown how this can be used to select the most efficient synthetic route to a well-known API with numerous reported syntheses. Furthermore, our approach—which aims to take into account the labor of both manual and automated synthesis—can be particularly useful as a tool in self-driving laboratories to expand the chemical space accessible by MAPs. To demonstrate this principle, we have carried out a multiobjective optimization to select a candidate organic laser molecule on the basis of its fluorescence within a desired wavelength range, its RouteScore, the overlap between its absorption and fluorescence spectra, and its fluorescence rate using the Gryffin and Chimera algorithms. The ML-driven optimizations efficiently identify top candidate molecules. In addition, optimizations that ignore the RouteScore identify molecules with similar predicted photophysical properties but that are more costly to synthesize according to RouteScore. Although we focused on organic materials, this method can be expanded to, e.g., inorganic materials synthesis. In general, the RouteScore and subway approaches may be a solution to the limited synthetic scope of self-driving laboratories. Although the RouteScore is generally robust, there are some important caveats. For example, determining the labor—and its cost—needed for each reaction will require more careful accounting than is typically carried out in academic laboratories. As discussed above, purification costs and yield remain very challenging to predict a priori because they are highly dependent on the physical properties of the molecules and materials used for purification. There are also additional costs that may be more significant for process chemistry, such as waste disposal and energy consumption. However, we believe that a better understanding of the underlying costs of material design will have significant benefits. Additionally, it may be desirable to remove the variance between RouteScore values in different currencies by normalizing price with respect to some commonly used chemical, similar to the Big Mac index.[57] In addition, although the RouteScore can only be most easily compared between laboratories where the material and labor costs are relatively similar, the code released with this work is sufficiently flexible and easy to implement that we hope calculating the RouteScore for different laboratories does not impede its adoption. Finally, we are working to reduce the significant annotation effort required to provide accurate time and labor costs to calculate the RouteScore. By using a chemical descriptive language[9] or an algorithm that can convert synthesis procedure text into actions,[58] it should be possible to estimate time and effort automatically in a diverse chemical space.

37 in total

1. Intramolecular cross-coupling of gem-dibromoolefins: a mild approach to 2-bromo benzofused heterocycles.

Authors: Stephen G Newman; Valentina Aureggi; Christopher S Bryan; Mark Lautens
Journal: Chem Commun (Camb) Date: 2009-08-06 Impact factor: 6.222

2. ExtractionScore: A Quantitative Framework for Evaluating Synthetic Routes on Predicted Liquid-Liquid Extraction Performance.

Authors: Anatoliy Kuznetsov; Nikolaos V Sahinidis
Journal: J Chem Inf Model Date: 2021-04-21 Impact factor: 4.956

3. Prediction of Major Regio-, Site-, and Diastereoisomers in Diels-Alder Reactions by Using Machine-Learning: The Importance of Physically Meaningful Descriptors.

Authors: Wiktor Beker; Ewa P Gajewska; Tomasz Badowski; Bartosz A Grzybowski
Journal: Angew Chem Int Ed Engl Date: 2018-12-04 Impact factor: 15.336

Routescore: Punching the Ticket to More Efficient Materials Development.

Introduction

Results and Discussion

Cost of a Synthetic Route

Synthetic Route Optimization for a Well-Studied Drug Molecule

Multiobjective Optimization of Organic Laser Molecules

Conclusion

1. Intramolecular cross-coupling of gem-dibromoolefins: a mild approach to 2-bromo benzofused heterocycles.

2. ExtractionScore: A Quantitative Framework for Evaluating Synthetic Routes on Predicted Liquid-Liquid Extraction Performance.

3. Prediction of Major Regio-, Site-, and Diastereoisomers in Diels-Alder Reactions by Using Machine-Learning: The Importance of Physically Meaningful Descriptors.

Review 4. Inverse molecular design using machine learning: Generative models for matter engineering.

5. Organic Lasers: Recent Developments on Materials, Device Geometries, and Fabrication Techniques.

6. Synthesis of many different types of organic small molecules using one automated process.

Review 7. Applications of Palladium-Catalyzed C-N Cross-Coupling Reactions.

8. Structure and reaction based evaluation of synthetic accessibility.

9. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction.

1. Autonomous Chemical Experiments: Challenges and Perspectives on Establishing a Self-Driving Lab.