Literature DB >> 30276246

Phoenics: A Bayesian Optimizer for Chemistry.

Florian Häse¹, Loïc M Roch¹, Christoph Kreisbeck¹, Alán Aspuru-Guzik^1,2,3,4.

Abstract

We report Phoenics, a probabilistic global optimization algorithm identifying the set of conditions of an experimental or computational procedure which satisfies desired targets. Phoenics combines ideas from Bayesian optimization with concepts from Bayesian kernel density estimation. As such, Phoenics allows to tackle typical optimization problems in chemistry for which objective evaluations are limited, due to either budgeted resources or time-consuming evaluations of the conditions, including experimentation or enduring computations. Phoenics proposes new conditions based on all previous observations, avoiding, thus, redundant evaluations to locate the optimal conditions. It enables an efficient parallel search based on intuitive sampling strategies implicitly biasing toward exploration or exploitation of the search space. Our benchmarks indicate that Phoenics is less sensitive to the response surface than already established optimization algorithms. We showcase the applicability of Phoenics on the Oregonator, a complex case-study describing a nonlinear chemical reaction network. Despite the large search space, Phoenics quickly identifies the conditions which yield the desired target dynamic behavior. Overall, we recommend Phoenics for rapid optimization of unknown expensive-to-evaluate objective functions, such as experimentation or long-lasting computations.

Entities: Chemical Disease Species

Year: 2018 PMID： 30276246 PMCID： PMC6161047 DOI： 10.1021/acscentsci.8b00307

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Optimization problems are ubiquitous in a variety of disciplines ranging from science to engineering and can take various facets: finding the lowest energy state of a system, searching for the optimal set of conditions to improve experimental procedures, or identifying the best strategies to realize industrial processes. They also have a rich history in chemistry. For example, conditions for chemical reactions are optimized with systematic methods like design of experiments (DOE).[1−3] More recently, optimization procedures assisted chemists in finding chemical derivatives of given molecules to best treat a given disease,[4] finding candidates for organic photovoltaics,[5] organic synthesis,[6,7] predicting reaction paths,[8−10] or automated experimentation.[11−14] Often, these applications are subject to multiple local optima, and involve costly evaluations of proposed conditions in terms of required experimentation or extensive computations. Optimization problems are typically formulated with an objective function, which for a given set of parameters returns a measure for the associated merit. This relates, for example, to measuring the yield of a chemical reaction conducted under specific experimental conditions. In the past, a variety of optimization algorithms have been developed. Gradient-based algorithms, such as gradient descent,[15] conjugate gradient,[16] or the more sophisticated BFGS,[17] are efficient at finding local optima. However, they require numerous evaluations, i.e., conducted experiments or computations, and are thus not well-suited for optimization problems in chemistry where evaluations of the objectives are often costly. Lately, the development of methods for finding the global optimum of nonconvex, expensive-to-evaluate objective functions has gained resurgence as an active field of research. Simplistic approaches consist in random searches, or systematic grid searches. While the advantages of random searches have been demonstrated in the context of hyperparameter optimization for machine learning models,[18,19] systematic grid search approaches like DOE were successfully applied to real-life experimentation planning.[1−3] More sophisticated methods include genetic algorithms[20,21] based on evolution strategies,[22,23] which are, for example, applied to the optimization of nanoalloy clusters,[24] or to resolve electronic spectra of rotamers of organic compounds.[25] Yet, such methods still require many evaluations of the objective function.[26] Bayesian optimization approaches have emerged as a popular and efficient alternative during the past decade.[27−33] The typical procedure of Bayesian optimization schemes consists of two major steps: First, an approximation (surrogate) to the merit landscape of the conditions is constructed. Second, a new set of conditions is proposed for the next evaluation based on this surrogate. As such, Bayesian optimization speculates about the experimental outcome using all previously conducted experiments, and verifies its speculations by requesting the evaluation of a new set of conditions. Several different models have been suggested for approximating the objective function landscape, ranging from random forests (RFs),[33−35] over Gaussian processes (GPs),[31,36] to Bayesian neural networks (BNNs).[37,38] Likewise, a variety of methods for proposing new conditions from the surrogate exists.[28,31,39−42] Bayesian optimization has been successfully employed for a variety of applications across chemistry. Examples include the property optimization of functional organic molecules,[43] calibrating high-throughput virtual screening results for organic photovoltaics,[44] or designing nanostructures for phonon transport.[45] Nevertheless, we identify three challenges in the deployment of Bayesian optimization techniques: One challenge is the domain specificity of the surrogate model: GPs typically perform well on continuous objective landscapes, while RFs are typically more suitable for discrete and quasidiscrete landscapes. Unknown experimental response landscapes are more amenable to methods which perform well on a large variety of possible landscapes. Another challenge is the parallelization capabilities: Traditionally, Bayesian optimization methods are sequential in nature, which prevents parallel evaluations of the objective function, for instance, by using multiple experimental platforms or computational resources. For parallel evaluation, however, a procedure enabling the generation of multiple informative parameter points is needed. Prior work on batched Bayesian optimization,[46−50] and nonmyopic approaches,[51] aimed to resolve this problem, but requires additional computation compared to sequential optimization. The third identified challenge is computational efficiency: Optimization strategies involving substantial additional computation are only applied efficiently if the time required to suggest a new set of conditions does not significantly exceed the execution time of the experiment. As such, the optimization procedure should be computationally efficient compared to the time required to evaluate the objectives of the considered problem, e.g., to run the experiment. The Probabilistic Harvard Optimizer Exploring Non-Intuitive Complex Surfaces (Phoenics) algorithm introduced in this study tackles the aforementioned challenges by supplementing ideas from Bayesian optimization with concepts from Bayesian kernel density estimation. More technically, we use BNNs to estimate kernel distributions associated with a particular objective function value from observed parameter points. Consequently, our approach differs from the traditional use of BNNs in the Bayesian optimization context, where objective function values are predicted from BNNs directly. Employing the estimated kernel distributions, we can construct a simple functional form of the approximation to the objective function. As a consequence, the computational cost of Phoenics scales linearly with the dimensionality of the search space and the number of observations, without the cost of numerous full evaluations of the BNN. Phoenics is available for download on GitHub.[52] We propose an inexpensive acquisition function, which enables intuitive search strategies for efficient parallelization. This is achieved by simultaneously proposing multiple parameter points with different sampling policies at negligible additional cost. Those policies are biased toward exploration or exploitation of the search space tuned by an intuitive hyperparameter. A synergistic effect is observed when proposing batches of parameter points with different sampling policies. Our batching policy not only helps to accelerate the optimization process, but also reduces the total number of required function evaluations. It is therefore to be seen as an improvement over trivial parallelization. In what follows, we start with a brief overview of related works. Then we detail the mathematical formulation of Phoenics. We further discuss performance results of Phoenics on analytic benchmark functions and compare to other Bayesian optimization methods. Before concluding we further highlight the applicability of our approach on the Oregonator, a model system for chemical reactions, on which we demonstrate the deployment of the proposed optimizer for practical problems in chemistry.

Background and Related Work

A plethora of global optimization algorithms has been developed to solve problems in different contexts, with different assumptions. The landscape of experimental responses could possibly be nonconvex, for instance, in scenarios where high reaction yields can be achieved with multiple different experimental conditions. Suitable optimization algorithms therefore need to be capable to overcome local optima to successfully find the global optimum. The most straightforward approaches to global optimization include random searches, grid searches, and (fractional) factorial design strategies.[1−3] While grid searches are embarrassingly parallel, they do not account for results obtained from recently executed experiments/computations. Computationally more involved methods, such as simulated annealing,[53,54] particle swarm,[55,56] or evolutionary strategies,[22,23] are able to make informed decisions about the next conditions to evaluate by accounting for results from recent evaluations. Recently, Bayesian optimization has gained increased attention as an alternative global optimization strategy as it was shown to reduce redundancy in the proposed conditions and, thus, locates global optima in fewer objective evaluations.[57] Bayesian optimization is a gradient-free strategy for the global optimization of possibly noisy black-box functions, which we denote with f from hereon.[27−30] It consists of two major steps: (i) construct a surrogate to f and (ii) propose new parameter points for querying f based on this probabilistic approximation. In the first step, the surrogate model is constructed by conditioning f on a prior ϕprior(θ) over the functional form, which is described by parameters θ. The parameters θ of the prior distribution are refined based on observations of n pairs of parameter values , denoting, for instance, experimental conditions, and corresponding objective function values f = f (), denoting the experimental responses such as reaction yield. The functional prior ϕprior is updated based on observations to yield a more informative posterior ϕpost. With more and more observations , the posterior ϕpost yields a better approximation and eventually converges to the objective function in the limit of infinitely many distinct observations, thus perfectly reproducing the experimental response landscape. In the second step of the general Bayesian optimization procedure, this surrogate model is used to propose new conditions for future evaluations via an acquisition function. Bayesian optimization therefore relies on both an accurate approximation to the objective function and also the formulation of an efficient acquisition function.

Constructing the Objective Function Approximation

A popular choice for modeling the functional prior ϕprior on the objective function are Gaussian processes (GPs),[30,31,50,58] and random forests (RFs).[33−35,59] GPs associate every point in the parameter domain with a normally distributed random variable. These normal distributions are then constructed via a similarity measure between observations given by a kernel function. A GP therefore provides a flexible way of finding analytic approximations to the objective function. Training a GP, however, is computationally costly as it involves the inversion of a dense covariance matrix, which scales cubically with the number of observations. Due to this limitation, GPs are typically used in relatively low-dimensional problems with an optimum that can be found in relatively few objective function evaluations. RFs are a collection of regression trees, which, in contrast to decision trees, have real numbers at their leaves. RFs have been shown to perform particularly well for categorical input data and classification tasks. RFs are therefore successfully applied to objective functions with discrete or quasidiscrete codomain. The computational cost of training a RF scales as with the number of observations and linearly with the dimensionality of the parameter space. Recently, Bayesian neural networks (BNNs) have been employed for Bayesian optimization,[37,38] retaining the flexibility of GPs at a computational scaling comparable to RFs. In contrast to traditional neural networks, weights and biases for neurons in BNNs are not single numbers but instead sampled from a distribution. BNNs are trained by updating the distributions from which weights and biases are sampled.

Acquisition Functions

The ideal acquisition function finds the adequate balance between exploration and exploitation. Exploration of the entire parameter space should be favored when no observations in vicinity to the global optimum have been made yet, and the acquisition function should only sample close to the global optimum once its general location has been determined. One of the earliest and most widely applied acquisition functions is expected improvement and variants thereof.[28,31,39] Expected improvement aims to measure the expected amount by which an observation of a point in parameter space improves over the current best value. Exploration and exploitation are implicitly balanced based on the posterior mean and the estimated uncertainty. More recently, alternative formulations of acquisition functions have been developed. The upper confidence bound method exploits confidence bounds for constructing an acquisition function which minimizes regret.[42] Variants of this acquisition function have been designed specifically to be applied in higher-dimensional parameter spaces.[49,50]Predictive entropy estimates the negative differential entropy of the location of the global optimum given the observations,[40,41] and has been shown to outperform expected improvement and upper confidence bound acquisitions.

Batched Bayesian Optimization

Many practical applications involve time-consuming evaluations of the objective function, but are amenable to parallelization. Examples include the execution of experiments on multiple experimental platforms or the distribution of computational models across multiple processors. Batched Bayesian optimization has been suggested with different assumptions and applicability scenarios. Marmin et. al proposed derivative-based expected improvement criterion for synchronous batch-sequential Bayesian optimization.[48] Other methods include approaches for estimating the expected objective function value for future observations,[31,50] or look ahead procedures.[51] In addition, ensemble procedures have been proposed where new batches of samples are proposed from GPs trained on subsets of the available data.[46,47] However, all proposed batch optimization methods require substantial additional computation compared to sequential evaluation strategies. A computationally less demanding batch optimization strategy has been proposed in the context of RF optimization.[33−35] Due to the computational efficiency of RF models, batch optimization is realized by running several RF optimization instances in parallel, while sharing and contributing to the same observation data set.

Formulating Phoenics

In this section we present the mathematical formulation of Phoenics. We assume that evaluations of the objective are expensive, where the cost could be related to any budgeted resource such as required execution time, experimental synthesis of chemical compounds, computing resources, and others. Phoenics combines ideas from Bayesian optimization with concepts from Bayesian kernel density estimation (BKDE).[60] The overall workflow of Phoenics is schematically represented in Figure and follows the general principles of traditional Bayesian optimization. At each iteration, a surrogate model is constructed, from which new conditions are proposed. In the following we detail how the surrogate model is constructed (Figure a–c) and how the surrogate model can be biased toward particular sampling strategies, i.e., exploration or exploitation, using an intuitive sampling parameter (Figure d).

Figure 1

Illustration of the workflow of Phoenics. (A) Unknown, possibly high-dimensional, objective function of an experimental procedure or computation. The objective function has been evaluated at eight different conditions (green), which comprise the set of observations in this illustration. (B) The observed conditions are processed by a Bayesian neural network yielding a probabilistic model for estimating parameter kernel densities. Note that the probabilistic approach allows for a higher flexibility of our surrogate model compared to standard kernel density estimation. (C) The surrogate model is constructed by weighting the estimated parameter kernel densities with their associated observed objective values. (D) The surrogate can be globally reshaped using a single sampling parameter λ to favor exploration (red) or exploitation (blue) of the parameter space.

Approximating the Objective Function

We suggest to use BNNs to estimate the parameter kernel density from the observed parameter points in an autoencoder-like architecture. As such, the BNN is used to nonlinearly estimate the density of the observed parameter points (see Figure b). A particular realization of the BNN represents a map projecting parameter points into the parameter space, i.e, . Thereby, we can construct an estimate to the parameter kernel density, which corresponds to a particular observed objective function value. The use of a BNN for the construction of the surrogate guarantees flexibility in the approximation as it has already been reported that BNNs are versatile function approximators at a favorable linear scaling with the number of observations.[37,38] Details on the BNN architecture are reported in the Supporting Information (see Sec. S.3). The implementation of Phoenics supports the construction and training of the surrogate model using either the PyMC3 library,[61,62] or the Edward library.[63] In both cases, the model parameters θ of the BNN are trained via variational inference. We start the sampling procedure with 500 samples of burn-in followed by another 1000 iterations retaining every 10th sample. This protocol was fixed for all tasks. We can construct an approximation to the kernel density from the distributions of BNN parameters learned from the sampling procedure. In particular, for observed conditions we compute the kernel densities, which are then used to approximate the objective function. The kernel density p() generated from a single observed parameter point (see Figure b) can therefore be written in closed form in eq , where ⟨·⟩ denotes the average over all sampled BNN architectures, and pred denotes the parameter points sampled from the BNN We formulate the approximation to the objective function as an ensemble average of the observed objective function values f taken over the set of computed kernel densities p() (see eq ).[64,65] In this ensemble average, each of the constructed distributions p() is rescaled by the value of the objective function f observed for the parameter point (Figure c). Note that α converges to the true objective f in the limit of infinitely many distinct function evaluations, as the kernel densities p become more peaked due to the increasing precision in the Gaussian prior. Details are provided in the Supporting Information (see Sec. S.5). This approximation to the objective function allows for inexpensive evaluations for any given parameter point as repetitive, full BNN evaluations are avoided.

Acquisition Function

In the resulting approximation α we effectively model the expectation value of f for a given parameter point based on prior observations. However, the parameter space could contain low-density regions, for which the objective function approximation α() is inaccurate (see Figure c). With these considerations, we propose an acquisition function based on the kernel densities p() for observations detailed in eq . The acquisition function differs from the approximation to the objective function (see eq ) by an additional term puniform() in the numerator and the denominator, which denotes the uniform distribution on the domain (see Figure d). In the numerator, puniform() is scaled by a factor λ, referred to as the sampling parameter from hereon. Note that this modification to α does not affect its convergence behavior (see the Supporting Information, Sec. S.5) The introduced parameter λ effectively compares the cumulative height of each rescaled density estimate p() to the uniform distribution. While the p() are constructed from the knowledge we acquired from previous experiments, puniform is used as a reference to indicate the lack of knowledge in parameter space regions where little or no information is available yet. The sampling parameter therefore balances between acquired knowledge and the lack of knowledge, which effectively tunes the exploitative and explorative behavior of the algorithm. Figure d illustrates the behavior of Phoenics on a one-dimensional objective function with different λ values. In this example, the acquisition function is constructed from eight observations indicated in green. Note that the acquisition function approximates the value of the objective function at observed parameter points. Acquisition functions which were constructed from a more positive λ show low values only in the vicinity of the observation with the lowest objective function value. In contrast, acquisition functions which were constructed from a more negative λ show low values far away from any observation. The choice for the value of the exploration parameter λ can therefore be directly related to explorative or exploitative behavior when proposing new conditions based on the global minimum of the surrogate. With a large positive value of λ, Phoenics favors exploitation, while a large negative value favors exploration. When λ = 0, the acquisition function shows no preference for a particular sampling strategy. Details on the global optimization of the surrogate are provided in the Supporting Information (see Sec. S.6). From Figure d we see that distinct points in parameter space are proposed based on particular values of the exploration parameter λ. The best choice of λ for a given objective function is a priori unknown. However, with the possibility to rapidly construct several acquisition functions with biases toward exploration or exploitation, we can propose multiple parameter points in batches based on different sampling strategies. The newly proposed parameter points are then evaluated on the black-box optimization function in possibly parallel evaluation runs.

Results and Discussion

In this section we report the performance of Phoenics and compare it to four frequently used optimization algorithms: particle swarm optimization (PSO),[55,56] covariance matrix adaptation evolution strategy (CMA-ES),[22,23] as well as Bayesian optimization based on GPs and based on RFs. PSO is implemented in the “pyswarms” python module.[66] An implementation of CMA is also available in the “cma” module.[67] Default settings as provided by the modules have been used for both optimization algorithms. The “spearmint” software package performs Bayesian optimization using GPs and the predictive entropy acquisition function.[31,36] Batch optimization is implemented in spearmint via estimating the expected objective value for future evaluations. The SMAC software employs RF models and allows for batch optimization by running multiple RF instances sharing the same set of samples.[33−35] Chemical systems can have complex, qualitatively different response surfaces. As chemical reactions are time-consuming to evaluate, we therefore assess the performance of each of these three algorithms on a set of 15 benchmark functions covering a large range of qualitatively diverse response surfaces for problems in chemistry. The employed functions are well-established benchmarks and include continuous and convex, nonconvex, or discrete functions with possibly multiple global minima. A complete list of the employed objective functions as well as their global minima is provided in the Supporting Information (see Table S.1). For reliable performance estimates we executed 20 independent optimization runs initialized with different random seeds, unless noted otherwise. During each run we record the lowest achieved objective function value after each iteration. We compare the averaged lowest achieved objective function values by relating to results from simple random searches. Each random search was run for 104 objective function evaluations, and results were averaged over 50 independent runs initialized with different random seeds. The average lowest achieved objective function values of the random search runs are summarized in the Supporting Information (see Table S.2). Our benchmark calculations indicate that Bayesian-based optimization algorithms outperform PSO and CMA-ES. As a matter of fact, after 200 function evaluations, PSO and CMA-ES yield significantly higher deviations from the global optimum than the three studied Bayesian optimization algorithms. Even when increasing the number of function evaluations by an order of magnitude, from 200 to 2000, PSO (and CMA-ES) fails to achieve lower deviations than the ones obtained with Phoenics after 200 evaluations for 12 (and 13) out of 15 objective functions. We therefore restrict our further analyses and comparisons to only Phoenics, GP optimization, and RF optimization. The benchmark results conducted with PSO and CMA-ES are detailed in the Supporting Information (see Figure S.6)

Analytic Benchmarks

Phoenics was set up with three different values for the sampling parameter, λ ∈ {−1, 0, 1}, to assess the effectiveness of a particular parameter choice. In Figure we report the number of objective function evaluations required by each of the optimization algorithms to reach an objective function value lower than the average lowest value found in random searches. Optimization traces for these runs on all 15 objective functions are reported in the Supporting Information (see Figure S.3).

Figure 2

Number of objective function evaluations required to reach objective function values lower than the average lowest achieved values of random searches with 104 evaluations for Phoenics (λ ∈ {−1, 0, 1}), RFs, and GPs. Results are reported for the Ackley (A), Dejong (B), Schwefel (C), and dAckley (D) objective functions. Details on the benchmark functions are provided in the Supporting Information, Sec. S.1. We find that GP optimization, as implemented in spearmint, generally quickly finds the global minimum if the objective function is strictly convex. In contrast, RF optimization, as implemented in SMAC, quickly finds the global minimum of objective functions with a discrete codomain. The performance of Phoenics varies with different values of the sampling parameter λ. When favoring exploitation over exploration, i.e., λ > 0, the algorithm performs better if the objective function features narrow and well-defined funnels (e.g., Ackley in Figure a or Schwefel in Figure c). With this choice for the sampling parameter, the algorithm is slightly biased toward exploring the local region around the current optimum. This behavior, however, is unfavorable in other cases, for instance, when the objective function has a discrete codomain (e.g., dAckley in Figure d). Since parameter points in the vicinity to the current optimum likely yield the same value if the objective function is discrete or quasidiscrete, Phoenics performs better on such objective functions when favoring exploration over exploitation, i.e., λ < 0.

Developing a Collective Sampling Strategy

The dependence of the performance of Phoenics on the sampling parameter λ could be eliminated by marginalizing over this parameter. Marginalization over the sampling parameter would effectively average out the advantageous effects of a bias toward exploitation for some objective functions and toward exploration for other objective functions. The shape of the objective function is a priori unknown, so suitable choices of the sampling parameter cannot be determined beforehand. However, since the sampling parameter can be directly related to the explorative and exploitative behavior of the algorithm we follow an approach to take full advantage of the sampling policy. We suggest to propose parameter points based on multiple different sampling parameter values. Note that values of λ are chosen beforehand and are kept fixed throughout the optimization procedure. Given a set of observations the construction of several objective surrogates with different values of λ is computationally cheap. This allows us to suggest multiple parameter points at each optimization iteration, which are proposed from more explorative and more exploitative parameter values, at almost no additional cost. With the observations on the simple benchmarks we would expect a synergistic effect of this batch optimization over sequential optimization with a single sampling parameter value. As parameter points can be proposed with both a bias toward exploration and a bias toward exploitation, we expect the number of required objective function evaluations to decrease. In addition, suggesting a batch of parameter points in one optimization step allows for the parallel evaluation of all proposed points, which accelerates the optimization process. The behavior of the three studied optimization algorithms under parallel optimization on the Ackley objective function is highlighted in Figure . Full results on the entire benchmark set are reported in the Supporting Information (see Sec. S.8). Figure depicts the minimum achieved objective function values for different runs with a different number of parallel evaluations of the objective function averaged over 20 independent runs. The minimum achieved objective function values are presented per number of objective function evaluations (left panel) and per batch evaluation (right panel).

Figure 3

Average minimum objective function values for the Ackley function achieved in 20 independent runs of the three optimization algorithms studied in this work: our optimizer (Phoenics), spearmint (GP), and SMAC (RF). For each run a different number of proposed samples p was evaluated in parallel. Minimum achieved objective function values are reported with respect to the total number of objective function evaluations and the number of evaluated batches. The dashed blue lines denote the minimum achieved error after 104 of random search for reference. We find that both spearmint and SMAC achieve low objective function values in fewer batches with an increasing number of points p proposed in each batch. While increasing the number of samples proposed per batch initially significantly improves the performance with respect to the number of proposed batches, this advantageous effect quickly levels off until there is no significant improvement beyond six samples per batch. However, when comparing the minimum achieved objective function values with respect to the total number of objective function evaluations, we did not observe any significant difference between runs with a different number of samples proposed per batch. In contrast, Phoenics shows a different behavior. Our algorithm not only reaches lower objective function values in a fewer number of batches when proposing more samples per batch, but also shows a better performance when considering the total number of function evaluations. This synergistic effect demonstrates that Phoenics indeed benefits from proposing points with multiple sampling strategies in cases in which proposed samples are evaluated sequentially. The performance improvement of Phoenics when proposing parameter points in batches at each optimization iteration is demonstrated on all 15 considered objective functions in the Supporting Information (see Sec. S.8). We ran our optimizer with four different sampling strategies using sampling parameter values evenly spaced across the [−1, 1] interval. All four proposed parameter points are then evaluated before we started another optimization iteration. For this particular batching protocol, we find that Phoenics outperforms RF-based optimization on all benchmark functions and GP-based optimization on 12 out of 15 benchmark functions. If and only if the objective function is convex, GP optimization finds lower objective function values. In addition, we observe the aforementioned synergistic effect of batch optimization for 12 out of 15 benchmark functions. Despite reducing the number of optimization iterations by a factor of 4, the achieved objective function values were found to be lower than values achieved in sequential optimizations with all three considered fixed sample parameter value. We suggest that this improved performance of the algorithm is due to the trade-off between exploration and exploitation. The exploration samples systematically sample the parameter space and ensure that the algorithm does not get stuck in local minima, while the exploitation samples explore the local environment of the current global minimum. This sampling behavior is illustrated in Figure for the Michalewicz function. The optimization runs on the Michalewicz function were all started from the same two random samples illustrated in black for all three investigated optimization algorithms. Bayesian optimization based on GPs as implemented in spearmint (lower panels) tends to sample many parameter points close to the boundaries of the domain space in this particular example. RF optimization as implemented in SMAC (central panels), however, shows a higher tendency of exploring the parameter space.

Figure 4

Progress of sample optimization runs of the three studied optimization algorithms on the two-dimensional Michalewicz function. Phoenics proposed a total of three samples per batch, which were then evaluated in parallel. Each sample was suggested based on a particular value of the exploration parameter λ ∈ {−1, 0, 1}. Left panels illustrate the parameter points proposed at each optimization iteration while right panels depict the achieved objective function values. Depicted points are more transparent at the beginning of the optimization and more opaque toward the end. Starting points for the optimization runs are drawn as black squares. Phoenics (Figure , upper panels) starts exploring the space and quickly finds a local minimum in vicinity of one of the initial samples. After finding this local minimum, samples which are proposed based on a more exploitative (positive) value of the sampling parameter λ explore the local environment of this local minimum while samples proposed from more explorative (negative) values of λ explore the entire parameter space. As soon as the exploration points find a point in parameter space with a lower value of the objective function, the exploitation points jump to this new region in parameter space and locally explore the region around the current best to quickly converge to the global minimum. Overall we have demonstrated that the value of the sampling parameter λ in the proposed acquisition function clearly influences the behavior of the optimization procedure toward a more explorative behavior for more negative values of this parameter and a more exploitative behavior for more positive parameter values. Batched optimization improves the performance of Phoenics even in terms of total objective function evaluations and reduces the number of required optimization iterations.

Increasing the Number of Dimensions

Practical chemistry problems are typically concerned with more than just two parameters. In fact, chemical reactions can be influenced by environmental conditions and experimental device settings, and computational studies frequently employ parameters to describe the system of interest. In this section we illustrate the performance of Phoenics in parameter spaces with dimensions k > 2. We evaluate the performance of the three optimization algorithms on the same objective function subset, but now successively increase the dimensionality of the parameter space from 2 to 20. Based on the results on batch optimization we ran GP optimization and RF optimization with one point per batch and the optimization algorithm introduced in this study with four points per batch on each considered benchmark function. Exploration parameter values were chosen to be evenly spaced across the [−1, 1] interval. For better comparisons we report the average deviation of the lowest encountered objective function value from the global minimum of each function taken over 20 independent optimization runs. Average deviations achieved by each of the optimization algorithms after 200 objective function evaluations are depicted in Figure .

Figure 5

Average deviations taken over 20 independent runs between the lowest encountered objective function value and the global minimum achieved after 200 objective function evaluations for different parameter set dimensions. Results are reported for Ackley (A), Dejong (B), Schwefel (C), and dAckley (D). Uncertainty bands illustrate bootstrapped estimates of the deviation of the means with one and two standard deviations. We observe that Phoenics maintains its rapid optimization properties for a variety of different objective functions even when increasing the number of dimensions. In the case of the Ackley function (Figure a) Phoenics appears to find and explore the major funnel close to the global optimum faster than the other two optimization algorithms regardless of the number of dimensions. The paraboloid (Figure b) is an easy case for the GP in low dimensions, but is optimized the fastest by Phoenics when considering parameter spaces with seven or more dimensions. No major differences are observed for the Schwefel function (Figure c). However, in the case of a discrete objective function (Figure d) Phoenics seems to have a slight advantage over the other two optimizers for lower dimensions and performs about as well as random forest optimization for higher dimensions.

Application to Chemistry

In this section, we demonstrate the applicability of Phoenics on the Oregonator, a model system of a chemical reaction described by a set of nonlinear coupled differential equations.[68] In particular, we demonstrate how Phoenics can be employed to propose a set of conditions for an experimental procedure. The experimental procedure can then be executed with the proposed conditions, and the results of the procedure are reported back to Phoenics. With this feedback, Phoenics can make more informed decisions and, thus, provides more promising sets of experimental conditions, to eventually result in the discovery of the optimal set of conditions. Most chemical reactions lead to a steady-state, i.e., a state in which the concentrations of involved compounds are constant in time. While chemical reactions described by linear differential equations always feature such a steady-state, more complicated dynamics phenomena can arise for reactions described by sets of nonlinear coupled differential equations. With the right choice of parameters, such differential equations may have a stable limit cycle, leading to periodic oscillations in the concentrations of involved compounds.[69,70] One of the earliest discovered reactions featuring a stable limit cycle for a set of reaction conditions is the Belousov–Zhabotinsky reaction.[71,72] This network of chemical reactions involves temporal oscillations of [CeIV] and [CeIII]. The entire reaction network can be written as a set of three subreactions listed in Scheme . For details on the mechanism we refer to a brief summary in the Supporting Information (see Sec. S.10) as well as to the literature.[68,72−75]

Scheme 1

Subreactions of the Belousov–Zhabotinsky Reaction[78]

Models at different levels of complexity have been developed to describe the dynamic behavior of the Belousov–Zhabotinsky reaction.[69,75−77] One of the simplest models of this reaction is the Oregonator.[68] The Oregonator consists of a set of three coupled first-order nonlinear differential equations for three model compounds X, Y, and Z, which are shown in eqs –6. These equations involve five reaction constants k, a stoichiometric factor f, determined by the prevalence of one subreaction over another subreaction, and the concentration of two additional chemical compounds A and B. A map of eqs –6 to Scheme is outlined in ref (68). The set of differential equations in the Oregonator can be reduced into a dimensionless form, such that the number of correlated parameters is reduced to a smaller set of independent parameters. This reduced version of the Oregonator, presented in the Supporting Information (see Sec. S.10), includes three dimensionless variables α, η, and ρ, which describe the concentration of chemical species, and four dimensionless reaction constants. Phoenics is used to reverse engineer the set of reaction conditions consisting of three initial concentrations (α0, η0, and ρ0) and four reaction constants (q, s, w, and f) from the concentration traces computed in the original publication.[68] Parameter values yielding the target concentration traces are reported in Table . The goal is 2-fold: (i) find a set of conditions for which the dynamical behavior qualitatively agrees with the behavior of the target, i.e., find chemical oscillations, and (ii) fine-tune these conditions such that we reproduce the dynamical behavior on a quantitative level. We aim to achieve these two goals while keeping the number of function evaluations, i.e., the number of experiments to run, to a minimum. Note that this constraint along with the dimensionality of the parameter space implies that grid searches or gradient-based algorithms are not suited for this problem.

Table 1

Reaction Parametersa of the Reduced Oregonator Model for the Belousov–Zhabotinsky Reaction

parameter	target value	range
s	77.27	0 ... 100
w	0.1610	0 ... 1
q	8.375 × 10^–6	10^–8 ... 10^–4
f	1	0 ... 5
α₀	2.0 × 10⁷	10⁴ ... 10⁹
η₀	3.3 × 10³	10³ ... 10⁵
ρ₀	4.1 × 10⁴	10³ ... 10⁶

Target parameters induce the existence of a limit cycle, from which chemical oscillations emerge. For finding these target parameters via optimization we constrained the domain space to the reported ranges. All reported quantities are dimensionless. Phoenics was run in parallel proposing four samples per batch with λ equally spaced on the [−1, 1] interval. We compare the performance to PSO in the pyswarms module, CMA-ES in the cma module, GP optimization in spearmint, and RF optimization in SMAC. Each of the five optimization algorithms was used in 50 independent optimization runs for 150 evaluations. All optimization procedures were carried out on a constrained parameter space reported in Table . Note that different choices of parameter sets within this bounded domain can result in quantitatively and qualitatively different dynamical behavior. In particular, parameter choices close to the target result in oscillatory behavior, for which the reduced concentrations α, ρ, and η change periodically over time, while other parameter choices can break the limit cycle and create a stable fixed point instead.[68,74,77] Concentration traces for a sampled set of reaction parameters were computed with a fourth-order Runge–Kutta integrator with adaptive time stepping. The integrator was run for a total of 107 integration steps covering 12 full concentration oscillations for the target parameter set. Sampled concentration traces are compared to the target concentration traces after a cubic spline interpolation. The distance (loss) between the sampled traces and the target traces is calculated as the euclidean distance between the points in time at which a concentration trace reaches a dimensionless concentration value of 100. Average achieved losses for all three optimization algorithms are displayed in Figure . Loss values between 300 and 500 indicate that the periodicity of the predicted concentration traces resembles the periodicity of the target traces; i.e., the predicted traces qualitatively agree with the target. Quantitative agreement, i.e., matching traces, is only achieved for loss values lower than about 100. Examples for concentration traces yielding different losses are presented in the Supporting Information (see Figure S.7).

Figure 6

Average achieved losses for finding reaction parameters of the reduced Oregonator model achieved by the five optimization algorithms employed in this study. Correct periodicities of the concentration traces are achieved for losses lower than 500. Uncertainty bands illustrate bootstrapped deviations on the mean for one standard deviation. Figure shows concentration traces associated with the lowest loss achieved by each of the three optimization algorithms across all 50 independent runs. Phoenics is the only algorithm reproducing qualitatively and quantitatively target dynamic behavior within 150 optimization iterations. RF optimization only finds parameter sets which qualitatively agree with the target. GP optimization finds only in rare occasions concentration traces in qualitative agreement with the target. Both PSO and CMA-ES consistently yield high losses for the first 75 evaluations, after which PSO slightly improves but never reaches the degree of agreement achieved by Phoenics.

Figure 7

Time traces of dimensionless concentrations of compounds in the Oregonator model. Target traces are depicted with solid, transparent lines while predicted traces are shown in dashed, opaque lines. Traces were simulated for a total of 12 dimensionless time units, but are only shown for the first six time units for clarity.

Conclusion and Outlook

We introduced Phoenics, an algorithm for global optimization in the context chemistry and experimentation. Phoenics is designed for scenarios where the merit of a set of conditions is evaluated via experimentation or expensive computations, which can possibly be parallelized. Our probabilistic optimizer combines Bayesian optimization with conceptual aspects of Bayesian kernel density estimation. As such, our algorithm is well-suited for applications where evaluations of the objective function are expensive with respect to budgeted resources such as time or money. Through an exhaustive benchmark study, we showed that Phoenics improves over optimization strategies based on particle swarms or evolutionary approaches, as well as on existing Bayesian global optimization methods and avoids redundant evaluations of the objective. We formulate an inexpensive acquisition function balancing the explorative and exploitative behavior of the algorithm. This acquisition function enables intuitive sampling policies for an efficient parallel search of global minima. By leveraging synergistic effects from running multiple sampling policies in batches, the performance of the algorithm improves, and requires a reduced total number of objective function evaluations. The applicability of Phoenics was highlighted on the Oregonator, a model system describing a complex chemical reaction network. Phoenics was able to determine the set of seven experimental conditions reproducing a target dynamic behavior in the concentrations of involved chemical species. High degrees of qualitative and quantitative agreement could be achieved with only 100 merit-evaluations of proposed conditions despite the rich solution space containing both steady-state systems and chemical oscillators. We believe that Phoenics has the potential to be applied to a wide range of problems, from optimization of reaction conditions and material properties, over control of robotics systems, to circuit design for quantum computing.[79,80] All in all, we recommend Phoenics for an efficient optimization of scalar, possibly nonconvex, black-box unknown objective functions.

16 in total

Review 1. Completely derandomized self-adaptation in evolution strategies.

Authors: N Hansen; A Ostermeier
Journal: Evol Comput Date: 2001 Impact factor: 3.277

2. A history of chemical oscillations and waves.

Authors: Anatol M. Zhabotinsky
Journal: Chaos Date: 1991-12 Impact factor: 3.642

3. Optimization by simulated annealing.

Authors: S Kirkpatrick; C D Gelatt; M P Vecchi
Journal: Science Date: 1983-05-13 Impact factor: 47.728

4. Pool-BCGA: a parallelised generation-free genetic algorithm for the ab initio global optimisation of nanoalloy clusters.

Authors: A Shayeghi; D Götz; J B A Davis; R Schäfer; R L Johnston
Journal: Phys Chem Chem Phys Date: 2014-12-08 Impact factor: 3.676

5. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction.

Authors: Marwin H S Segler; Mark P Waller
Journal: Chemistry Date: 2017-02-22 Impact factor: 5.236

6. Discovery of wall-selective carbon nanotube growth conditions via automated experimentation.

Authors: Pavel Nikolaev; Daylond Hooper; Nestor Perea-López; Mauricio Terrones; Benji Maruyama
Journal: ACS Nano Date: 2014-10-16 Impact factor: 15.881

7. Neural Networks for the Prediction of Organic Chemistry Reactions.

Authors: Jennifer N Wei; David Duvenaud; Alán Aspuru-Guzik
Journal: ACS Cent Sci Date: 2016-10-14 Impact factor: 14.553

8. Optimizing Chemical Reactions with Deep Reinforcement Learning.

Authors: Zhenpeng Zhou; Xiaocheng Li; Richard N Zare
Journal: ACS Cent Sci Date: 2017-12-15 Impact factor: 14.553

9. Human versus Robots in the Discovery and Crystallization of Gigantic Polyoxometalates.

Authors: Vasilios Duros; Jonathan Grizou; Weimin Xuan; Zied Hosni; De-Liang Long; Haralampos N Miras; Leroy Cronin
Journal: Angew Chem Int Ed Engl Date: 2017-08-03 Impact factor: 15.336

10. Prediction of Organic Reaction Outcomes Using Machine Learning.

Authors: Connor W Coley; Regina Barzilay; Tommi S Jaakkola; William H Green; Klavs F Jensen
Journal: ACS Cent Sci Date: 2017-04-18 Impact factor: 14.553

24 in total

Review 1. QSAR without borders.

Authors: Eugene N Muratov; Jürgen Bajorath; Robert P Sheridan; Igor V Tetko; Dmitry Filimonov; Vladimir Poroikov; Tudor I Oprea; Igor I Baskin; Alexandre Varnek; Adrian Roitberg; Olexandr Isayev; Stefano Curtarolo; Denis Fourches; Yoram Cohen; Alan Aspuru-Guzik; David A Winkler; Dimitris Agrafiotis; Artem Cherkasov; Alexander Tropsha
Journal: Chem Soc Rev Date: 2020-05-01 Impact factor: 54.564

2. Bayesian reaction optimization as a tool for chemical synthesis.

Authors: Benjamin J Shields; Jason Stevens; Jun Li; Marvin Parasram; Farhan Damani; Jesus I Martinez Alvarado; Jacob M Janey; Ryan P Adams; Abigail G Doyle
Journal: Nature Date: 2021-02-03 Impact factor: 49.962

3. Autonomous Chemical Experiments: Challenges and Perspectives on Establishing a Self-Driving Lab.

Authors: Martin Seifrid; Robert Pollice; Andrés Aguilar-Granda; Zamyla Morgan Chan; Kazuhiro Hotta; Cher Tian Ser; Jenya Vestfrid; Tony C Wu; Alán Aspuru-Guzik
Journal: Acc Chem Res Date: 2022-08-10 Impact factor: 24.466

4. Benchmarking the acceleration of materials discovery by sequential learning.

Authors: Brian Rohr; Helge S Stein; Dan Guevarra; Yu Wang; Joel A Haber; Muratahan Aykol; Santosh K Suram; John M Gregoire
Journal: Chem Sci Date: 2020-01-29 Impact factor: 9.825

5. Self-driving laboratory for accelerated discovery of thin-film materials.

Authors: B P MacLeod; F G L Parlane; T D Morrissey; F Häse; L M Roch; K E Dettelbach; R Moreira; L P E Yunker; M B Rooney; J R Deeth; V Lai; G J Ng; H Situ; R H Zhang; M S Elliott; T H Haley; D J Dvorak; A Aspuru-Guzik; J E Hein; C P Berlinguette
Journal: Sci Adv Date: 2020-05-13 Impact factor: 14.136

6. Chimera: enabling hierarchy based multi-objective optimization for self-driving laboratories.

Authors: Florian Häse; Loïc M Roch; Alán Aspuru-Guzik
Journal: Chem Sci Date: 2018-08-28 Impact factor: 9.825

7. Accurate Multiobjective Design in a Space of Millions of Transition Metal Complexes with Neural-Network-Driven Efficient Global Optimization.

Authors: Jon Paul Janet; Sahasrajit Ramesh; Chenru Duan; Heather J Kulik
Journal: ACS Cent Sci Date: 2020-03-11 Impact factor: 14.553