Florian Häse1, Loïc M Roch1, Christoph Kreisbeck1, Alán Aspuru-Guzik1,2,3,4. 1. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States. 2. Department of Chemistry and Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3H6, Canada. 3. Vector Institute for Artificial Intelligence, Toronto, Ontario M5S 1M1, Canada. 4. Canadian Institute for Advanced Research (CIFAR) Senior Fellow, Toronto, Ontario M5S 1M1, Canada.
Abstract
We report Phoenics, a probabilistic global optimization algorithm identifying the set of conditions of an experimental or computational procedure which satisfies desired targets. Phoenics combines ideas from Bayesian optimization with concepts from Bayesian kernel density estimation. As such, Phoenics allows to tackle typical optimization problems in chemistry for which objective evaluations are limited, due to either budgeted resources or time-consuming evaluations of the conditions, including experimentation or enduring computations. Phoenics proposes new conditions based on all previous observations, avoiding, thus, redundant evaluations to locate the optimal conditions. It enables an efficient parallel search based on intuitive sampling strategies implicitly biasing toward exploration or exploitation of the search space. Our benchmarks indicate that Phoenics is less sensitive to the response surface than already established optimization algorithms. We showcase the applicability of Phoenics on the Oregonator, a complex case-study describing a nonlinear chemical reaction network. Despite the large search space, Phoenics quickly identifies the conditions which yield the desired target dynamic behavior. Overall, we recommend Phoenics for rapid optimization of unknown expensive-to-evaluate objective functions, such as experimentation or long-lasting computations.
We report Phoenics, a probabilistic global optimization algorithm identifying the set of conditions of an experimental or computational procedure which satisfies desired targets. Phoenics combines ideas from Bayesian optimization with concepts from Bayesian kernel density estimation. As such, Phoenics allows to tackle typical optimization problems in chemistry for which objective evaluations are limited, due to either budgeted resources or time-consuming evaluations of the conditions, including experimentation or enduring computations. Phoenics proposes new conditions based on all previous observations, avoiding, thus, redundant evaluations to locate the optimal conditions. It enables an efficient parallel search based on intuitive sampling strategies implicitly biasing toward exploration or exploitation of the search space. Our benchmarks indicate that Phoenics is less sensitive to the response surface than already established optimization algorithms. We showcase the applicability of Phoenics on the Oregonator, a complex case-study describing a nonlinear chemical reaction network. Despite the large search space, Phoenics quickly identifies the conditions which yield the desired target dynamic behavior. Overall, we recommend Phoenics for rapid optimization of unknown expensive-to-evaluate objective functions, such as experimentation or long-lasting computations.
Optimization problems
are ubiquitous in a variety of disciplines
ranging from science to engineering and can take various facets: finding
the lowest energy state of a system, searching for the optimal set
of conditions to improve experimental procedures, or identifying the
best strategies to realize industrial processes. They also have a
rich history in chemistry. For example, conditions for chemical reactions
are optimized with systematic methods like design of experiments (DOE).[1−3] More recently, optimization procedures assisted chemists in finding
chemical derivatives of given molecules to best treat a given disease,[4] finding candidates for organic photovoltaics,[5] organic synthesis,[6,7] predicting
reaction paths,[8−10] or automated experimentation.[11−14] Often, these applications are
subject to multiple local optima, and involve costly evaluations of
proposed conditions in terms of required experimentation or extensive
computations.Optimization problems are typically formulated
with an objective
function, which for a given set of parameters returns a measure for
the associated merit. This relates, for example, to measuring the
yield of a chemical reaction conducted under specific experimental
conditions. In the past, a variety of optimization algorithms have
been developed. Gradient-based algorithms, such as gradient descent,[15] conjugate gradient,[16] or the more sophisticated BFGS,[17] are
efficient at finding local optima. However, they require numerous
evaluations, i.e., conducted experiments or computations, and are
thus not well-suited for optimization problems in chemistry where
evaluations of the objectives are often costly.Lately, the
development of methods for finding the global optimum
of nonconvex, expensive-to-evaluate objective functions has gained
resurgence as an active field of research. Simplistic approaches consist
in random searches, or systematic grid searches. While the advantages
of random searches have been demonstrated in the context of hyperparameter
optimization for machine learning models,[18,19] systematic grid search approaches like DOE were successfully applied
to real-life experimentation planning.[1−3] More sophisticated methods
include genetic algorithms[20,21] based on evolution
strategies,[22,23] which are, for example, applied
to the optimization of nanoalloy clusters,[24] or to resolve electronic spectra of rotamers of organic compounds.[25] Yet, such methods still require many evaluations
of the objective function.[26]Bayesian
optimization approaches have emerged as a popular and
efficient alternative during the past decade.[27−33] The typical procedure of Bayesian optimization schemes consists
of two major steps: First, an approximation (surrogate) to the merit
landscape of the conditions is constructed. Second, a new set of conditions
is proposed for the next evaluation based on this surrogate. As such,
Bayesian optimization speculates about the experimental outcome using
all previously conducted experiments, and verifies its speculations
by requesting the evaluation of a new set of conditions. Several different
models have been suggested for approximating the objective function
landscape, ranging from random forests (RFs),[33−35] over Gaussian
processes (GPs),[31,36] to Bayesian neural networks (BNNs).[37,38] Likewise, a variety of methods for proposing new conditions from
the surrogate exists.[28,31,39−42]Bayesian optimization has been successfully employed for a
variety
of applications across chemistry. Examples include the property optimization
of functional organic molecules,[43] calibrating
high-throughput virtual screening results for organic photovoltaics,[44] or designing nanostructures for phonon transport.[45] Nevertheless, we identify three challenges in
the deployment of Bayesian optimization techniques:One challenge is
the domain specificity
of the surrogate model: GPs typically perform well on continuous objective
landscapes, while RFs are typically more suitable for discrete and
quasidiscrete landscapes. Unknown experimental response landscapes
are more amenable to methods which perform well on a large variety
of possible landscapes.Another challenge is the parallelization
capabilities: Traditionally, Bayesian optimization methods are sequential
in nature, which prevents parallel evaluations of the objective function,
for instance, by using multiple experimental platforms or computational
resources. For parallel evaluation, however, a procedure enabling
the generation of multiple informative parameter points is needed.
Prior work on batched Bayesian optimization,[46−50] and nonmyopic approaches,[51] aimed to resolve this problem, but requires additional computation
compared to sequential optimization.The third identified challenge is
computational efficiency: Optimization strategies involving substantial
additional computation are only applied efficiently if the time required
to suggest a new set of conditions does not significantly exceed the
execution time of the experiment. As such, the optimization procedure
should be computationally efficient compared to the time required
to evaluate the objectives of the considered problem, e.g., to run
the experiment.The Probabilistic Harvard
Optimizer Exploring Non-Intuitive Complex
Surfaces (Phoenics) algorithm introduced in this study tackles the
aforementioned challenges by supplementing ideas from Bayesian optimization
with concepts from Bayesian kernel density estimation. More technically,
we use BNNs to estimate kernel distributions associated with a particular
objective function value from observed parameter points. Consequently,
our approach differs from the traditional use of BNNs in the Bayesian
optimization context, where objective function values are predicted
from BNNs directly. Employing the estimated kernel distributions,
we can construct a simple functional form of the approximation to
the objective function. As a consequence, the computational cost of
Phoenics scales linearly with the dimensionality of the search space
and the number of observations, without the cost of numerous full
evaluations of the BNN. Phoenics is available for download on GitHub.[52]We propose an inexpensive acquisition
function, which enables intuitive
search strategies for efficient parallelization. This is achieved
by simultaneously proposing multiple parameter points with different
sampling policies at negligible additional cost. Those policies are
biased toward exploration or exploitation of the search space tuned
by an intuitive hyperparameter. A synergistic effect is observed when
proposing batches of parameter points with different sampling policies.
Our batching policy not only helps to accelerate the optimization
process, but also reduces the total number of required function evaluations.
It is therefore to be seen as an improvement over trivial parallelization.In what follows, we start with a brief overview of related works.
Then we detail the mathematical formulation of Phoenics. We further
discuss performance results of Phoenics on analytic benchmark functions
and compare to other Bayesian optimization methods. Before concluding
we further highlight the applicability of our approach on the Oregonator,
a model system for chemical reactions, on which we demonstrate the
deployment of the proposed optimizer for practical problems in chemistry.
Background
and Related Work
A plethora of global optimization algorithms
has been developed
to solve problems in different contexts, with different assumptions.
The landscape of experimental responses could possibly be nonconvex,
for instance, in scenarios where high reaction yields can be achieved
with multiple different experimental conditions. Suitable optimization
algorithms therefore need to be capable to overcome local optima to
successfully find the global optimum.The most straightforward
approaches to global optimization include
random searches, grid searches, and (fractional) factorial design
strategies.[1−3] While grid searches are embarrassingly parallel,
they do not account for results obtained from recently executed experiments/computations.
Computationally more involved methods, such as simulated annealing,[53,54] particle swarm,[55,56] or evolutionary strategies,[22,23] are able to make informed decisions about the next conditions to
evaluate by accounting for results from recent evaluations.Recently, Bayesian optimization has gained increased attention
as an alternative global optimization strategy as it was shown to
reduce redundancy in the proposed conditions and, thus, locates global
optima in fewer objective evaluations.[57] Bayesian optimization is a gradient-free strategy for the global
optimization of possibly noisy black-box functions, which we denote
with f from hereon.[27−30] It consists of two major steps:
(i) construct a surrogate to f and (ii) propose new
parameter points for querying f based on this probabilistic
approximation.In the first step, the surrogate model is constructed
by conditioning f on a prior ϕprior(θ) over the functional form, which is described
by parameters θ. The parameters θ of the prior
distribution are refined based on observations of n pairs of parameter values , denoting, for instance,
experimental
conditions, and corresponding objective function values f = f (), denoting the experimental
responses such as reaction yield. The functional prior ϕprior is updated based on observations to yield
a more informative posterior ϕpost. With more and
more observations , the posterior ϕpost yields
a better approximation and eventually converges to the objective function
in the limit of infinitely many distinct observations, thus perfectly
reproducing the experimental response landscape.In the second
step of the general Bayesian optimization procedure,
this surrogate model is used to propose new conditions for future
evaluations via an acquisition function. Bayesian optimization therefore
relies on both an accurate approximation to the objective function
and also the formulation of an efficient acquisition function.
Constructing
the Objective Function Approximation
A
popular choice for modeling the functional prior ϕprior on the objective function are Gaussian processes (GPs),[30,31,50,58] and random forests (RFs).[33−35,59] GPs associate every point in the parameter domain with a normally
distributed random variable. These normal distributions are then constructed
via a similarity measure between observations given by a kernel function.
A GP therefore provides a flexible way of finding analytic approximations
to the objective function. Training a GP, however, is computationally
costly as it involves the inversion of a dense covariance matrix,
which scales cubically with the number of observations. Due to this
limitation, GPs are typically used in relatively low-dimensional problems
with an optimum that can be found in relatively few objective function
evaluations. RFs are a collection of regression trees, which, in contrast
to decision trees, have real numbers at their leaves. RFs have been
shown to perform particularly well for categorical input data and
classification tasks. RFs are therefore successfully applied to objective
functions with discrete or quasidiscrete codomain. The computational
cost of training a RF scales as with the number of observations and linearly
with the dimensionality of the parameter space.Recently, Bayesian
neural networks (BNNs) have been employed for Bayesian optimization,[37,38] retaining the flexibility of GPs at a computational scaling comparable
to RFs. In contrast to traditional neural networks, weights and biases
for neurons in BNNs are not single numbers but instead sampled from
a distribution. BNNs are trained by updating the distributions from
which weights and biases are sampled.
Acquisition Functions
The ideal acquisition function
finds the adequate balance between exploration and exploitation. Exploration
of the entire parameter space should be favored when no observations
in vicinity to the global optimum have been made yet, and the acquisition
function should only sample close to the global optimum once its general
location has been determined.One of the earliest and most widely
applied acquisition functions is expected improvement and variants thereof.[28,31,39] Expected improvement aims to measure the expected amount by which
an observation of a point in parameter space improves over the current
best value. Exploration and exploitation are implicitly balanced based
on the posterior mean and the estimated uncertainty. More recently,
alternative formulations of acquisition functions have been developed.
The upper confidence bound method exploits confidence
bounds for constructing an acquisition function which minimizes regret.[42] Variants of this acquisition function have been
designed specifically to be applied in higher-dimensional parameter
spaces.[49,50]Predictive entropy estimates
the negative differential entropy of the location of the global optimum
given the observations,[40,41] and has been shown
to outperform expected improvement and upper confidence bound acquisitions.
Batched Bayesian Optimization
Many practical applications
involve time-consuming evaluations of the objective function, but
are amenable to parallelization. Examples include the execution of
experiments on multiple experimental platforms or the distribution
of computational models across multiple processors. Batched Bayesian
optimization has been suggested with different assumptions and applicability
scenarios. Marmin et. al proposed derivative-based expected improvement
criterion for synchronous batch-sequential Bayesian optimization.[48] Other methods include approaches for estimating
the expected objective function value for future observations,[31,50] or look ahead procedures.[51] In addition,
ensemble procedures have been proposed where new batches of samples
are proposed from GPs trained on subsets of the available data.[46,47] However, all proposed batch optimization methods require substantial
additional computation compared to sequential evaluation strategies.
A computationally less demanding batch optimization strategy has been
proposed in the context of RF optimization.[33−35] Due to the
computational efficiency of RF models, batch optimization is realized
by running several RF optimization instances in parallel, while sharing
and contributing to the same observation data set.
Formulating Phoenics
In this section we present the mathematical formulation of Phoenics.
We assume that evaluations of the objective are expensive, where the
cost could be related to any budgeted resource such as required execution
time, experimental synthesis of chemical compounds, computing resources,
and others. Phoenics combines ideas from Bayesian optimization with
concepts from Bayesian kernel density estimation (BKDE).[60] The overall workflow of Phoenics is schematically
represented in Figure and follows the general principles of traditional Bayesian optimization.
At each iteration, a surrogate model is constructed, from which new
conditions are proposed. In the following we detail how the surrogate
model is constructed (Figure a–c) and how the surrogate model can be biased toward
particular sampling strategies, i.e., exploration or exploitation,
using an intuitive sampling parameter (Figure d).
Figure 1
Illustration of the workflow of Phoenics. (A)
Unknown, possibly
high-dimensional, objective function of an experimental procedure
or computation. The objective function has been evaluated at eight
different conditions (green), which comprise the set of observations
in this illustration. (B) The observed conditions are processed by
a Bayesian neural network yielding a probabilistic model for estimating
parameter kernel densities. Note that the probabilistic approach allows
for a higher flexibility of our surrogate model compared to standard
kernel density estimation. (C) The surrogate model is constructed
by weighting the estimated parameter kernel densities with their associated
observed objective values. (D) The surrogate can be globally reshaped
using a single sampling parameter λ to favor exploration (red)
or exploitation (blue) of the parameter space.
Illustration of the workflow of Phoenics. (A)
Unknown, possibly
high-dimensional, objective function of an experimental procedure
or computation. The objective function has been evaluated at eight
different conditions (green), which comprise the set of observations
in this illustration. (B) The observed conditions are processed by
a Bayesian neural network yielding a probabilistic model for estimating
parameter kernel densities. Note that the probabilistic approach allows
for a higher flexibility of our surrogate model compared to standard
kernel density estimation. (C) The surrogate model is constructed
by weighting the estimated parameter kernel densities with their associated
observed objective values. (D) The surrogate can be globally reshaped
using a single sampling parameter λ to favor exploration (red)
or exploitation (blue) of the parameter space.
Approximating the Objective Function
We suggest to
use BNNs to estimate the parameter kernel density from the observed
parameter points in an autoencoder-like architecture. As such, the
BNN is used to nonlinearly estimate the density of the observed parameter
points (see Figure b). A particular realization of the BNN represents
a map projecting parameter points into the parameter space, i.e, . Thereby, we can construct an estimate
to the parameter kernel density, which corresponds to a particular
observed objective function value. The use of a BNN for the construction
of the surrogate guarantees flexibility in the approximation as it
has already been reported that BNNs are versatile function approximators
at a favorable linear scaling with the number of observations.[37,38] Details on the BNN architecture are reported in the Supporting Information (see Sec. S.3).The implementation of Phoenics supports the construction and training
of the surrogate model using either the PyMC3 library,[61,62] or the Edward library.[63] In both cases,
the model parameters θ of the BNN are trained via
variational inference. We start the sampling procedure with 500 samples
of burn-in followed by another 1000 iterations retaining every 10th
sample. This protocol was fixed for all tasks.We can construct
an approximation to the kernel density from the
distributions of BNN parameters learned from the sampling procedure.
In particular, for observed conditions we compute
the kernel densities, which
are then used to approximate the objective function. The kernel density p() generated from a single observed parameter point (see Figure b) can therefore be written
in closed form in eq , where ⟨·⟩ denotes the average over all sampled
BNN architectures, and pred denotes the parameter points sampled from the BNNWe formulate the
approximation to the objective function as an
ensemble average of the observed objective function values f taken over the set of computed
kernel densities p() (see eq ).[64,65] In this ensemble average, each
of the constructed distributions p() is rescaled by the value
of the objective function f observed for the parameter point (Figure c). Note that α converges to the true objective f in the limit of infinitely many distinct function evaluations,
as the kernel densities p become more peaked due to the increasing precision in the Gaussian
prior. Details are provided in the Supporting Information (see Sec. S.5).This approximation to the objective function allows for inexpensive
evaluations for any given parameter point as repetitive, full BNN evaluations are avoided.
Acquisition
Function
In the resulting approximation
α we effectively model the expectation value of f for a given parameter point based
on prior observations. However, the parameter space could contain
low-density regions, for which the objective function approximation
α() is inaccurate (see Figure c).With these
considerations, we propose an acquisition function based on the kernel
densities p() for observations detailed
in eq . The acquisition
function differs from the
approximation to the objective function (see eq ) by an additional term puniform() in the numerator
and the denominator, which denotes the uniform distribution on the
domain (see Figure d). In the numerator, puniform() is scaled by a factor λ, referred
to as the sampling parameter from hereon. Note that this modification
to α does not affect its convergence behavior (see the Supporting Information, Sec. S.5)The introduced parameter λ effectively compares the cumulative
height of each rescaled density estimate p() to the uniform
distribution. While the p() are constructed from the knowledge
we acquired from previous experiments, puniform is used as a reference to indicate the lack of knowledge in parameter
space regions where little or no information is available yet. The
sampling parameter therefore balances between acquired knowledge and
the lack of knowledge, which effectively tunes the exploitative and
explorative behavior of the algorithm.Figure d illustrates
the behavior of Phoenics on a one-dimensional objective function with
different λ values. In this example, the acquisition function
is constructed from eight observations indicated in green. Note that
the acquisition function approximates the value of the objective function
at observed parameter points. Acquisition functions which were constructed
from a more positive λ show low values only in the vicinity
of the observation with the lowest objective function value. In contrast,
acquisition functions which were constructed from a more negative
λ show low values far away from any observation. The choice
for the value of the exploration parameter λ can therefore be
directly related to explorative or exploitative behavior when proposing
new conditions based on the global minimum of the surrogate. With
a large positive value of λ, Phoenics favors exploitation, while
a large negative value favors exploration. When λ = 0, the acquisition
function shows no preference for a particular sampling strategy. Details
on the global optimization of the surrogate are provided in the Supporting Information (see Sec. S.6).From Figure d we
see that distinct points in parameter space are proposed based on
particular values of the exploration parameter λ. The best choice
of λ for a given objective function is a priori unknown. However,
with the possibility to rapidly construct several acquisition functions
with biases toward exploration or exploitation, we can propose multiple
parameter points in batches based on different sampling strategies.
The newly proposed parameter points are then evaluated on the black-box
optimization function in possibly parallel evaluation runs.
Results
and Discussion
In this section we report the performance
of Phoenics and compare
it to four frequently used optimization algorithms: particle swarm
optimization (PSO),[55,56] covariance matrix adaptation
evolution strategy (CMA-ES),[22,23] as well as Bayesian
optimization based on GPs and based on RFs.PSO is implemented
in the “pyswarms” python module.[66] An implementation of CMA is also available in
the “cma” module.[67] Default
settings as provided by the modules have been used for both optimization
algorithms. The “spearmint” software package performs
Bayesian optimization using GPs and the predictive entropy acquisition
function.[31,36] Batch optimization is implemented in spearmint
via estimating the expected objective value for future evaluations.
The SMAC software employs RF models and allows for batch optimization
by running multiple RF instances sharing the same set of samples.[33−35]Chemical systems can have complex, qualitatively different
response
surfaces. As chemical reactions are time-consuming to evaluate, we
therefore assess the performance of each of these three algorithms
on a set of 15 benchmark functions covering a large range of qualitatively
diverse response surfaces for problems in chemistry. The employed
functions are well-established benchmarks and include continuous and
convex, nonconvex, or discrete functions with possibly multiple global
minima. A complete list of the employed objective functions as well
as their global minima is provided in the Supporting Information (see Table S.1).For reliable performance estimates
we executed 20 independent optimization
runs initialized with different random seeds, unless noted otherwise.
During each run we record the lowest achieved objective function value
after each iteration. We compare the averaged lowest achieved objective
function values by relating to results from simple random searches.
Each random search was run for 104 objective function evaluations,
and results were averaged over 50 independent runs initialized with
different random seeds. The average lowest achieved objective function
values of the random search runs are summarized in the Supporting
Information (see Table S.2).Our
benchmark calculations indicate that Bayesian-based optimization
algorithms outperform PSO and CMA-ES. As a matter of fact, after 200
function evaluations, PSO and CMA-ES yield significantly higher deviations
from the global optimum than the three studied Bayesian optimization
algorithms. Even when increasing the number of function evaluations
by an order of magnitude, from 200 to 2000, PSO (and CMA-ES) fails
to achieve lower deviations than the ones obtained with Phoenics after
200 evaluations for 12 (and 13) out of 15 objective functions. We
therefore restrict our further analyses and comparisons to only Phoenics,
GP optimization, and RF optimization. The benchmark results conducted
with PSO and CMA-ES are detailed in the Supporting Information (see Figure S.6)
Analytic Benchmarks
Phoenics was
set up with three
different values for the sampling parameter, λ ∈ {−1,
0, 1}, to assess the effectiveness of a particular parameter choice.
In Figure we report
the number of objective function evaluations required by each of the
optimization algorithms to reach an objective function value lower
than the average lowest value found in random searches. Optimization
traces for these runs on all 15 objective functions are reported in
the Supporting Information (see Figure S.3).
Figure 2
Number of objective function evaluations required to reach objective
function values lower than the average lowest achieved values of random
searches with 104 evaluations for Phoenics (λ ∈
{−1, 0, 1}), RFs, and GPs. Results are reported for the Ackley
(A), Dejong (B), Schwefel (C), and dAckley (D) objective functions.
Details on the benchmark functions are provided in the Supporting Information, Sec. S.1.
Number of objective function evaluations required to reach objective
function values lower than the average lowest achieved values of random
searches with 104 evaluations for Phoenics (λ ∈
{−1, 0, 1}), RFs, and GPs. Results are reported for the Ackley
(A), Dejong (B), Schwefel (C), and dAckley (D) objective functions.
Details on the benchmark functions are provided in the Supporting Information, Sec. S.1.We find that GP optimization, as implemented in
spearmint, generally
quickly finds the global minimum if the objective function is strictly
convex. In contrast, RF optimization, as implemented in SMAC, quickly
finds the global minimum of objective functions with a discrete codomain.
The performance of Phoenics varies with different values of the sampling
parameter λ. When favoring exploitation over exploration, i.e.,
λ > 0, the algorithm performs better if the objective function
features narrow and well-defined funnels (e.g., Ackley in Figure a or Schwefel in Figure c). With this choice
for the sampling parameter, the algorithm is slightly biased toward
exploring the local region around the current optimum. This behavior,
however, is unfavorable in other cases, for instance, when the objective
function has a discrete codomain (e.g., dAckley in Figure d). Since parameter points
in the vicinity to the current optimum likely yield the same value
if the objective function is discrete or quasidiscrete, Phoenics performs
better on such objective functions when favoring exploration over
exploitation, i.e., λ < 0.
Developing a Collective
Sampling Strategy
The dependence
of the performance of Phoenics on the sampling parameter λ could
be eliminated by marginalizing over this parameter. Marginalization
over the sampling parameter would effectively average out the advantageous
effects of a bias toward exploitation for some objective functions
and toward exploration for other objective functions.The shape
of the objective function is a priori unknown, so suitable choices
of the sampling parameter cannot be determined beforehand. However,
since the sampling parameter can be directly related to the explorative
and exploitative behavior of the algorithm we follow an approach to
take full advantage of the sampling policy. We suggest to propose
parameter points based on multiple different sampling parameter values.
Note that values of λ are chosen beforehand and are kept fixed
throughout the optimization procedure.Given a set of observations the construction of several objective
surrogates
with different values of λ is computationally cheap. This allows
us to suggest multiple parameter points at each optimization iteration,
which are proposed from more explorative and more exploitative parameter
values, at almost no additional cost. With the observations on the
simple benchmarks we would expect a synergistic effect of this batch
optimization over sequential optimization with a single sampling parameter
value. As parameter points can be proposed with both a bias toward
exploration and a bias toward exploitation, we expect the number of
required objective function evaluations to decrease. In addition,
suggesting a batch of parameter points in one optimization step allows
for the parallel evaluation of all proposed points, which accelerates
the optimization process.The behavior of the three studied
optimization algorithms under
parallel optimization on the Ackley objective function is highlighted
in Figure . Full results
on the entire benchmark set are reported in the Supporting Information (see Sec. S.8). Figure depicts the minimum achieved objective function
values for different runs with a different number of parallel evaluations
of the objective function averaged over 20 independent runs. The minimum
achieved objective function values are presented per number of objective
function evaluations (left panel) and per batch evaluation (right
panel).
Figure 3
Average minimum objective function values for the Ackley function
achieved in 20 independent runs of the three optimization algorithms
studied in this work: our optimizer (Phoenics), spearmint (GP), and
SMAC (RF). For each run a different number of proposed samples p was evaluated in parallel. Minimum achieved objective
function values are reported with respect to the total number of objective
function evaluations and the number of evaluated batches. The dashed
blue lines denote the minimum achieved error after 104 of
random search for reference.
Average minimum objective function values for the Ackley function
achieved in 20 independent runs of the three optimization algorithms
studied in this work: our optimizer (Phoenics), spearmint (GP), and
SMAC (RF). For each run a different number of proposed samples p was evaluated in parallel. Minimum achieved objective
function values are reported with respect to the total number of objective
function evaluations and the number of evaluated batches. The dashed
blue lines denote the minimum achieved error after 104 of
random search for reference.We find that both spearmint and SMAC achieve low objective
function
values in fewer batches with an increasing number of points p proposed in each batch. While increasing the number of
samples proposed per batch initially significantly improves the performance
with respect to the number of proposed batches, this advantageous
effect quickly levels off until there is no significant improvement
beyond six samples per batch. However, when comparing the minimum
achieved objective function values with respect to the total number
of objective function evaluations, we did not observe any significant
difference between runs with a different number of samples proposed
per batch.In contrast, Phoenics shows a different behavior.
Our algorithm
not only reaches lower objective function values in a fewer number
of batches when proposing more samples per batch, but also shows a
better performance when considering the total number of function evaluations.
This synergistic effect demonstrates that Phoenics indeed benefits
from proposing points with multiple sampling strategies in cases in
which proposed samples are evaluated sequentially.The performance
improvement of Phoenics when proposing parameter
points in batches at each optimization iteration is demonstrated on
all 15 considered objective functions in the Supporting Information (see Sec. S.8). We ran our optimizer with four
different sampling strategies using sampling parameter values evenly
spaced across the [−1, 1] interval. All four proposed parameter
points are then evaluated before we started another optimization iteration.
For this particular batching protocol, we find that Phoenics outperforms
RF-based optimization on all benchmark functions and GP-based optimization
on 12 out of 15 benchmark functions. If and only if the objective
function is convex, GP optimization finds lower objective function
values. In addition, we observe the aforementioned synergistic effect
of batch optimization for 12 out of 15 benchmark functions. Despite
reducing the number of optimization iterations by a factor of 4, the
achieved objective function values were found to be lower than values
achieved in sequential optimizations with all three considered fixed
sample parameter value.We suggest that this improved performance
of the algorithm is due
to the trade-off between exploration and exploitation. The exploration
samples systematically sample the parameter space and ensure that
the algorithm does not get stuck in local minima, while the exploitation
samples explore the local environment of the current global minimum.
This sampling behavior is illustrated in Figure for the Michalewicz function. The optimization
runs on the Michalewicz function were all started from the same two
random samples illustrated in black for all three investigated optimization
algorithms. Bayesian optimization based on GPs as implemented in spearmint
(lower panels) tends to sample many parameter points close to the
boundaries of the domain space in this particular example. RF optimization
as implemented in SMAC (central panels), however, shows a higher tendency
of exploring the parameter space.
Figure 4
Progress of sample optimization runs of
the three studied optimization
algorithms on the two-dimensional Michalewicz function. Phoenics proposed
a total of three samples per batch, which were then evaluated in parallel.
Each sample was suggested based on a particular value of the exploration
parameter λ ∈ {−1, 0, 1}. Left panels illustrate
the parameter points proposed at each optimization iteration while
right panels depict the achieved objective function values. Depicted
points are more transparent at the beginning of the optimization and
more opaque toward the end. Starting points for the optimization runs
are drawn as black squares.
Progress of sample optimization runs of
the three studied optimization
algorithms on the two-dimensional Michalewicz function. Phoenics proposed
a total of three samples per batch, which were then evaluated in parallel.
Each sample was suggested based on a particular value of the exploration
parameter λ ∈ {−1, 0, 1}. Left panels illustrate
the parameter points proposed at each optimization iteration while
right panels depict the achieved objective function values. Depicted
points are more transparent at the beginning of the optimization and
more opaque toward the end. Starting points for the optimization runs
are drawn as black squares.Phoenics (Figure , upper panels) starts exploring the space and quickly finds
a local
minimum in vicinity of one of the initial samples. After finding this
local minimum, samples which are proposed based on a more exploitative
(positive) value of the sampling parameter λ explore the local
environment of this local minimum while samples proposed from more
explorative (negative) values of λ explore the entire parameter
space. As soon as the exploration points find a point in parameter
space with a lower value of the objective function, the exploitation
points jump to this new region in parameter space and locally explore
the region around the current best to quickly converge to the global
minimum.Overall we have demonstrated that the value of the
sampling parameter
λ in the proposed acquisition function clearly influences the
behavior of the optimization procedure toward a more explorative behavior
for more negative values of this parameter and a more exploitative
behavior for more positive parameter values. Batched optimization
improves the performance of Phoenics even in terms of total objective
function evaluations and reduces the number of required optimization
iterations.
Increasing the Number of Dimensions
Practical chemistry
problems are typically concerned with more than just two parameters.
In fact, chemical reactions can be influenced by environmental conditions
and experimental device settings, and computational studies frequently
employ parameters to describe the system of interest. In this section
we illustrate the performance of Phoenics in parameter spaces with
dimensions k > 2.We evaluate the performance
of the three optimization algorithms on the same objective function
subset, but now successively increase the dimensionality of the parameter
space from 2 to 20. Based on the results on batch optimization we
ran GP optimization and RF optimization with one point per batch and
the optimization algorithm introduced in this study with four points
per batch on each considered benchmark function. Exploration parameter
values were chosen to be evenly spaced across the [−1, 1] interval.For better comparisons we report the average deviation of the lowest
encountered objective function value from the global minimum of each
function taken over 20 independent optimization runs. Average deviations
achieved by each of the optimization algorithms after 200 objective
function evaluations are depicted in Figure .
Figure 5
Average deviations taken over 20 independent
runs between the lowest
encountered objective function value and the global minimum achieved
after 200 objective function evaluations for different parameter set
dimensions. Results are reported for Ackley (A), Dejong (B), Schwefel
(C), and dAckley (D). Uncertainty bands illustrate bootstrapped estimates
of the deviation of the means with one and two standard deviations.
Average deviations taken over 20 independent
runs between the lowest
encountered objective function value and the global minimum achieved
after 200 objective function evaluations for different parameter set
dimensions. Results are reported for Ackley (A), Dejong (B), Schwefel
(C), and dAckley (D). Uncertainty bands illustrate bootstrapped estimates
of the deviation of the means with one and two standard deviations.We observe that Phoenics maintains
its rapid optimization properties
for a variety of different objective functions even when increasing
the number of dimensions. In the case of the Ackley function (Figure a) Phoenics appears
to find and explore the major funnel close to the global optimum faster
than the other two optimization algorithms regardless of the number
of dimensions. The paraboloid (Figure b) is an easy case for the GP in low dimensions, but
is optimized the fastest by Phoenics when considering parameter spaces
with seven or more dimensions. No major differences are observed for
the Schwefel function (Figure c). However, in the case of a discrete objective function
(Figure d) Phoenics
seems to have a slight advantage over the other two optimizers for
lower dimensions and performs about as well as random forest optimization
for higher dimensions.
Application to Chemistry
In this
section, we demonstrate the applicability of Phoenics on
the Oregonator, a model system of a chemical reaction described by
a set of nonlinear coupled differential equations.[68] In particular, we demonstrate how Phoenics can be employed
to propose a set of conditions for an experimental procedure. The
experimental procedure can then be executed with the proposed conditions,
and the results of the procedure are reported back to Phoenics. With
this feedback, Phoenics can make more informed decisions and, thus,
provides more promising sets of experimental conditions, to eventually
result in the discovery of the optimal set of conditions.Most
chemical reactions lead to a steady-state, i.e., a state in
which the concentrations of involved compounds are constant in time.
While chemical reactions described by linear differential equations
always feature such a steady-state, more complicated dynamics phenomena
can arise for reactions described by sets of nonlinear coupled differential
equations. With the right choice of parameters, such differential
equations may have a stable limit cycle, leading to periodic oscillations
in the concentrations of involved compounds.[69,70]One of the earliest discovered reactions featuring a stable
limit
cycle for a set of reaction conditions is the Belousov–Zhabotinsky
reaction.[71,72] This network of chemical reactions involves
temporal oscillations of [CeIV] and [CeIII].
The entire reaction network can be written as a set of three subreactions
listed in Scheme .
For details on the mechanism we refer to a brief summary in the Supporting Information (see Sec. S.10) as well
as to the literature.[68,72−75]
Scheme 1
Subreactions of the
Belousov–Zhabotinsky Reaction[78]
Models at different levels
of complexity have been developed to
describe the dynamic behavior of the Belousov–Zhabotinsky reaction.[69,75−77] One of the simplest models of this reaction is the
Oregonator.[68] The Oregonator consists of
a set of three coupled first-order nonlinear differential equations
for three model compounds X, Y,
and Z, which are shown in eqs –6. These equations
involve five reaction constants k, a stoichiometric factor f, determined by
the prevalence of one subreaction over another subreaction, and the
concentration of two additional chemical compounds A and B. A map of eqs –6 to Scheme is outlined in ref (68).The set of differential equations in the Oregonator can be
reduced
into a dimensionless form, such that the number of correlated parameters
is reduced to a smaller set of independent parameters. This reduced
version of the Oregonator, presented in the Supporting Information (see Sec. S.10), includes three dimensionless variables
α, η, and ρ, which describe the concentration of
chemical species, and four dimensionless reaction constants.Phoenics is used to reverse engineer the set of reaction conditions
consisting of three initial concentrations (α0, η0, and ρ0) and four reaction constants (q, s, w, and f) from the concentration traces computed in the original publication.[68] Parameter values yielding the target concentration
traces are reported in Table . The goal is 2-fold: (i) find a set of conditions for which
the dynamical behavior qualitatively agrees with the behavior of the
target, i.e., find chemical oscillations, and (ii) fine-tune these
conditions such that we reproduce the dynamical behavior on a quantitative
level. We aim to achieve these two goals while keeping the number
of function evaluations, i.e., the number of experiments to run, to
a minimum. Note that this constraint along with the dimensionality
of the parameter space implies that grid searches or gradient-based
algorithms are not suited for this problem.
Table 1
Reaction
Parametersa of the Reduced Oregonator Model
for the Belousov–Zhabotinsky
Reaction
parameter
target value
range
s
77.27
0 ... 100
w
0.1610
0 ... 1
q
8.375 × 10–6
10–8 ... 10–4
f
1
0 ... 5
α0
2.0 × 107
104 ... 109
η0
3.3 × 103
103 ... 105
ρ0
4.1 × 104
103 ... 106
Target parameters induce the
existence of a limit cycle, from which chemical oscillations emerge.
For finding these target parameters via optimization we constrained
the domain space to the reported ranges. All reported quantities are
dimensionless.
Target parameters induce the
existence of a limit cycle, from which chemical oscillations emerge.
For finding these target parameters via optimization we constrained
the domain space to the reported ranges. All reported quantities are
dimensionless.Phoenics
was run in parallel proposing four samples per batch with
λ equally spaced on the [−1, 1] interval. We compare
the performance to PSO in the pyswarms module, CMA-ES in the cma module,
GP optimization in spearmint, and RF optimization in SMAC. Each of
the five optimization algorithms was used in 50 independent optimization
runs for 150 evaluations.All optimization procedures were carried
out on a constrained parameter
space reported in Table . Note that different choices of parameter sets within this bounded
domain can result in quantitatively and qualitatively different dynamical
behavior. In particular, parameter choices close to the target result
in oscillatory behavior, for which the reduced concentrations α,
ρ, and η change periodically over time, while other parameter
choices can break the limit cycle and create a stable fixed point
instead.[68,74,77]Concentration
traces for a sampled set of reaction parameters were
computed with a fourth-order Runge–Kutta integrator with adaptive
time stepping. The integrator was run for a total of 107 integration steps covering 12 full concentration oscillations for
the target parameter set. Sampled concentration traces are compared
to the target concentration traces after a cubic spline interpolation.
The distance (loss) between the sampled traces and the target traces
is calculated as the euclidean distance between the points in time
at which a concentration trace reaches a dimensionless concentration
value of 100.Average achieved losses for all three optimization
algorithms are
displayed in Figure . Loss values between 300 and 500 indicate that the periodicity of
the predicted concentration traces resembles the periodicity of the
target traces; i.e., the predicted traces qualitatively agree with
the target. Quantitative agreement, i.e., matching traces, is only
achieved for loss values lower than about 100. Examples for concentration
traces yielding different losses are presented in the Supporting Information
(see Figure S.7).
Figure 6
Average achieved losses
for finding reaction parameters of the
reduced Oregonator model achieved by the five optimization algorithms
employed in this study. Correct periodicities of the concentration
traces are achieved for losses lower than 500. Uncertainty bands illustrate
bootstrapped deviations on the mean for one standard deviation.
Average achieved losses
for finding reaction parameters of the
reduced Oregonator model achieved by the five optimization algorithms
employed in this study. Correct periodicities of the concentration
traces are achieved for losses lower than 500. Uncertainty bands illustrate
bootstrapped deviations on the mean for one standard deviation.Figure shows concentration
traces associated with the lowest loss achieved by each of the three
optimization algorithms across all 50 independent runs. Phoenics is
the only algorithm reproducing qualitatively and quantitatively target
dynamic behavior within 150 optimization iterations. RF optimization
only finds parameter sets which qualitatively agree with the target.
GP optimization finds only in rare occasions concentration traces
in qualitative agreement with the target. Both PSO and CMA-ES consistently
yield high losses for the first 75 evaluations, after which PSO slightly
improves but never reaches the degree of agreement achieved by Phoenics.
Figure 7
Time traces
of dimensionless concentrations of compounds in the
Oregonator model. Target traces are depicted with solid, transparent
lines while predicted traces are shown in dashed, opaque lines. Traces
were simulated for a total of 12 dimensionless time units, but are
only shown for the first six time units for clarity.
Time traces
of dimensionless concentrations of compounds in the
Oregonator model. Target traces are depicted with solid, transparent
lines while predicted traces are shown in dashed, opaque lines. Traces
were simulated for a total of 12 dimensionless time units, but are
only shown for the first six time units for clarity.
Conclusion and Outlook
We introduced Phoenics, an algorithm
for global optimization in the context chemistry and experimentation.
Phoenics is designed for scenarios where the merit of a set of conditions
is evaluated via experimentation or expensive computations, which
can possibly be parallelized. Our probabilistic optimizer combines
Bayesian optimization with conceptual aspects of Bayesian kernel density
estimation. As such, our algorithm is well-suited for applications
where evaluations of the objective function are expensive with respect
to budgeted resources such as time or money. Through an exhaustive
benchmark study, we showed that Phoenics improves over optimization
strategies based on particle swarms or evolutionary approaches, as
well as on existing Bayesian global optimization methods and avoids
redundant evaluations of the objective.We formulate an inexpensive
acquisition function balancing the explorative and exploitative behavior
of the algorithm. This acquisition function enables intuitive sampling
policies for an efficient parallel search of global minima. By leveraging
synergistic effects from running multiple sampling policies in batches,
the performance of the algorithm improves, and requires a reduced
total number of objective function evaluations.The applicability
of Phoenics was highlighted on the Oregonator,
a model system describing a complex chemical reaction network. Phoenics
was able to determine the set of seven experimental conditions reproducing
a target dynamic behavior in the concentrations of involved chemical
species. High degrees of qualitative and quantitative agreement could
be achieved with only 100 merit-evaluations of proposed conditions
despite the rich solution space containing both steady-state systems
and chemical oscillators.We believe that Phoenics has the potential
to be applied to a wide
range of problems, from optimization of reaction conditions and material
properties, over control of robotics systems, to circuit design for
quantum computing.[79,80] All in all, we recommend Phoenics
for an efficient optimization of scalar, possibly nonconvex, black-box
unknown objective functions.
Authors: Connor W Coley; Regina Barzilay; Tommi S Jaakkola; William H Green; Klavs F Jensen Journal: ACS Cent Sci Date: 2017-04-18 Impact factor: 14.553
Authors: Eugene N Muratov; Jürgen Bajorath; Robert P Sheridan; Igor V Tetko; Dmitry Filimonov; Vladimir Poroikov; Tudor I Oprea; Igor I Baskin; Alexandre Varnek; Adrian Roitberg; Olexandr Isayev; Stefano Curtarolo; Denis Fourches; Yoram Cohen; Alan Aspuru-Guzik; David A Winkler; Dimitris Agrafiotis; Artem Cherkasov; Alexander Tropsha Journal: Chem Soc Rev Date: 2020-05-01 Impact factor: 54.564
Authors: Benjamin J Shields; Jason Stevens; Jun Li; Marvin Parasram; Farhan Damani; Jesus I Martinez Alvarado; Jacob M Janey; Ryan P Adams; Abigail G Doyle Journal: Nature Date: 2021-02-03 Impact factor: 49.962
Authors: Martin Seifrid; Robert Pollice; Andrés Aguilar-Granda; Zamyla Morgan Chan; Kazuhiro Hotta; Cher Tian Ser; Jenya Vestfrid; Tony C Wu; Alán Aspuru-Guzik Journal: Acc Chem Res Date: 2022-08-10 Impact factor: 24.466
Authors: Brian Rohr; Helge S Stein; Dan Guevarra; Yu Wang; Joel A Haber; Muratahan Aykol; Santosh K Suram; John M Gregoire Journal: Chem Sci Date: 2020-01-29 Impact factor: 9.825
Authors: B P MacLeod; F G L Parlane; T D Morrissey; F Häse; L M Roch; K E Dettelbach; R Moreira; L P E Yunker; M B Rooney; J R Deeth; V Lai; G J Ng; H Situ; R H Zhang; M S Elliott; T H Haley; D J Dvorak; A Aspuru-Guzik; J E Hein; C P Berlinguette Journal: Sci Adv Date: 2020-05-13 Impact factor: 14.136
Authors: Loïc M Roch; Florian Häse; Christoph Kreisbeck; Teresa Tamayo-Mendoza; Lars P E Yunker; Jason E Hein; Alán Aspuru-Guzik Journal: PLoS One Date: 2020-04-16 Impact factor: 3.240