Literature DB >> 35235453

Adaptive Monte Carlo augmented with normalizing flows.

Marylou Gabrié^1,2, Grant M Rotskoff³, Eric Vanden-Eijnden⁴.

Abstract

SignificanceMonte Carlo methods, tools for sampling data from probability distributions, are widely used in the physical sciences, applied mathematics, and Bayesian statistics. Nevertheless, there are many situations in which it is computationally prohibitive to use Monte Carlo due to slow "mixing" between modes of a distribution unless hand-tuned algorithms are used to accelerate the scheme. Machine learning techniques based on generative models offer a compelling alternative to the challenge of designing efficient schemes for a specific system. Here, we formalize Monte Carlo augmented with normalizing flows and show that, with limited prior data and a physically inspired algorithm, we can substantially accelerate sampling with generative models.

Entities: Chemical

Keywords: Monte Carlo; free energy calculations; normalizing flows; phase transitions

Year: 2022 PMID： 35235453 PMCID： PMC8915891 DOI： 10.1073/pnas.2109420119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Monte Carlo approximations are the method of choice to extract information from high-dimensional probability distributions encountered in the description of natural systems and statistical models. One generic feature of these distributions that is particularly challenging for sampling is multimodality (or metastability): that is, when low-probability regions separate high-probability regions (or basins) of the state space. Markov Chain Monte Carlo (MCMC) algorithms, which are driven primarily by local dynamics such as Hamiltonian Monte Carlo or Langevin dynamics, typically struggle to transition between metastable basins, leading to either extremely long correlation times along the chains and few effective independent samples or even failure to equilibrate at all. As a result, slow relaxation and metastability plague sampling problems that arise in chemistry and biophysics (1). On the other hand, generative models, which have garnered much attention in the machine learning literature, seem to efficiently sample complicated high-dimensional distributions, such as collections of images. Most of these generative models, including generative adversarial networks (2) and variational autoencoders (3), rely on the transformation of samples from a simple and tractable base distribution through a map parametrized with neural networks. After learning, the map transforms samples from the base to mimic samples of a given empirical distribution. This formulation allows for drawing independent samples from the model at a negligible cost. However, the conventional strategy for training a generative model requires an extensive dataset of samples. Arguably, these models have succeeded most dramatically in domains where the cost of generating and curating data are comparatively low (e.g., image recognition) (2, 4, 5). In scientific computing applications, obtaining data from the distribution is the primary goal. Furthermore, the quality metrics used in traditional machine learning applications are a priori quite different from the efficacy and precision in sampling a target distribution. Hence, it is natural to ask whether traditional MCMC methods and generative models can be successfully combined to accelerate sampling of complicated high-dimensional distributions? The prospect of enhancing sampling with suitable generative models is an active area of inquiry (4, 6–11). In particular, sampling via Metropolis–Hastings MCMC requires the computation of each transition generation probability and its inverse. As a result, the model architectures on which most generative neural networks rely are not conducive to Metropolis–Hasting MCMC. However, specific classes of neural networks have been designed with this in mind, allowing for efficient estimates of the probability of a generated sample, including autoregressive models (12) and normalizing flows (NFs), which are expressive invertible function representations (13, 14). At this point, NFs have been investigated as transition operators in MCMC algorithms and variational ansatz in a variety of contexts in the physical sciences and Bayesian applications (13, 15–20). These methods offer a promising speedup for sampling unimodal distributions without requiring preexisting data samples by relying on a self-training objective for the map (described in detail in ). However, the multimodal case requires prior knowledge about either the symmetries of the systems generating the degeneracy of the modes (20) or the location of the metastable basins (17). This necessity was noted in the influential work of Noé et al. (21) that proposes a training strategy for NFs to generate low-energy configurations, which are subsequently reweighted. The aim of this paper is to propose an alternative class of adaptive MCMC algorithms augmented with an NF trained on the fly with the generated samples and also, to carefully assess the prospects of these algorithms for accelerating sampling in cases where no extensive preexisting dataset is available. Our main contributions are as follows. We introduce an adaptive Metropolis–Hastings MCMC algorithm that augments a chain performing local steps with nonlocal resampling steps proposed by an NF. The corresponding proposal distribution is adapted along sampling by training the NF via optimization of a forward Kullback–Leibler divergence estimated on the generated data. As the adaptation of the map depends on the history of the chains, the convergence of the proposed algorithm, where training and sampling happen simultaneously, is not trivial. We show that the adaptive algorithm is akin to a nonlinear MCMC scheme (22), which we analyze in the continuous-time limit. In this limit, we show that the algorithm converges asymptotically with an exponential rate that can be explicitly estimated. We test this adaptive MCMC approach on complex examples in high dimension (random fields, transition paths, and interacting particle systems at phase coexistence) and show that it dramatically accelerates the sampling. In particular, we estimate the relative statistical weights of metastable states efficiently without constructing a specific pathway between the basins of interest. Our results also emphasize some key determinants for the success of sampling augmented with learning. One representative configuration per mode of interest in the target distribution must be known beforehand to initialize the chains. We critically assess the ability of using generative model proposals to discover unknown metastable states and show that this prospect is statistically unlikely without prior information about these states. Blending generative sampling with a standard MCMC strategy is typically required to guarantee good sampling of the target distribution; in particular, we show that relying on generative sampling alone may not be sufficient because it requires that we learn the generative model to a degree of accuracy that is hard to achieve in practice, especially in high-dimensional examples. Finally, our analysis and numerical experiments show how scaling to high dimensions is also facilitated by parametrizations of NFs that incorporate known structures of the target distributions, such as short-scale correlations. The possibility to inform the map, or the base distribution, with physical intuition alleviates the curse of dimensionality that would prevent general-purpose parametrizations from reaching the required level of precision with reasonably sized models as the dimension grows.

Design Challenges in MCMC Methods

The goal of sampling is to generate configurations in proportion to some probability measure , which we assume has probability density function . In physical systems, we typically write this in Boltzmann form:where is the potential energy function for the system. We assume that we have an explicit model for and can efficiently evaluate this energy function, although we may have little a priori information about the distribution of configurations associated with this energy and in general, do not know the normalization constant . MCMC algorithms avoid computing by generating a sequence of configurations with a transition kernel with for all , which quantifies the conditional probability density of a transition from state x into state y. Assume that the kernel is irreducible and aperiodic (23) and satisfies the detailed balance relation Then, the sequence will sample the target density in the sense that the empirical average of any suitable observable converges to its expectation over : that is, Designing a transition kernel π leading to fast convergence of the series in [3] is a generically challenging task for MCMC algorithms. In Metropolis–Hastings MCMC, one constructs a proposal distribution that creates new samples that are then accepted or rejected according to a criterion that maintains [2]. For example, in the Metropolis-adjusted Langevin algorithm (MALA) (24), new configurations are proposed by approximating the solution of the Langevin equation propagated on a fixed time interval. Metropolis–Hastings MCMC algorithms, however, involve a trade-off between two requirements that are hard to fulfill simultaneously. Proposal distributions using local dynamics like MALA suffer from long decorrelation times when there is metastability in the target density At the same time, seeking faster mixing times with nonlocal proposal distributions requires careful design to avoid high rejection rates. Recent work in the machine learning literature has suggested a data-driven approach to constructing the transition kernel (4, 8, 9) that aids in this design challenge; these approaches originally were pioneered in the context of adaptive and nonlinear MCMC algorithms (22, 25–27). Here, we explore the use of NFs to adaptively parameterize the transition kernel.

MCMC Sampling with NFs

An NF is an invertible map that is optimized to transport samples from a base measure (for example, a Gaussian with unit variance) to a given target distribution (14). The goal is to produce a map with inverse such that an expectation of an observable with respect to can be estimated by transforming samples from the base density to the target: that is, if is drawn from , then is a sample from so that for any suitable observable , we have The existence of such a map is guaranteed under the general conditions on and investigated (e.g., in the context of optimal transport theory) (28, 29). Of course, in practice we do not have direct access to this ideal map . Next, we discuss how any approximation T of can in principle be used to perform exact sampling of the target via Metropolis–Hastings MCMC and how the map T can be improved via training.

Metropolis–Hastings MCMC with NF.

Throughout, we denote the push forward of under the map T simply by ; it has the explicit formwhere denotes the inverse map [i.e., ]. In practice, the parametrization of the map T must be designed carefully to evaluate this density efficiently, requiring easily estimable Jacobian determinants and inverses. This issue has been one of the main foci in the NF literature (14) and is, for instance, solved using coupling layers (30, 31). Even if the map T is not the optimal [i.e., ], as long as and are either both positive or both zero at any point , we can still generate configurations using T with the correct statistical weight in the target distribution by using a Metropolis–Hastings MCMC algorithm with an accept–reject step; a proposed configuration from a given configuration x is accepted with probability This procedure is equivalent to using the transition kernelwhere . The formula in Eq. for the acceptance probability emphasizes that if the generated configurations do not have appreciable statistical weight in the target distribution [i.e., is very small], few configurations will be accepted. This problem can become fundamental in high-dimensional spaces because unless care is taken to ensure otherwise, the push-forward measure and the target will not overlap (a discussion of this issue and a precise measure-theoretic formulation of MCMC with NF are in ). In contrast, when the map yields an appreciable acceptance rate, the flow-based proposals may mix much faster than proposals based on local moves as independent configurations y can be directly sampled from . We illustrate these features in numerical experiments presented below.

Map Training.

Improving the map T requires that we optimize some objective function measuring the discrepancy between the and : for example, the Kullback–Leibler (KL) divergence of with respect to that is given by an expectation over ,where is a constant irrelevant for the optimization of this objective over T. Typically, this procedure is used in situations where a dataset from is available beforehand (4, 5) and can be used to construct an empirical approximation of Eq. ; in contrast, we are focused on situations where only limited data exist initially. In this context, it has been suggested (13, 15, 21) to use the reverse KL divergence of with respect to since it can be expressed as an expectation over : The (unknown) constant is irrelevant for the optimization of this objective over T. This approach seems to alleviate altogether the need of preexisting samples from ; however, it rests on the possibility to discover relevant regions on via sampling . In practice, this may be very hard to achieve unless we have a good estimate of the ideal to begin with, which is typically not the case; for this reason, here we will resort to optimizing an approximation of the direct KL in Eq. . This procedure, described in the next section, relies on a dynamical estimate of the forward KL divergence that uses data generated via an adaptive MCMC that synergistically takes advantage of the learning to produce samples of the target efficiently. We stress that once the map T becomes accurate enough, Eqs. and can also be combined for further training, as was done, for example, in the related context of Boltzmann generators (21) (for a road map of the different possible strategies to train T, we refer the reader to ). We also stress that trainable generative models other than NFs can be used as well, as long as they offer an easy way to sample some that can be adapted to the target ; this feature is illustrated in the numerical examples presented below. Adaptive MCMC: concurrent MCMC sampling and map training. 1: SampleTrain(, T, , τ, , ϵ) 2: Inputs: target energy, T initial map, initial data, time step, total duration, number of Langevin steps per resampling step, map training time step 3: k = 0 4: while do 5: for do 6: if then 7: 8: ⊳ push-forward via T 9: with probability , otherwise ⊳ resampling step 10: else 11: with ⊳ discretized Langevin step 12: with MALA acceptance probability or ULA, otherwise 13: 14: ⊳ evaluate on sampled data 15: ⊳ Update the map 16: return: , T

Adaptive MCMC: Concurrent Sampling and Training

The adaptive MCMC we propose concurrently acquires data by combining a local sampler with a nonlocal one based on an NF and uses these data to further optimize the flow. This procedure is summarized in with MALA as the local MCMC algorithm, and it involves the following components.

Sampling.

Our algorithm combines MCMC steps using a local kernel π with those obtained using the NF kernel π in Eq. . Assuming for simplicity that we make consecutive steps with each kernel, the algorithm uses the compounded kernelwhich satisfies the detailed balance relation Eq. because the transitions kernels π and π individually do. While the flow-based kernel π allows global mixing between modes once T is sufficiently optimized, alternating with the local kernel π improves the robustness of the scheme by ensuring that sampling proceeds in places within the modes where the map is not optimal. This is useful during the first iterations of the scheme when the map T is almost untrained as well as once training has converged, if the expressiveness of the map parametrization is not sufficient to capture all the features of the target distribution. In , we demonstrate numerically the benefit of retaining local components to the sampling scheme (). Let us also note that the convergence rate of a chain using is necessarily faster than that of MCMC using or individually; if we assume the existence of a spectral gap for both π and π and denote the leading eigenvalues of these kernels by , and , respectively, we have . While we employ MALA here, any detailed balance MCMC method could be used in Eq. . The transition kernel π does not need to be local; it should, however, have satisfactory acceptance rates. Note that in the experiments that follow, we used ULA because the time steps were sufficiently small to ensure a high acceptance rate.

Adaptation.

The kernel π of is adapted by using the newly sampled configurations as data to optimize the parameters of the NF T. Denoting by ρ the probability density of the chain with kernel after steps from initialization ρ0, we minimize the KL divergence of ρ with respect to , instead of the KL divergence of the unknown with respect to as in Eq. . Denoting by the sample of n chains after steps of MCMC, this amounts to using the following consistent estimator for up to an irrelevant constant: In practice, we use stochastic gradient descent on this loss function to update the parameters of the NF (, line 11). While the expression for the loss is written at iteration k, we can average gradients over multiple MCMC steps. Details of the map parametrizations and training procedures for the experiments presented in the next sections are described in .

Initialization.

To start the MCMC chains, we assume that we have configurations in the different modes of the target, but they are not necessarily drawn from . We emphasize that the method, therefore, applies in situations where the locations of the metastable states of interest are known a priori, and one should not expect the procedure to find states in basins distinct from initialization. We demonstrate that it is unlikely that the adaptive MCMC will discover new metastable basins without any initial information about their location in on the example of a Gaussian mixture model () and in for the random-field example discussed below. We initialize the map T as the identity transformation and propagate the initial data using . The initial sampling is essentially driven by the local MCMC, here Langevin dynamics, as the map is not adapted to the target. As the map improves, nonlocal moves start to be accepted, the autocorrelation time drops, and the Markov chains reallocate mass in proportion to the statistical weights of the different basins. These features are illustrated in in the context of the Gaussian mixture model discussed in and in Figs. 1 and 2 in the context of the random-field example discussed below.

Fig. 1.

Fig. 2.

Concurrent training and sampling of the stochastic Allen–Cahn model with a real-non volume preserving (real-NVP) NF. (A) The stochastic gradient descent using samples generated by the procedure decreases the negative log likelihood gradually. (B) As the training progresses, the acceptance rate in the Metropolis–Hasting using proposals from the NF improves gradually, reaching levels well beyond 50%. The rolling average over the last 50 time steps is plotted in darker color. (C) As independent proposals from the flow start getting accepted, the Markov Chain autocorrelation times drops abruptly. (D) Fast mixing is illustrated by looking at the consecutive states of one walker updated with the transition kernel combining local Langevin updates and resampling with the push forward. In 10 steps, the single walker has jumped between and .

Sampling metastable states of the stochastic Allen–Cahn model with Langevin dynamics augmented with an NF. (A) Configurations obtained by pushing independent samples from the informed base measure Eq. through the flow T at the beginning (black) and at the end of training (blue). Around of generated configurations are accepted according to the Metropolis–Hasting criteria. (B) The learned map T is local in space. (C) Fourier spectrum of the target samples: samples from a flow with informed base measure and uniformed base measure. An informed base measure is necessary to capture the higher-frequency features of the target density. (D) Computation of the free energy differences between positive and negative modes with importance sampling (IS) from the NF as a function of a local biasing field added in the Hamiltonian Eq. . Results are reported for inverse temperature β = 20, as in the rest of the plots, and for the same experiment repeated at temperature β = 10. Errors bars computed from estimator variance are smaller than the marker. Concurrent training and sampling of the stochastic Allen–Cahn model with a real-non volume preserving (real-NVP) NF. (A) The stochastic gradient descent using samples generated by the procedure decreases the negative log likelihood gradually. (B) As the training progresses, the acceptance rate in the Metropolis–Hasting using proposals from the NF improves gradually, reaching levels well beyond 50%. The rolling average over the last 50 time steps is plotted in darker color. (C) As independent proposals from the flow start getting accepted, the Markov Chain autocorrelation times drops abruptly. (D) Fast mixing is illustrated by looking at the consecutive states of one walker updated with the transition kernel combining local Langevin updates and resampling with the push forward. In 10 steps, the single walker has jumped between and .

Convergence.

Two important questions arise regarding . First, does this scheme produce samples that converge in distribution toward the target, and if so, does the adaptive training of the map T improve the rate of convergence to the target distribution? To analyze the properties of a transition operator that combines nonlocal moves with the NF and a local MCMC algorithm, we consider our approach in the continuous-time limit. In this limit, when using Langevin dynamics as local sampler, the density of the evolving with respect to the target , defined as , satisfieswhere and is an adjustable parameter that measures the balance between the Langevin and the resampling parts of the dynamics. Setting α = 0 amounts to using the Langevin dynamics alone; in that case, for any initial condition ρ0, we have that (i.e., ) as , but this convergence will be exponentially slow in general (32). The situation changes if we include the resampling step (i.e., consider Eq. with ). In , under various assumptions about , we derive convergence rates under the dynamics in Eq. for the Pearson divergence of ρ with respect to , which we denote as In particular, we study the situation where T learns the instantaneous distribution at all times: that is, (and hence, ) for all . While this is certainly a significant approximation, we observe in numerical experiments that there is a dramatic improvement in sampling once there is some mixing between metastable basins, which motivates this limiting scenario. In this case, under the assumptions that there exists some such that andwe show that This equation indicates that remains approximately constant for and then, decays exponentially with constant rate subsequently. The derivation of Eq. also shows that the exponential rate is controlled by the resampling step of the MCMC algorithm that relies on the NF, and this rate can only improve when we concurrently use Langevin dynamics steps. In , we connect the sampling scheme we use to a birth–death Fokker–Planck equation (33), which could also be implemented in practice as a Markov jump process; again, this analysis emphasizes the favorable convergence properties of the scheme.

Scalability: Model-Informed Base Distributions and Maps, Mixtures, Etc.

As the dimension of the problem grows, it becomes increasingly difficult to train a map to produce a push-forward distribution matching the target to a given level of accuracy. Before presenting numerical experiments, we emphasize a few additional ingredients, easing the learning of generative models for the sampling of complex high-dimensional systems. When training an NF to represent a target density for which a preexisting empirical dataset is available, a standardizing transformation or a “whitening layer” is typically added at the output (21). This layer centers and rescales the different input dimensions such that their covariance matches the identity covariance of the standard normal distribution usually used as base distribution. This operation, although it requires preexisting data, crucially improves the outcome of learning when the original covariance of the data is highly anisotropic. In the experiments below, we show that it is sometimes possible to rely on the knowledge of the target distribution to perform an operation akin to this whitening layer with no preexisting data samples. For instance, below we choose a base Gaussian distribution with covariance matching the short-scale correlations of equilibrium configurations of the system’s known Hamiltonian. We can also design physics-informed base distributions that are more adapted to the problem at hand than a Gaussian distribution; for example, in the interacting particle system, we used the uniform distribution of the particle in the domain, which is an ideal distribution in the gaseous phase. Using prior knowledge about the physics can also help in designing the class of maps T to optimize upon. For example, in the interacting particle system, we used maps that factorize in ways tailored to the system’s features. Yet another way of easing learning, especially when modes have very different fine structures or statistical weights, is to rely on a mixture of maps instead on a single map, training each component to represent a different mode. A similar idea was exploited in ref. 21 to compute free energy differences between basins after training. In practice, a map T is pretrained for each mode indexed by m using data generated with the local MCMC sampler initialized in the corresponding mode. Then, the adaptive MCMC procedure described in can be started. The nonlocal proposal is the mixture of the push-forward with initial weights p. The adaptive part of the proposal then amounts to optimizing the mixture weights p via Eq. , in a similar fashion as the parameters of the flow when using a single map.* This mixture method requires that we train several maps but allows for treatment of more complex systems, as demonstrated below.

Numerical Experiments

Fast-Mixing Augmented MCMC for Random Fields.

As a first example to illustrate the efficacy of adaptive sampling, we consider a stochastic Allen–Cahn model, a canonical and ubiquitous model for the microscopic physics of phase transitions in condensed matter systems (34).

Field system.

The stochastic Allen–Cahn equation is defined in terms of a random field that satisfieswhere a > 0 is a parameter, β is the inverse temperature, denotes the spatial variable, and η is a spatiotemporal white noise, and we impose Dirichlet boundary conditions in which throughout. This stochastic partial differential equation (SPDE) is well posed in one spatial dimension (35, 36), and its invariant measure is the Gibbs measure associated with the Hamiltonian The first term in the Hamiltonian [17] is a spatial coupling that penalizes changes in and hence, at low temperature, has the effect of aligning the field in positive or negative direction. As a result, the Hamiltonian [17] has two global minima, denoted by and , in which the typical values of are (Fig. 1). Because there is a free energy barrier between and , local updates via traditional MCMC based on, for example, using the stochastic Allen–Cahn Eq. will not mix, even on very long timescales. Indeed, if we wanted to compute the free energy difference between these basins, we would need to construct a pathway through configuration space and use importance sampling techniques along the path (37). Our adaptive MCMC algorithm, augmented with an NF, offers an alternative approach. Fig. 2 demonstrates that a map T can be trained to efficiently generate samples with high statistical weight in the target distribution enabling rapid mixing across the free energy barrier.

Informed base measure.

In order to learn the map robustly, a standard implementation of an NF model, with a standard Gaussian field with uncoupled spins as base measure, does not suffice in this instance. Using a base measure that is “informed” alleviates this issue. Explicitly, we sample the base measure corresponding to a Gaussian random field with a local coupling (a “Ornstein–Uhlenbeck bridge”), which corresponds to a system with Hamiltonian Importantly, this measure does not have any metastability and remains easy to sample. As discussed in , at this continuous-field level, we must choose this measure to ensure that the push-forward distribution has a nonvanishing statistical weight in the target distribution.

Numerical implementation and results.

In practice, we must discretize the field on a grid, and throughout, we take N = 100 with a lattice spacing , meaning that the map we must learn is high-dimensional We also use the associated Langevin equation as the discretized version of the SPDE [16] to generate samples as the local component of our compounded MCMC scheme. We trained maps T and along our adaptive MCMC with the informed base measure [18] and an uninformed Gaussian measure that lacked a coupling term (), respectively, using the same architecture and compared their suitability for resampling after an equal number of iterations. Typical configurations , in this case generated by the NF T, are shown in Fig. 1. For comparison, we show in samples generated with T. While T generates samples that are accepted in the MCMC procedure with average acceptance rate approaching 60% (Fig. 2), fails to produce samples that have appreciable statistical weight in the target distribution. The evident difference is in the local structure of the random fields that are produced. Fig. 1 shows the Fourier spectrum of field computed with samples from the target measure (obtained using the proposed MCMC method after convergence) as well as from the push forward in the informed and uninformed case. While accurately captures the decay of the Fourier components at all scales, fails to compensate for the uncoupled base measure, and does not accurately capture high-frequency oscillations of the field , which subsequently leads to high rejection rates in the MCMC procedure. While at the discrete level, the adequacy of the base measure is a priori less stringent than at the continuous level examined in , this experiment shows that at N = 100, it is already highly beneficial to preadapt the covariance of the push forward. In the absence of preexisting samples to compute an empirical whitening transform, it is the role of the proposed informed base measure.

Interpreting the map.

Examining the learned map T reveals its simple underlying structure. As shown in Fig. 1, the map is spatially local, transporting spins near the center of the domain to , while spins near the boundary are mapped closer to 0. It is again useful to examine the properties of a mapped configuration in Fourier space. The k = 0 mode reveals that the mean value is transported substantially; is approximately as shown in . However, higher-frequency modes are left invariant by the map ().

Calculating free energy differences.

Perhaps most remarkably, the learned map T can be used to evaluate free energy differences between the metastable basins and , even in thermodynamic conditions distinct from those in which the map was trained. Fig. 1 shows an estimate of the free energy difference between the positive and negative metastable basins as a function of an external field h, which enters the Hamiltonian as These estimates are produced with importance sampling using as described in . Analytical estimates at low temperature via a Laplace approximation reveal that the NF accurately recapitulates the free energy difference despite the fact that the map was optimized only with samples where Similar generalization properties were observed in refs. 21, 38, and 39, where a map was used at temperatures distinct from the temperature at which training data were collected. This approach is valid in cases where the modified parameter, here the field h, distorts the relative populations of the metastable basins but has a mild effect on the local structure of the field, which can be controlled by monitoring the variance of the estimator. Additional tests for related applications are presented in . For this stochastic Allen–Cahn system, we show that the method can be useful to sample configurations with domain walls by tilting the Hamiltonian (). In , we discuss a similar sampling problem that involves the nonequilibrium transition path, which we employ to illustrate the use of Brownian bridge base measures. This example is challenging as metastable basins have very different statistical weights, which is also the case for the particle systems discussed in the next section, where we demonstrate the usage of mixtures to tackle this circumstance.

Detecting Phase Transitions in Interacting Particle Systems.

Thermal systems undergoing a first-order phase transition are archetypal examples of models displaying metastability. Near the transition point, ergodic mixing from the unstable to the stable phase is broken in the thermodynamic limit, leading to the well-known challenge of detecting these transitions with molecular dynamic simulations. In this section, we show our method to be useful in this context.

Particle system and phase diagram.

As an example, we consider a system of N interacting particles evolving in a two-dimensional periodic box of lateral size L according to the Langevin equation (here written in the overdamped limit): The interaction W(x) is a pairwise attracting potential with range a > 0:which when , is well approximated by . These equations sample the Boltzmann–Gibbs distribution of the system where we denote by the state of the N particle system. When a is much smaller than L, in the thermodynamic limit (), this system displays a first-order phase transition between a gas-like phase, where the particles are uniformly distributed in the domain that is preferred at high temperatures, and a liquid-like phase, where they cluster in a droplet that is preferred at low temperatures. Typical particle configurations in these phases are shown in Fig. 3. The phase diagram of the model can be estimated using a mean-field approximation ( has details) and is shown in Fig. 4.

Fig. 3.

Fig. 4.

Detecting phase transitions in interacting particle systems. Blue curves and labels show the free energies of the gas and liquid phases, showing that a first-order phase transition occurs at the critical . For temperatures around this value, particles configurations in either the homogeneous or the clustered phase are highly metastable, and no transition between these states is observed in brute-force simulation of Eq. . Red curve and labels show the value of p in the mixture [23] learned by our adaptive procedure augmented with an NF starting from ; the algorithm correctly learns the right value of p in the mixture and thereby, is able to detect the phase transition.

Detecting phase transitions in interacting particle systems. (Left and Center) Two hundred particles seen in the gas and liquid phases, respectively, in dimension d = 2 at a temperature below the critical , at which both phases are metastable but the clustered one is thermodynamically preferred. (Right) A contour plot of the local density of the particles in the liquid phase plotted in log scale. Detecting phase transitions in interacting particle systems. Blue curves and labels show the free energies of the gas and liquid phases, showing that a first-order phase transition occurs at the critical . For temperatures around this value, particles configurations in either the homogeneous or the clustered phase are highly metastable, and no transition between these states is observed in brute-force simulation of Eq. . Red curve and labels show the value of p in the mixture [23] learned by our adaptive procedure augmented with an NF starting from ; the algorithm correctly learns the right value of p in the mixture and thereby, is able to detect the phase transition. Detecting this phase transition via brute-force simulation of Eq. is, however, challenging because the particles stay trapped in whichever configuration they occupy (homogeneous or droplet) for very long periods of time; in fact, the transition times from one phase to the other in a parameter regime where they are both metastable can be estimated as and , where and denote, respectively, free energy barriers between the liquid and the gas phase and vice versa. Since these barriers are both independent of N, these transition times diverge exponentially with the number of particles N.

Adaptive simulations augmented by nonlocal resampling.

Simulation of Eq. augmented by a nonlocal resampling map can detect the phase transition. As the two modes of interest have here very different structures and very different statistical weights across the phase transition, we resort to a parameterization of the nonlocal proposal density in terms of a mixture. For the homogeneous phase mixture component, it is straightforward to draw particles configurations; we can simply pick each of their individual positions uniformly in the box. For the droplet phase mixture component, it is natural to use as base distribution the uniform distribution that corresponds to the homogeneous phase and train a map T that then takes one such configuration and maps it onto a droplet configuration whose local density is close to that of the particles in the liquid phase. Denoting this local density by , we can also exploit the fact that the liquid droplet has no internal structure and factorize the map as with such that t(x) has density if x is drawn uniformly in : that is, , where is the inverse of the map t. All in all, this leads to a resampling mixture density that can be expressed aswhere is a factor to be learned, , and the local density needs to be estimated—in the simulations, we simply used the mean-field approximation recalled in to calculate numerically, but this density could also be estimated directly from the molecular dynamics (MD) simulations.

Implementation and results.

Consistent with , we run Eq. used as the local sampler for a fixed duration t and then attempt a resampling move by proposing a configuration from Eq. . This resampling step requires one to evaluate the Metropolis–Hastings probability in Eq. accurately, which is nontrivial when N is large. Here, we used as approximation withleading to the following explicit approximation for the ratio : This expression shows that the presence in the Metropolis–Hastings probability effectively accounts for the entropy of the particle configurations, as opposed to their energy accounted for by . Eq. also emphasizes the need for (i.e., the NF map in general) to be accurate enough; indeed, to get any significant probability of acceptance, even for a move aimed toward the thermodynamically preferred phase, the factor in the exponentials in Eq. must be of order 1 in N. If the map fails to achieve this accuracy and the factors remain of order N (which is their typical scale for a map that is unadapted), the move would be systematically rejected (or accepted between configurations with little resemblance to permitted ones). This issue will be generic for problems where the system’s energy is extensive. In the context of the present example, once has been estimated, the learning component of reduces to optimizing the parameter p. This is done by approximating the KL divergence using an estimator based on using the current state X(t) of the system. Specifically, we used as an objective on which we performed gradient descent in p concurrently with running the augmented MD strategy above. We applied the procedure above to a system of N = 200 particles drawn initially from the mixture density [23] using as the initial value. As can be seen in Fig. 4, this allowed us to train p to values converging either to p = 0 or p = 1 in a way that detects the phase transition. That is, the augmented procedure correctly reweights the homogeneous and droplet configurations and determines which of the two is the most likely even in situations where brute-force MD simulations would observe no transitions between these configurations. We stress that a simplifying feature of this example is that the particles experience no short-range repulsion (i.e., there is no order in the droplet phase). This is what allowed us to use the product of in the mixture density in Eq. . In systems with hard-core repulsions, this approximation is invalid, in which case more complicated mixtures (or equivalently, more complicated maps in the NF) will have to be used. We leave investigation of such situations for future work.

Conclusions

Connections and Differences with Previous Works.

Most of the methods that seek to train an NF to (approximately) sample from the Boltzmann distribution of a known target energy rely on the reverse KL training using Eq. (15, 18, 38, 39). However, this objective is known to be prone to “mode collapse,” where the estimation concentrates on the bulk of one mode. This failure comes typically from the fact that the map may never explore modes far away from the core of the base distribution—such that these modes are missed altogether (40). Additionally, the reverse KL objective is known to lead to underestimation of the tails (41). To alleviate the shortcomings of reverse KL training, ref. 21 relied on initial short trajectories to estimate and factor in the optimization objective of the forward KL divergence. Closer to our proposition, the authors of ref. 42 propose Markov score climbing, an adaptive MCMC strategy using the same estimate of the forward KL as Eq. to discover good variational approximators. Our method can be seen as an extension of Markov score climbing, introducing the additional alternation with a local sampler and the replacement of simple variational families by the more flexible NFs. Along the same lines, ref. 43 investigated proposals parametrized by lower triangular maps. Interestingly, ref. 21 also proposed an iteratively retrained variant of its Boltzmann generator that shares similarity with our proposition. Note that an advantage of the adaptive MCMC over the consecutive training, then sampling schemes with either reverse KL (15, 18, 38) or forward KL (11, 21) training objectives is to allow for real-time monitoring of the quality of training toward the final purpose of obtaining well-mixed samples. In the experiments, that was done by monitoring, for instance, acceptance rates and autocorrelation of the chains across iterations. The adaptive MCMC proposed here retains a local component in the sampling in the form of interleaved steps of a local sampler that brings robustness to the scheme. Albeit different, the adaptive MCMC proposed in ref. 21 also defines an intermediary between a purely global and a purely local procedure. The proposals consist of local steps in the latent space of the NF, while our query of the generative model yields completely independent resampling. To encourage transition across modes, the algorithm of ref. 21 was augmented with parallel tempering in ref. 39, a step that does not appear to be necessary in our scheme.

Outlook and Future Work.

As the use of data-driven methods from machine learning becomes increasingly routine in the physical sciences, we must carefully assess the cost of data acquisition and training to ensure that we can leverage machine learning methods in a productive fashion. Sampling systems with complex local structure and multiple metastable basins is a generically challenging task in high-dimensional systems, and we have already seen that neural networks can contend with this challenge in nontrivial settings (5, 6, 15, 18, 21). Nonlocal transport in MCMC algorithms can significantly enhance mixing, and NFs provide a compelling framework for designing adaptive schemes, even in cases where no statistically representative dataset is available at first. Nevertheless, we do not find that these methods enable discovery of unknown modes of a target distribution, emphasizing the importance of having some a priori information about the metastable states of the system. Many questions remain about how to ensure efficient learning in complex high-dimensional systems and encourage desirable properties of the map, such as locality and transferability. Incorporating known invariances and symmetries of target distributions into architectures is currently a key area of research (e.g., refs. 44 and 45) that will help scaling further applications of sampling methods enhanced by learning.

8 in total