Jesse van Rhijn1, Claudia Filippi1, Stefania De Palo2, Saverio Moroni2. 1. MESA+ Institute for Nanotechnology, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands. 2. CNR-IOM DEMOCRITOS, Istituto Officina dei Materiali, and SISSA Scuola Internazionale Superiore di Studi Avanzati, Via Bonomea 265, I-34136 Trieste, Italy.
Abstract
We present unbiased, finite-variance estimators of energy derivatives for real-space diffusion Monte Carlo calculations within the fixed-node approximation. The derivative dλE is fully consistent with the dependence E(λ) of the energy computed with the same time step. We address the issue of the divergent variance of derivatives related to variations of the nodes of the wave function both by using a regularization for wave function parameter gradients recently proposed in variational Monte Carlo and by introducing a regularization based on a coordinate transformation. The essence of the divergent variance problem is distilled into a particle-in-a-box toy model, where we demonstrate the algorithm.
We present unbiased, finite-variance estimators of energy derivatives for real-space diffusion Monte Carlo calculations within the fixed-node approximation. The derivative dλE is fully consistent with the dependence E(λ) of the energy computed with the same time step. We address the issue of the divergent variance of derivatives related to variations of the nodes of the wave function both by using a regularization for wave function parameter gradients recently proposed in variational Monte Carlo and by introducing a regularization based on a coordinate transformation. The essence of the divergent variance problem is distilled into a particle-in-a-box toy model, where we demonstrate the algorithm.
Variational Monte Carlo (VMC) and diffusion Monte Carlo (DMC) are
numerical stochastic approaches based on a real-space representation
of a correlated trial wave function to study many-body quantum systems,
including electronic structure problems. VMC calculates expectation
values of quantum operators on the trial function, which in turn is
optimized via minimization of a suitable cost function such as the
variational energy. DMC further improves the VMC results through a
stochastic implementation of the power method, which projects the
lowest-energy component of the trial function. Its accuracy, in the
fixed-node (FN) approximation almost invariably adopted to avoid the
sign problem, is ultimately limited by the error in the nodal surface
of the trial function.[1]In the past
decade, the efficient calculation of analytic energy
derivatives,[2,3] leveraging modern optimization
methods,[4−6] spawned impressive progress in both accuracy and
scope of the VMC method.[7−10]DMC largely benefits from advances in VMC because
improved trial
functions tend to have better nodes. However, it would be desirable
to have efficient and unbiased estimators of derivatives in DMC as
well to perform such tasks as the direct optimization of the nodal
surface or DMC structural relaxation. This is still an open issue,
with the latest developments featuring uncontrolled approximations
and/or a very low efficiency.[11−13] Here, we present an algorithm
to calculate unbiased energy derivatives in FN-DMC with finite variance
and demonstrate it on a model where the effect of the nodes is dominant.
In the Supporting Information, we also
include an application to the lithium dimer.
Energy
Derivatives
In both VMC and DMC, the energy is calculated
aswhere R represents the coordinates
of all the particles, EL(R) = HΨ(R)/Ψ(R) is the local energy of the trial function Ψ(R), and P(R) is proportional
to the underlying probability distribution: in VMC, P(R) = Ψ2(R), and
in DMC, P(R) = Ψ(R)Φ(R) with Φ(R) being
the FN solution. The derivative with respect to a parameter λ
isThe variance of this naïve
estimator is 0 if both
Ψ and its derivative dλΨ are exact; however,
for an approximate trial function, EL(R) diverges at the nodes as 1/d(R), where
d = |Ψ|/∥∇Ψ∥, and the variance diverges
as well.[14,15]
Regularized Estimators
In VMC, this
problem was fully solved by Attaccalite and Sorella[14] with a reweighting scheme, hereafter dubbed AS, whereby
one samples the square of a modified trial function Ψ̃
which differs from Ψ only for d smaller than a cutoff parameter
ϵ and stays finite on the nodal surface of Ψ. A similar
sampling scheme was proposed by Trail.[16] The AS estimator has the same average of the bare estimator of eq for any value of ϵ
and finite variance.An alternative regularized estimator, recently
proposed by Pathak and Wagner,[15] simply
consists in multiplying the term in brackets of eq by the polynomial fϵ(x) = 7x6 – 15x4 + 9x2, with x = d/ϵ, whenever x < 1. This estimator, hereafter dubbed PW, has finite variance
for any finite ϵ and a bias which vanishes as ϵ →
0. The polynomial is chosen in such a way to remove from the bias
the linear term, shown[15] to be ∝∫01(fϵ – 1)dx, and to be continuously
differentiable in (0, 1). The odd parity of fermionic wave functions
near a node then implies a cubic leading term in the bias. Since various
values of ϵ can be used in the same simulation, the bias can
be eliminated at no cost by extrapolation. The PW estimator has been
proposed for parameter gradients in the VMC optimization of Ψ,
but it is equally applicable to VMC interatomic forces and, as will
be shown, to generic derivatives in DMC.We introduce a third
regularized estimator, which we denote as
“warp” by analogy with the space–warp transformation
of ref (17) devised
to reduce the statistical noise of the forces. There, as a nucleus
is displaced, a transformation is applied to the coordinates of nearby
electrons, in such a way to maintain the electron–nucleus distances
approximately constant. Here, the goal is to maintain constant the
value of d(R) when the nodal surface is displaced
by a variation of the parameter λ. In this way, the diverging
term in the local energy does not change, and the variance of the
derivative is finite.The warp transformation, illustrated in Figure , is defined aswhere primed quantities are calculated for
the value λ′ of the parameter, n′
is the unit vector in the direction of ∇Ψ′(R), and u(d) is a cutoff function with
support [0, ϵ] which decreases smoothly from 1 to 0, restricting
the warp transformation to a region close to the nodal surface. We
use the quintic polynomial with zero first and second derivatives
at the boundaries of the support.
Figure 1
Schematic picture of the warp transformation.
The curve S is the nodal surface for the value λ0 of the parameter λ. The value of d for the current
configuration R is pictorially represented as the
distance from S (this is strictly true only if Ψ
is linear, i.e.,
close enough to the node). When a variation of λ from λ0 to λ′ shifts S to S′, the value of d changes to d′. Hence, we displace R by an amount Δ = d – d′ to R̅ in the direction of sign(Ψ′)∇Ψ′(R) so that the value of |Ψ′(R̅)|/∥∇Ψ′(R̅)∥
is approximately equal to d.
Schematic picture of the warp transformation.
The curve S is the nodal surface for the value λ0 of the parameter λ. The value of d for the current
configuration R is pictorially represented as the
distance from S (this is strictly true only if Ψ
is linear, i.e.,
close enough to the node). When a variation of λ from λ0 to λ′ shifts S to S′, the value of d changes to d′. Hence, we displace R by an amount Δ = d – d′ to R̅ in the direction of sign(Ψ′)∇Ψ′(R) so that the value of |Ψ′(R̅)|/∥∇Ψ′(R̅)∥
is approximately equal to d.For a finite increment of λ, the energy iswhere J = detJ =
det∂R̅/∂R is the Jacobian of the transformation.
For the practical
implementation, we formulate a differential expression of the finite
energy difference E′ – E so that no actual transformation is needed: the analytic derivative
dλE, calculated at the value of
the parameter λ = λ0, isThis warp regularized estimator has finite
variance and no bias
for any value of ϵ (see the Supporting Information).Note that all the functions in eq are evaluated at λ = λ0, where R̅ = R and J = 1.
Therefore, the warp transformation only contributes to the estimator
through the derivatives ∂λR̅|λ and
∂λln J|λ, while the sampling is done over one and the same distribution P(R) for any parameter we may vary.Furthermore, for λ = λ0, the cofactors of
the Jacobi matrix J are M = δ, and the seemingly awkward derivative of
the Jacobian greatly simplifies, ∂λln J|λ = ∑M∂λJ|λ = ∑∂λJ|λ, so that the implementation
of eq is not overly
complicated. In particular, most of the derivatives needed are already
present—or very similar to those already present—in
VMC codes with analytic derivatives for structural and full variational
optimization. The only exceptions are the off-diagonal components
of the Hessian ∂2Ψ/∂R∂R and their derivatives with respect to λ, which
contribute to ∂λJ. These
extra derivatives are only needed when the nodal distance is smaller
than the cutoff ϵ, that is, for a very small fraction of the
sampled configurations. Furthermore, we will show (heuristically)
that the bias incurred by neglecting these terms can be extrapolated
out at no cost.
Variational Monte Carlo
Before addressing
the derivatives in DMC, we compare the three regularized estimators
PW, AS, and warp in VMC. To this purpose, it is expedient to consider
a system stripped of all complexities of external and interparticle
potentials so that we can focus exclusively on the divergence of the
local energy at the nodal surface. Our toy model is a free particle
in an elliptic box with hard walls, meant to represent the configuration
of a generic system within a nodal pocket. Atomic units are used throughout.
We choose Ψ = Ψ0(x,y) = a2 – x2/C – y2/(C – 1) with C = [cosh(1)]2 = 2.3810978, which is positive inside the ellipse, vanishes
at the border, and is not the true ground state. Therefore, the ellipse
is defined through the wave function, and we take the derivative with
respect to the parameter a, which changes the size
of the ellipse at constant eccentricity.The average and variance
of the various estimators, calculated by quadrature, are shown in Figure . None of the regularized
estimators entails uncontrolled approximations. In particular, although
there is no rigorous way to establish the range where the leading
correction in ϵ is sufficient, or the number of powers in ϵ
needed over an arbitrary range, PW can be accurately extrapolated
to the unbiased result. Here, the bias of the PW estimator has a leading
contribution of ϵ2 because Ψ0 does
not have odd parity across the node. The second-order bias can be
removed with a different choice of the polynomial, for example, f̃ϵ(x) = 60x2 – 200x3 + 225x4 – 84x5, at the expense of a larger variance for the given value
of the cutoff ϵ. The right panel shows that the relative efficiency
may depend on the system at hand; in this example, it varies over
the range of a considered.
Figure 2
VMC calculations of dE, with integrals performed
by quadrature. Left panel: derivative
at a = 1 obtained with the warp, PW, and AS estimators
as a function of the cutoff ϵ, compared to the “exact”
result dEfit, defined as the derivative of a fit to E(a) calculated separately for several values of a. AS and warp, shown here only near ϵ = 0.2, are unbiased;
PW can be extrapolated to the unbiased result either including only
the leading term in ϵ, here ϵ2, on a sufficiently
small range (dashed line) or using a sufficiently large number of
terms on an extended range (solid line). Middle panel: variance (in
a logarithmic scale) of the various regularized estimators as a function
of the cutoff. As ϵ vanishes, all schemes regress to the infinite
variance of the bare estimator. Right panel: variance (in a logarithmic
scale) of the various regularized estimators as a function of the
parameter a. For the warp and AS estimators, ϵ
= 0.2; for PW, data are reported for ϵ = 0.0125, 0.025, 0.05,
0.15, and 0.2 in the order of decreasing variance. Common data in
the middle and the right panels are circled.
VMC calculations of dE, with integrals performed
by quadrature. Left panel: derivative
at a = 1 obtained with the warp, PW, and AS estimators
as a function of the cutoff ϵ, compared to the “exact”
result dEfit, defined as the derivative of a fit to E(a) calculated separately for several values of a. AS and warp, shown here only near ϵ = 0.2, are unbiased;
PW can be extrapolated to the unbiased result either including only
the leading term in ϵ, here ϵ2, on a sufficiently
small range (dashed line) or using a sufficiently large number of
terms on an extended range (solid line). Middle panel: variance (in
a logarithmic scale) of the various regularized estimators as a function
of the cutoff. As ϵ vanishes, all schemes regress to the infinite
variance of the bare estimator. Right panel: variance (in a logarithmic
scale) of the various regularized estimators as a function of the
parameter a. For the warp and AS estimators, ϵ
= 0.2; for PW, data are reported for ϵ = 0.0125, 0.025, 0.05,
0.15, and 0.2 in the order of decreasing variance. Common data in
the middle and the right panels are circled.The bias of the warp estimator when the off-diagonal elements of
the Hessian are neglected is compared to the bias of the PW estimator
in Figure . Since
we need a non-diagonal Hessian to start with, we consider (i) a rotated,
more eccentric ellipse with a further non-symmetrical distortion of
the wave function, and (ii–iv) the positive lobe of a wave
function limited to half of the original ellipse by the nodal line y + sin(αx) = 0, with α = 0.5,
1, and 1.5 (see the contour plots in Figure ). The derivative is taken with respect to a for Ψ1 and with respect to α for
Ψ2–Ψ4. Within this (very
limited) set of test cases, the bias is smaller and less system-dependent
for the warp than for the PW estimator. The important result is that
neither involves uncontrolled approximations as both can be extrapolated
to the unbiased result in a single run. We have also verified that
the warp estimator with the full Hessian is unbiased for finite ϵ.
Figure 3
Bias of
the VMC derivative for PW and warp with approximate Hessian,
calculated by quadrature for various wave functions Ψ1–Ψ4. In each case, we show a contour plot
of the normalized Ψ with level
lines from 0 in steps of 0.2 (the contour plot of Ψ0 in the top left inset defines the (x, y) scale). The colored labels near each curve indicate the powers
in ϵ needed to extrapolate to the unbiased value with a five-digit
accuracy. The bias of PW for Ψ2–Ψ4 has a leading term ϵ3 because of the use
of the modified polynomial f̃ϵ. Empirically, we see that the warp estimator has a cubic leading
term in all cases.
Bias of
the VMC derivative for PW and warp with approximate Hessian,
calculated by quadrature for various wave functions Ψ1–Ψ4. In each case, we show a contour plot
of the normalized Ψ with level
lines from 0 in steps of 0.2 (the contour plot of Ψ0 in the top left inset defines the (x, y) scale). The colored labels near each curve indicate the powers
in ϵ needed to extrapolate to the unbiased value with a five-digit
accuracy. The bias of PW for Ψ2–Ψ4 has a leading term ϵ3 because of the use
of the modified polynomial f̃ϵ. Empirically, we see that the warp estimator has a cubic leading
term in all cases.
Diffusion
Monte Carlo
We now consider
the derivative in DMC. The FN-DMC algorithm is a branching random
walk of many weighted walkers, generated by a short-time approximation G(R′,R) to the
importance-sampled Green’s function, which asymptotically samples
the distribution P(R) = Ψ(R)Φ(R).[1] The problem with the derivative estimator, eq , is the presence of the logarithmic derivative
of P(R), which is not a known function
of R. However, P is the marginal
distribution of the joint probability density Pjoint of the whole random walk, which does have an explicit
expression as a product of Green’s functions,where P0 is the
(largely arbitrary) probability distribution of the initial configuration R0. Therefore, it is sufficient to consider the
estimator of dλE in eq as an average over the whole trajectory
of the random walk, rather than over the current configuration, to
bring an explicitly known probability distribution to the fore.[13] This is similar to the calculation of forces
in path integral Monte Carlo.[18]In
practice, the inclusion of the entire trajectory in the estimator
is not necessary. As shown in ref (13), the logarithmic derivative of the DMC density
distribution P in the estimator of eq can be replaced with the summationover the last k steps of
the random walk, with R being the current configuration R of eq . The omitted term,[13] ⟨[EL(R) – E]dλ ln P(R)⟩, vanishes for
sufficiently large k because EL(R) and dλ ln P(R) become statistically
independent variables and ⟨EL – E⟩ = 0.In eq , G(R′,R) is the transition
rule from R to R′ of the random
walk. It includes a Metropolis test to reduce the time-step error;[1] therefore, R′ is the
configuration proposed when the walker is at R, and the next configuration R is R′ or R if the move is
accepted or rejected, respectively. Note that in the formal expression
of Pjoint, eq , the arguments of Green’s functions
are integrated over, whereas in the contribution to the estimator
of the derivative, eq , they are the particular values of the particles’ coordinates
effectively sampled by the random walk. Correspondingly, the actual
value taken by G(R′, R) iswhere T is the a priori transition
probability, p is the probability of accepting the
move, and W is the branching factor (see below).The inclusion of rejected configurations in eq and of the factor p or 1
– p in eq in the derivative of the full Green’s function
is instrumental to obtain an estimate of dλE completely consistent with the DMC energy E(λ) calculated at the same time step. Their omission still
can give an unbiased result in the limit τ → 0, but it
may cause an unacceptably large time step error on the derivative.[13]The functions T, p, and W are standard[19]Here,
τ is the time
step, V = ∇lnΨ is the so-called velocity,
and is the damping factor of its divergence
near the nodes; the logarithm of the branching factor is also damped
at the nodes, S(R) = [Eest – EL(R)]F(R) – ln(N/N0), where Eest is the best current estimate of the energy and N and N0 are the current and the target
number of walkers.The presence of Eest, the Monte Carlo
estimate of E, in the branching term S implies that the calculation of dλE includes a contribution proportional to dλE itself,This does not require
prior knowledge of the result: we can calculate
the factor F̅ and (dλE)0, the derivative when the contribution of eq is omitted, and combine
them to get the unbiased result as dλE = (dλE)0 + F̅dλE, or dλE = (dλE)0/(1 – F̅).The main
technical complication in DMC is the need to store a few
quantities for each derivative and for each value of ϵ over
the last k steps of each walker, namely, dλ ln G and ELdλ ln G, to implement eq with the probability distribution P of eq and F and ELF to
implement eq .The AS regularized estimator has been applied to approximate DMC
forces in refs (13) and (20). However,
it pushes a finite density of walkers on the nodes, which is presumably
not optimal in DMC. Furthermore, unlike in VMC, it requires[13] an extrapolation to ϵ → 0 which—at
difference with the PW and warp estimators—cannot be done in
a single run.Therefore, for the DMC derivatives, we consider
only the PW and
the warp estimators. For the former, we insert eq into eq and multiply each term of the resulting summation by a polynomial fϵ calculated at the appropriate configuration.
For the latter, we just insert eqs in 5. For the warp estimator,
the argument of the Jacobian needs some care: we evaluate ∂λ ln J in the proposed configuration R′ for both accepted and rejected moves; alternatively,
we can include ∂λ ln J only
for the accepted moves, provided the warp transformation is not considered
in the derivatives at R′ when the move is
rejected.We present results of DMC simulations with the wave
function Ψ0, time step τ = 0.1, and target
number of walkers N0 = 100. We have verified
that in the limit
τ → 0, we recover the analytic results[21] for the ground-state energy, E = 2q/a2 with q = 0.825352549, within a statistical error of less than 1 part in
10,000. To this level of accuracy, the population control bias[1] is negligible.Figure exposes
the drawback of the bare estimator: the probability distribution p(ξ) for the block averages of the derivative features
a right heavy tail, consistent with the expected[11,22] leading decay ∝|ξ – ξ0|–5/2. In the data trace, shown in the inset, heavy tails
result in large spikes that would mar the smooth convergence of structural
or variational optimization. Meaningful averages and statistical uncertainties
of heavy-tailed distributions with known tail indices can be computed
with a tail regression analysis.[23] This
technique, however, requires a heavy post-processing not very practical
for large-scale applications. The regularized estimators PW and warp,
instead, have nearly Gaussian distributions amenable to standard statistical
analysis with significantly smaller statistical errors and, most importantly,
no large spikes in the data trace.
Figure 4
Histogram of 46,000 block averages, each
of which taken over 10,000
steps of 100 walkers in a DMC calculation of dE|. The warp
and PW estimators (with ϵ = 0.2 and 0.0125, respectively) have
nearly Gaussian distributions. The bare estimator has a heavy-tailed
distribution; the largest value in the present sample exceeds x = 10. Inset: the data trace of the first 1000 block averages.
Histogram of 46,000 block averages, each
of which taken over 10,000
steps of 100 walkers in a DMC calculation of dE|. The warp
and PW estimators (with ϵ = 0.2 and 0.0125, respectively) have
nearly Gaussian distributions. The bare estimator has a heavy-tailed
distribution; the largest value in the present sample exceeds x = 10. Inset: the data trace of the first 1000 block averages.The central result of this work is shown in Figure . We calculate the
energy E and its derivative dE for a set of values of a and compare the DMC derivatives
with the derivative of a fit to the DMC energies. All the estimators
(bare, extrapolated PW, and warp) are unbiased, which demonstrates
the correctness of the proposed algorithm. For comparison, the variational
drift-diffusion (VD) approximation of ref (13) gives for a = 1 a bias of ∼0.2,
twice the full scale of the right panel, and it gets even worse for
smaller time steps (although the VD approximation is devised to exploit
good wave functions, while our Ψ0 is poor on purpose
to test the unbiased estimators).
Figure 5
DMC calculations of dE with τ = 0.1. Left panel: the warp
and PW results compare
favorably with the “exact” result, defined as the derivative
dEfit of
a fit to DMC calculations of E(a). The analytic result dE = −4q/a3 differs
from dEfit because the latter has a finite time step error. Right panel: the
difference of the calculated derivatives with dEfit is shown on an expanded scale
by subtracting the latter (hence, the solid black line is minus the
time step error of dEfit). The PW derivatives are extrapolated to ϵ →
0. We also include the bare-estimator derivatives, with averages and
statistical uncertainties obtained with the tail regression estimator
analysis toolkit made available in ref (23). Small horizontal shifts are applied to the
same—a data for clarity.
DMC calculations of dE with τ = 0.1. Left panel: the warp
and PW results compare
favorably with the “exact” result, defined as the derivative
dEfit of
a fit to DMC calculations of E(a). The analytic result dE = −4q/a3 differs
from dEfit because the latter has a finite time step error. Right panel: the
difference of the calculated derivatives with dEfit is shown on an expanded scale
by subtracting the latter (hence, the solid black line is minus the
time step error of dEfit). The PW derivatives are extrapolated to ϵ →
0. We also include the bare-estimator derivatives, with averages and
statistical uncertainties obtained with the tail regression estimator
analysis toolkit made available in ref (23). Small horizontal shifts are applied to the
same—a data for clarity.For given a, the same run is used for all the
estimators. Therefore, the statistical error is a direct measure of
the square root of their relative efficiency. The statistical errors
of the bare, PW, and warp estimators, averaged over the values of a shown in Figure , are in the ratio of 4.9:2.4:1. These figures may belittle
the PW estimator somewhat because in this particular example, a large
quadratic bias needs to be eliminated by extrapolation, but they convey
the relevant message that both PW and warp are significantly more
efficient than the bare estimator.
Conclusions
In summary, we have presented an algorithm to calculate unbiased,
finite-variance derivatives in DMC. The estimate of the derivative
with respect to a given parameter is fully consistent with the dependence
on that parameter of the FN energy, calculated with the same time
step. The tail regression statistical analysis[23] can cope with the problem of the infinite variance of the
bare estimator. Alternatively, and more efficiently, both the recently
proposed PW regularization[15] and the warp
regularization introduced in this work can be used to good effect
to eliminate the divergence of the variance.
Authors: P R C Kent; Abdulgani Annaberdiyev; Anouar Benali; M Chandler Bennett; Edgar Josué Landinez Borda; Peter Doak; Hongxia Hao; Kenneth D Jordan; Jaron T Krogel; Ilkka Kylänpää; Joonho Lee; Ye Luo; Fionn D Malone; Cody A Melton; Lubos Mitas; Miguel A Morales; Eric Neuscamman; Fernando A Reboredo; Brenda Rubenstein; Kayahan Saritas; Shiv Upadhyay; Guangming Wang; Shuai Zhang; Luning Zhao Journal: J Chem Phys Date: 2020-05-07 Impact factor: 3.488