Literature DB >> 26346321

Probabilistic numerics and uncertainty in computations.

Philipp Hennig¹, Michael A Osborne², Mark Girolami³.

Abstract

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. Such uncertainties, arising from the loss of precision induced by numerical calculation with limited time or hardware, are important for much contemporary science and industry. Within applications such as climate science and astrophysics, the need to make decisions on the basis of computations with large and complex data have led to a renewed focus on the management of numerical uncertainty. We describe how several seminal classic numerical methods can be interpreted naturally as probabilistic inference. We then show that the probabilistic view suggests new algorithms that can flexibly be adapted to suit application specifics, while delivering improved empirical performance. We provide concrete illustrations of the benefits of probabilistic numeric algorithms on real scientific problems from astrometry and astronomical imaging, while highlighting open problems with these new algorithms. Finally, we describe how probabilistic numerical methods provide a coherent framework for identifying the uncertainty in calculations performed with a combination of numerical algorithms (e.g. both numerical optimizers and differential equation solvers), potentially allowing the diagnosis (and control) of error sources in computations.

Entities: Chemical Disease Gene

Keywords: inference; numerical methods; probability; statistics

Year: 2015 PMID： 26346321 PMCID： PMC4528661 DOI： 10.1098/rspa.2015.0142

Source DB: PubMed Journal: Proc Math Phys Eng Sci ISSN： 1364-5021 Impact factor: 2.704

Introduction

Probability theory is the quantitative framework for scientific inference [1]. It codifies how observations (data) combine with modelling assumptions (prior distributions and likelihood functions) to give evidence for or against a hypothesis and values of unknown quantities. There is continuing debate about how prior assumptions can be chosen and validated (e.g. [2, §1.2.3]). However, the role of probability as the language of uncertainty is rarely questioned; that is, as long as the subject of inference is a physical variable. What if the quantity in question is a mathematical statement, the solution to a computational task? Does it make sense to assign a probability measure p(x) over the solution of a linear system of equations Ax=b if A and b are known? If so, what is the meaning of p(x), and can it be identified with a notion of ‘uncertainty’? If one sees the use of probability in statistics as a way to remove ‘noise’ from ‘signal’, it seems misguided to apply it to a deterministic mathematical problem. But noise and stochasticity are themselves difficult to define precisely. Probability theory does not rest on the notion of randomness (aleatory uncertainty), but extends to quantifying epistemic uncertainty, arising solely from missing information.[1] Connections between deterministic computations and probabilities have a long history. Erdös & Kac [3] showed that the number of distinct prime factors in an integer follows a normal distribution. Their statement is precise, and useful for the analysis of factorization algorithms [4], even though it is difficult to ‘sample’ from the integers. It is meaningful without appealing to the concept of epistemic uncertainty. Probabilistic and deterministic methods for inference on physical quantities have shared dualities from very early on: Legendre introduced the method of least-squares in 1805 as a deterministic best fit for data without a probabilistic interpretation. Gauss' 1809 probabilistic formulation of the exact same method added a generative stochastic model for how the data might be assumed to have arisen. Legendre's least-squares is a useful method without the generative interpretation, but the Gaussian formulation adds the important notion of uncertainty (also interpretable as model capacity) that would later become crucial in areas such as the study of dynamical systems.[2] Several authors [12-14] have shown that the language of probabilistic inference can be applied to numerical problems, using a notion of uncertainty about the result of an intractable or incomplete computation, and giving rise to methods we will here call probabilistic numerics.[3] In such methods, uncertainty regularly arises solely from the lack of information inherent in the solution of an ‘intractable’ problem: a quadrature method, for example, has access only to finitely many function values of the integrand; an exact answer would, in principle, require infinitely many such numbers. Algorithms for problems such as integration and optimization proceed iteratively, each iteration providing information improving a running estimate for the correct answer. Probabilistic numerics provides methods that, in place of such estimates, update probability measures over the space of possible solutions. As Diaconis noted [12], it appears that Poincaré proposed such an approach already in the nineteenth century. The recent explosive growth of automated inference, and the increasing importance of numerics for science, has given this idea new urgency. This article connects recent results, promising applications and central questions for probabilistic numerics. We collate results showing that a number of basic, popular numerical methods can be identified with families of probabilistic inference procedures. The probability measures arising from this new interpretation of established methods can offer improved performance, enticing new functionality and conceptual clarity; we demonstrate this with examples drawn from astrometry and computational photography. The article closes by pointing out the propagation of uncertainty through computational pipelines as a guiding goal for probabilistic numerics. It may be helpful to separate the issues discussed here clearly from other areas of overlap between statistics and numerical mathematics: we here focus on well-posed deterministic problems, identifying degrees of uncertainty arising from the computation itself. This is in contrast to the notion of uncertainty quantification (e.g. [15,16]), which identifies degrees of freedom in ill-posed problems, and where epistemic uncertainty arises from the set-up of the computation, rather than the computation itself. It also differs from several concepts of stochastic numerical methods, which use (aleatory) random numbers either to quantify uncertainty from repeated computations (e.g. [17]) or to reduce computational cost through randomly chosen projections (see, for example, [18,19]). It is also important to note that, as a matter of course, existing frameworks already analyse and estimate the error created by a numerical algorithm. Theoretical analysis of computational errors generally yields convergence rates—bounds up to an unknown constant—made under certain structural assumptions. Where probabilistic numerical methods are derived from classic ones, as described below, they naturally inherit such analytical properties. At runtime, the error is also estimated for the specific problem instance. Such runtime error estimation is frequently performed by monitoring the dynamics of the algorithm's main estimate (see [20, §II.4], for ODEs,[21, §4.5], for quadrature). Similarly, in optimization problems, the magnitude of the gradient is often used to monitor the algorithm's progress. Such error estimates are informal, as are the solution estimates computed by the numerical method itself. They are meant to be used locally, mostly as criteria for the termination of a method and the adaptation of its internal parameters. They cannot typically be interpreted as a property (e.g. the variance) of a posterior probability measure, and thus cannot be communicated to other algorithms, and thus cannot be embedded in a larger framework of error estimation. They also do not usually inform the design of the numerical algorithm itself; instead, they are a diagnostic tool added post hoc. Below, we argue that the estimation of errors should be given a formal framework, and that probability theory is uniquely suited for this task. Describing numerical computations as inference on a latent quantity yields a joint, consistent, framework for the construction of solution and error estimates. The inference perspective can provide a natural intuition that may suggest extensions and improvements. And the probabilistic framework provides a lingua franca for numerical computations, which allows the communication of uncertainty between methods in a chain of computation.

Probabilistic numerical methods from classical ones

Numerical algorithms estimate quantities not directly computable, using the results of more readily available computations. Even existing numerical methods can thus be seen as inference rules, reasoning about latent quantities from ‘observables’ or ‘data’. At face value, this connection between inference and computation is vague. But several recent results have shown that it can be made rigorous, such that established deterministic rules for various numerical problems can be understood as maximum a posteriori estimates under specific priors (hypothesis classes) and likelihoods (observation models). The recurring picture is that there is a one-to-many relationship between a classic numerical method for a specific task and a family of probabilistic priors which give the same maximum posterior estimate but differing measures of uncertainty. Choosing one member of this family amounts to fitting an uncertainty, a task we call uncertainty calibration. The result of this process is a numerical method that returns a point estimate surrounded by a probability measure of uncertainty, such that the point estimate inherits the proved theoretical properties of the classic method, and the uncertainty offers new functionality.

Quadrature

We term the probabilistic numeric approach to quadrature Bayesian quadrature. Diaconis [12] may have been first to point out a clear connection between a Gaussian process regression model and a deterministic quadrature rule, an observation subsequently generalized by Wahba [22, §8] and O'Hagan [23], and also noted by [24]. Details can be found in these works; here we construct an intuitive example highlighting the practical challenges of assigning uncertainty to the result of a computation. For concreteness, consider (black in figure 1a). Evidently, f has a compact symbolic form and f(x) can be computed for virtually any x∈R in nanoseconds. It is a wholly deterministic object. Nevertheless, the real number has no simple analytic value, in the sense that it cannot be natively evaluated in low-level code. Quadrature rules offer ‘black box’ estimates of F. These rules have been optimized so heavily (e.g. [21]) that they could almost be called ‘low level’, but their results do not come with the strict error bounds of floating-point operations; instead, assumptions about f are necessary to bound error. Perhaps the simplest quadrature rule is the trapezoid rule, which amounts to linear interpolation of f (red line in figure 1a(i)): evaluate f(x) on a grid of N points, and compute

Figure 1.

Quadrature rules, illustrating the challenge of uncertainty calibration. (a(i)(ii)) Function, f(x) (black line), is approximately integrated using two different Gaussian process priors (a(i) linear spline; a(ii) exponentiated quadratic), giving posterior distributions and mean estimates. Grey lines are functions sampled from the prior. The thick coloured line is the posterior mean, thin lines are posterior samples and the delineation of two marginal standard deviations. The shading represents posterior probability density. (b(i)(ii)) As the number of evaluation points increases, the posterior mean (thick line with points) converges to the true integral value; note the more rapid convergence of the exponentiated-quadratic prior. The posterior covariance provides an error estimate whose scale is defined by the posterior mean alone (each thin coloured line in the plots corresponds to a different instance of such an estimate). But it is only a meaningful error estimate if it is matched well to the function's actual properties. (b(i)) shows systematic difference between the convergence of the real error and the convergence of the estimated error under the linear spline, whereas convergence of the estimated error under the exponentiated-quadratic prior is better calibrated to the real error. Grey grid lines in the background, bottom left, correspond to convergence of the error in the number N of function evaluations. (a(iii)(iv),b(iii)(iv)) The same experiment repeated with a function f drawn from the spline kernel prior. For this function, the trapezoid rule is the optimal statistical estimator of the integral (note well-calibrated error measure in b(iii)), while the Gaussian kernel GP is strongly over-confident.

Bayes–Hermite quadrature

A probabilistic description of F requires a joint probability distribution over f and F. Since f lies in an infinite-dimensional Banach or Hilbert space, no Lebesgue measure can be defined. But Gaussian process measures can be well-defined over such spaces, and offer a powerful framework for quadrature [22,25]. In particular, we may choose to model f on [−3,3] by , a Gaussian process with vanishing mean μ=0 and the linear spline covariance function [26] (a stationary variant of the Wiener process) More precisely, this assigns a probability measure over the function values f(x) on [−3,3], such that any restriction to finitely many evaluations =[ f(x1),…,f(x)] at X=[x1,…,x] is jointly Gaussian distributed with zero mean and covariance cov[ f(x),f(x)]=k(x,x). Samples drawn from this process are shown in grey in figure 1a(i). These sample paths represent the hypothesis class associated with this Gaussian process prior: they are continuous, but not differentiable, in mean square expectation [27, §2.2]. Because Gaussian processes are closed under linear projections (e.g. [27, p. 27]), this distribution over f is identified with a corresponding univariate Gaussian distribution [26], on F. The strength of this formulation is that it provides a clear framework under which observations f(x) can be incorporated. The measure on p(F) conditioned on the collected function values is another scalar Gaussian distribution where and K is the symmetric positive definite matrix with elements K=k(x,x). The final expression in the brackets, the posterior covariance, can be optimized with respect to X to design a ‘most informative’ dataset. For the spline kernel (2.3), this leads to placing X on a regular, equidistant, grid [26]. The spline kernel k of (2.3) is a piecewise linear function with a single point of non-differentiability at x=x′. The posterior mean is a weighted sum of N such kernels centred on the evaluation points X and constrained by the likelihood to pass through the values at the nodes X. Thus, the posterior mean is a linear spline interpolant of the evaluations, and the posterior mean of equation (2.5) is exactly equal to the trapezoid-rule estimate from X (see also [12]). That is, Bayesian quadrature with a linear spline prior provides a probabilistic interpretation of the trapezoidal rule; it supplements the estimate with a full probability distribution characterizing uncertainty. The corresponding conditional on f is a Gaussian process whose mean is the piecewise linear red function in figure 1a(i) (the figure also represents the conditional distribution p( f|f(x1),…,f(x)) as a red cloud, and some samples). Various other, more elaborate, quadrature rules (e.g. higher order spline interpolation and Chebyshev polynomials) can be cast probabilistically in analogous ways, simply by changing the covariance function k [12,22,23].

Added value, and challenge, of probabilistic output

A non-probabilistic analysis of the trapezoid rule is that, for sufficiently regular functions f, this rule converges at a rate of at least to the correct value for F as N increases (e.g. [21, eq. 2.1.6]) (other quadrature rules, like cubic splines, have better convergence for differentiable functions). Figure 1b(i) shows the mean estimate (equivalently, the trapezoid rule) as a thick line. Grey help lines represent convergence. Clearly, the asymptotic behaviour is steeper than those lines.[4] So the theoretical bound is correct, but it is also of little practical value: no linear bound is tight from start to convergence. On the probabilistic side, the standard deviation of the conditional (2.5) is regularly interpreted as an estimate of the imprecision of the mean estimate. In fact, this error estimate (thin red lines in figure 1a(i)) is analogous to the deterministic linear convergence bound for continuous functions. There is a family of such error bounds associated with the same posterior mean, with each line corresponding to a different value for the unknown constant in the bound and different values for the parameters b and c in the covariance.[5] It may seem as though the probabilistic interpretation had added nothing new, but since this view identifies quadrature rules of varying assumptions as parameter choices in Gaussian process regression, it embeds seemingly separate rules in a hierarchical space of models, from which models with good error modelling can be selected by hierarchical probabilistic inference. This can be done without collecting additional data points [25, §5]. More generally, the probabilistic numeric viewpoint provides a principled way to manage the parameters of numerical procedures. Where Markov chain Monte Carlo procedures might require the hand-tuning of parameters such as step sizes and annealing schedules, Bayesian quadrature allows the machinery of statistical inference procedure to be brought to bear upon such parameters. A practical example of the benefits of approximate Bayesian inference for the (hyper-)parameters of a Bayesian quadrature procedure is given by [28]. The function f used in this example is much smoother than typical functions under the Gaussian process prior distribution associated with the trapezoid rule (shown as thin grey samples in figure 1a(i)). Figure 1a(ii),b(ii) (using orange) shows analogous experiments with the exponentiated-quadratic covariance function , corresponding to a very strong smoothness assumption on f [29] (see grey samples in figure 1a(ii)), giving a very quickly converging estimate. In this case, the ‘error bars’ provided by the standard deviation converge in a qualitatively comparable way. Of course, the faster convergence of this quadrature rule based on the exponentiated-quadratic covariance prior is not a universal property. It is the effect of a much stronger set of prior assumptions. If the true integrand is rougher than expected under this prior, the quadrature estimate arising from this prior can be quite wrong. Figure 1a(iii)(iv),b(iii)(iv) shows analogous experiments on an integrand that is a true sample from a Gaussian process with the spline covariance (2.3). In this case, the spline prior is the optimal statistical estimator by construction, and its error estimate is perfectly calibrated (figure 1a(iii),b(iii)), while the exponentiated-quadratic kernel gives over-confident, and inefficient, estimates (figure 1a(iv),b(iv)). Identifying the optimal regression model from a larger class, just based on the collected function values, requires more computational work than to fix a regressor from the start. But it also gives better calibrated uncertainty. Contemporary general-purpose quadrature implementations (e.g. [21]) remain lightweight by recursively re-using previous computations. The above experiments show that it is possible to design Bayesian quadrature rules with well-calibrated posterior error estimates, but it remains a question how small the computational overhead from a probabilistic computation over these methods can be made. Even so, formulating quadrature as probabilistic regression precisely captures a trade-off between prior assumptions inherent in a computation and the computational effort required in that computation to achieve a certain precision. Computational rules arising from a strongly constrained hypothesis class can perform much better than less restrictive rules if the prior assumptions are valid. In the numerical setting—in contrast to many empirical situations in statistics—it is often possible to precisely check whether a particular prior assumption is valid: the machine performing the computation has access, at least in principle, to a formal, complete, description of its task, in the form of the source code describing the task (the integrand, the optimization objective, etc.). Using this source code, it is possible, for example, to test, at runtime, whether, and how many times, an integrand is continuously differentiable (e.g. [30]). Using this information will in future allow the design of improved quadrature methods. Once one knows that general-purpose quadrature methods effectively use a Gaussian process prior over function values, it is natural to ask whether this prior actually incorporates the salient information available for one's specific problem. Including such information in the prior leads to customized, ‘tailored’, numerical methods that can perform better. Probabilistic inference also furnishes the framework required to tackle numerics tasks using decision theory. For quadrature, the decision problem to be solved is that of node selection: that is, the determination of the optimal positions for points (or nodes) at which to evaluate the integrand. With the definition of an appropriate loss function, such as the posterior variance of the integral, such nodes can be optimally selected by minimizing the expected loss function. It seems clear that this approach can improve upon the simple uniform grids of many traditional quadrature methods, and can enable active learning: where surprising evaluations can influence the selection of future nodes. This decision-theoretic approach stands in contrast with that adopted by randomized, Monte Carlo, approaches to quadrature. Such approaches generate nodes with the use of a pseudo-random number generator, injecting additional epistemic uncertainty (about the value of the generator's outputs) into a procedure designed to reduce the uncertainty in an integral. It is worth noting that the use of pseudo-random generators burdens the procedure with additional computational overhead: pseudo-random numbers are cheap, but not free. The principal feature of Monte Carlo approaches is their conservative nature: the Monte Carlo policy will always, eventually, take an additional node arbitrarily close to an existing node. The disadvantage of this strategy is its waste of valuable evaluations: the convergence rate of Monte Carlo techniques, , is clearly improved upon by both traditional quadrature and Bayesian quadrature methods. This problem is only worsened by the common discarding of evaluations known as ‘burn-in’ and ‘thinning’. The advantage of Monte Carlo, of course, is its robustness to even highly non-smooth integrands. However, Bayesian quadrature can realize more value from evaluations by exploiting known structure (e.g. smoothness) in the integrand.

Empirical evaluation of Bayesian quadrature for astrometry

We illustrate this on an integration problem drawn from astrometry, the measurement of the motion of stars. In order to validate astrometric analysis, we aim to recover the number of planets present in synthetic data generated (with a known number of planets) to mimic that produced by an astrometric facility such as the GAIA satellite [31].[6] Here, quadrature's task is to compute the model evidence (marginal likelihood) of a model for orbital motions, where are the gathered observations and θ are the model parameters. Specifically, it is of interest to compare the evidence for models including differing numbers of exoplanets. For the following example, to provide a focal problem upon which to compare quadrature methods, we compute the evidence of a model with two such planets on data generated with a two-planet model. The corresponding integral is analytically intractable, with a multi-modal integrand (the likelihood ) and a 19-dimensional θ rendering the quadrature problem challenging. As the probabilistic method, we use a recent algorithm: warped, sequential, active Bayesian integration (WSABI[7]) [32]. WSABI is a Bayesian quadrature algorithm that uses an internal probabilistic model that is well-calibrated to the suspected properties of the problem's integrand. Firstly, like the exponentiated quadratic, WSABI's covariance function is suitable for smooth integrands, as are expected for the problem. Secondly, and beyond what is achievable with classic quadrature rules, WSABI explicitly encodes the fact that the integrand (the likelihood ) is strictly positive. WSABI also makes use of a further opportunity afforded by the probabilistic numeric approach: it actively selects nodes so as to minimize the uncertainty in the integral. This final contribution permits nodes to be selected that are far more informative than gridded or randomly selected evaluations. We compare this algorithm against two different Monte Carlo approaches to the problem: annealed importance sampling (AIS) [33] (which was implemented with a Metropolis–Hastings sampler) and simple Monte Carlo (SMC). We additionally compared against Bayesian Monte Carlo (BMC) [24], a Bayesian quadrature algorithm using the simpler exponentiated-quadratic model, and whose nodes were taken from the same samples selected by SMC. ‘Ground truth’ (Ztrue) was obtained through exhaustive SMC sampling (106 samples). The results in figure 2 show that the probabilistic quadrature method achieves improved precision drastically faster than the Monte Carlo estimates. It is important to point out that the plot's abscissa is ‘wall-clock’ time, not algorithmic steps. Probabilistic algorithms need not be expensive.

Figure 2.

We compare different quadrature methods in computing the 19-dimensional integral giving model evidence in an astrometry application. The Bayesian quadrature algorithm (WSABI) that employs active selection of nodes, along with prior knowledge of the smoothness and non-negativity of the integrand, converges faster than the Monte Carlo approaches: simple Monte Carlo (SMC) and annealed importance sampling (AIS). Note that Bayesian Monte Carlo (BMC) is a Bayesian quadrature algorithm that uses the same samples as SMC, explaining their similar performance. Note also the performance improvement offered by WSABI over BMC, suggesting the crucial role played by the active selection of nodes.

Linear algebra

Computational linear algebra covers various operations on matrices. We will here focus on linear optimization, where is known and the task is to find such that A= where is symmetric positive definite. We assume access to projections A for arbitrary . If N is too large for exact inversion of A, a widely known approach is the method of conjugate gradients (CG) [9], which produces a convergent sequence 0,…, of improving estimates for . Each iteration involves one matrix-vector multiplication and a small number of linear operations, to produce the update =−, and an ‘observation’ =A. The good performance of CG has been analysed extensively (e.g. [34, §5.1]). We will use the shorthands S=[1,…,] and Y=[1,…,] for the set of projection directions and ‘observations’ after M iterations of the method. Defining a probabilistic numerical algorithm requires a joint probability measure p(Y,A,,|S) over all involved variables, conditioned on the algorithm's active design decisions; and an ‘action rule’ (design, policy, control law) describing how the algorithm should collect data. Recent work by Hennig [10] describes such a model which exactly reproduces the sequence {} of CG: as prior, choose a Gaussian measure over the (assumed to exist) inverse H=A−1 of A This implies a Gaussian prior over x=Hb, and a non-Gaussian prior over A. In (2.6), is a Gaussian distribution over the N2 elements of H stacked into a vector . Γ is the symmetrization operator , and ⊗ is the Kronecker product ( is also known as the symmetric Kronecker product [35]). As projections are presumed noise-free, the likelihood is the Dirac distribution, the limit of a Gaussian with vanishing width, which can also be seen as performing a strict conditioning of the prior on the observed function values, The action rule at iteration i is to move to , where is the posterior mean p(H|Y). The optimal step α can be computed exactly in a linear computation. Equation (2.6) requires choices for 0 and W. The prior mean is set to unit, 0=. If is positive definite, then p(H) assigns finite measure to every symmetric , and the algorithm converges to the true in at most N steps, assuming exact computations [10]. In this sense it is non-restrictive, but of course some matrices are assigned more density than others. If W is set to a value among the set (this includes the true matrix W=H, but does not require access to it), then the resulting algorithm exactly reproduces the iteration sequence {} of the CG method, and can be implemented in the exact same way. However, this derivation also provides something new: a Gaussian posterior distribution over H. Details can be found in [10]; for the present discussion it suffices to know that both the posterior mean H and covariance parameter W are of manageable form. In particular, with a diagonal matrix and ‘skinny’ matrices , which are a function of the steps and observed projections . So, as in the case of quadrature, there is a family of Gaussian priors of varying width (scaled by σ), such that all members of the family give the same posterior mean estimate. And this posterior mean estimate is identical to a classic numerical method (CG). But each member of the family gives a different posterior covariance—a different uncertainty estimate. It is an interesting question to which degree the uncertainty parameter W can be designed to give a meaningful error estimate. Some answers can be found in [10]. Interestingly, the equivalence class of prior covariances W0 that match CG in the mean estimate has more degrees of freedom than the number M of observations (matrix-vector multiplications) collected by the algorithm during its typical runtime. Fitting the posterior uncertainty W thus requires strong regularization. The method advocated in [10] constructs such a regularized estimator exclusively from scalar numbers already collected during the run of the CG method, thus keeping computational overhead very small. But, as in quadrature, there are valuable applications for the probabilistic formulation that do not strictly require a well-calibrated width of the posterior. Applications that make primary use of the posterior mean may just require algebraic structure in the prior, up to an arbitrary scaling constant, to incorporate available, helpful, information. We will not do this here and instead highlight another use of probabilities over point estimates: the propagation of knowledge from one linear problem to another related problem. The approach described in the following is known in the numerical linear algebra community as the recycling of Krylov sequences [36]. However, while the framework of classic numerical analysis required challenging and bespoke derivation of this result, it follows naturally in the probabilistic viewpoint as the extension of a parametric regressor to a filter on a time-varying process. The probabilistic formulation of computation uses the universal and unique language of inference to enable the solution of similar problems across the breadth of numerics using similar techniques. This is in contrast to the compartmentalized state of current numerics, which demands distinct expert knowledge of each individual numeric problem in order to make progress towards it. Furthermore, there is social value in making results accessible to any practitioner with a graduate knowledge of statistics. In many set-ups, the same A features in a sequence of problems A=, t=1,…. In others, a map A changes slightly from step t to step t+1. Figure 3 describes a blind deconvolution problem from astronomical imaging:[8] the task is to remove an unknown linear blur from a sequence 1,…, of astronomical images [37]. Atmospherical disturbances create a blur that continuously varies over time. The model is that each frame is the noisy result of convolution of the same ground truth image with a spatially varying blur kernel , i.e. =+, where is white Gaussian noise. A matrix X encodes the convolution operation as =X+ (it is possible to ensure X is positive definite). A blind deconvolution algorithm iterates between estimating X and estimating the [37]. Each iteration thus requires the solution of K linear problems (with fixed X) to find the , then one larger linear problem to find . The naïve approach of running separate instances of CG wastes information, because the K linear problems all share the matrix X, and from one iteration to the next, the matrix X changes less and less as the iterations approach convergence. Instead, information can be propagated between related computations using the probabilistic interpretation of CG, by starting each computation in the sequence with a prior mean H0 set to the posterior mean H of the preceding problem. Owing to the low-rank structure of H, this has low cost. To prevent a continued rise in computational costs as more and more linear problems are solved, H can be restricted to a fixed rank approximation after each inner loop, also at low cost. Figure 3b shows the increase in computational efficiency: for the baseline of solving 40 linear problems independently in sequence, each converges about equally fast (figure 3b(i); each ‘jump’ is the start of a new problem). Figure 3b(ii) shows optimization progress of the same 40 problems when information is propagated from one problem to the next. The first problem amounts to standard CG, while subsequent iterations can make increasingly better use of the available information. The figure also shows the dominant eigenvectors of the inferred posterior means of X−1 after K=40 subsequent linear problems, which converge to a relatively generic basis for point-spread functions. Although not strictly correct, this scheme can be intuitively understood as inferring a pre-conditioner across the sequence of problems, by Bayesian filtering (the technical caveat is that the described scheme does not re-scale the linear problem itself, as a pre-conditioner would, it just shifts the initial solution estimate).

Figure 3.

Solving sequences of linear problems by using the probabilistic interpretation of CG, also known as the recycling of Krylov sequences. (a(i)) A sequence of observed astronomical images is modelled as the convolution of a stationary true image with a time-varying point-spread function (the matrix X encodes the convolution with ). Each individual deconvolution task requires one run of a linear solver, here chosen to be the method of CG. (b) If each problem is solved independently, each instance of the solver progresses similarly (b(i) optimization residuals over time; each ‘jump’ in the residual is the start of a new frame/deconvolution problem). If the posterior mean implied by the probabilistic interpretation of the solver is communicated from one problem to the next, the solvers progress increasingly faster (b(ii) note decreasing number of steps for each deconvolution problem, and decrease, by about one order of magnitude, of initial residuals). (a(ii)) Over time, the vectors spanning this posterior mean for X−1 converge to a generic basis for point-spread functions.

Further areas

These examples highlight the areas of quadrature and linear algebra. Analogous results, identifying existing numerical methods with maximum a posteriori estimators, have been established in other areas, too. We do not experiment with them here, but they help complete the picture of numerical methods as inference across problem boundaries, as follows. In nonlinear optimization, quasi-Newton methods such as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) rule are deeply connected to CG (in linear problems, BFGS and CG are the same algorithm [38]). The BFGS algorithm can be interpreted as a specific kind of autoregressive generalization of the Gaussian model for CG [39]. Among other things, this allows explicit modelling of noise on evaluated gradients, a pressing issue in large-scale machine learning [40]. One currently particularly exciting area for probabilistic numerics is ordinary differential equations; more specifically, the solution of initial value problems (IVPs) of the form (d/dt)(t)=f((t),t), where is a real-valued curve parametrized by t known to start at the initial value (t0)=0. Explicit Runge–Kutta methods are a basic and well-studied tool for such problems [20]—while not necessarily state of the art, they nevertheless perform well on many problems, and are conceptually very clear. From the inference perspective, Runge–Kutta methods are linear extrapolation rules. At increasing nodes t0 It is important to point out the central role that linear extrapolation, and linear computations more generally, play again here, as they did in the other numerical settings discussed above. It is not an oversimplification to note that numerical methods often amount to efficiently projecting an intractable problem into a tractable linear computation. Since the Gaussian family is closed under all linear operations, it is, perhaps, no surprise that Gaussian distributions play a central role in probabilistic re-interpretations of existing numerical methods. In the case of IVPs, Gaussian process extrapolation was previously suggested by Skilling [14] as a tool for their solution. Starting from scratch, Skilling arrived at a method that shares the linear structure of equation (2.8), but does not have the strong theoretical underpinning of Runge–Kutta methods; in particular, Skilling's method does not share the high convergence order of Runge–Kutta methods. But the probabilistic formulation allows for novel theoretical analysis of its own [41], and new kinds of applications, such as the marginalization over posterior uncertainty in subsequent computational steps and finite prior uncertainty over the initial value [42]. A considerably vaguer but much earlier observation was made, on the side of numerical mathematics, by Nordsieck [43], who noted that a class of methods he proposed, and which were subsequently captured in a wider nomenclature of ODE solvers, bore a resemblance to linear electrical filters. These, in turn, are closely connected to Gaussian process regression through the notion of Markov processes. Recently, Schober et al. [44] showed that these connections between Gaussian regression and the solution of IVPs, hitherto only conceptual, can be made tight. There is a family of Gauss–Markov priors that, used as extrapolation rules, give posterior Gaussian processes whose mean function exactly matches members of the Runge–Kutta family. (Thanks to its Markov property, the corresponding inference method can be implemented as a signal filter, and thus in linear computational complexity, like Runge–Kutta methods). Hence, as in the other areas of numerics, there is now a family of methods that returns the trusted point estimates of an established method, while giving a new posterior uncertainty estimate allowing new functionality.

Discussion

General recipe for probabilistic numerical algorithms

These recent results, identifying probabilistic formulations for classic numerical methods, highlight a general structure. Consider the problem of approximating the intractable variable z, if the algorithm has the ability to choose ‘inputs’ ={x} for computations that result in numbers ()={y(x)}. A blueprint for the definition of probabilistic numerical methods requires two main ingredients: (i) A generative model p(z,()) for all variables involved—that is, a joint probability measure over the intractable quantity to be computed, and the tractable numerical quantities computed in the process of the algorithm. Like all (sufficiently structured) probability measures, this joint measure can be written as i.e. separated into a prior p(z) and a likelihood p(()|z). The prior encodes a hypothesis class over solutions, and assigns a typically non-uniform measure over this class. The likelihood explains how the collected tractable numbers relate to z. It has the basic role of describing the numerical task. Often, in classic numerical problems, the likelihood is a deterministic conditioning rule, a point measure. (i) A design, action rule, or policy r, such that encoding how the algorithm should collect numbers. (Here 1: should be understood as the actions taken in the preceding steps 1 to i, and similarly for 1:). This rule can be simple; for example, it could be independent of collected data (regular grids for integration). Or it might have a Markov-type property that the decision at i only depends on k26]. The aforementioned results show that classic, base-case algorithms for several fundamental numerical problems can be cast as maximum a posteriori inference in specific cases of this description; typically under Gaussian priors, and often under simple action rules, such as uniform gridding (in quadrature) or greedy extrapolation (in linear algebra, optimization and the solution of ODEs). Table 1 gives a short summary.
Table 1.
Probabilistic description of several basic numerical problems (shortened notation for brevity). In quadrature, (symmetric positive definite) linear optimization, nonlinear optimization and the solution of ordinary differential equation IVPs, classic methods can be cast as maximum a posteriori estimation under Gaussian priors. In each case, the likelihood function is a strict conditioning, because observations are assumed to be noise-free. Because numerical methods are active (they decide which computations to perform), they require a decision rule. This is often ‘greedy’: evaluation under the posterior mean estimate. The exception is integration, which is the only area where the estimated solution of the numerical task is not required to construct the next evaluation.
problem class integration linear opt. nonlinear opt. ODE IVPs
inferred z z=f;∫f(x) dx z=A⁻¹;Ax=b z=B=∇∇⊺f z′(t)=f(z(t),t)
classic method Gaussian quad. conjugate gradients BFGS Runge–Kutta
p(z) GP(f;μ,k) N(A−1; M,V) GP(z;μ,k) GP(z;μ,k)
p(y|z) I(f(xi)=yi) I(yi=Axi) I(yi=Bxi) I(yi=z′(t))
decision rule minimize post. variance gradient at est. solution gradient under est. Hessian evaluate at est. solution
Probabilistic description of several basic numerical problems (shortened notation for brevity). In quadrature, (symmetric positive definite) linear optimization, nonlinear optimization and the solution of ordinary differential equation IVPs, classic methods can be cast as maximum a posteriori estimation under Gaussian priors. In each case, the likelihood function is a strict conditioning, because observations are assumed to be noise-free. Because numerical methods are active (they decide which computations to perform), they require a decision rule. This is often ‘greedy’: evaluation under the posterior mean estimate. The exception is integration, which is the only area where the estimated solution of the numerical task is not required to construct the next evaluation.
Current limitations
Numerical methods have undergone centuries of development and analysis. The result is a mature set of algorithms that have been ingrained in the scientific tool-set. By contrast, the probabilistic viewpoint suggested here is an emerging area. Many questions remain unanswered, and many aspects of practical importance are missing: formal analysis is at an early stage. Efficient and stable implementations are still in development. Convincing use-cases from various scientific disciplines are only beginning to emerge. We hope that the reader will take these issues as a motivation to contribute, rather than a hold-up. As the use of large-scale computation, simulation and the use of data permeate the quantitative sciences, there is clearly a need for a formal theory of uncertainty in computation.
New paths for research
In our opinion, the match between probabilistic inference and existing numerical methods lays a firm foundation for the analysis of probabilistic numerical methods. We see two primary, complementary goals, as follows. Firstly, implicit prior assumptions can now be questioned. This could be done in an ‘aggressive’ way, in the hope of finding either algorithms with faster convergence on a smaller set of problems satisfying stronger assumptions (as in the quadrature example of §3b). Conversely, a ‘conservative’ re-definition of prior assumptions might improve robustness at increased computational cost. A particularly important aspect in this regard is the action rule r. Wherever r is a function of previously collected ‘data’ (known as adaptive design in statistics and active learning in machine learning), a bias can occur. Where the collected data also influence the result of future actions (as in ODE solvers), a more severe problem, an exploration–exploitation trade-off, can arise. Checking for such biases, and potentially correcting them, can increase computational cost. But in some applications that require high robustness this effort can pay off. Secondly, the modelling assumptions, in particular the likelihood, can be extended to increase the reach of existing methods to new settings. A first point of interest is the explicit modelling of uncertainty or noise on the evaluations themselves. This generalization, which would be challenging to construct from a classical standpoint, is often straightforward once a probabilistic interpretation has been found. It may be as simple as replacing the point-mass likelihood functions in table 1 with Gaussian distributions. A prominent case of this aspect is the optimization of noisy functions, such as arises, for example, in the training of large-scale machine learning architectures from subsets of a large or infinite dataset. Other ideas, such as the propagation of knowledge between problems, as in figure 3, are just as difficult to motivate and study in a classic formulation, but suggest themselves quite naturally in the probabilistic formulation. There are also practical considerations that shape the research effort. Gaussian distributions play an important role, at least where the inferred quantity is continuous valued. This is not incidental: to a large degree, the point of a numerical method is to turn an intractable computation into a sequence of linear computations. The Gaussian exponential family is closed under linear projections, thus ideally suited for this task. Efficient adaptation of model hyper-parameters is crucial for a well-calibrated posterior measure. Models with fixed parameters often simply reproduce existing analytic bounds; only through parameter adaptation can uncertainty be actively ‘fitted’. Doing so is perhaps more challenging than elsewhere in statistics because numerical methods are ‘inner-loop’ algorithms used to solve more complex, higher level computations. It is important to find computationally lightweight parameter estimation methods, perhaps at the cost of accepting some limitations in model flexibility. Although the fundamental insight that numerical methods solve inference problems is not new, the study of probabilistic numerical methods is still young. Recent work has made progress, exposing a wealth of enticing applications in the process. We conclude this text by highlighting a most promising, if distant, application motivating ongoing research.
A vision: chained numerical methods communicating uncertainty
Fuelled by ubiquitous collection and communication of data, several academic and industrial fields are now interested in systems that use observations to adapt to, and interact with, their data source in an autonomous way. Figure 4 shows a conceptual sketch of an autonomous machine aiming to solve a given task by using observations (data) D to build a probabilistic model p(x|D,θ) of variables x that describe the environment's dynamics through model parameters θ. The model can be used to predict future states x as a function of actions a chosen by the machine. The goal is to choose actions that, over time, maximize some measure of utility that encodes the task. This requires a sequence of numerical steps: inference on x requires marginalization and expectations, i.e. integration. Fitting θ involves optimization. Prediction of x may entail solving differential equations. All three areas have linear base cases (inference in linear regression, optimization of quadratic functions, the solution of linear ODEs). The combination of a sequence of ‘black-box’ numerical methods in such automated set-ups gives rise to new challenges. Each method receives a point estimate from its precursor, performs its local computation (and adds its local error), and hands the result on. Errors can accumulate in unexpected ways along this chain, but modelling their accumulation provides value: it may be unnecessary to run a numerical method to convergence if its inputs are already known to be only rough estimates.
Figure 4.
Sketch of an autonomous system, collecting data to build a parametrized model of the environment. Predictions of future states from the model can be used to choose an action strategy. If intermediate operations are solved by numerical methods, both computational errors and inherent uncertainty should be propagated across the pipeline to monitor and target computational effort.
Sketch of an autonomous system, collecting data to build a parametrized model of the environment. Predictions of future states from the model can be used to choose an action strategy. If intermediate operations are solved by numerical methods, both computational errors and inherent uncertainty should be propagated across the pipeline to monitor and target computational effort. Specifically, numerical methods allowing for probabilistic inputs and outputs turn the sketch of figure 4 into a factor graph [45], and allow propagation of uncertainty estimates along the chain of computation, through message passing [46]. This would identify sources of computational error, allowing: the active management of a computational budget across the chain; the dedication of finite computer resources to steps that dominate the overall error; and the truncation of computations early once they reach sufficient precision. Uncertainty propagation through computations has been studied widely before ([47] gives a review), but the available algorithms focus on the effects of set-up uncertainties on the outcome of a computation, rather than the computation itself. This new functionality explicitly requires calibrated probabilistic uncertainty at each step of the computation, at runtime. Classic abstract convergence analyses cannot be used for this kind of estimation.
Conclusion
Numerical tasks can be interpreted as inference problems, giving rise to probabilistic numerical methods. Established algorithms for many tasks can be cast explicitly in this light. Doing so establishes connections between seemingly disparate problems, yields new functionality and can improve performance on structured problems. To allow interpretation of the posterior as a statement of uncertainty, care must be taken to ensure well-calibrated priors and models. But even where the uncertainty interpretation is not (yet) rigorously established, the probabilistic formulation already allows for the encoding of prior information about problem structure, including the propagation of collected information among problem instances, leading to improved performance. Many open questions remain for this exciting field. In the long run, probabilistic formulations may allow the propagation of uncertainty through pipelines of computation, and thus the active control of computational effort through hierarchical, modular computations.

problem class	integration	linear opt.	nonlinear opt.	ODE IVPs
inferred z	z=f;∫f(x) dx	z=A⁻¹;Ax=b	z=B=∇∇⊺f	z′(t)=f(z(t),t)
classic method	Gaussian quad.	conjugate gradients	BFGS	Runge–Kutta
p(z)	GP(f;μ,k)	N(A−1; M,V)	GP(z;μ,k)	GP(z;μ,k)
p(y\|z)	I(f(xi)=yi)	I(yi=Axi)	I(yi=Bxi)	I(yi=z′(t))
decision rule	minimize post. variance	gradient at est. solution	gradient under est. Hessian	evaluate at est. solution

1 in total

1. Randomized algorithms for the low-rank approximation of matrices.
Authors: Edo Liberty; Franco Woolfe; Per-Gunnar Martinsson; Vladimir Rokhlin; Mark Tygert
Journal: Proc Natl Acad Sci U S A       Date: 2007-12-04       Impact factor: 11.205

1 in total

6 in total

1. Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization.
Authors: Samir Bhatt; Ewan Cameron; Seth R Flaxman; Daniel J Weiss; David L Smith; Peter W Gething
Journal: J R Soc Interface       Date: 2017-09       Impact factor: 4.118
2. MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning.
Authors: Diego Granziol; Binxin Ru; Stefan Zohren; Xiaowen Dong; Michael Osborne; Stephen Roberts
Journal: Entropy (Basel)       Date: 2019-05-31       Impact factor: 2.524
3. Using Probabilistic Approach to Evaluate the Total Population Density on Coarse Grids.
Authors: Manal Alqhtani; Khaled M Saad
Journal: Entropy (Basel)       Date: 2020-06-14       Impact factor: 2.524
Review 4. Timing errors and temporal uncertainty in clinical databases-A narrative review.
Authors: Andrew J Goodwin; Danny Eytan; William Dixon; Sebastian D Goodfellow; Zakary Doherty; Robert W Greer; Alistair McEwan; Mark Tracy; Peter C Laussen; Azadeh Assadi; Mjaye Mazwi
Journal: Front Digit Health       Date: 2022-08-18
5. Statistical analysis of differential equations: introducing probability measures on numerical solutions.
Authors: Patrick R Conrad; Mark Girolami; Simo Särkkä; Andrew Stuart; Konstantinos Zygalakis
Journal: Stat Comput       Date: 2016-06-02       Impact factor: 2.559
6. Dynamical Field Inference and Supersymmetry.
Authors: Margret Westerkamp; Igor Ovchinnikov; Philipp Frank; Torsten Enßlin
Journal: Entropy (Basel)       Date: 2021-12-08       Impact factor: 2.524

6 in total