Literature DB >> 35450025

Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control.

U Fasel¹, J N Kutz², B W Brunton³, S L Brunton¹.

Abstract

Sparse model identification enables the discovery of nonlinear dynamical systems purely from data; however, this approach is sensitive to noise, especially in the low-data limit. In this work, we leverage the statistical approach of bootstrap aggregating (bagging) to robustify the sparse identification of the nonlinear dynamics (SINDy) algorithm. First, an ensemble of SINDy models is identified from subsets of limited and noisy data. The aggregate model statistics are then used to produce inclusion probabilities of the candidate functions, which enables uncertainty quantification and probabilistic forecasts. We apply this ensemble-SINDy (E-SINDy) algorithm to several synthetic and real-world datasets and demonstrate substantial improvements to the accuracy and robustness of model discovery from extremely noisy and limited data. For example, E-SINDy uncovers partial differential equations models from data with more than twice as much measurement noise as has been previously reported. Similarly, E-SINDy learns the Lotka Volterra dynamics from remarkably limited data of yearly lynx and hare pelts collected from 1900 to 1920. E-SINDy is computationally efficient, with similar scaling as standard SINDy. Finally, we show that ensemble statistics from E-SINDy can be exploited for active learning and improved model predictive control.

Entities: Chemical

Keywords: active learning; ensemble methods; model discovery; nonlinear dynamics; probabilistic forecasting; sparse regression

Year: 2022 PMID： 35450025 PMCID： PMC9006119 DOI： 10.1098/rspa.2021.0904

Source DB: PubMed Journal: Proc Math Phys Eng Sci ISSN： 1364-5021 Impact factor: 2.704

Introduction

Data-driven model discovery enables the characterization of complex systems where first principles derivations remain elusive, such as in neuroscience, power grids, epidemiology, finance and ecology. A wide range of data-driven model discovery methods exist, including equation-free modelling [1], normal form identification [2-4], nonlinear Laplacian spectral analysis [5], Koopman analysis [6,7] and dynamic mode decomposition (DMD) [8-10], symbolic regression [11-15], sparse regression [16,17], Gaussian processes [18], combining machine learning and data assimilation [19,20], and deep learning [21-27]. Limited data and noisy measurements are fundamental challenges for all of these model discovery methods, often limiting the effectiveness of such techniques across diverse application areas. The sparse identification of nonlinear dynamics (SINDy) [16] algorithm is promising, because it enables the discovery of interpretable and generalizable models that balance accuracy and efficiency. Moreover, SINDy is based on simple sparse linear regression that is highly extensible and requires significantly less data in comparison with, for instance, neural networks. In this work, we unify and extend innovations of the SINDy algorithm by leveraging classical statistical bagging methods [28] to produce a computationally efficient and robust probabilistic model discovery method that overcomes the two canonical failure points of model discovery: noise and limited data. The SINDy algorithm [16] provides a data-driven model discovery framework, relying on sparsity-promoting optimization to identify parsimonious models that avoid overfitting. These models may be ordinary differential equations (ODEs) [16] or partial differential equations (PDEs) [17,29]. SINDy has been applied to a number of challenging model discovery problems, including for reduced-order models of fluid dynamics [30-35] and plasma dynamics [36-38], turbulence closures [39-41], mesoscale ocean closures [42], nonlinear optics [43], computational chemistry [44] and numerical integration schemes [45]. SINDy has been widely adopted, in part, because it is highly extensible. Extensions of the SINDy algorithm include accounting for control inputs [46] and rational functions [47,48], enforcing known conservation laws and symmetries [30], promoting stability [49], improved noise robustness through the integral formulation [37,50-54], generalizations for stochastic dynamics [44,55] and tensor formulations [56], and probabilistic model discovery via sparse Bayesian inference [57-61]. Many of these innovations have been incorporated into the open source software package PySINDy [62,63]. Today, the biggest challenge with SINDy, and more broadly in model discovery, is learning models from limited and noisy data, especially for spatio-temporal systems governed by PDEs. Model discovery algorithms are sensitive to noise because they rely on the accurate computation of derivatives, which is especially challenging for PDEs where noise can be strongly amplified when computing higher-order spatial derivatives. There have been two key innovations to improve the noise robustness of SINDy: control volume formulations and ensemble methods. The integral formulation of SINDy [50] has proven powerful, enabling the identification of PDEs in a weak form that averages over control volumes, which significantly improves its noise tolerance. This approach has been used to discover a hierarchy of PDE models for fluids and plasmas [37,51-54,64,65]. Several works have begun to explore ensemble methods to robustify data-driven modelling, including the use of bagging for DMD [66], ensemble-Lasso [67], subsample aggregating for improved discovery [61,68], statistical learning of PDEs to select model coefficients with high importance measures [69] and improved discovery using ensembles based on subsampling of the data [51,52,61,65]. Also, symbolic regression methods [11-13] and spectral proper orthogonal decomposition (SPOD) [70] are inherently imbued with ensembling ideas. Symbolic regression models are formed by initially randomly combining mathematical building blocks (library terms) and then recombining building blocks (equations or library terms) that best model the experimental data. In SPOD, a modal decomposition method closely related to DMD, optimally averaged DMD modes are obtained from an ensemble DMD problem. Thus, both these methods naturally include ensembling ideas. When dealing with noise-compromised data, it is also critical to provide uncertainty estimates of the discovered models. In this direction, recent innovations of SINDy use sparse Bayesian inference for probabilistic model discovery [57-60]. Such methods employ Markov Chain Monte Carlo, which is extremely computationally intensive. These extensions have all improved the robustness of SINDy for high-noise data, although they have been developed largely in isolation and they have not been fully characterized, exploited and/or integrated. In this work, we unify and extend recent innovations in ensembling and the weak formulation of SINDy to develop and characterize a more robust ensemble-SINDy (E-SINDy) algorithm. Furthermore, we show how this method can be useful for active learning and control. In particular, we apply b(r)agging[1] to SINDy to identify models of nonlinear ODEs of the form with state and dynamics , and for nonlinear PDEs of the form with a system of nonlinear functions of the state , its derivatives and parameters ; partial derivatives are denoted with subscripts, such that . We show that b(r)agging improves the accuracy and robustness of SINDy. The method also promotes interpretability through the inclusion probabilities of candidate functions, enabling uncertainty quantification. Importantly, the ensemble statistics are useful for producing probabilistic forecasts and can be used for active learning and nonlinear control. We also demonstrate library E-SINDy, which subsamples terms in the SINDy library. E-SINDy is computationally efficient compared with probabilistic model identification methods based on Markov Chain Monte Carlo sampling [60], which take several hours of CPU time to identify a model. By contrast, our method identifies models and summary statistics in seconds by leveraging the sparse regression of SINDy with statistical bagging techniques. Indeed, E-SINDy has similar computational scaling to standard SINDy. This method applies under the same conditions as the standard SINDy algorithm, where it is assumed that all relevant variables are measured at a sufficient temporal and spatial resolution so as to approximate derivatives. We investigate different ensemble methods, apply them to several synthetic and real-world datasets, and demonstrate that E-SINDy outperforms existing sparse regression methods, especially in the low-data and high-noise limit. A schematic of E-SINDy is shown in figure 1. We first describe SINDy for ODEs and PDEs in §2, before introducing the E-SINDy extension in §3, and discussing applications to challenging model discovery, active learning and control problems in §4.

Figure 1

(a–c) Schematic of the E-SINDy framework. E-SINDy exploits the statistical method of bootstrap aggregating (bagging) to identify ordinary and partial differential equations that govern the dynamics of observed noisy data. First, sparse regression is performed on bootstraps of the measured data, or on the library terms in case of library bagging, to identify an ensemble of SINDy models. The mean or median of the coefficients are then computed, coefficients with low inclusion probabilities are thresholded, and the E-SINDy model is aggregated and used for forecasting. (Online version in colour.)

Background

Here, we describe SINDy [16], a data-driven model discovery method to identify sparse nonlinear models from measurement data. First, we introduce SINDy to identify ODEs, followed by its generalization to identify PDEs [17,29].

Sparse identification of nonlinear dynamics

The SINDy algorithm [16] identifies nonlinear dynamical systems from data, based on the assumption that many systems have relatively few active terms in the dynamics in equation (1.1). SINDy uses sparse regression to identify these few active terms out of a library of candidate linear and nonlinear model terms. Therefore, sparsity-promoting techniques may be used to find parsimonious models that automatically balance model complexity with accuracy [16]. We first measure snapshots of the state in time and arrange these into a data matrix Next, we compute the library of candidate nonlinear functions This library is constructed to include any functions that might describe the data, and this choice is crucial. The underlying dynamical system is unknown and we cannot guarantee that the dynamics are well described by the span of the library. Therefore, the recommended strategy is to start with a basic choice, such as low-order polynomials, and then increase the complexity and order of the library until sparse and accurate models are obtained. We must also compute the time derivatives of the state , typically by numerical differentiation. We therefore need a suitable data sampling time that allows for the computation of the time derivatives, which may limit the applicability of the SINDy algorithm for certain datasets with coarse or uneven sampling in time. The system in equation (1.1) may then be written in terms of these data matrices Each entry in is a coefficient corresponding to a term in the dynamical system. Many dynamical systems have relatively few active terms in the governing equations. Thus, we may employ sparse regression to identify a sparse matrix of coefficients signifying the fewest active terms from the library that result in a good model fit The regularizer is chosen to promote sparsity in . For example, sequentially thresholded least-squares (STLS) [16] uses with a single hyperparameter , whereas sequentially thresholded ridge regression (STRidge) [17] uses with two hyperparameters and . STLS was first introduced to discover ODEs and STRidge was introduced to discover PDEs where data can be highly correlated and STLS tends to perform poorly. There are several other recently proposed regularizers and optimization schemes [49,71,72]. We illustrate STRidge in pseudo code algorithm 1, noting that STRidge reduces to STLS for .

Discovering PDEs

SINDy was recently generalized to identify PDEs [17,29] in the partial differential equation functional identification of nonlinear dynamics (PDE-FIND) algorithm. PDE-FIND is similar to SINDy, but with the library including partial derivatives. Spatial time-series data are arranged into a column vector , with data collected over time points and spatial locations. Thus, for PDE-FIND, the library of candidate terms is . The PDE-FIND implementation of Rudy et al. [17] takes derivatives using finite difference for clean data or polynomial interpolation for noisy data. The library of candidate terms can then be evaluated: The time derivative is reshaped into a column vector and the system in equation (1.2) is written as For most PDEs, is sparse and can be identified with a similar sparsity-promoting regression STRidge improves model identification with highly correlated data that is common in PDE regression problems. PDE-FIND is extremely prone to noise, because noise is amplified when computing high-order partial derivatives for . To make PDE-FIND more noise robust, integral [50] and weak formulations [51,54] were introduced. Instead of discovering a model based on equation (1.2), the PDE can be multiplied by a weight and integrated over a domain . This can be repeated for a number of combinations of and . Stacking the results of the integration over different domains using different weights leads to a linear system with and the integrated left-hand side and integrated library of candidate terms, which replace and the library of nonlinear functions . As with PDE-FIND, sparse regression can be employed to identify a sparse matrix of coefficients , using STLS, STRidge or other regularizers. For all of our results, we use this weak formulation as a baseline and for the basis of ensemble models.

Ensemble SINDy

In this work, we introduce E-SINDy, which incorporates ensembling techniques into data-driven model discovery. Ensembling is a well-established machine learning technique that combines multiple models to improve prediction. A range of ensembling methods exist, such as bagging (bootstrap aggregation) [28], bragging (robust bagging) [73,74] and boosting [75,76]. Structure learning techniques such as cross-validation [77] or stability selection [78] can also be considered ensembling methods, because they combine and use the information of a collection of learners or models. For model discovery, ensembling improves robustness and naturally provides inclusion probabilities and uncertainty estimates for the identified model coefficients, which enable probabilistic forecasting and active learning. Schematic of SINDy and E-SINDy with b(r)agging and library bagging. Shown is a single model of the ensemble. In the case of b(r)agging, data bootstraps (data samples with replacement) are used to generate an ensemble of SINDy models. The E-SINDy model is aggregated by taking the mean of the identified coefficients for bagging, and the median for bragging. In case of library bagging, instead of data bootstraps, library term bootstraps are sampled without replacement. Library terms with low inclusion probability are discarded and the E-SINDy model can be identified on the smaller library using standard SINDy or b(r)agging E-SINDy. (Online version in colour.) Here, we propose two new ensemble model discovery methods: the first method is called b(r)agging E-SINDy, and the second method is called library E-SINDy. A general schematic of E-SINDy is shown in figure 1, and a schematic of the sparse regression problems for b(r)agging and library E-SINDy is shown in figure 2. Our first method, b(r)agging E-SINDy, uses data bootstraps to discover an ensemble of models that are aggregated by taking the mean of the identified model coefficients in case of bagging, and taking the median in the case of bragging. Bootstraps are data samples with replacement. Applied to SINDy to identify ODEs, we first build a library of candidate terms and derivatives . From the rows of the data matrices and , corresponding to samples in time, we select bootstrapped data samples and generate SINDy models in the ensemble. For each of these data bootstraps, new rows are sampled with replacement from the original rows of the data matrices. On average, each data bootstrap will have around 63% of the entries of the original data matrices, with some of these entries being represented multiple times in the bootstrap; for large this quantity converges to , which is the limit of for . In this way, randomness and subsampling is inherent to the bootstrapping procedure. From the identified SINDy models in the ensemble, we can either directly aggregate the identified models, or first threshold coefficients with low inclusion probability. The procedure is illustrated in algorithm 2 for bagging E-SINDy using STRidge. The same procedure applies for bragging, taking the median instead of the mean, and using other regularizers than STRidge. Note that there are other random data subsampling approaches that may be used, such as generating models based on random subsamples of rows of the data without replacement, of which there are . However, boostrapping based on selection with replacement is the most standard procedure.

Figure 2

Schematic of SINDy and E-SINDy with b(r)agging and library bagging. Shown is a single model of the ensemble. In the case of b(r)agging, data bootstraps (data samples with replacement) are used to generate an ensemble of SINDy models. The E-SINDy model is aggregated by taking the mean of the identified coefficients for bagging, and the median for bragging. In case of library bagging, instead of data bootstraps, library term bootstraps are sampled without replacement. Library terms with low inclusion probability are discarded and the E-SINDy model can be identified on the smaller library using standard SINDy or b(r)agging E-SINDy. (Online version in colour.)

The second method proposed, library bagging E-SINDy, samples library terms instead of data pairs. We sample out of library terms without replacement. In case of sampling library terms, replacement does not affect the sparse regression problem. However, using smaller libraries can drastically speed up model identification, as the complexity of the least-squares algorithm is . Library bagging with small can therefore help counteract the increased computational cost of solving multiple regression problems in the ensemble. As with bagging E-SINDy, we obtain an ensemble of models and model coefficient inclusion probabilities. We can directly aggregate the models and threshold coefficients with low inclusion probabilities to get a library E-SINDy model. We can also use the inclusion probabilities to threshold the library, only keeping relevant terms, and run bagging E-SINDy using the smaller library. This can be particularly useful if we start with a large library: we first identify and remove all library terms that are clearly not relevant and then run bagging E-SINDy on the smaller library. However, the library bagging inclusion probability threshold needs to be selected carefully to not remove relevant terms from the library. We show a pseudo code of library bagging E-SINDy in algorithm 3. E-SINDy provides inclusion probabilities and uncertainty estimates for the discovered model coefficients, thus connecting to Bayesian model identification techniques. The identified ensemble of model coefficients can be used to compute coefficient probability density functions, which form a posterior distribution . In terms of forecasting, we can either use the aggregated mean or median of the identified coefficients to forecast, or we can draw from multiple identified SINDy models to generate ensemble forecasts that represent posterior predictive distributions that provide prediction confidence intervals.

Results

We now apply E-SINDy to challenging synthetic and real-world datasets to identify ODEs and PDEs. We apply library bagging E-SINDy to a real-world ecological dataset, showing its performance in the very low data limit. For PDEs, we use the recent weak-SINDy (WSINDy) [54] as a baseline and show the improved noise robustness when using E-SINDy for identifying a range of PDEs. Trends for the noise and data length sensitivity of bagging, bragging and library bagging to identify the chaotic Lorenz system dynamics are presented in appendix A.

ODEs

We apply E-SINDy to a challenging real-world dataset from the Hudson Bay Company, which consists of the yearly number of lynx and hare pelts collected from 1900 to 1920. These pelt counts are thought to be roughly proportional to the population of the two species [79]. Lynx are predators whose diet depends on hares. The population dynamics of the two species should, therefore, be well approximated by a Lotka–Volterra model. There are several challenges in identifying a SINDy model from this dataset: there are only 21 data points available, and there is large uncertainty in the measurements arising from weather variability, consistency in trapping and other changing factors over the years measured. In figure 3, we show that E-SINDy correctly identifies the Lotka–Volterra dynamics, providing model coefficient and inclusion probabilities and confidence intervals for the reconstructed dynamics. We use library bagging, followed by bagging using a library of polynomials up to third order, to identify a sparse model in this very low data limit with only 21 data points per species. Similar results for the lynx-hare dataset were recently published using a probabilistic model discovery method [60] based on sparse Bayesian inference. This approach employed Markov Chain Monte Carlo, for which the computational effort to generate a probabilistic model is comparably high, taking several hours of CPU time. By contrast, E-SINDy takes only seconds to identify a model and its coefficient and inclusion probabilities.

Figure 3

Library bagging E-SINDy (LB-SINDy) on real data: data consisting of measurements by the Hudson Bay Company of lynx and hare pelts from 1900 to 1920. (a) Uncertainty in the identified model coefficients, (b) inclusion probabilities of the model coefficients (with 65% threshold) and (c) model reconstruction. LB-SINDy (continuous lines) uses the mean value of the identified coefficients for reconstruction, and the 95% confidence interval depicts ensemble reconstruction, drawing five models and averaging the coefficients for 1000 realizations. (Online version in colour.)

PDEs

In this section, we present results applying E-SINDy to discover PDEs from noisy data. We use the recent WSINDy implementation [54] as the baseline model for ensembling. WSINDy was successfully applied to identify models in the high-noise regime using large libraries. We perform library bagging on the system of equation (2.8) instead of equation (2.6), and refer to the resulting method as ensemble weak SINDy (E-WSINDy). We apply E-WSINDy to identify PDEs from synthetic data for the inviscid Burgers, Korteweg de Vries, nonlinear Schroedinger, Kuramoto–Sivashinsky and reaction–diffusion equations. Details on the numerical methods for creating the data are discussed in appendix B and in our E-SINDy data and code repository. We quantify the accuracy and robustness of the model identification by assessing the success rate and model coefficient errors for a number of noise realizations. The success rate is defined as the rate of identifying the correct non-zero and zero terms in the library, averaged over all realizations. The model coefficient error quantifies how much the identified coefficients deviate from the true parameters that we use to generate the data: The results are summarized in figure 4. For all PDEs, E-WSINDy reduces the model coefficient error and increases the success rate of the model discovery. Moreover, E-WSINDy can accurately identify the correct model structure for the reaction–diffusion case, where WSINDy falsely identifies a linear oscillator model instead of the nonlinear reaction–diffusion model. To investigate the limits of E-WSINDy, we further increase the noise level for each case up to the point where the success rate drops below 90%. On average, for all investigated PDEs, we find that ensembling improves the noise robustness of WSINDy by a factor of 2.3. We conclude that ensembling significantly improves model discovery robustness and enables the identification of PDEs in the extreme noise limit.

Figure 4

Comparison of model error and success rate of discovered PDEs using weak-SINDy and ensemble weak-SINDy. Ensembling robustifies and improves the accuracy of PDE identification. (Online version in colour.)

Exploiting ensemble statistics for active learning

We now present results exploiting the ensemble statistics for active learning [80,81]. Active learning is a machine learning method that can reduce training data size while maintaining accuracy by actively exploring regions of the feature space that maximally inform the learning process [82,83]. This can be particularly effective for systems with large feature spaces that are expensive to explore, such as in biological systems or high-dimensional systems with control. In biological systems, collecting samples can be time consuming and expensive, but it may be possible to initialize specific new initial conditions of the system. Active learning can inform the selection of these initial conditions for improved data efficiency of the learning process and model discovery. For control problems, similarly, exploration of large feature spaces may be expensive, such as repeatedly performing robotics experiments. Active learning can enable data-efficient exploration by the guided collection of relevant and descriptive data that optimally supports the model discovery process of the controlled robotic system. Here, we leverage the ensemble statistics of E-SINDy to identify and sample high-uncertainty regions of phase space that maximally inform the sparse regression. In E-SINDy, we collect data from a single initial condition or from multiple randomly selected initial conditions and identify a model in one shot. Instead, we can successively identify E-SINDy models and exploit their ensemble statistics to identify new initial conditions with high information content, which can improve the data efficiency of the model discovery process. The basic idea is to compute ensemble forecasts from a large set of initial conditions using successively improved E-SINDy models and only explore regions with high ensemble forecast variance. Our simple but effective active E-SINDy approach iteratively identifies models in three steps: (1) collecting a small amount of randomly selected data to identify an initial E-SINDy model; (2) selecting a number of random initial conditions and computing the ensemble forecast variance for each initial condition using the current E-SINDy model; and (3) sampling the true system with the initial condition with highest variance. Finally, we concatenate the newly explored data to the existing dataset to identify a new E-SINDy model, and continue the model identification until the model accuracy and/or variance of the identified model coefficients converge. Here, we test active E-SINDy on the Lorenz system dynamics introduced in figure 1 and appendix A and show the results in figure 5. In figure 5a, we show five illustrative ensemble forecasts from different initial conditions after initializing the algorithm. In total, at each iteration, we compute ensemble forecasts at 200 different initial conditions. We found that at each initial condition, integrating a single time step forward in time is informative enough to compute ensemble forecasting variance. In figure 5b, we show the probability density functions of the identified model coefficients at initialization have wide distributions, and after 80 active learning steps the variance of the distributions is significantly reduced. Figure 5c also shows the improved data efficiency of the model discovery using active learning E-SINDy compared with E-SINDy. Through active E-SINDy, we reduce the variance of the identified model coefficients, increase the success rate of identifying the correct model structure and reduce the model coefficient error compared with standard E-SINDy.

Figure 5

Exploiting ensemble statistics for active learning. Active E-SINDy randomly selects a number of initial conditions (IC), computes the ensemble forecast variance at each IC and explores the IC with highest variance. (a) Ensemble forecasts from different ICs. (b) Reduced variance of identified model coefficients after several active learning steps. (c) Improved data efficiency and accuracy of model discovery using active learning. (Online version in colour.)

Ensemble sparse identification of nonlinear dynamics model predictive control

It is also possible to use E-SINDy to improve model predictive control (MPC) [84-86]. MPC is a particularly compelling approach that enables control of strongly nonlinear systems with constraints, multiple operating conditions and time delays. The major challenge of MPC lies in the development of a suitable model. Deep neural network models have been increasingly used for deep MPC [87,88]; however, they often rely on access to massive datasets, have limited ability to generalize, do not readily incorporate known physical constraints and are computationally expensive. Kaiser et al. [46] showed that sparse models obtained via SINDy perform nearly as well with MPC, and may be trained with relatively limited data compared to a neural network. Here, we show that E-SINDy can further reduce the training data requirements compared to SINDy, enabling the control of nonlinear systems in the very low data limit. We use E-SINDy to identify a model of the forced Lorenz system dynamics and use MPC to stabilize one of the two unstable fixed points . The Lorenz system is introduced in figure 1 and we add a control input to the first state of the dynamics: . The control problem is based on Kaiser et al. [46]. We describe the MPC problem in more detail in appendix C. In figure 6, we show the performance of MPC based on E-SINDy models for different training data length and . In figure 6a, we show the sensitivity of the mean MPC cost to training data length. We run 1000 noise realizations and average the mean MPC cost of all runs. Figure 6b shows trajectories of the controlled Lorenz system for models trained with E-SINDy and SINDy, using 50 and 150 time-step data points. E-SINDy significantly improves the MPC performance in the low data limit compared with SINDy.

Figure 6

System identification in the low-data limit for model predictive control (MPC). (a) MPC cost function average over number of time steps used for training with E-SINDy (blue continuous line) and SINDy (red dashed line). Controlled trajectory coordinates x, y, of Lorenz system with MPC input u, for models trained with E-SINDy (blue continuous line) and SINDy (red dashed line) using (b) 50 and (c) 150 time-step data points. (Online version in colour.)

Discussion

This work has developed and demonstrated a robust variant of the SINDy algorithm based on ensembling. The proposed E-SINDy algorithm significantly improves the robustness and accuracy of SINDy for model discovery, reducing the data requirements and increasing noise tolerance. E-SINDy exploits foundational statistical methods, such as bootstrap aggregating, to identify ensembles of ODEs and PDEs that govern the dynamics from noisy data. From this ensemble of models, aggregate model statistics are used to generate inclusion probabilities of candidate functions, which promotes interpretability in model selection and provides probabilistic forecasts. We show that ensembling may be used to improve several standard SINDy variants, including the integral formulation for PDEs. Combining ensembling with the integral formulation of SINDy enables the identification of PDE models from data with more than twice as much measurement noise as has been previously reported. These results are promising for the discovery of governing equations for complex systems in neuroscience, power grids, epidemiology, finance or ecology, where governing equations have remained elusive. Importantly, the computational effort to generate probabilistic models using E-SINDy is low. E-SINDy produces accurate probabilistic models in seconds, compared with existing Bayesian inference methods that take several hours. Library bagging has the additional advantage of making the least-squares computation more efficient by sampling only small subsets of the library. E-SINDy has also been incorporated into the open-source PySINDy package [62,63] to promote reproducible research. We also present results exploiting the ensemble statistics for active learning and control. Recent active exploration methods [89] and active learning of nonlinear system identification [90] suggest exploration techniques using trajectory planning to efficiently explore high uncertainty regions of the feature space. We use the uncertainty estimates of E-SINDy to explore high uncertainty regions that maximally inform the learning process. Active E-SINDy reduces the variance of the identified model coefficients, increases the success rate of identifying the correct model structure and reduces the model coefficient error compared with standard E-SINDy in the low data limit. Finally, we apply E-SINDy to improve nonlinear MPC. SINDy was recently used to generate models for real-time MPC of nonlinear systems. We show that E-SINDy can significantly reduce the training data required to identify a model, thus enabling control of nonlinear systems with constraints in the very low data limit. An exciting future extension of the computationally efficient probabilistic model discovery is to combine the active learning and MPC strategies based on E-SINDy. An important avenue of future work may also explore active sampling that is constrained by the physical limitations of a given simulation or experiment. Highly efficient exploration and identification of nonlinear models may also enable learning task-agnostic models that are fundamental components of model-based reinforcement learning.

Table 1

Spatial and temporal discretizations for different PDEs.

PDE	n	m	xb	tb
Inviscid Burgers	256	256	[−4000,4000]	[0, 4]
Korteweg de Vries	400	601	[−π,π]	[0, 0.006]
nonlinear Schroedinger	256	251	[−5,5]	[0,π]
Kuramoto–Sivashinsky	256	301	[0, 100]	[0, 150]
reaction–diffusion	256×256	201	[−10,10]×[−10,10]	[0, 4]

30 in total

1. Nonlinear Laplacian spectral analysis for time series with intermittency and low-frequency variability.

Authors: Dimitrios Giannakis; Andrew J Majda
Journal: Proc Natl Acad Sci U S A Date: 2012-01-17 Impact factor: 11.205

2. Active learning from stream data using optimal weight classifier ensemble.

Authors: Xingquan Zhu; Peng Zhang; Xiaodong Lin; Yong Shi
Journal: IEEE Trans Syst Man Cybern B Cybern Date: 2010-04-01

3. Bayesian differential programming for robust systems identification under uncertainty.

Authors: Yibo Yang; Mohamed Aziz Bhouri; Paris Perdikaris
Journal: Proc Math Phys Eng Sci Date: 2020-11-25 Impact factor: 2.704

4. Learning physically consistent differential equation models from data using group sparsity.

Authors: Suryanarayana Maddu; Bevan L Cheeseman; Christian L Müller; Ivo F Sbalzarini
Journal: Phys Rev E Date: 2021-04 Impact factor: 2.529

5. Learning partial differential equations via data discovery and sparse optimization.

Authors: Hayden Schaeffer
Journal: Proc Math Phys Eng Sci Date: 2017-01 Impact factor: 2.704

6. Sparse model selection via integral terms.

Authors: Hayden Schaeffer; Scott G McCalla
Journal: Phys Rev E Date: 2017-08-02 Impact factor: 2.529

7. Nonlinear stochastic modelling with Langevin regression.

Authors: J L Callaham; J-C Loiseau; G Rigas; S L Brunton
Journal: Proc Math Phys Eng Sci Date: 2021-06-02 Impact factor: 2.704

8. WEAK SINDY FOR PARTIAL DIFFERENTIAL EQUATIONS.

Authors: Daniel A Messenger; David M Bortz
Journal: J Comput Phys Date: 2021-06-23 Impact factor: 4.645

9. Efficient inference of parsimonious phenomenological models of cellular dynamics using S-systems and alternating regression.

Authors: Bryan C Daniels; Ilya Nemenman
Journal: PLoS One Date: 2015-03-25 Impact factor: 3.240

10. Robust learning from noisy, incomplete, high-dimensional experimental data via physically constrained symbolic regression.

Authors: Patrick A K Reinbold; Logan M Kageorge; Michael F Schatz; Roman O Grigoriev
Journal: Nat Commun Date: 2021-05-28 Impact factor: 14.919

3 in total