Literature DB >> 33286844

Can Short and Partial Observations Reduce Model Error and Facilitate Machine Learning Prediction?

Abstract

Predicting complex nonlinear turbulent dynamical systems is an important and practical topic. However, due to the lack of a complete understanding of nature, the ubiquitous model error may greatly affect the prediction performance. Machine learning algorithms can overcome the model error, but they are often impeded by inadequate and partial observations in predicting nature. In this article, an efficient and dynamically consistent conditional sampling algorithm is developed, which incorporates the conditional path-wise temporal dependence into a two-step forward-backward data assimilation procedure to sample multiple distinct nonlinear time series conditioned on short and partial observations using an imperfect model. The resulting sampled trajectories succeed in reducing the model error and greatly enrich the training data set for machine learning forecasts. For a rich class of nonlinear and non-Gaussian systems, the conditional sampling is carried out by solving a simple stochastic differential equation, which is computationally efficient and accurate. The sampling algorithm is applied to create massive training data of multiscale compressible shallow water flows from highly nonlinear and indirect observations. The resulting machine learning prediction significantly outweighs the imperfect model forecast. The sampling algorithm also facilitates the machine learning forecast of a highly non-Gaussian climate phenomenon using extremely short observations.

Entities: Chemical Disease Species

Keywords: conditional sampling; data assimilation; machine learning; model error; non-Gaussian systems; short and partial observations

Year: 2020 PMID： 33286844 PMCID： PMC7597167 DOI： 10.3390/e22101075

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Predicting complex nonlinear turbulent dynamical systems is an important and practical topic. In many areas including engineering, geophysics, and climate science, dynamical or statistical models are widely used for prediction. However, due to the lack of a complete understanding of nature, the ubiquitous model error may greatly affect the prediction performance. On the other hand, machine learning forecasts based on the properties learned from massive training data possess a large potential to overcome the model error and they have become the predominant forecast methods in quite a few industrial and engineering problems [1,2,3]. However, the prediction of many climate and geophysical phenomena using machine learning algorithms can still be very challenging. This is because the high-resolution satellites and other refined measurements were not widely developed until recent times. As a result, the available training data are very limited with about only 40 years, which is far from enough to train the machine learning algorithms for predicting interannual or decadal variability. Another fundamental difficulty in utilizing machine learning to predict nature is that only partial observations are available in most applications. In other words, the typical observations involve merely a small subset of the variables and sometimes there is even no direct observations for any variables of interest. Reconstructing the training data of the unobserved variables is quite difficult without the aid of a suitable model. Since the data in climate science, geophysics, and many other complex nonlinear systems are spatiotemporally correlated and intrinsically chaotic, the traditional data augmentation methods [4,5,6] for static data are not applicable to expanding the training data set of these problems. On the other hand, thanks to the development of many physics-based dynamical models in describing nature, long correlated time series from these models have been used for training the machine learning algorithms [7,8]. The resulting machine learning prediction can often outperform the simple model forecast. However, such an enhancement of prediction accuracy mainly comes from the improvement of the forecast methodology whereas model error remains in the training data. Thus, incorporating the available short observations into the imperfect model is essential to improving the quality of the training data. Data assimilation is a well-known technique that combines observations with an imperfect model to improve the statistical state estimation [9,10,11,12]. The resulting posterior distribution is typically more informative and accurate than the model’s prior distribution. Data assimilation also plays an important role in recovering the states of the unobserved variables via their dynamical coupling with the observed ones in the model. Despite the idea of combining partial observations with an imperfect model to improve the quality of the training data for machine learning, there remain two key issues to be resolved. First, data assimilation only provides a statistical state estimation at each time instant. The more desired format of training data are the time series that describes the spatiotemporal evolution of the underlying dynamics. In practice, the time series by collecting all the posterior mean estimates is often used as an approximation of the true signal [13]. However, the posterior mean time series can be biased in reproducing even the basic dynamical properties of nature (See Appendix A). Second, since the length of the assimilated states is consistent with that of the short observations, the insufficient training data remain as an unsolved problem. In this article, an efficient and dynamically consistent conditional sampling algorithm for nonlinear time series is developed that resolves the two issues discussed above. The sampling algorithm starts with a forward and a backward data assimilation procedure to characterize the correct uncertainty of the posterior estimates. Then, the path-wise temporal dependence conditioned on the observations is combined with the point-wise posterior estimates to develop a recursive sampling scheme. For a rich class of nonlinear and non-Gaussian systems, the conditional sampling is carried out by solving a simple stochastic differential equation (SDE), where the short and partial observations serve as the implicit input. This allows an extremely efficient way to sample the nonlinear time series even in high dimensions. One crucial feature of the resulting sampled trajectories is that they outweigh the posterior mean time series in that they succeed in reproducing the exact dynamical and statistical characteristics of nature in the perfect model setup. In the presence of an imperfect model, the model error can be significantly mitigated in the sampled trajectories due to the extra information from observations. What remains is to cope with the short observations. Here, the posterior uncertainties play a vital role in generating multiple distinct sampled trajectories conditioned on the same short observations, which effectively provide a large training set. In addition, this conditional sampling technique allows for sampling the trajectories of the unobserved variables. All these features facilitate the machine learning training and advance the effective prediction. The rest of the article is organized as follows. The efficient and dynamically consistent conditional sampling algorithm is developed in Section 2. The prediction schemes are described in Section 3. Section 4 aims at improving the prediction of multiscale compressible shallow water flows using indirect observations. Section 5 focuses on advancing the forecast of a highly non-Gaussian climate phenomenon using extremely short observations. The article is concluded in Section 6.

2. The Nonlinear Models and the Conditional Sampling Algorithm

The focus here is on a rich class of high-dimensional nonlinear and non-Gaussian turbulent models, the structure of which allows using closed analytic formulae to develop an efficient and dynamically consistent conditional sampling algorithm. The procedure of deriving such a conditional sampling technique can be extended to general nonlinear systems using particle methods.

2.1. The Nonlinear Modeling Framework

Consider the following class of nonlinear and non-Gaussian systems [14,15], where and are both multi-dimensional state variables. In (1), , , , , , , and are vectors and matrices that depend nonlinearly on the state variables and time t, while and are independent white noise. For nonlinear systems with partial observations, can be regarded as the collection of the observed variables while contains the variables that are not directly observed. The systems in (1) are called conditional Gaussian nonlinear systems due to the fact that conditioned on a given realization of for , the distribution of is Gaussian. Despite the conditional Gaussianity, both the joint and marginal distributions can be highly non-Gaussian. Extreme events, intermittency, and complex nonlinear interactions between different variables all appear in such systems. The framework includes many physics-constrained nonlinear stochastic models, large-scale dynamical models in turbulence, fluids, and geophysical flows, as well as stochastically coupled reaction–diffusion models in neuroscience and ecology. A gallery of examples of conditional Gaussian systems in engineering, neuroscience, ecology, fluids, and geophysical flows can be found in [15]. Some well-known dynamical systems that belong to this framework are the noisy versions of various Lorenz systems, a variety of the stochastically coupled FitzHugh–Nagumo model, and the Boussinesq equations with noise.

2.2. The Nonlinear Data Assimilation

Data assimilation or filtering aims at solving the conditional (or posterior) distribution . One advantage of the conditional Gaussian nonlinear systems (1) is that such a conditional distribution can be written down using closed analytic formulae. (Nonlinear Optimal Filter [14]). For the conditional Gaussian nonlinear systems (1), given one realization of where Here, the “optimality” is in the Bayesian sense. From now on, we assume , which is the case in most applications.

2.3. The Optimal Conditional Sampling

The data assimilation exploits the observational information up to the current time instant for improving the initialization of real-time prediction. On the other hand, the optimal offline point-wise statistical state estimation can be carried out by making use of the observational information in the entire training period. This leads to a more accurate state estimation and is achieved by a forward pass of filtering and a backward pass of smoothing. (Nonlinear Optimal Smoother [16]). For the conditional Gaussian nonlinear systems (1), given one realization of The backward notations on the left-hand side of (3) are understood as The nonlinear smoother provides the optimal statistical state estimate at each time instant t. However, merely using the information from the smoother posterior distributions (3) is not sufficient to draw unbiased model trajectories conditioned on the observations. This is because, in addition to the point-wise statistical state estimates (namely at each fixed time instant) from the smoother (3), the conditional temporal dependence at nearby time instants also needs to be taken into account in creating these conditional sampled trajectories. In practice, the posterior mean time series is often used as a surrogate of the recovered model trajectory conditioned on the observations. However, the posterior uncertainty and its temporal correlation are completely ignored in such an approximation. The consequence is that the posterior mean time series fails to capture many key dynamical and statistical features of the underlying dynamics, such as the temporal autocorrelation function (ACF) and the PDF (See Appendix A). Note that a naive way of involving the posterior uncertainty in sampling hidden trajectories conditioned on the observations is to draw independent random numbers from the posterior distributions at different time instants and then connect them together. However, such an approach not only leads to a noisy time series but fails to capture the underlying dynamical features as well due to the lack of incorporating the temporal correlation of the uncertainty. The result presented below exploits both the conditional path-wise temporal dependence and the point-wise optimal state estimate from the nonlinear smoother (3) to build an optimal conditional sampling algorithm. For the nonlinear systems (1), the conditional sampling can be carried out in an extremely efficient way by solving a simple SDE. (Conditional Sampling Formula). For the conditional Gaussian nonlinear systems (1), conditioned on one realization of where See Appendix D for the proof of this Theorem. The mathematical conciseness of (4) indicates that the sampled trajectories meander around the smoother mean time series. Meanwhile, the second term on the right-hand side indicates a correlated uncertainty in the sampled trajectories. Such a temporal correlated uncertainty plays a crucial role for the sampled trajectories to perfectly reproduce the path-wise and statistical features of nature in the absence of model error, which is however lacked in the posterior mean time series. The uncertainty also facilitates the sampling of multiple distinct trajectories conditioned on the same short observational time series, which effectively provide a large training set for the machine learning algorithms. Notably, in the presence of an imperfect model, the model error can be significantly mitigated in the sampled trajectories due to the extra information from observations. The machine learning prediction based on these sampled trajectories is thus expected to outperform the model forecast using the imperfect model. Thanks to the explicit formula in (4), the conditional sampling procedure is computationally much cheaper than the traditional particle methods. In Appendix A, an alternative conditional sampling formula is included. It requires only the filter estimate, which further reduces the computational cost. In addition, a block decomposition technique [17] can be adopted here for efficiently sampling many high-dimensional systems. Finally, Figure 1 shows a schematic illustration of the entire procedure. Here, the nonlinear smoother is crucial for the state estimation. This is because the smoother improves the filter estimates by exploiting the entire observational information and leads to an unbiased conditional state estimation. Utilizing the point-wise state estimates and the temporal dependence to derive the formula of the path-wise conditional sampling is another key step that fundamentally differs from the state estimation using data assimilation.

Figure 1

Illustration of the conditional sampling procedure and its application to the augmentation of the training data for machine learning forecasts.

3. The Prediction Schemes

3.1. The Machine Learning Algorithm

The purpose of this article is to elucidate the advantage of using the multiple conditional sampled trajectories to reduce the model error and improve the machine learning prediction. To this end, designing sophisticated machine learning algorithms is not the main focus here. Throughout this article, the long short-term memory (LSTM) network [18] is adopted as the machine learning algorithm. The LSTM network is an artificial recurrent neural network architecture in deep learning. Unlike the standard feed forward neural networks, LSTM has feedback connections. Therefore, it is applicable to predicting sequential data, in particular the time series from complex nonlinear dynamical systems. Only the vanilla LSTM network is used here. It starts with a sequence input layer followed by an LSTM layer. The network ends with a fully connected layer and a regression output layer.

3.2. The Ensemble Forecast

The ensemble forecast is a widely used model-based prediction approach, which involves making a set of forecasts. Due to the intrinsic chaotic behavior of the underlying dynamics, different forecast trajectories are distinct from each other. The ensemble mean forecast time series is often regarded as the path-wise prediction of the true signal and the ensemble spread characterizes the forecast uncertainty. In the following tests, the ensemble forecast is adopted as the prediction scheme for both the perfect and imperfect models. The associated ensemble mean time series will be compared with the LSTM prediction.

4. Predicting Multiscale Compressible Shallow Water Flows Using Lagrangian Observations

In many applications, one fundamental difficulty is that the variables of interest are not directly observed. The goal of this section is to predict multiscale compressible shallow water flows, where the available indirect observations are the Lagrangian tracer trajectories that are transported by the underlying flow field [19,20,21]. To mimic the real-world scenario, the only information in hand is an imperfect flow model and a short observational trajectory of the Lagrangian tracers.

4.1. The Shallow Water Equation

The linear rotating shallow water equation with double periodic boundary conditions in domain reads [22,23] where is the two-dimensional velocity field and is the height function. The non-dimensional numbers are , , where Ro is the Rossby number that represents the ratio between the Coriolis term and the advection term, and Fr is the Froude number. For most geophysical problems, ranges from to , representing fast to moderate rotations while is a typically choice. Applying a plane wave ansatz, the solution of (5) is given by [22] where is the two-dimensional Fourier wavenumber. The set represents three different modes associated with each Fourier wavenumber. The modes with are the geostrophically balanced (GB) modes, where the GB relation always holds. The GB flows are incompressible, which is embodied in the eigenvector , and the associated phase speed is . The modes with represent the compressible gravity waves with the phase speeds . See Appendix B for details. Here, the temporal evolution of each random Fourier coefficient is governed by a linear Gaussian process [24], Such an representation is widely used in practice [25,26,27], where the damping and random noise allow a simple way to mimic the turbulent features and the interactions between the resolved and unresolved scales. The large-scale deterministic forcings and represent for example the seasonal cycle. It is clear that, when is small, the model becomes a multiscale system with slow GB flows and fast gravity waves.

4.2. The Lagrangian Observations

Assume there are L tracers. The equation of each tracer trajectory is given by the Newton’s law together with some small-scale uncertainties where is a two-dimensional vector, containing the first two components of the eigenvector . It is important to note that the GB and the gravity modes are coupled in the observations. In addition, the tracer equation is highly nonlinear because the prognostic variable appears in the exponential function on the right-hand side. These features lead to a tough test in recovering and sampling the underlying velocity fields. Nevertheless, despite the intrinsic nonlinearity, the coupled system (6)–(8) belongs to the modeling framework (1), which facilitates the conditional sampling of the flow trajectories [28].

4.3. Setup of the Numerical Tests

The modes with wavenumbers are used in the study here, which means the total number of the Fourier modes is . However, the first two components of the -th GB mode are zero. This mode often represents the given background flow. Since it has no contribution to the observational process, it is removed here. Thus, the degree of freedom (DoF) of the underlying model is DoF . See Appendix B for details. The parameters in (7) are , and , which indicate an equipartition of the energy in different modes. The uncertainty in the tracer Equation (8) is . The number of tracers is , which is slightly smaller than the DoF of the resolved modes and is the typical case in practice. Appendix B includes a sensitivity analysis of the prediction skill as a function of L, which shows qualitatively similar results as those in the main text here. Below, a small Rossby number is used in the perfect model. In practice, the dynamics of the slowly-varying GB modes is often known, but building an accurate model for the fast gravity modes is a challenging task, especially for estimating the phase speeds [29]. To mimic such a situation, an imperfect model is used here to sample and predict the gravity modes, where the model error comes from an overestimation of the Rossby number with in the imperfect model that leads to an inaccurate . The available observational data has only units, which is about five decorrelation times of the Fourier modes. The prediction test is taken on an independent period with 50 units, where the true signal generated from the perfect model in this period is used as the reference value to compute the prediction skill scores in the forecast experiments.

4.4. Conditional Sampling

To expand the training data set, distinct trajectories from the conditional sampling algorithm are generated by exploiting the imperfect model, the short observations, and the formula in (4). Therefore, the effective training data for the LSTM network contains units. Figure 2a shows the true signal and the smoother estimate of the gravity mode , where the uncertainty in the smoother estimate facilitates the sampling of multiple trajectories. The variability in these trajectories significantly enriches the short-term time evolution patterns that are associated with the underlying flow field but are not fully reflected in the short true time series. It is also important to note that the model error in the phase speed is reduced by a large extent in the conditional sampled trajectories. In fact, (b) includes the ACF of the sampled trajectory, which demonstrates a large improvement compared with that of the imperfect model. Since the ACF characterizes the temporal dependence, the result here suggests a potential enhancement of the prediction accuracy in the LSTM forecast using the conditional sampled trajectories as training data.

Figure 2

Gravity model of the shallow water equation. (a) the true signal and two sampled trajectories conditional on the observations of the tracers’ motions. The deep, moderate and light magenta shading areas show the one, two, and three standard deviations (STDs) of the uncertainty in the smoother estimate, respectively; (b) the ACFs of the truth, the conditional sampled trajectory and the trajectory from a free run of the imperfect model.

4.5. The LSTM Setup

Recall the total DoF of the underlying flow field is 26, where the DoF of the gravity modes is 18. Since the equations of different Fourier modes are independent from each other in the perfect model, it is reasonable to build nine LSTM networks for predicting these gravity modes, each of which contains two Fourier modes that are the complex conjugate pair. Similar manipulation is used for predicting the GB modes. Since no model error is involved in the GB modes, the LSTM prediction of the GB part of the flow is almost the same as the perfect model ensemble forecast. Note that certain weak coupling between different Fourier modes may exist in the conditional sampled time series because the observations are the mixture of all the Fourier modes. Nevertheless, dividing a complex problem into several independent subproblems facilitates learning the features of the underlying dynamics. The input time series has a length of time units while the output is the next time units. Each of the LSTM used here contains only one hidden LSTM layer, where the number of the neurons is 100. The maximum epoch is 100. The Adam optimization algorithm is adopted, and a mean squared error loss function is used.

4.6. Prediction

For both the model-based ensemble forecast and the machine learning prediction here, the initial conditions of the flow fields are assumed to be perfectly known in the prediction stage. Though in practice data assimilation is required for the initializations, the idealized setup here rules out the impact from initialization and allows us to study the predictability limit of different methods. Figure 3 shows the root-mean-square error (RMSE) and the pattern correlation (Corr) of the predicted time series related to the truth as a function of lead time. These skill scores for the gravity modes and with are included in (a,b). The prediction of the other gravity modes has similar behavior. (c) shows the reconstructed velocity field associated with the gravity modes in physical space. Clearly, the skill scores of the LSTM prediction are quite close to those of the perfect model, while the imperfect model has a much larger forecast bias. The improvement in the LSTM prediction is due to the significant reduction of the model error and the extended length of the training time series resulting from the conditional sampling algorithm.

Figure 3

The RMSE and the Corr as a function of lead time. (a,b) the gravity modes and with . The RMSE and Corr are computed based only on the real part of the Fourier time series; (c) the reconstructed velocity field associated with the gravity modes in physical space, where the skill scores are the averaged values of the u and v components. In reconstructing the velocity field in physical space, a mesh grid is used. The red, green, and black curves show the prediction skill scores using the perfect model, the imperfect model, and the LSTM network. The dashed lines in the RMSE panels indicate the one standard deviation of the true signal and those in the Corr panels show the Corr threshold.

A case study is included in Figure 4, which compares the predicted flow fields at a lead time . The first row shows the truth and the predicted velocity field associated with only the gravity modes while the second row shows those of total velocity field that includes both the GB and gravity modes. The perfect model succeeds in predicting both the compressible gravity flows and the total velocity field. In contrast, the predicted flow pattern associated with the gravity modes from the imperfect model is completely reversed. As a result, the imperfect model fails to forecast the vortices in the total flow field in that they are overwhelmed by the large error from the forecasted gravity waves. On the other hand, despite a slight overestimation of the flow amplitude, the overall patterns are forecasted accurately by the LSTM network. In particular, both the meandering jets and the predominant vortices in the total velocity field are precisely predicted by the LSTM network.

Figure 4

Predicting the flow fields at a lead time in physical space. The prediction starting from in the prediction phase. (a–d) compare the truth, the perfect model forecast, the imperfect model forecast, and the LSTM forecast. The first row shows the velocity field associated with only the gravity modes while the second row shows the total velocity field. The velocity field in each panel is shown by the quivers and its amplitude is denoted by the shading area.

5. Predicting the Monsoon Intraseasonal Oscillation (MISO)

5.1. The MISO Index and the Low-Order Nonlinear Stochastic Models

Monsoon Intraseasonal Oscillation (MISO) [30,31,32] is one of the prominent modes of tropical intraseasonal variability. As a slow moving planetary scale envelope of convection propagating northeastward, it strongly interacts with the boreal summer monsoon rainfall over south Asia. The MISO plays a crucial role in determining the onset and demise of the Indian summer monsoon as well as affecting the seasonal amount of rainfall over the Indian subcontinent. Therefore, both real-time monitoring and accurate extended-range forecast of MISO phases have large ecological and societal impacts. Recently, a new MISO index [33], by applying a new nonlinear data analysis technique to the daily Global Precipitation Climatology Project (GPCP) rainfall data [34] over the Asian summer monsoon region (20 S–30 N, 30 E–140 E), was developed. This new index outweighs the traditional ones that are based on the extended empirical orthogonal function (EEOF) in capturing the intermittency and nonlinear features of the MISO. The associated MISO modes also have higher memory and predictability, stronger amplitude, and higher fractional explained variance over the western Pacific, Western Ghats, and adjoining Arabian Sea regions, and more realistic representation of the regional heat sources over the Indian and Pacific Oceans compared with those extracted via EEOF analysis. (a) of Figure 5 shows this MISO index, which is a two-dimensional time series. Both components are intermittent with active phases in summer and quiescent phases in winter. The associated PDFs, as are shown in the blue curves, are highly non-Gaussian.

Figure 5

MISO index and the associated statistics. (a) the two components of the MISO index; (b) the ACFs and the cross-correlation functions (CCFs) associated with the MISO time series in (a) and those from the simulation of the model (9) and (10) with the parameters listed in the top row of Table 1; (c) similar to (b) but for the PDFs; (d) the ACF of the MISO 1 component in different years; (e) similar to (d) but for the PDF.

To develop a model that describes such a MISO index, it is natural to start with a two-dimensional linear stochastic oscillator. However, the linear oscillator model itself is not sufficient to characterize all the observed features of the observed MISO index. In particular, it fails to capture the variability in the amplitude and the oscillation frequency. In fact, as is shown in (d–e) in Figure 5, the dynamical and statistical behavior of the MISO index in different years is quite distinct with each other. Therefore, a simple but effective way to take into account these characteristics is to add two extra processes, representing the randomness in damping and phase [35]. The model reads where the time-periodic function provides a crude description of the seasonal cycle. In this model, and stand for the two components of the MISO index while v and are the stochastic damping and the stochastic phase, respectively. The white noise sources , , and are independent from each other. Note that the nonlinear model developed here is slightly different from the physics-constraint model in [35]. However, these two models have almost the same skill in predicting the MISO index. The main reason to adopt the form in (9) and (10) is that the model structure of this coupled nonlinear model facilitates the application of the conditional sampling algorithm. To illustrate the skill of the coupled model (9) and (10) in capturing the nonlinear and non-Gaussian features of the MISO index, it is shown in (b,c) of Figure 5 that the model can perfectly recover the ACFs, the cross-correlation functions (CCFs), and the non-Gaussian PDFs of the entire MISO index. The model trajectories also highly resemble the true MISO index (see Figure A3 in Appendix C). The associated model parameters are listed in the top row of Table 1. Due to the high skill in describing the MISO index, the model (9) and (10) with these parameters is named as the “nearly perfect model”.

Figure A3

(a,b) the two components of the observed MISO index (the same as those in Figure 5); (c,d) a simulation from the nearly perfect model (9) and (10), where the parameters are listed in the top row of Table 1.

Table 1

Two sets of the parameters for the model (9) and (10).

	du	dv	dω	γ	a	σu	σv	σω	f0	f1	ωf	ϕ
I. nearly perfect model	0.9	0.6	0.5	0.2	4.1	0.5	0.5	0.7	1	4.7	2π/12	−2
II. imperfect model	0.9	0.6	0.5	0.2	5.2	0.5	0.5	0.7	1	4.7	2π/12	−2

It is important to note that the nearly perfect model is obtained by exploiting all the observational information (1998–2013) for model calibration. Therefore, despite the fact that such a model succeeds in describing nature, it cannot be used for predicting the MISO index during the same period because of the overfitting issue. In the following, only the MISO index in year 1998 is used for model calibration. This mimics most applications in nature where the data for model calibration is very limited. On the other hand, the time series from year 2001 to year 2013 is adopted for testing the prediction skill, which provides a sufficient long validation period to reach a robust conclusion. Due to the fact that the MISO index has distinct year-to-year behavior ((d,e) in Figure 5), the model parameters estimated by using only the time series in year 1998 are different from the nearly perfect model. These estimated model parameters are shown in the bottom row of Table 1 and the resulting model is named as the “imperfect model” in predicting the MISO time series from year 2001 to year 2013. Below, either a one-year or a three-year time series is combined with this imperfect model to enrich the training data set for the machine learning forecasts by applying the conditional sampling technique. Note that, although the imperfect model can be in principle re-calibrated when the new data comes sequentially in time, the parameters in the imperfect model are unchanged in the tests here for two reasons. First, the online parameter estimation for complex nonlinear systems in real applications can be extremely expensive and thus it is seldom adopted in practice. Second, there often exists an intrinsic model error in real applications. The imperfect parameters here mimic such an intrinsic barrier since otherwise the model is nearly perfect.

5.2. Conditional Sampling

Below, the short observational data in (1) year 1998, (2) year 1999, (3) year 2000, or (4) a three-year period 1998–2000 are utilized together with the imperfect model to enrich the training data set of the machine learning forecast based on the conditional sampling technique. The conditional sampling algorithm here involves two steps: Conditioned on the two components of the observed MISO index, sample N trajectories of . Conditioned on each of the sampled trajectories, denoted now by , from Step 1, sample a trajectory of . In both the steps, denotes the conditional variables while stands for the sampled variables. The structure of the nonlinear model (9) and (10) allows the setups in both the steps belong to the nonlinear modeling family (1) such that the closed analytic Formulae (4) can be applied in sampling both and . Note that, despite being not appeared in the equations of , the conditional sampling in step 2 is necessary since uncertainty exists in the processes of . Simply plugging the sampled trajectories of into the equations of leads to biased results.

5.3. The Setup of the LSTM and the Model Ensemble Forecast

In all the tests here, a 30-year sampled time series of is used as the machine learning training data. This implies when a one-year observational data are used in the conditional sampling and when the three-year data from year 1998 to year 2000 is used. The hidden variables are only used to generate the sampled trajectories of , but they are not directly utilized in the machine learning training and forecasting steps. The input data in the LSTM has a length of 30 days and the output is the next one day. The number of the neurons in the LSTM layer is 200, and the maximum epoch is 100. The model ensemble forecast is adopted as a comparison with the machine learning prediction. The model ensemble forecast exploits 50 ensemble members, which have been validated to provide an unbiased ensemble mean prediction time series. Since there is no observational data for the two hidden variables v and , the exact data assimilation scheme in (2) is utilized for their initializations.

5.4. Prediction

The MISO index prediction using different methods is shown in Figure 6. The solid curves in both (a) and (b) illustrate the overall prediction skill in the 2001–2013 period using the LSTM. The difference between these two panels is the following: In (a), the training data are the short observational data with either one year (blue, red, and green curves) or three years (black curves). On the other hand, in (b), the 30-year sampled time series is used for the LSTM training. As a comparison, the model ensemble forecast results are also included in these panels, where the dashed pink curve shows the prediction using the nearly perfect model while the dashed cyan curve illustrates this using the imperfect model.

Figure 6

The MISO prediction. (a) the RMSE and Corr as a function of lead time (days) in predicting the MISO time series from 2001 to 2013, where the short observational data are used for training the LSTM. Here, the solid blue, red, and green curves are the LSTM prediction based on a one-year training data using the observed time series in year 1998, 1999, and 2000, respectively. The solid black curve is the LSTM prediction based on a three-year training data using the observed time series from 1998 to 2000. The dashed pink and cyan curves show the model predictions using the nearly perfect model and the imperfect model. The black dotted lines show the threshold and one standard deviation of the true time series. (b) is similar to (a), but the training data of the LSTM are obtained by applying the conditional sampling algorithm to the observed time series in either one of the three years of 1998 (blue), 1999 (red) and 2000 (green), or all three years (black). The total length of the sampled data are 30 years; (c) the maximum lead time of the skillful prediction in different years. Here, the skillful prediction is defined as the and (standard deviation) of the true signal.

The main conclusions are as follows: first, either a one-year or a three-year time series is not sufficient for training the LSTM. Among all these short training data, using the time series in year 1998 brings about the worst prediction skill. This is because the dynamical and statistical features of the MISO index in year 1998 are more distinguished from those in the entire observational period, compared with year 1999 and year 2000 (see (d,e) in Figure 5). The consequence is that the LSTM prediction using only the time series of year 1998 as the training data performs even significantly worse than the ensemble forecast using the imperfect model. Second, the LSTM prediction using the long sampled trajectories based on the information of the extremely short observations in year 1999 or year 2000 has almost the same skill as the perfect model forecast! In fact, the medium-range forecast (20 to 30 days) using the LSTM even slightly outweighs that using the perfect model. Notably, the LSTM prediction using the long sampling data based on the time series of year 1998 is also greatly improved and outweighs the ensemble forecast using the imperfect model. This seems to be counter-intuitive. Nevertheless, since the model is stochastic, it is able to generate patterns with different oscillation frequencies in the conditional sampled trajectories. The LSTM forecast then exploits the input information to find the most similar patterns in the training data set. In other words, the LSTM gives different weights to all the possible events in the training data set, which can be regarded as different ensemble members. This is, however, not the case in the model based ensemble mean forecast, where the same weight is assigned to all the ensemble members. Thus, despite making use of the same observational information, the LSTM performs better than the model ensemble mean forecast. In Figure A4 of Appendix C, the model generated training data are used for the LSTM prediction, which also indicates such a mechanism.

Figure A4

Skill scores of predicting the MISO index. The two panels here are similar to those in Figure 6, but the solid curves here show the training data of the LSTM is from the model simulation with a 30-year length. The dashed curves show the model ensemble forecasts.

(e) of Figure 6 shows the maximum lead time of skillful predictions in different years. Here, the skillful prediction is defined as the and (standard deviation) of the true signal. Although the useful predictions vary from year to year, the LSTM prediction using the conditional sampled trajectories based on the one-year observed time series in 1999 is comparable to the nearly perfect model forecast. The associated maximum lead time is longer than that using the three-year short training data and the imperfect model forecast in most of the years. The only exception is year 2010. In fact, the MISO pattern of year 2010 is similar to that of year 1998 with a weak amplitude. Note that the winter of both year 1998 and year 2010 corresponds to a strong positive phase of the Atlantic Zonal Mode (AZM) [36,37]. The results here confirm the negative correlation between the MISO and the AZM as well as the relatively short predictability of the MISO at the positive phase of the AZM. Next, 2002 and year 2004 are recorded as drought years. The MISO index in these two years is more irregular than the other years, which explains its lower overall prediction skill than most of the other years. On the other hand, 2008 has an overall strong and regular MISO activity during the whole monsoon season that results in a long predictability. Finally, Figure 7 shows the predicted time series at the lead time of 30 days, which clearly indicates the LSTM with the conditional sampled training data outperforms the imperfect model ensemble forecast in most of the years.

Figure 7

Comparison of the predicted MISO index using different methods. (a) prediction using the imperfect model; (b) prediction using the LSTM, where the training data are the true observational data from 1998 to 2000; (c) prediction using the LSTM, where the training data are obtained by applying the conditional sampling algorithm to the observed MISO time series of year 1999 with in total 30 years sampled data; (d) prediction using the nearly perfect model.

6. Conclusions

An efficient and optimal algorithm for sampling nonlinear time series conditioned on the observations is developed in this article. It exploits only short and partial observations to greatly reduce the model error and expand the training data set for the machine learning forecast algorithms. The sampling algorithm succeeds in creating a large training data of multiscale compressible shallow water flows from highly nonlinear and indirect observations. The resulting machine learning prediction significantly outweighs the imperfect model forecast. The sampling algorithm also facilitates the machine learning forecast of the highly non-Gaussian MISO time series using extremely short observations. Note that the MISO is an intraseasonal variability and therefore the total available data of MISO in the satellite era is actually not that restricted. The reason to adopt only a small portion of the observed MISO data for training is to mimic the real situations of predicting many other nature phenomena. Meanwhile, the remaining relatively long MISO trajectories let us study the improvement of the MISO forecast using the conditional sampling technique. On the other hand, the El Niño-Southern Oscillation (ENSO) is an interannual variability, which has significant impact on the entire earth system. The ENSO has various spatiotemporal patterns, but the observational data of the ENSO is very scarce. In addition, the current operational models have large model error in describing and predicting the ENSO. Thus, an interesting and useful task is to utilize the conditional sampling algorithm for improving the ENSO prediction, which is left as a future work. Another future direction is to generalize the conditional sampling algorithm to general nonlinear and non-Gaussian systems using particle methods.

4 in total