Literature DB >> 35799481

Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions.

Daniel S Silk¹, Veronica E Bowman¹, Daria Semochkina², Ursula Dalrymple¹, Dave C Woods².

Abstract

Scientific advice to the UK government throughout the COVID-19 pandemic has been informed by ensembles of epidemiological models provided by members of the Scientific Pandemic Influenza group on Modelling. Among other applications, the model ensembles have been used to forecast daily incidence, deaths and hospitalizations. The models differ in approach (e.g. deterministic or agent-based) and in assumptions made about the disease and population. These differences capture genuine uncertainty in the understanding of disease dynamics and in the choice of simplifying assumptions underpinning the model. Although analyses of multi-model ensembles can be logistically challenging when time-frames are short, accounting for structural uncertainty can improve accuracy and reduce the risk of over-confidence in predictions. In this study, we compare the performance of various ensemble methods to combine short-term (14-day) COVID-19 forecasts within the context of the pandemic response. We address practical issues around the availability of model predictions and make some initial proposals to address the shortcomings of standard methods in this challenging situation.

Entities: Chemical

Keywords: COVID-19; disease forecasting; model combination; model stacking; uncertainty quantification

Mesh：

Year: 2022 PMID： 35799481 PMCID： PMC9272045 DOI： 10.1177/09622802221109523

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 2.494

Introduction

Comprehensive uncertainty quantification in epidemiological modelling is a timely and challenging problem. During the COVID-19 pandemic, a common requirement has been statistical forecasting in the presence of an ensemble of multiple candidate models. For example, multiple candidate models may be available to predict disease case numbers, resulting from different modelling approaches (e.g. mechanistic or empirical) or differing assumptions about spatial or age mixing. Alternative models capture genuine uncertainty in scientific understanding of disease dynamics, different simplifying assumptions underpinning each model derivation, and/or different approaches to the estimation of model parameters. While the analysis of multi-model ensembles can be computationally challenging, accounting for this ‘structural uncertainty’ can improve forecast accuracy and reduce the risk of over-confident prediction.[1,2] A common ensemble approach is model averaging, which tries to find an optimal combination of models in the space spanned by all individual models.[3,4] However, in many settings this approach can fail dramatically: (i) the required marginal likelihoods (or equivalently Bayes factors) can depend on arbitrary specifications for non-informative prior distributions of model-specific parameters; and (ii) asymptotically the posterior model probabilities, used to weight predictions from different models, converge to unity on the model closest to the ‘truth’. While this second feature may be desirable when the set of models under consideration contains the true model (the -closed setting), it is less desirable in more realistic cases when the model set does not contain the true data generator (-complete and -open). Here, this property of Bayesian model averaging asymptotically choosing a single model can be thought of as a form of overfitting. For these latter settings, alternative methods of combining predictions from model ensembles may be preferred, for example, via combinations of individual predictive densities.[5] Combination weights can be chosen via application of predictive scoring, as commonly applied in meteorological and economic forecasting.[6,7] If access to full posterior information is available, other approaches are also possible. Model stacking methods,[8] see Section 2.2, can be applied directly, ideally using leave-one-out predictions or sequential predictions of future data to mitigate over-fitting. Alternatively, the perhaps confusingly named Bayesian model combination method[9,10] could be employed, where the ensemble is expanded to include linear combinations of the available models. For computationally expensive models, where straightforward generation of predictive densities is prohibitive, a statistical emulator built from a Gaussian process prior[11] could be assumed for each expensive model. Stacking or model combination can then be applied to the resulting posterior predictive distributions, conditioning on model runs and data. In this paper, we explore the process of combining short-term epidemiological forecasts for COVID-19 daily deaths, and hospital and intensive care unit (ICU) occupancy, within the context of supporting UK decision-makers during the pandemic response. In practice, this context placed constraints on the information available to the combination algorithms. In particular, the individual model posterior distributions were unavailable, which prohibited the use of the preferred approach outlined above, and so alternative methods had to be utilized. The construction of consensus forecasts in the UK has been undertaken through a mixture of algorithmic model combination and expert curation by the Scientific Pandemic Influenza group on Modelling (SPI-M). For time-series forecasting, equally-weighted mixture models have been employed.[12-14] We compare this approach to more sophisticated ensemble methods. Important related work is the nowcasting of the current state of the disease within the UK population through metrics such as the effective reproduction number, growth rate and doubling time.[15] The rest of the paper is organized as follows. Section 2 describes in more detail methods of combining individual model predictions. Limitations of the available forecast data for the COVID-19 response are then described in Section 3, and Section 4 compares the performance of ensemble algorithms and individual model predictions. Section 5 provides some discussion and areas for future work. This paper complements work undertaken in a wider effort to improve the policy response to COVID-19, in particular a parallel effort to assess forecast performance.[16]

Combinations of predictive distributions

Let represent the observed data with an ensemble of models, with the th model having posterior predictive density , where is the future data. We consider two categories of ensemble methods: (i) those that stack the predictive densities as weighted mixture distributions (also referred to as pooling in decision theory); and (ii) those that use regression models to combine point predictions obtained from the individual posterior predictive densities. In both cases, stacking and regression weights can be chosen using scoring rules.

Scoring rules

Probabilistic forecast quality is often assessed via a scoring rule [17,18] with arguments , a predictive density, and , a realization of future outcome . Throughout, we apply negatively-orientated scoring rules, such that a lower score denotes a better forecast. A proper scoring rule ensures that the minimum expected score is obtained by choosing the data generating process as the predictive density. That is, if is the density function from the true data generating process, thenfor any predictive density . Common scoring rules include: Quantile and interval scores can be averaged across available quantiles/intervals to provide a score for the predictive density. The CRPS is the integral of the quantile score with respect to . Scoring rules can be used to rank predictions from individual models or to combine models from an ensemble, for example using stacking or regression methods. log-score: ; continuous ranked probability score (CRPS): , with and having finite first moment. For deterministic predictions, that is, being a point mass density with support on , CRPS reduces to the mean absolute error , and hence this score can be used to compare probabilistic and deterministic predictions. If only quantiles from are available, alternative scoring rules include quantile score: for quantile forecast from density at level ; interval score: for being a central % prediction interval from .

Stacking methods

Given an ensemble of models , stacking methods result in a posterior predictive density of the formwhere is the (posterior) predictive density from model and weights the contribution of the th model to the overall prediction, with . Giving equal weighting to each model in the stack has proved surprisingly effective in economic forecasting.[7] Alternatively, given a score function and out-of-sample ensemble training data , weights can be chosen viawhere is the -dimensional simplex. This approach is the essence of Bayesian model stacking as described by Yao et al.[8] Alternatively, scoring functions can be used to construct normalized weightswith being the average score for the th model, and being an inversely monotonic function; with the log-score (1) and , Akaike Information Criterion style weights are obtained.[19]

Regression-based methods

Ensemble model predictions can also be formed using regression methods with covariates that correspond to point forecasts from the model ensemble. These methods are particularly suited to ‘low-resolution’ posterior predictive information where only posterior summaries are available and the covariates can be defined directly from, for example, reported quantiles. We consider two such regression-based methods: Ensemble Model Output Statistics[20] (EMOS) and Quantile Regression Averaging[21] (QRA). EMOS defines the ensemble prediction in the form of a Gaussian distributionwhere are point forecasts from the individual models, and are non-negative coefficients, is the ensemble variance with the ensemble mean, and and are regression coefficients. Tuning of the coefficients is achieved by minimizing the average CRPS using out-of-sample dataFor following distribution (3), is available in closed form.[20] To aid interpretation, the coefficients can be constrained to be non-negative (this version of the algorithm is known as EMOS). QRA defines a predictive model for each quantile level, , of the ensemble forecast aswhere is the -level quantile of the (posterior) predictive distribution for model . We make the parsimonious assumption that the non-negative coefficients, , are independent of the level , and estimate them by minimizing the weighted average interval score across central predictive intervals defined from the quantiles, and out-of-sample data points:where the factors weight the interval scores such that at the limit of including all intervals, the score approaches the CRPS.

COVID-19 pandemic forecast combinations

For the COVID-19 response, probabilistic forecasts from multiple epidemiological models were provided by members of SPI-M at weekly intervals. Especially in the early stages of the pandemic, the model structures and parameters were evolving to reflect an increased understanding of COVID-19 epidemiology and changes in social restrictions. The impact of evolving models is discussed in Section 5. Practical constraints on data bandwidth and rapid delivery schedules resulted in individual model forecasts being reported as quantiles of the predictive distributions for the forecast window, and minimal information was available about the individual posterior distributions. A dataset similar to that used for this study, but for a different time window is available to the reader (10.5281/zenodo.6778105). Whilst EMOS and QRA can be directly applied using only posterior summaries, to implement stacking we estimated posterior densities as skewed-Normal distributions fitted to each set of quantiles. Stacking weights are obtained similarly to (2), with taken to be the reciprocal function,[22] that iswithandThe exponential decay term (with ) controls the relative influence of more recent observations. In Section 4, three choices of stacking weights are compared: (i) reciprocal weights (5), which are invariant with respect to future observation times , (ii) equal-weights , and (iii) time-varying weights constructed via exponential interpolation between (i) and (ii) to reduce the influence on forecasts further in the future of the performance of individual models in the training window.

Comparing the performance of ensemble and individual forecasts

The performance of the different ensemble and individual forecasts from models provided by SPI-M was assessed for four separate day forecast windows. For each window, the ensemble methods were trained by comparing forecasts from the individual models for the days immediately preceding the forecast window, to corresponding observational data obtained from government-provided data streams. Three different COVID-19-related quantities (see Table 1) for the four UK nations and seven regions were considered. Note that each model only provided forecasts for a subset of quantities and nations/regions. Since there was an overlap between day forecasts provided at weekly intervals for each model, training predictions were selected as the most recently reported that had not been conditioned on data from the combination training window. The forecast and training windows are summarized in Table 2. The assessment was conducted after a sufficient delay such that the effects of under-reporting on the observational data was negligible. However, it is unknown whether individual SPI-M models attempted to account for potential under-reporting bias in the data used for parameter fitting.

Table 1.

COVID-19 value types (model outputs of interest) for which forecasts were scored.

Value types	Description
death_inc_line	New daily deaths by date of death
hospital_prev	Hospital bed occupancy
icu_prev	Intensive care unit (ICU) occupancy

Table 2.

Forecast windows and the corresponding dates over which the ensemble methods were trained. The number of contributing individual models is provided for each forecast window. Model 6 was missing for the week beginning 23rd June, and Model 5 was missing for the week beginning 14th July.

Forecast window start date	Training window	Number of contributing models
23rd June 2020	13th June–22nd June	6
30th June 2020	20th June–29th June	7
7th July 2020	27th June–6th July	7
14th July 2020	4th June–13th July	6

We present results for the individual models and stacking, EMOS and QRA ensemble methods, as described in Section 2. Data-driven, equal and time-varying weights were applied with model stacking (see Section 3). EMOS coefficients were estimated by minimizing CRPS, with the intercept set to zero to force the combination to use the model predictions. While this disabled some potential for bias correction, it was considered important that the combined forecasts could be interpreted as weighted averages of the individual model predictions. QRA was parameterized via minimization of the average of the weighted interval scores for the (i.e. the quantile score for the median), and prediction intervals, as described in Section 2.3. The same score was used to calculate the stacking weights in (5). In each case, ensemble training data points were used, and optimization was performed using a particle swarm algorithm.[23] Performance was measured for each model/ensemble method using the weighted average of the interval score over the 14 day forecast window and , and intervals. In addition, three well-established assessment metrics were calculated; sharpness, bias and calibration.[24] Sharpness () is a measure of prediction uncertainty, and is defined here as the average width of the central prediction interval over the forecast window. Bias () measures over- or under-prediction of a predictive model as the proportion of predictions for which the reported median is greater than the data value over the forecast window. Calibration () quantifies the statistical consistency between the predictions and data, via the proportion of predictions for which the data lies inside the central predictive interval. The bias and calibration scores were linearly transformed as and , such that a well-calibrated prediction with no bias or uncertainty corresponds to . COVID-19 value types (model outputs of interest) for which forecasts were scored. Forecast windows and the corresponding dates over which the ensemble methods were trained. The number of contributing individual models is provided for each forecast window. Model 6 was missing for the week beginning 23rd June, and Model 5 was missing for the week beginning 14th July. Table 3 summarizes the performance of the ensemble methods and averaged performance of the individual models across the four forecast windows. Averaged across nations/regions, the best-performing forecasts for new daily deaths were obtained using stacking with time-invariant weights; for hospital bed occupancy QRA performed best; and for ICU occupancy, the best method was EMOS. The lowest interval scores occur when predicting new daily deaths, reflecting the more accurate and precise individual model predictions for this output, relative to the others. The ensemble methods also all perform similarly for this output. Importantly, every ensemble method improves upon the average scores for the individual models. This means that with no prior knowledge about model fidelity, if a single forecast is desired, it is better on average to use an ensemble than to select a single model.

Table 3.

Model	S¯I	β^	γ^
(a) death_inc_line
Stacked: time-invariant weights	2.16	0.66	0.49
Stacked: equal-weights	2.25	0.72	0.57
QRA	2.28	0.65	0.51
Stacked: time-varying weights	2.43	0.76	0.45
EMOS	2.82	0.76	0.54
Models	3.03	0.77	0.61
(b) hospital_prev
QRA	18.0	0.80	0.78
Stacked: time-invariant weights	24.5	0.81	0.68
Stacked: time-varying weights	25.4	0.84	0.77
EMOS	25.4	0.82	0.76
Stacked: equal-weights	27.6	0.82	0.79
Models	33.2	0.88	0.77
(c) icu_prev
EMOS	2.62	0.76	0.63
QRA	3.68	0.84	0.75
Stacked: time-invariant weights	3.8	0.78	0.75
Stacked: time-varying weights	3.84	0.78	0.74
Stacked: equal-weights	4.07	0.79	0.77
Models	4.28	0.85	0.74

QRA: Quantile Regression Averaging; EMOS: Ensemble Model Output Statistics.

Median interval and mean bias and calibration scores for each value type, averaged over regions/nations for each ensemble algorithm. The mean score for the individual models is also shown. Rows are ordered by increasing the interval score in each subtable. The median was chosen for the interval score due to the presence of extreme values. QRA: Quantile Regression Averaging; EMOS: Ensemble Model Output Statistics. To examine the performance of the ensemble methods, and individual model predictions, in more detail, we plot results for the nations/regions separately for different forecast windows, see Figure 1 for two examples. We plot bias () against calibration () and divide the plotting region into four quadrants; the top-right quadrant represents perhaps the least worrying errors, as here the methods over-predict the outcomes with prediction intervals that are under-confident. Where both observational data and model predictions were available, the metrics were evaluated for the three value types and 11 regions/nations. Unfortunately, it is difficult to ascertain any patterns for either example. The best forecasting model was also highly variable across nations/regions and value types (as shown for calibration and bias in Figure 2), and was often an individual model.

Figure 1.

Figure 2.

The best-performing individual model or ensemble method for each region/nation and value type (for forecasts delivered on the 23rd and 30th June, and 7th and 14th July 2020), evaluated using the absolute distance from the origin on the calibration-bias plots. Ties were broken using the sharpness score. For each date, only the overall best-performing model/ensemble is displayed, but for clarity, the results are separated into (left) individual models and (right) combinations. Region/nation and value type pairs for which there were less than two individual models with both training and forecast data available were excluded from the analysis.

Sharpness, bias and calibration scores for the (left) individual and (right) ensemble forecasts, for all regions and value types delivered on (top) 30th June and (bottom) 7th July 2020. Note that multiple points are hidden when they coincide. The shading of the quadrants (from darker to lighter) implies a preference for over-prediction rather than under-prediction, and for prediction intervals that contain too many data points, rather than too few. The best-performing individual model or ensemble method for each region/nation and value type (for forecasts delivered on the 23rd and 30th June, and 7th and 14th July 2020), evaluated using the absolute distance from the origin on the calibration-bias plots. Ties were broken using the sharpness score. For each date, only the overall best-performing model/ensemble is displayed, but for clarity, the results are separated into (left) individual models and (right) combinations. Region/nation and value type pairs for which there were less than two individual models with both training and forecast data available were excluded from the analysis. The variability in performance was particularly stark for QRA and EMOS, which were found in some cases to vastly outperform the individual and stacked forecasts, and in others to substantially underperform (as shown in Figure 3). The former case generally occurred when all the individual model training predictions were highly biased (e.g. for occupied ICU beds in the South West for the 23rd June forecast). In these cases, the non-convexity of the QRA and EMOS coefficients led to forecasts that were able to correct for this bias. The problem of bias correction is, of course, the case for which these methods were originally proposed in weather forecasting. Whether this behaviour is desirable depends upon whether the data is believed to be accurate, or itself subject to large systematic biases, such as under-reporting, that is not accounted for within the model predictions. The latter case of underperformance occurred in the presence of large discontinuities between the individual model training predictions and forecasts, corresponding to changes in model structure or parameters. This disruption to the learnt relationships between individual model predictions, and the data (as captured by the regression models), led to increased forecast bias (an example of this is shown in Figure 4).

Figure 3.

Figure 4.

Quantile Regression Averaging (QRA) forecast for hospital bed occupancy in the North West region. A large discontinuity between the current and past forecasts (black line) of the individual model corresponding to the covariate with the largest regression coefficient can lead to increased bias for the QRA algorithm. The median, and QRA prediction intervals are shown in blue, while the data is shown in red.

Performance of individual models and ensemble methods for each region/nation and value type (for forecasts delivered on the 23rd and 30th June, and 7th and 14th July 2020). The height of each bar is calculated as the reciprocal of the weighted average interval score, so that higher bars correspond to better performance. Gaps in the results correspond to region/nation and value type pairs for which a model did not provide forecasts. No combined predictions were produced for Scotland hospital_prev on the 7th July as forecasts were only provided from a single model. Quantile Regression Averaging (QRA) forecast for hospital bed occupancy in the North West region. A large discontinuity between the current and past forecasts (black line) of the individual model corresponding to the covariate with the largest regression coefficient can lead to increased bias for the QRA algorithm. The median, and QRA prediction intervals are shown in blue, while the data is shown in red. In comparison, the relative performance of the stacking methods was more stable across different regions/nations and value types (see Figure 3), which likely reflects the conservative nature of the simplex weights in comparison to the optimized regression coefficients. The latter were often observed to assign all the weight to a single model, which is potentially not a robust strategy in the context of evolving model structures. In terms of sharpness, the stacked forecasts were always outperformed by the tight predictions of some of the individual models, and by the EMOS method. It is worth noting the delay in the response of the combination methods to changes in individual model performance. For example, the predictions from model eight for hospital bed occupancy in the Midlands improved considerably to become the top-performing model on 7th July. The weight assigned to this model, for example in the stacking with time-invariant weights algorithm, increased from on 7th July to (out of two models providing forecasts for this value type and region) on the 14th July, but only became the highest weighted model () for the forecast window beginning on 21st July. This behaviour arises from the requirement for sufficient training data from the improved model to become available in order to detect the change in performance.

Discussion

Scoring rules are a natural way of constructing weights for forecast combinations. In comparison to Bayesian model averaging, weights obtained from scoring rules are directly tailored to approximate a predictive distribution and reduce sensitivity to the choice of prior distribution. Crucially use of scoring rules avoids the pitfall associated with model averaging of convergence to a predictive density from a single model, even when the ensemble of models does not include the true data generator. Guidance is available in the literature for which situations different averaging and ensemble methods are appropriate.[25] In this study, several methods (stacking, EMOS and QRA) to combine epidemiological forecasts have been investigated within the context of the delivery of scientific advice to decision-makers during a pandemic. Their performance was evaluated using the well-established sharpness, bias and calibration metrics as well as the interval score. When averaged over nations/regions, the best-performing forecasts according to both the metrics and interval score, originated from the time-invariant weights stacking method for new daily deaths, EMOS for ICU occupancy, QRA for hospital bed occupancy. However, the performance metrics for each model and ensemble method were found to vary considerably over the different regions and value type combinations. Whilst some individual models were observed to perform consistently well for particular region and value type combinations, the extent to which the best-performing models remain stable over time requires further investigation using additional forecasting data. The rapid evolution of the models (through changes in both parameterization and structure) during the COVID-19 outbreak has led to substantial changes in each model’s predictive performance over time. This represents a significant challenge for ensemble methods that essentially use a model’s past performance to predict its future utility, and has resulted in cases where the ensemble forecasts do not represent an improvement to the individual model forecasts. For the stacking approaches, this challenge could be overcome by either (a) additionally providing quantile predictions from the latest version of the models for (but not fit to) data points within a training window or (b) sampled trajectories from the posterior predictive distribution for the latest model at data points that have been used for parameter estimation. Option (a) would allow direct application of the current algorithms to the latest models but may be complicated by the addition of a model structure due to, for example, changes in control measures, whilst (b) would enable the application of full Bayesian stacking approaches using, for example, leave-one-out cross validation.[8] However, it is important to consider the practical constraints on the resolution of information available during the rapid delivery of scientific advice during a pandemic. With no additional information, it is possible to make the following simple modification to the regression approaches to reduce the prediction bias associated with changes to the model structure or parameters. Analysis of the individual model forecasts revealed large discrepancies between the past and present forecasts for an individual model could lead to increased bias, particularly for the QRA and EMOS combined forecasts. Overlapping of past and present forecasts allows this discrepancy to be characterized, and its impact reduced by translating the individual model predictions (covariates) at time () in the regression models) to match the training predictions for the start of the forecast window, . For cases where the discrepancy is large (e.g. Figure 4), the reduction in the bias of this shifted QRA (SQRA) over QRA is striking, see Figure 5).

Figure 5.

QRA (blue) and SQRA (green) forecasts for hospital bed occupancy in the North West region for a forecast window beginning on 14 May. SQRA corrects for the discontinuity between past and current forecasts for the individual model (black line) that corresponds to the covariate with the largest coefficient. Data is shown in red. Table 4 shows the average interval scores for the SQRA algorithm over the four forecast windows. For new daily deaths and ICU occupancy, the SQRA algorithm achieves better median scores than the other methods considered. These promising results motivate future research into ensemble methods that are robust to the practical limitations imposed by the pandemic response context.

Table 4.

Median interval, mean bias and calibration scores for short Quantile Regression Averaging (SQRA) for each value type, over the four forecast windows.

	S¯I	β^	γ^
death_inc_line	1.86	0.45	0.46
hospital_prev	24.0	0.86	0.72
icu_prev	2.23	0.66	0.63

Median interval, mean bias and calibration scores for short Quantile Regression Averaging (SQRA) for each value type, over the four forecast windows.

1 in total

1. Challenges on the interaction of models and policy for pandemic control.

Authors: Liza Hadley; Peter Challenor; Chris Dent; Valerie Isham; Denis Mollison; Duncan A Robertson; Ben Swallow; Cerian R Webb
Journal: Epidemics Date: 2021-08-30 Impact factor: 5.324

1 in total