Literature DB >> 35394862

Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States.

Estee Y Cramer¹, Evan L Ray¹, Velma K Lopez², Johannes Bracher^3,4, Andrea Brennen⁵, Alvaro J Castro Rivadeneira¹, Aaron Gerding¹, Tilmann Gneiting^4,6, Katie H House¹, Yuxin Huang¹, Dasuni Jayawardena¹, Abdul H Kanji¹, Ayush Khandelwal¹, Khoa Le¹, Anja Mühlemann⁷, Jarad Niemi⁸, Apurv Shah¹, Ariane Stark¹, Yijin Wang¹, Nutcha Wattanachit¹, Martha W Zorn¹, Youyang Gu⁹, Sansiddh Jain¹⁰, Nayana Bannur¹⁰, Ayush Deva¹⁰, Mihir Kulkarni¹⁰, Srujana Merugu¹⁰, Alpan Raval¹⁰, Siddhant Shingi¹⁰, Avtansh Tiwari¹⁰, Jerome White¹⁰, Neil F Abernethy¹¹, Spencer Woody¹², Maytal Dahan¹³, Spencer Fox¹², Kelly Gaither¹³, Michael Lachmann¹⁴, Lauren Ancel Meyers¹², James G Scott¹⁵, Mauricio Tec¹⁶, Ajitesh Srivastava¹⁷, Glover E George¹⁸, Jeffrey C Cegan¹⁹, Ian D Dettwiller¹⁸, William P England¹⁸, Matthew W Farthing¹⁸, Robert H Hunter¹⁸, Brandon Lafferty¹⁸, Igor Linkov¹⁹, Michael L Mayo¹⁸, Matthew D Parno²⁰, Michael A Rowland¹⁸, Benjamin D Trump¹⁹, Yanli Zhang-James²¹, Samuel Chen²², Stephen V Faraone²¹, Jonathan Hess²¹, Christopher P Morley²³, Asif Salekin²⁴, Dongliang Wang²³, Sabrina M Corsetti²⁵, Thomas M Baer²⁶, Marisa C Eisenberg^27,28,29, Karl Falb²⁵, Yitao Huang²⁵, Emily T Martin²⁹, Ella McCauley²⁵, Robert L Myers²⁵, Tom Schwarz²⁵, Daniel Sheldon³⁰, Graham Casey Gibson³¹, Rose Yu^32,33, Liyao Gao³⁴, Yian Ma³⁵, Dongxia Wu³², Xifeng Yan³⁶, Xiaoyong Jin³⁶, Yu-Xiang Wang³⁶, YangQuan Chen³⁷, Lihong Guo³⁸, Yanting Zhao³⁹, Quanquan Gu⁴⁰, Jinghui Chen⁴⁰, Lingxiao Wang⁴⁰, Pan Xu⁴⁰, Weitong Zhang⁴⁰, Difan Zou⁴⁰, Hannah Biegel⁴¹, Joceline Lega⁴¹, Steve McConnell⁴², V P Nagraj⁴³, Stephanie L Guertin⁴³, Christopher Hulme-Lowe⁴⁴, Stephen D Turner⁴³, Yunfeng Shi⁴⁵, Xuegang Ban⁴⁶, Robert Walraven⁴⁷, Qi-Jun Hong^48,49, Stanley Kong⁵⁰, Axel van de Walle⁴⁹, James A Turtle⁵¹, Michal Ben-Nun⁵¹, Steven Riley⁵², Pete Riley⁵¹, Ugur Koyluoglu⁵³, David DesRoches⁵⁴, Pedro Forli⁵⁵, Bruce Hamory⁵⁶, Christina Kyriakides⁵⁷, Helen Leis⁵⁸, John Milliken⁵³, Michael Moloney⁵³, James Morgan⁵³, Ninad Nirgudkar⁵⁹, Gokce Ozcan⁵³, Noah Piwonka⁵⁸, Matt Ravi⁵⁹, Chris Schrader⁵⁸, Elizabeth Shakhnovich⁵⁸, Daniel Siegel⁵³, Ryan Spatz⁵⁹, Chris Stiefeling⁶⁰, Barrie Wilkinson⁶¹, Alexander Wong⁵⁷, Sean Cavany⁶², Guido España⁶², Sean Moore⁶², Rachel Oidtman^62,63, Alex Perkins⁶², David Kraus⁶⁴, Andrea Kraus⁶⁴, Zhifeng Gao⁶⁵, Jiang Bian⁶⁵, Wei Cao⁶⁵, Juan Lavista Ferres⁶⁵, Chaozhuo Li⁶⁵, Tie-Yan Liu⁶⁵, Xing Xie⁶⁵, Shun Zhang⁶⁵, Shun Zheng⁶⁵, Alessandro Vespignani^66,67, Matteo Chinazzi⁶⁷, Jessica T Davis⁶⁷, Kunpeng Mu⁶⁷, Ana Pastore Y Piontti⁶⁷, Xinyue Xiong⁶⁷, Andrew Zheng⁶⁸, Jackie Baek⁶⁸, Vivek Farias⁶⁹, Andreea Georgescu⁶⁸, Retsef Levi⁶⁹, Deeksha Sinha⁶⁸, Joshua Wilde⁶⁸, Georgia Perakis⁷⁰, Mohammed Amine Bennouna⁷⁰, David Nze-Ndong⁷⁰, Divya Singhvi⁷¹, Ioannis Spantidakis⁷⁰, Leann Thayaparan⁷⁰, Asterios Tsiourvas⁷⁰, Arnab Sarker⁷², Ali Jadbabaie⁷², Devavrat Shah⁷², Nicolas Della Penna⁷³, Leo A Celi⁷³, Saketh Sundar⁷⁴, Russ Wolfinger⁷⁵, Dave Osthus⁷⁶, Lauren Castro⁷⁷, Geoffrey Fairchild⁷⁷, Isaac Michaud⁷⁶, Dean Karlen^78,79, Matt Kinsey⁸⁰, Luke C Mullany⁸⁰, Kaitlin Rainwater-Lovett⁸⁰, Lauren Shin⁸⁰, Katharine Tallaksen⁸⁰, Shelby Wilson⁸⁰, Elizabeth C Lee⁸¹, Juan Dent⁸¹, Kyra H Grantz⁸¹, Alison L Hill⁸², Joshua Kaminsky⁸¹, Kathryn Kaminsky⁸³, Lindsay T Keegan⁸⁴, Stephen A Lauer⁸¹, Joseph C Lemaitre⁸⁵, Justin Lessler⁸¹, Hannah R Meredith⁸¹, Javier Perez-Saez⁸¹, Sam Shah⁸⁶, Claire P Smith⁸¹, Shaun A Truelove^81,87,88, Josh Wills⁸⁶, Maximilian Marshall⁸⁹, Lauren Gardner⁸⁹, Kristen Nixon⁸⁹, John C Burant⁹⁰, Lily Wang⁸, Lei Gao⁹¹, Zhiling Gu⁸, Myungjin Kim⁸, Xinyi Li⁹², Guannan Wang⁹³, Yueying Wang⁸, Shan Yu⁹⁴, Robert C Reiner⁹⁵, Ryan Barber⁹⁵, Emmanuela Gakidou⁹⁵, Simon I Hay⁹⁵, Steve Lim⁹⁵, Chris Murray⁹⁵, David Pigott⁹⁵, Heidi L Gurung⁹⁶, Prasith Baccam⁹⁶, Steven A Stage⁹⁷, Bradley T Suchoski⁹⁶, B Aditya Prakash⁹⁸, Bijaya Adhikari⁹⁹, Jiaming Cui⁹⁸, Alexander Rodríguez⁹⁸, Anika Tabassum¹⁰⁰, Jiajia Xie⁹⁸, Pinar Keskinocak¹⁰¹, John Asplund¹⁰², Arden Baxter¹⁰¹, Buse Eylul Oruc¹⁰¹, Nicoleta Serban¹⁰¹, Sercan O Arik¹⁰³, Mike Dusenberry¹⁰³, Arkady Epshteyn¹⁰³, Elli Kanal¹⁰³, Long T Le¹⁰³, Chun-Liang Li¹⁰³, Tomas Pfister¹⁰³, Dario Sava¹⁰³, Rajarishi Sinha¹⁰³, Thomas Tsai¹⁰⁴, Nate Yoder¹⁰³, Jinsung Yoon¹⁰³, Leyou Zhang¹⁰³, Sam Abbott¹⁰⁵, Nikos I Bosse¹⁰⁵, Sebastian Funk¹⁰⁵, Joel Hellewell¹⁰⁵, Sophie R Meakin¹⁰⁵, Katharine Sherratt¹⁰⁵, Mingyuan Zhou¹⁰⁶, Rahi Kalantari¹⁰⁷, Teresa K Yamana¹⁰⁸, Sen Pei¹⁰⁸, Jeffrey Shaman¹⁰⁸, Michael L Li⁶⁸, Dimitris Bertsimas⁶⁹, Omar Skali Lami⁶⁸, Saksham Soni⁶⁸, Hamza Tazi Bouardi⁶⁸, Turgay Ayer^101,109, Madeline Adee¹¹⁰, Jagpreet Chhatwal¹¹⁰, Ozden O Dalgic¹¹¹, Mary A Ladd¹¹⁰, Benjamin P Linas¹¹², Peter Mueller¹¹⁰, Jade Xiao¹⁰¹, Yuanjia Wang^113,114, Qinxia Wang¹¹³, Shanghong Xie¹¹³, Donglin Zeng¹¹⁵, Alden Green¹¹⁶, Jacob Bien¹¹⁷, Logan Brooks¹¹⁶, Addison J Hu¹¹⁶, Maria Jahja¹¹⁶, Daniel McDonald¹¹⁸, Balasubramanian Narasimhan^119,120, Collin Politsch¹²¹, Samyak Rajanala¹²⁰, Aaron Rumack¹²¹, Noah Simon¹²², Ryan J Tibshirani¹¹⁶, Rob Tibshirani¹²⁰, Valerie Ventura¹¹⁶, Larry Wasserman¹¹⁶, Eamon B O'Dea¹²³, John M Drake¹²³, Robert Pagano¹²⁴, Quoc T Tran¹²⁵, Lam Si Tung Ho¹²⁶, Huong Huynh¹²⁷, Jo W Walker², Rachel B Slayton², Michael A Johansson², Matthew Biggerstaff², Nicholas G Reich¹.

Abstract

Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. Starting in April 2020, the US COVID-19 Forecast Hub (https://covid19forecasthub.org/) collected, disseminated, and synthesized tens of millions of specific predictions from more than 90 different academic, industry, and independent research groups. A multimodel ensemble forecast that combined predictions from dozens of groups every week provided the most consistently accurate probabilistic forecasts of incident deaths due to COVID-19 at the state and national level from April 2020 through October 2021. The performance of 27 individual models that submitted complete forecasts of COVID-19 deaths consistently throughout this year showed high variability in forecast skill across time, geospatial units, and forecast horizons. Two-thirds of the models evaluated showed better accuracy than a naïve baseline model. Forecast accuracy degraded as models made predictions further into the future, with probabilistic error at a 20-wk horizon three to five times larger than when predicting at a 1-wk horizon. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks.

Entities: Chemical

Keywords: COVID-19; ensemble forecast; forecasting; model evaluation

Mesh：

Year: 2022 PMID： 35394862 PMCID： PMC9169655 DOI： 10.1073/pnas.2113561119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Effective pandemic response requires federal, state, and local leaders to make timely decisions in order to reduce disease transmission. During the COVID-19 pandemic, surveillance data on the number of cases, hospitalizations, and disease-associated deaths were used to inform response policies (1, 2). While these data provide insight into recent trends in the outbreak, they only present a partial, time-lagged picture of transmission and do not show if and when changes may occur in the future. Anticipating outbreak change is critical for optimal resource allocation and response. Forecasting models provide quantitative, evaluable, and probabilistic predictions about the epidemic trajectory for the near-term future. Forecasts can inform operational decisions about allocation of healthcare supplies (e.g., personal protective equipment, therapeutics, and vaccines), staffing needs, and school closures (3). Providing prediction uncertainty is critical for such decisions, as it allows stakeholders to assess the most likely outcomes and plausible worst-case scenarios (3). Academic research groups, government agencies, industry teams, and individuals produced COVID-19 forecasts at an unprecedented scale starting in March 2020. Publicly available forecasts reflect varied approaches, data sources, and assumptions. Some models had mechanisms that allowed them to incorporate an estimated impact of current or potential future policies on human behavior and COVID-19 transmission. Other models assumed that currently observed trends would continue into the future without considering external data on policies in different jurisdictions. To leverage these forecasts for the COVID-19 response, the US Centers for Disease Control and Prevention (CDC) partnered with the Reich Laboratory at the University of Massachusetts Amherst to create the COVID-19 Forecast Hub (https://covid19forecasthub.org/) (4). Launched in early April 2020, the Forecast Hub facilitated the collection, archiving, evaluation, and synthesis of forecasts. Teams were explicitly asked to submit “unconditional” forecasts of the future, in other words, predictions that integrate across all possible changes in future dynamics. In practice, most individual models made predictions that were conditional on explicit or implicit assumptions of how policies, behaviors, and pathogens would evolve in the coming weeks. From these forecasts, a multimodel ensemble was developed, published weekly in real time, and used by the CDC in official public communications about the pandemic (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/mathematical-modeling.html) (5). Forecasts were generated for the outcomes of reported cases, hospitalizations, and deaths due to COVID-19. This paper focuses on evaluating forecasts of reported deaths. Ensemble models incorporate the information and uncertainties from multiple forecasts, each with their own perspectives, strengths, and limitations, to create accurate predictions with well-calibrated uncertainty (6–11). Synthesizing multiple models removes the risk of overreliance on any single approach for accuracy or stability. It is challenging for individual models to make calibrated predictions of the future when the behavior of the system being studied is nonstationary due to continually changing policies and behaviors. Ensemble approaches have previously demonstrated superior performance compared with single models in forecasting influenza (12–14), Ebola (15), and dengue fever outbreaks (16). Preliminary research suggested that COVID-19 ensemble forecasts were also more accurate and precise than individual models in the early phases of the pandemic (17, 18). Predicting the trajectory of a novel pathogen outbreak such as COVID-19 is subject to many challenges. These include the role of human behavior and decision-making in outbreak trajectories, and the fact that epidemic forecasts may play a role in a “feedback loop” when and if the forecasts themselves have the ability to impact future societal or individual decision-making (19). There are also a host of data irregularities, especially in the early stages of the pandemic. It is important to systematically and rigorously evaluate forecasts designed to predict real-time changes to the outbreak in order to identify strengths and weaknesses of different approaches and to understand the extent to which the forecasts are a reliable input to public-health decisions. Knowledge of what leads to more or less accurate and well-calibrated forecasts can inform their development and their use within outbreak science and public policy. In this analysis, we sought to evaluate the accuracy of individual and ensemble probabilistic forecasts submitted to the Forecast Hub, focusing on forecasts of reported weekly incident deaths.

Results

Summary of Models.

Forecasts evaluated in this analysis are based on submissions in a continuous 79-wk period starting in late April 2020 and ending in late October 2021 (Fig. 1 and Methods). Forecasts were evaluated at 55 locations including all 50 states, four jurisdictions and territories (Guam, US Virgin Islands, Puerto Rico, and the District of Columbia), and the US national level. The evaluation period captured the decline of the spring 2020 wave, a late summer 2020 increase in several locations, a large late-fall/early-winter surge in 2020/2021, and the rise and fall of the Delta variant in the summer and fall of 2021 (Fig. 1).

Fig. 1.

Overview of the evaluation period included in the paper. Vertical dashed lines indicate “phases” of the pandemic analyzed separately in . (A) The reported number of incident weekly COVID-19 deaths by state or territory, per JHU CSSE reports. Locations are sorted by the cumulative number of deaths as of October 30th, 2021. (B) The time series of weekly incident deaths at the national level overlaid with example forecasts from the COVID-19 Forecast Hub ensemble model. (C) The number of models submitting forecasts for incident deaths each week. Weeks in which the ensemble was submitted are shown with a red asterisk. The number of models that submitted forecasts of incident deaths to the Forecast Hub and were screened for inclusion in this analysis increased from four models at the beginning of the evaluation period to an average of 41.2 models per week during the first 10 months of 2021 (Fig. 1 and ). A total of 28 models met inclusion criteria, yielding 1,791 submission files with 556,050 specific predictions for unique combinations of forecast dates, targets (horizons forecasted), and locations. The evaluated forecasts used different data sources and made varying assumptions about future transmission patterns (). All evaluated models, other than CEID-Walk, the COVIDhub-baseline, the COVIDhub-ensemble, and PSI-Draft, used case data as inputs to their forecast models. A total of 10 models included data on COVID-19 hospitalizations, 10 models incorporated demographic data, and 9 models used mobility data. Of the 28 evaluated models, 7 made explicit assumptions that social distancing and other behavioral patterns would change over the prediction period. Two naïve models were included. The COVIDhub-baseline is a neutral model built with median predicted incidence equal to the number of reported deaths in the most recent week with uncertainty around the median based on weekly differences in previous observations (see Methods). CEID-Walk is a random walk model with simple outlier handling ().

Overall Model Accuracy.

To evaluate probabilistic accuracy, the primary metric used was the weighted interval score (WIS), a nonnegative metric, which measures how consistent a collection of prediction intervals is with an observed value (20). For WIS, a lower value represents smaller error (see Methods and ). Led by the ensemble model, a majority of the evaluated models achieved better accuracy than the baseline model in forecasting incident deaths (Table 1). The COVIDhub-ensemble achieved a relative WIS of 0.61, which can be interpreted as achieving, on average, 39% less probabilistic error than the baseline forecast in the evaluation period, adjusting for the difficulty of the specific predictions made. An additional seven models achieved a relative WIS of less than or equal to 0.75. In total, 18 models had a relative WIS of less than 1, indicating lower probabilistic forecast error than the baseline model, and 10 models (including the baseline) had a relative WIS of 1 or greater (Table 1). Patterns in relative point forecast error were similar, with 18 models having equal or lower mean absolute error (MAE) than the baseline (Table 1). Values of relative WIS and rankings of models were robust to changing thresholds for submission inclusion criteria and to the inclusion or exclusion of individual outlying or revised observations (). When stratified by phase of the pandemic, different models showed the highest accuracy overall ().

Table 1.

Summary accuracy metrics for all submitted forecasts from 28 models meeting inclusion criteria, aggregated across locations (50 states only), submission week, and 1- through 4-wk forecast horizons

Model	No. forecasts	95% PI Coverage	50% PI Coverage	Relative WIS	Relative MAE
BPagano-RtDriven	10,864	0.72	0.36	0.77	0.80
CEID-Walk	12,161	0.78	0.45	1.00	1.03
CMU-TimeSeries	10,456	0.77	0.42	0.78	0.80
Covid19Sim-Simulator	11,770	0.34	0.11	1.02	0.85
CovidAnalytics-DELPHI	11,064	0.82	0.46	0.99	1.01
COVIDhub-baseline	15,460	0.88	0.51	1.00	1.00
COVIDhub-ensemble	14,260	0.90	0.53	0.61	0.66
CU-select	13,710	0.72	0.43	0.92	0.89
DDS-NBDS	12,261	0.86	0.43	1.25	2.19
epiforecasts-ensemble1	12,204	0.87	0.46	3.17	2.74
GT-DeepCOVID	13,585	0.84	0.41	0.75	0.82
IHME-SEIR	11,116	0.59	0.25	0.79	0.82
JHU_CSSE-DECOM	10,190	0.80	0.35	0.75	0.80
JHU_IDD-CovidSP	14,170	0.82	0.33	0.99	1.04
JHUAPL-Bucky	11,664	0.63	0.29	1.05	1.06
Karlen-pypm	13,060	0.86	0.47	0.64	0.70
LANL-GrowthRate	13,560	0.83	0.38	0.85	0.91
MOBS-GLEAM_COVID	15,452	0.71	0.37	0.77	0.78
OliverWyman-Navigator	10,548	0.82	0.45	0.72	0.76
PSI-DRAFT	13,209	0.34	0.15	1.51	1.27
RobertWalraven-ESG	13,430	0.51	0.28	1.13	0.97
SteveMcConnell-CovidComplete	12,063	0.8	0.45	0.74	0.77
UA-EpiCovDA	13,710	0.72	0.41	0.98	0.94
UCLA-SuEIR	10,549	0.31	0.09	1.37	1.21
UCSD_NEU-DeepGLEAM	11,664	0.91	0.7	0.83	0.78
UMass-MechBayes	14,660	0.93	0.56	0.63	0.67
UMich-RidgeTfReg	11,394	0.63	0.34	1.18	1.08
USC-SI_kJalpha	9,660	0.52	0.22	0.75	0.72

The “No. forecasts” column refers to the number of individual location/target/week combinations. Empirical prediction interval (PI) coverage rates calculate the fraction of times the 50% or 95% PIs covered the eventually observed value. Values within 5% coverage of the nominal rates are highlighted in boldface text. The “relative WIS” and “relative MAE” columns show the relative mean WIS and relative MAE, which compare each model to the baseline model while adjusting for the difficulty of the forecasts the given model made for state-level forecasts (see Methods). The baseline model is defined to have a relative score of 1. Models with relative WIS or MAE values lower than 1 had “better” accuracy relative to the baseline model (best score in bold).

Summary accuracy metrics for all submitted forecasts from 28 models meeting inclusion criteria, aggregated across locations (50 states only), submission week, and 1- through 4-wk forecast horizons The “No. forecasts” column refers to the number of individual location/target/week combinations. Empirical prediction interval (PI) coverage rates calculate the fraction of times the 50% or 95% PIs covered the eventually observed value. Values within 5% coverage of the nominal rates are highlighted in boldface text. The “relative WIS” and “relative MAE” columns show the relative mean WIS and relative MAE, which compare each model to the baseline model while adjusting for the difficulty of the forecasts the given model made for state-level forecasts (see Methods). The baseline model is defined to have a relative score of 1. Models with relative WIS or MAE values lower than 1 had “better” accuracy relative to the baseline model (best score in bold). The degree to which individual models provided calibrated predictions varied (Table 1). We measured the probabilistic calibration of model forecasts using the empirical coverage rates of prediction intervals (PIs). Across 1- through 4-wk-ahead horizons, 79 wk, and 50 states, only the ensemble model achieved near-nominal coverage rates for both the 50% and 95% PIs. Eight models achieved coverage rates within 5% of the desired coverage level for the 50% PI, and only the COVIDhub-ensemble and UMass-MechBayes achieved coverage rates within 5% for the 95% PI. Typically, observed coverage rates were lower than the nominal rate (Table 1 and ). Three models had very low coverage rates (less than 50% for the 95% PI or less than 15% for the 50% PI). In general, models were penalized more for underpredicting the eventually observed values than overpredicting (). Among the top-performing models, there was variation in data sources used, indicating that the inclusion of additional data sources was not a sufficient condition for high accuracy. Of the seven top individual models with a relative WIS less than or equal to 0.75 (Table 1), four used data beyond the epidemiological hospitalization, case, and death surveillance data from the Center for System Science and Engineering (CSSE) (). A total of 10 of the 18 individual models that performed better than the baseline used data other than epidemiological surveillance data (e.g., demographics or mobility). The top performers consisted of both models with mechanistic components and mostly phenomenological ones.

Model Accuracy Rankings Are Highly Variable.

The COVIDhub-ensemble was the only model that ranked in the top half of all models (standardized rank > 0.5) for more than 75% of the observations it forecasted, although it made the single best forecast less frequently than any other model (Fig. 2). We ranked models based on relative WIS for each combination of 1- through 4-wk-ahead horizons, 79 wk, and 55 locations, contributing to 17,006 possible predicted observations for each model (Fig. 2). All models showed large variability in relative skill, with each model having observations for which it had the best (lowest) WIS and thereby a standardized rank of 1. Some models, such as JHUAPL-BUCKY and PSI-DRAFT, show a bimodal distribution of standardized rank, with one mode in the top quartile of models and another in the bottom quartile. In these cases, the models frequently made overconfident predictions () resulting in either lower scores (indicating better performance) in instances in which their predictions were very close to the truth or higher scores (indicating worse performance) when their predictions were far from the truth. Similar patterns in ranking and relative model performance were seen when stratifying ranks by pandemic phase ().

Fig. 2.

A comparison of each model’s distribution of standardized rank of WIS for each location/target/week observation. A standardized rank of 1 indicates that the model had the best WIS for that particular location, target, and week, and a value of 0 indicates it had the worst WIS. The density plots show interpolated distributions of the standardized ranks achieved by each model for every observation that model forecasted. The quartiles of each model’s distribution of standardized ranks are shown in different colors: yellow indicates the top quarter of the distribution and purple indicates the bottom quarter of the distribution. The models are ordered by the first quartile of the distribution, with models that rarely had a low rank near the top.

Observations on Accuracy in Specific Weeks.

Forecasts from individual models showed variation in accuracy by forecast week and horizon (Fig. 3). The COVIDhub-ensemble model showed better average WIS than both the baseline model and the average error of all models across the entire evaluation period, except for 3 wk during which the baseline had lower 1-wk-ahead error than the ensemble. The COVIDhub-ensemble 1-wk-ahead forecast for EW02-2021 yielded its highest average WIS across all weeks (average WIS = 72.7), and 9 out of 26 other models that submitted for the same locations outperformed it that week. The 4-wk-ahead COVIDhub-ensemble forecasts were worse in EW49-2020 than in any other week during the evaluation period (average WIS = 111.7), and 15 out of the 26 models outperformed the ensemble that week at a forecast horizon of 4 wk.

Fig. 3.

Average WIS by the target forecasted week for each model across all 50 states. A shows the observed weekly COVID-19 deaths based on the CSSE-reported data as of May 25, 2021. B shows the average 1-wk-ahead WIS values per model (in gray). For all 21 wk in which the ensemble model (red triangle) is present, this model has lower WIS values than the baseline model (green square) and the average score of all models (blue circle). C shows the average 4-wk-ahead WIS values per model (in gray). For all 21 wk in which the ensemble model (red triangle) is present, this model has lower WIS values than the baseline model (green square) and the average score of all models (blue circle). The y-axes are truncated in B and C for readability of the majority of the data. There was high variation among the individual models in their forecast accuracy during periods of increasing deaths and near peaks (i.e., forecast dates in July through early August of 2020, November through March, and August through October of 2021; Fig. 3). High errors in the baseline model tended to be associated with large outliers in observed data for a particular week (e.g., times when a state reported a large backfill of deaths in the most recent week) (). In general, other models did not show unusual errors in their forecasts originating from these anomalous data, suggesting that their approaches, including possible adjustments to recent observations, were robust to anomalies in how data were reported.

Model Performance in Specific Pandemic Waves.

In addition to evaluating performance in aggregate across the entire evaluation period and separate phases, we evaluated model performance during important moments during the pandemic. To assess the impact of rapidly changing trends on incident death forecasting accuracy, we ran an analysis restricted to specific locations and time periods that experienced high rates of change during four different waves of the pandemic (Fig. 4): 1) the summer 2020 waves in the south and southwest, 2) the late-fall 2020 rise in deaths in the upper Midwest, 3) the wave driven by the Alpha SARS-CoV-2 variant in Michigan in in March/April 2021, and 4) the Delta variant wave in summer 2021 throughout most states in the United States.

Fig. 4.

Forecasts for selected states and pandemic waves, with PIs coverage. The first column shows every 1- and 4-wk-ahead forecast with 95% PIs made by the ensemble during the selected evaluation period. The second and third columns of plots show evaluations of PIs across 1- through 4-wk horizons (x-axis). The red line with triangle points corresponds to the coverage rates of the COVIDhub-ensemble forecasts, and green squares refer to the COVIDhub-baseline model. The boxplots represent the distribution of coverage rates from all component models. The second column evaluates only forecasts made for the dates shown in the first column. The third column evaluates forecasts across all weeks in the evaluation period. In the last two columns, the expected coverage rate (95%) is shown by the dashed line. Forecast performance varied substantially in these examples. Models in general systematically underpredicted the mortality curve as trends were rising and overpredicted as trends were falling. In some of the selected waves (e.g., North Dakota and Florida), the ensemble forecast showed inappropriate levels of uncertainty, with the 95% PIs covering the eventual observations less than 80% of the time. However, during other waves (e.g., Louisiana and Michigan), the ensemble forecast, while systematically biased first below and then above the eventually reported counts of deaths, did cover the observations at or above 95% of the time, although PIs were very wide. In general, lower-than-expected coverage rates and bias were more pronounced at a 4-wk horizon than a 1-wk horizon. These four examples appeared to be representative of trends observed when looking across a larger number of waves (Dataset S2).

Individual Model Forecast Performance Varies Substantially by Location.

Forecasts from individual models showed large variation in accuracy by location when aggregated across all weeks and targets (Fig. 5). Only the ensemble model showed superior accuracy when compared to baseline in all locations. Ensemble forecasts of incident deaths showed the largest relative accuracy improvements in New York, New Jersey, Indiana (relative WIS = 0.4), California, Massachusetts, and at the national level (relative WIS = 0.5) and the lowest relative accuracy in Vermont, Guam, and The Virgin Islands (relative WIS = 0.9). The COVIDhub-ensemble was the only model to outperform the baseline in every location when eligible in a specific pandemic phase ().

Fig. 5.

Relative WIS by location for each model across all horizons and submission weeks. The value in each box represents the relative WIS calculated from 1- to 4-wk-ahead targets available for a model at each location. Boxes are colored based on the relative WIS compared to the baseline model. Blue boxes represent teams that outperformed the baseline, and red boxes represent teams that performed worse than the baseline. Locations are sorted by cumulative deaths as of the end of the evaluation period (October 30, 2021). Teams are listed on the horizontal axis in order from the lowest to highest relative WIS values (Table 1).

Forecast Performance Degrades with Increasing Horizons.

Averaging across all states and weeks in the evaluation period, forecasts from all models showed lower accuracy and higher variance at a forecast horizon of 4 wk ahead compared to a horizon of 1 wk ahead; however, models generally showed improved performance relative to the naïve baseline model at larger horizons (). A total of 11 models showed a lower average WIS (range: 24.9 to 34.3) than the baseline at a 1-wk horizon (average WIS = 35.8). At a 4-wk-ahead horizon, 19 models had a lower average WIS (range: 39.9 to 65.2) than baseline (average WIS = 70.1). Across all models except one, the average WIS was higher than the median WIS, indicative of outlying forecasts impacting the mean value. When averaging across locations and stratifying by phase of the pandemic, there was variation in the top-performing models (). Four models had a lower mean WIS than baseline for both 1- and 4-wk-ahead targets in at least three out of four phases (COVIDhub-ensemble, GT-DeepCOVID, Karlen-pypm, and UMass-MechBayes). Additionally, UMass-MechBayes and COVIDhub-ensemble were the only models to appear in the top three models in three of the four phases analyzed (). In contrast to average WIS, PIs coverage rates did not change substantially across the 1- to 4-wk horizons for most models (). While many teams submitted only short-term (1- to 4-wk-ahead) forecasts, a smaller number of teams consistently submitted longer-term predictions with up to a 20-wk horizon for all 50 states (). Across all teams submitting forecasts for the 50 states, 4-wk-ahead forecasts had around 76% more error (based on relative WIS) than 1-wk-ahead forecasts, a relationship that was consistent across the entire evaluation period. Longer-term forecasts showed less accuracy on average than 1- and 4-wk-ahead forecasts. There were no clear overall differences in probabilistic model accuracy between 8- and 20-wk horizons, although in early summer 2020, late spring 2021, and fall of 2021, average WIS at 8-wk horizons were slightly lower than at longer horizons (). For the two teams who made 20-wk-ahead forecasts for all 50 states, the average WIS was 2.9 to 4 times higher at a 20-wk horizon than it was at a 1-wk horizon. The increased WIS at longer prediction horizons for these models were due to larger dispersion (i.e., wider predictive distributions representing increased uncertainty) as well as larger penalties for underprediction and overprediction (). The biggest increases in WIS were from increased penalties for underprediction, suggesting that the model forecasts did not accurately capture the possibility of increases in incidence at long horizons. Coverage rates for 95% PIs tended to be stable or decline as the horizon increased ().

Discussion

Given the highly visible role that forecasting has played in the response to the COVID-19 pandemic, it is critical that model consumers, such as decision-makers, the general public, and modelers themselves, understand how reliable models are. This paper provides a comprehensive and comparative look at the probabilistic accuracy of different modeling approaches for forecasting COVID-19–related deaths during the COVID-19 pandemic in the United States from April 2020 through October 2021. This work illustrates the tension between the desire for long-term forecasts, which would be helpful for public-health practitioners, and the decline in forecast accuracy at longer horizons shown by all forecasting methods. As seen in prior epidemic forecasting projects, ensemble forecasts simplify the information provided to model consumers and can provide a stable, accurate, and low-variance forecast (3, 14–16). The results presented here show high variation in accuracy between and within stand-alone models but more consistent accuracy from an ensemble forecast. This supports prior results and confirms that an ensemble model can provide a reliable and comparatively accurate forecast that exceeds the performance of most, if not all, of the models that contribute to it. The ensemble approach was the only model that 1) outperformed the baseline forecast in every location, 2) had better overall 4-wk-ahead accuracy than the baseline forecast in every week, and 3) ranked in the top half of forecasts for more than 75% of the forecasts it made. It achieved the best overall measures of point and probabilistic forecast accuracy for forecasting deaths. However, during key moments in the pandemic, while the ensemble outperformed many models, it often showed lower than desired accuracy (Figs. 3 and 4). These results strengthen the evidence base for synthesizing multiple models for public-health decision support. We summarize the key findings of the work as follows. The performance of all individual models forecasting COVID-19 mortality was highly variable even for short-term targets (Figs. 2 and 3). One source of variation was data inputs. Further investigation is needed to determine in what settings additional data can yield measurable improvements in forecast accuracy or add valuable diversity to a collection of models that are combined. A simple ensemble forecast that combined all submitted models each week was consistently the most accurate model when aggregating performance across forecast targets, weeks, or locations (Fig. 3 and 5 and ). Although rarely the “most accurate” model for individual predictions, the ensemble was consistently one of the top few models for any single prediction (Fig. 2). For public-health agencies concerned with using a model that shows dependably accurate performance, this is a desirable feature of a model. The high variation in ranks of models for each location/target/week suggests that all models, even those that are not as accurate on average, have observations for which they are the most accurate (Fig. 2). The post hoc evaluation of models during forecasting waves in select states showed poor accuracy of the ensemble model’s point forecasts for 1 and 4 wk ahead. During periods of increasing incident deaths, the ensemble tended to underpredict while tending to overpredict during periods of decreasing incident deaths. PIs coverage during these times varied (Fig. 4). Forecast accuracy and calibration were substantially degraded as forecast horizons increased, largely due to underestimating the possibility of increases in incidence at long horizons (). Model performance should be assessed both in aggregate (to compare models that showed the best overall performance) and in specific important moments during the pandemic. It is of public-health interest to evaluate how well models are able to predict points at which the observed trends change. However, we note that a post hoc evaluation that focuses only on times at which a specific type of trend was observed raises conceptual challenges. Extreme turning points in the pandemic are relatively rare compared with the many weeks during which trends continue or only slightly change from previous weeks. A post hoc evaluation that focuses exclusively on these “change-points” may reward models that may regularly predict extreme changes even when they do not occur at other times (21). Adapting proper scoring rules to weigh good performance in both kinds of situations is difficult. Rigorous evaluation of forecast accuracy faces many limitations in practice. The large variation and correlation in forecast errors across targets, submission weeks, and locations makes it difficult to create rigorous comparisons of models. Forecast comparison is also challenging because teams have submitted forecasts for different lengths of time, different locations, and for different numbers of horizons (). Some teams have also changed their models over time (). To account for some of this variability, we implemented specific inclusion criteria. However, those criteria may exclude valuable approaches that were not applied to a large fraction of locations or weeks (see Methods). Forecast performance may be affected by ground-truth data and forecast target. Ground-truth data are not static. They can be later revised as more data become available (Dataset S1). There are also instances in which data are not revised but rather left with large peaks or dips due to reporting effects, especially around holidays. Different sources for ground truth data can also have substantial differences that impact model performance. Lastly, because this evaluation focuses on incident death forecasts, it cannot speak to model performance for incident cases or hospitalizations. Deaths may serve as a lagging indicator of COVID-19, thus making it more predictable than hospitalization and case targets (22). While the Hub has provided many insights into what has and has not been predictable in the COVID-19 pandemic, it also has left many important questions unanswered. Due to the operational, real-time orientation of the project, the Hub did not collect data on experimental modeling studies for which certain features can be included or left out to explicitly test what features of a model increase predictive accuracy. An observational study could be conducted with forecasts collected by the Hub, but any such analysis would be confounded by other factors about how the model was built and validated. Other research in this area has shown small but measurable improvements in predictive accuracy by including other data streams available in real time (23). Continued research is needed to evaluate how behavioral, mobility, variant prevalence, or other data streams might enhance predictive modeling Short-term forecasts of COVID-19 mortality have informed public-health response and risk communication for the pandemic. The number of teams and forecasts contributing to the COVID-19 ensemble forecast model has exceeded forecasting activity for any prior epidemic or pandemic. These forecasts are only one component of a comprehensive public-health data and modeling system needed to help inform outbreak response. Preparedness for future pandemics could be facilitated by creating resources for creating and maintaining model submission formats. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modeling capabilities to support local, state, and federal response to outbreaks.

Methods

Surveillance Data.

Early in the COVID-19 pandemic, the Johns Hopkins CSSE developed a publicly available data-tracking system and dashboard that was widely used (24). CSSE collected daily data on cumulative reported deaths due to COVID-19 at the county, state, territorial, and national levels and made these data available in a standardized format beginning in March 2020. Incident deaths were inferred from this time series as the difference in reported cumulative deaths on successive days. Throughout the real-time forecasting exercise described in this paper, the Forecast Hub stated that forecasts of deaths would be evaluated using the CSSE data as the ground truth and encouraged teams to train their models on CSSE data. Like data from other public-health systems, the CSSE data occasionally exhibited irregularities due to reporting anomalies. CSSE made attempts to redistribute large “backlogs” of data to previous dates in instances in which the true dates of deaths, or dates when the deaths would have been reported, were known. However, in some cases, these anomalous observations were left in the final dataset (). All updates were made available in a public GitHub repository (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#data-modification-records). Weekly incidence values were defined and aggregated based on daily totals from Sunday through Saturday, according to the standard definition of epidemiological weeks (EW) used by the CDC (25).

Forecast Format.

Research teams from around the world developed forecasting models and submitted their predictions to the COVID-19 Forecast Hub, a central repository that collected forecasts of the COVID-19 pandemic in the United States beginning in April 2020. The Forecast Hub submission process has been described in detail elsewhere (26). Incident death forecasts, the focus of this evaluation, could be submitted with predictions for horizons of 1 to 20 wk after the week in which a forecast was submitted. A prediction for a given target (e.g., “1-week ahead incident death”) and location (e.g., “California”) was specified by one or both of a point forecast (a single number representing the prediction of the eventual outcome) and a probabilistic forecast. Probabilistic forecasts were represented by a set of 23 quantiles at probability levels 0.01, 0.025, 0.05, 0.10, 0.15, …, 0.95, 0.975, 0.99.

Forecast Model Eligibility and Evaluation Period.

To create a set of standardized comparisons between forecasts, only models that met specific inclusion criteria were included in the analysis. For the 79 wk beginning in EW17-2020 and ending with EW42-2021, a model’s weekly submission was determined to be eligible for evaluation if the forecast was designated as the “primary” forecast model from a team (groups who submitted multiple parameterizations of similar models were asked to designate prospectively a single model as their scored forecast); contained predictions for at least 25 out of 51 focal locations (national level and states); contained predictions for each of the 1- through 4-wk-ahead targets for incident deaths; and contained a complete set of quantiles for all predictions. A model was included in the evaluation if it had submitted an eligible forecast for at least 60% (n = 47) of the submission weeks during the continuous 79-wk period (). Based on the eligibility criteria, we compared 28 models that had at least 47 eligible weeks during this time period.

Aggregated Forecast Evaluation of Pandemic Phases.

In a secondary analysis, forecasts were evaluated based on model submissions during four different phases of the pandemic. A model was eligible for inclusion in a given phase if it met the eligibility criteria listed in the Methods section: Forecast model eligibility and evaluation period, and had forecast submissions for at least 60% of the weeks during that phase. For the spring phase, models had to submit eligible forecasts for at least 6 out of 10 wk starting EW16-2020 and ending EW26-2020. For summer eligibility, a model required submissions for at least 12 out of 20 submission weeks between EW27-2020 and EW46-2020. For winter eligibility, a model required submissions for at least 14 out of 23 submission weeks between EW47-2020 and EW16-2021. For delta phase eligibility, a model required submissions for at least 16 out of 26 submission weeks between EW17-2021 and EW42-2021. These phases were determined based on the waves of deaths at the national level during pandemic (Fig. 1). Each phase includes a period of increasing and decreasing incident deaths, although forecasts for the spring phase did not begin early enough to capture the increase in many locations. Forecasts were scored using CSSE data available as of November 16, 2021. We did not evaluate forecasts on data first published in the 2 wk prior to this date due to possible revisions to the data.

Disaggregated Forecast Evaluation by Pandemic Wave.

In a post hoc secondary analysis, we evaluated forecasts made in selected locations during selected pandemic waves. We used the following criteria in selecting locations and waves to represent this analysis (Fig. 4 and Dataset S2). We selected states that had unusually severe waves or whose waves “led” the overall wave. Locations for which data for weekly deaths during the wave had been substantially revised after the initial report were excluded from consideration. We picked an initial date near the start of the first increase at the start of the wave and a last date at the end of the steep decline of the wave. To compare the forecasts during the waves, we plotted 1- and 4-wk-ahead forecasts and calculated 95% PIs coverage rates of forecasts made for the given location both during the wave of interest over all weeks. Coverage rates were computed for models that were included in the overall analysis (see eligibility in Methods section: Forecast model eligibility and evaluation period) and, for inclusion in the coverage calculations for each wave, the model additionally had to have made forecasts for at least 3 wk in the selected wave.

Forecast Locations.

Forecasts were submitted for 57 locations including all 50 states, six jurisdictions and territories (American Samoa, Guam, the Northern Mariana Islands, US Virgin Islands, Puerto Rico, and the District of Columbia), and a US national-level forecast. Because American Samoa and the Northern Mariana Islands had no reported COVID-19 deaths and one reported COVID-19 death, respectively, during the evaluation period, we excluded these locations from our analysis. In analyses for which measures of forecast skill were aggregated across locations, we typically only included the 50 states in the analysis. Including these territories in raw score aggregations would favor models that had forecasted for these regions because models were often accurate in predicting low or zero deaths each week, thereby reducing their average error. The national-level forecasts were not included in the aggregated scores because the large magnitude of scores at the national level strongly influences the averages. However, in analyses for which scores were stratified by location, we included forecasts for all US states, including territories and at the national level. This evaluation used the CSSE COVID-19 surveillance data as ground truth when assessing forecast performance. We did not score observations when ground-truth data showed negative values for weekly incident deaths (due to changes in reporting practices from state/local health agencies [e.g., removing “probable” COVID-19 deaths from cumulative counts]). This occurred 11 times.

Forecast Models.

For the primary evaluation, we compared 28 models that submitted eligible forecasts for at least 47 of the 79 wk considered in the overall model eligibility period (Fig. 1). Teams that submitted to the COVID-19 Forecast Hub used a wide variety of modeling approaches and input data (). Two of the evaluated models are from the COVID-19 Forecast Hub itself: a baseline model and an ensemble model. The COVIDhub-baseline model was designed to be a neutral model to provide a simple reference point of comparison for all models. This baseline model forecasted a predictive median incidence equal to the number of reported deaths in the most recent week (), with uncertainty around the median based on changes in weekly incidence that were observed in the past of the time series (details in ). The COVIDhub-ensemble model combined forecasts from all models that submitted a full set of 23 quantiles for 1- through 4-wk-ahead forecasts for incident deaths. The ensemble for incident weekly deaths was first submitted in the week ending June 6, 2020 (EW23). For submission from EW23 through EW29 (week ending July 18, 2020), the ensemble took an equally weighted average of forecasts from all models at each quantile level. For submissions starting in EW30 (week ending July 25, 2020), the ensemble computed the median across forecasts from all models at each quantile level (27). We evaluated more complex ensemble methods, and while they did show modest improvements in accuracy, they also displayed undesirable increases in variability in performance during this evaluation period (28, 29).

Forecast Submission Timing.

Of the 3,555 forecast submissions we included in the evaluation, 230 (6%) were either originally submitted or updated more than 24 h after the submission deadline. In all of these situations, modeling teams attested (via annotation on the public data repository) to the fact that they were correcting inadvertent errors in the code that produced the forecast and that the forecast used as input only data that would have been available before the original submission due date. In these limited instances, we evaluated the most recently submitted forecasts.

Evaluation Methodology.

We evaluated aggregate forecast skill using a range of scores that assessed both point and probabilistic accuracy. These scores were aggregated over time and locations for near-term forecasts (4 wk or less into the future) and, in a single analysis, for longer-term projections (5 to 20 wk into the future). Point forecast error was assessed using the MAE defined for a set of observations and each model’s designated point predictions as . To assess probabilistic forecast accuracy, we used two scores that are easily computable from the quantile representation for forecasts described in the Methods section, Forecast Format. Briefly, the WIS is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution (20): An individual interval score for a single prediction and uncertainty level can be broken into three additive components. These components—dispersion, underprediction, and overprediction as they appear, respectively, in the preceding equation—represent contributions to the score. As an example, say a 50% PIs (α = 0. 5) is (40, 60) and the observation is 30. The , where the dispersion is 20, the penalty for underprediction is 40, and there is no penalty for overprediction. Similarly, the WIS, which is computed as a weighted sum of interval scores across all available uncertainty levels as shown in the preceding equation, can be split into contributions from each of these components. These then can be used to summarize the average performance of a model in terms of the width of its intervals and the average penalties it receives for intervals missing below or above the observation. Proper scores promote “honest” forecasting by not providing forecasters with incentives to report forecasts that differ from their true beliefs about the future (30). We also evaluated the PIs coverage, the proportion of times a PIs of a certain level covered the observed value, to assess the degree to which forecasts accurately characterized uncertainty about future observations. Computational details for PIs coverage are provided in .

Forecast Comparisons.

Comparative evaluation of the considered models 1, …, M is hampered by the fact that not all of them provide forecasts for the same set of locations and time points. To adjust for the level of difficulty of each model’s set of forecasts, we computed 1) a standardized rank between 0 and 1 for every forecasted observation relative to other models that made the same forecast and 2) an adjusted relative WIS and MAE. To compute the WIS standardized rank score for model and observation (), we computed the number of models that forecasted that observation () and the rank of model among those models (). The model with the best (i.e., lowest) WIS received a rank of 1 and the worst received a rank of . The standardized rank then rescaled the ranks to between 0 and 1, where 0 corresponded to the worst rank and 1 to the best (31–33), as follows: This metric is not dependent on the scale of the observed data. If all models were equally accurate, distributions of standardized ranks would be approximately uniform. A procedure to compute a measure of relative WIS, which evaluates the aggregate performance of one model against the baseline model is described in . To adjust for the relative difficulty of beating the baseline model on the covered set of forecast targets, the chosen measure also takes into account the performance of all other available models. The same procedure was used to compute a relative MAE.

16 in total

1. Improved Weather and Seasonal Climate Forecasts from Multimodel Superensemble.

Authors:
Journal: Science Date: 1999-09-03 Impact factor: 47.728

2. Atmospheric science. Weather forecasting with ensemble methods.

Authors: Tilmann Gneiting; Adrian E Raftery
Journal: Science Date: 2005-10-14 Impact factor: 47.728

Review 3. Epidemic Forecasting is Messier Than Weather Forecasting: The Role of Human Behavior and Internet Data Streams in Epidemic Forecast.

Authors: Kelly R Moran; Geoffrey Fairchild; Nicholas Generous; Kyle Hickmann; Dave Osthus; Reid Priedhorsky; James Hyman; Sara Y Del Valle
Journal: J Infect Dis Date: 2016-12-01 Impact factor: 5.226

Review 4. Outbreak analytics: a developing data science for informing the response to emerging pathogens.

Authors: Jonathan A Polonsky; Amrish Baidjoe; Zhian N Kamvar; Anne Cori; Kara Durski; W John Edmunds; Rosalind M Eggo; Sebastian Funk; Laurent Kaiser; Patrick Keating; Olivier le Polain de Waroux; Michael Marks; Paula Moraga; Oliver Morgan; Pierre Nouvellet; Ruwan Ratnayake; Chrissy H Roberts; Jimmy Whitworth; Thibaut Jombart
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-07-08 Impact factor: 6.237

5. Evaluating epidemic forecasts in an interval format.

Authors: Johannes Bracher; Evan L Ray; Tilmann Gneiting; Nicholas G Reich
Journal: PLoS Comput Biol Date: 2021-02-12 Impact factor: 4.779

6. Recommended reporting items for epidemic forecasting and prediction research: The EPIFORGE 2020 guidelines.

Authors: Simon Pollett; Michael A Johansson; Nicholas G Reich; David Brett-Major; Sara Y Del Valle; Srinivasan Venkatramanan; Rachel Lowe; Travis Porco; Irina Maljkovic Berry; Alina Deshpande; Moritz U G Kraemer; David L Blazes; Wirichada Pan-Ngum; Alessandro Vespigiani; Suzanne E Mate; Sheetal P Silal; Sasikiran Kandula; Rachel Sippy; Talia M Quandelacy; Jeffrey J Morgan; Jacob Ball; Lindsay C Morton; Benjamin M Althouse; Julie Pavlin; Wilbert van Panhuis; Steven Riley; Matthew Biggerstaff; Cecile Viboud; Oliver Brady; Caitlin Rivers
Journal: PLoS Med Date: 2021-10-19 Impact factor: 11.069

7. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States.

Authors: Nicholas G Reich; Logan C Brooks; Spencer J Fox; Sasikiran Kandula; Craig J McGowan; Evan Moore; Dave Osthus; Evan L Ray; Abhinav Tushar; Teresa K Yamana; Matthew Biggerstaff; Michael A Johansson; Roni Rosenfeld; Jeffrey Shaman
Journal: Proc Natl Acad Sci U S A Date: 2019-01-15 Impact factor: 11.205

8. An open challenge to advance probabilistic forecasting for dengue epidemics.

Authors: Michael A Johansson; Karyn M Apfeldorf; Scott Dobson; Jason Devita; Anna L Buczak; Benjamin Baugher; Linda J Moniz; Thomas Bagley; Steven M Babin; Erhan Guven; Teresa K Yamana; Jeffrey Shaman; Terry Moschou; Nick Lothian; Aaron Lane; Grant Osborne; Gao Jiang; Logan C Brooks; David C Farrow; Sangwon Hyun; Ryan J Tibshirani; Roni Rosenfeld; Justin Lessler; Nicholas G Reich; Derek A T Cummings; Stephen A Lauer; Sean M Moore; Hannah E Clapham; Rachel Lowe; Trevor C Bailey; Markel García-Díez; Marilia Sá Carvalho; Xavier Rodó; Tridip Sardar; Richard Paul; Evan L Ray; Krzysztof Sakrejda; Alexandria C Brown; Xi Meng; Osonde Osoba; Raffaele Vardavas; David Manheim; Melinda Moore; Dhananjai M Rao; Travis C Porco; Sarah Ackley; Fengchen Liu; Lee Worden; Matteo Convertino; Yang Liu; Abraham Reddy; Eloy Ortiz; Jorge Rivero; Humberto Brito; Alicia Juarrero; Leah R Johnson; Robert B Gramacy; Jeremy M Cohen; Erin A Mordecai; Courtney C Murdock; Jason R Rohr; Sadie J Ryan; Anna M Stewart-Ibarra; Daniel P Weikel; Antarpreet Jutla; Rakibul Khan; Marissa Poultney; Rita R Colwell; Brenda Rivera-García; Christopher M Barker; Jesse E Bell; Matthew Biggerstaff; David Swerdlow; Luis Mier-Y-Teran-Romero; Brett M Forshey; Juli Trtanj; Jason Asher; Matt Clay; Harold S Margolis; Andrew M Hebbeler; Dylan George; Jean-Paul Chretien
Journal: Proc Natl Acad Sci U S A Date: 2019-11-11 Impact factor: 11.205

9. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S.

Authors: Nicholas G Reich; Craig J McGowan; Teresa K Yamana; Abhinav Tushar; Evan L Ray; Dave Osthus; Sasikiran Kandula; Logan C Brooks; Willow Crawford-Crudell; Graham Casey Gibson; Evan Moore; Rebecca Silva; Matthew Biggerstaff; Michael A Johansson; Roni Rosenfeld; Jeffrey Shaman
Journal: PLoS Comput Biol Date: 2019-11-22 Impact factor: 4.475

10. Applying infectious disease forecasting to public health: a path forward using influenza forecasting examples.

Authors: Chelsea S Lutz; Mimi P Huynh; Monica Schroeder; Sophia Anyatonwu; F Scott Dahlgren; Gregory Danyluk; Danielle Fernandez; Sharon K Greene; Nodar Kipshidze; Leann Liu; Osaro Mgbere; Lisa A McHugh; Jennifer F Myers; Alan Siniscalchi; Amy D Sullivan; Nicole West; Michael A Johansson; Matthew Biggerstaff
Journal: BMC Public Health Date: 2019-12-10 Impact factor: 3.295

20 in total

1. Trend estimation and short-term forecasting of COVID-19 cases and deaths worldwide.

Authors: Ekaterina Krymova; Benjamín Béjar; Dorina Thanou; Tao Sun; Elisa Manetti; Gavin Lee; Kristen Namigai; Christine Choirat; Antoine Flahault; Guillaume Obozinski
Journal: Proc Natl Acad Sci U S A Date: 2022-08-03 Impact factor: 12.779

2. Extended compartmental model for modeling COVID-19 epidemic in Slovenia.

Authors: Miha Fošnarič; Tina Kamenšek; Jerneja Žganec Gros; Janez Žibert
Journal: Sci Rep Date: 2022-10-08 Impact factor: 4.996

3. Experimental guidance for discovering genetic networks through hypothesis reduction on time series.

Authors: Breschine Cummins; Francis C Motta; Robert C Moseley; Anastasia Deckard; Sophia Campione; Marcio Gameiro; Tomáš Gedeon; Konstantin Mischaikow; Steven B Haase
Journal: PLoS Comput Biol Date: 2022-10-10 Impact factor: 4.779

4. Optimal control of the spatial allocation of COVID-19 vaccines: Italy as a case study.

Authors: Joseph Chadi Lemaitre; Damiano Pasetto; Mario Zanon; Enrico Bertuzzo; Lorenzo Mari; Stefano Miccoli; Renato Casagrandi; Marino Gatto; Andrea Rinaldo
Journal: PLoS Comput Biol Date: 2022-07-08 Impact factor: 4.779

5. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States.

Authors: Evan L Ray; Logan C Brooks; Jacob Bien; Matthew Biggerstaff; Nikos I Bosse; Johannes Bracher; Estee Y Cramer; Sebastian Funk; Aaron Gerding; Michael A Johansson; Aaron Rumack; Yijin Wang; Martha Zorn; Ryan J Tibshirani; Nicholas G Reich
Journal: Int J Forecast Date: 2022-07-01

6. Enhancing long-term forecasting: Learning from COVID-19 models.

Authors: Hazhir Rahmandad; Ran Xu; Navid Ghaffarzadegan
Journal: PLoS Comput Biol Date: 2022-05-19 Impact factor: 4.779

7. An ensemble model based on early predictors to forecast COVID-19 health care demand in France.

Authors: Juliette Paireau; Alessio Andronico; Nathanaël Hozé; Maylis Layan; Pascal Crépey; Alix Roumagnac; Marc Lavielle; Pierre-Yves Boëlle; Simon Cauchemez
Journal: Proc Natl Acad Sci U S A Date: 2022-04-27 Impact factor: 12.779

8. An ensemble n -sub-epidemic modeling framework for short-term forecasting epidemic trajectories: Application to the COVID-19 pandemic in the USA.

Authors: Gerardo Chowell; Sushma Dahal; Amna Tariq; Kimberlyn Roosa; James M Hyman; Ruiyan Luo
Journal: medRxiv Date: 2022-06-21

9. An evaluation of Internet searches as a marker of trends in population mental health in the US.

Authors: Evgeniy Gabrilovich; Gregory A Wellenius; Uma Vaidyanathan; Yuantong Sun; Tomer Shekel; Katherine Chou; Sandro Galea
Journal: Sci Rep Date: 2022-05-27 Impact factor: 4.996

10. Aggregating human judgment probabilistic predictions of COVID-19 transmission, burden, and preventative measures.

Authors: Allison Codi; Damon Luk; David Braun; Juan Cambeiro; Tamay Besiroglu; Eva Chen; Luis Enrique Urtubey de Cèsaris; Paolo Bocchini; Thomas McAndrew
Journal: ArXiv Date: 2022-04-05