| Literature DB >> 32863495 |
John P A Ioannidis1, Sally Cripps2, Martin A Tanner3.
Abstract
Epidemic forecasting has a dubious track-record, and its failures became more prominent with COVID-19. Poor data input, wrong modeling assumptions, high sensitivity of estimates, lack of incorporation of epidemiological features, poor past evidence on effects of available interventions, lack of transparency, errors, lack of determinacy, consideration of only one or a few dimensions of the problem at hand, lack of expertise in crucial disciplines, groupthink and bandwagon effects, and selective reporting are some of the causes of these failures. Nevertheless, epidemic forecasting is unlikely to be abandoned. Some (but not all) of these problems can be fixed. Careful modeling of predictive distributions rather than focusing on point estimates, considering multiple dimensions of impact, and continuously reappraising models based on their validated performance may help. If extreme values are considered, extremes should be considered for the consequences of multiple dimensions of impact so as to continuously calibrate predictive insights and decision-making. When major decisions (e.g. draconian lockdowns) are based on forecasts, the harms (in terms of health, economy, and society at large) and the asymmetry of risks need to be approached in a holistic fashion, considering the totality of the evidence.Entities:
Keywords: Bayesian models; Bias; COVID-19; Forecasting; Hospital bed utilization; Mortality; SIR models; Validation
Year: 2020 PMID: 32863495 PMCID: PMC7447267 DOI: 10.1016/j.ijforecast.2020.08.004
Source DB: PubMed Journal: Int J Forecast ISSN: 0169-2070
Some predictions about hospital bed needs and their rebuttal by reality: examples from news coverage of some influential forecasts.
| State | Prediction made | What happened |
|---|---|---|
| New York ( | “Sophisticated scientists, Mr. Cuomo said, had studied the coming coronavirus outbreak and | “But the number of intensive care beds being used declined for the first time in the crisis, to 4,908, according to daily figures released on Friday. And the total number hospitalized with the virus, 18,569, was far lower than the darkest expectations.” |
| Tennessee ( | “Last Friday, the model suggested Tennessee would see the peak of the pandemic on about April 19 and would need an estimated 15,500 inpatient beds, 2,500 ICU beds and nearly 2,000 ventilators to keep COVID-19 patients alive.” | “Now, it is projecting the peak to come four days earlier and that the state will need 1,232 inpatients beds, 245 ICU beds and 208 ventilators. Those numbers are all well below the state’s current health care capacity.” |
| California ( | “In California alone, at least 1.2 million people over the age of 18 are projected to need hospitalization from the disease, according to an analysis published March 17 by the Harvard Global Health Institute and the Harvard T.H. Chan School of Public Health… California needs 50,000 additional hospital beds to meet the incoming surge of coronavirus patients, Gov. Gavin Newsom said last week.” | “In our home state of California, for example, COVID-19 patients occupy fewer than two in 10 ICU beds, and the growth in COVID-19-related utilization, thankfully, seems to be flattening out. California’s picture is even sunnier when it comes to general hospital beds. Well under five percent are occupied by COVID-19 patients.” |
Forecasting what will happen after reopening.
| PREDICTION FOR REOPENING | WHAT ACTUALLY HAPPENED |
|---|---|
| “Results indicate that lifting restrictions too soon can result in a second wave of infections and deaths. Georgia is planning to open some businesses on April 27th. The tool shows that COVID-19 is not yet contained in Georgia and even lifting restrictions gradually over the next month can result in over 23,000 deaths.” | Number of deaths over one month: 896 instead of the predicted 23,000 |
| “administration is privately projecting a steady rise in the number of coronavirus cases and deaths over the next several weeks. The daily death toll will reach about 3,000 on June 1, according to an internal document obtained by The New York Times, a 70 percent increase from the current number of about 1,750. | Number of daily deaths on June 1: 731 instead of the predicted 3,000, i.e. a 60% decrease instead of 70% increase |
| “According to the Penn Wharton Budget Model (PWBM), reopening states will result in an additional 233,000 deaths from the virus — even if states don’t reopen at all and with social distancing rules in place. This means that if the states were to reopen, 350,000 people in total would die from coronavirus by the end of June, the study found.” | Based on JHU dashboard death count, number of additional deaths as of June 30 was 5,700 instead of 233,000, i.e. total deaths was 122,700 instead of 350,000. It is unclear also whether any of the 5,700 deaths were due to reopening rather than error in the original model calibration of the number of deaths without reopening. |
| “Dr. Ashish Jha, the director of the Harvard Global Health Institute, told CNN’s Wolf Blitzer that the current data shows that somewhere between 800 to 1000 Americans are dying from the virus daily, and even if that does not increase, the US is poised to cross 200,000 deaths sometime in September. | Within less than 4 weeks of this quote, the number of daily deaths was much less than the 800–1000 quote (516 daily average for the week ending July 4). Then it increased again to over 1000 daily average in the first three weeks in August and then it decreased again to 710 daily average by the last week of September. Predictions are precarious with such volatile behavior of the epidemic wave. |
Potential reasons for the failure of COVID-19 forecasting along with examples and extent of potential amendments.
| Reasons | Examples | How to fix: extent of potential amendments |
|---|---|---|
| Poor data input on key features of the pandemic that go into theory-based forecasting (e.g. SIR models) | Early data providing estimates for case fatality rate, infection fatality rate, basic reproductive number, and other key numbers that are essential in modeling were inflated. | May be unavoidable early in the course of the pandemic when limited data are available; should be possible to correct when additional evidence accrues about true spread of the infection, proportion of asymptomatic and non-detected cases, and risk-stratification. Investment should be made in the collection, cleaning, and curation of data. |
| Poor data input for data-based forecasting (e.g. time series) | Lack of consensus as to what is the ‘ground truth” even for seemingly hard-core data such as the daily the number of deaths. They may vary because of reporting delays, changing definitions, data errors, etc. Different models were trained on different and possibly highly inconsistent versions of the data. | As above: investment should be made in the collection, cleaning, and curation of data. |
| Wrong assumptions in the modeling | Many models assume homogeneity, i.e. all people having equal chances of mixing with each other and infecting each other. This is an untenable assumption and, in reality, tremendous heterogeneity of exposures and mixing is likely to be the norm. Unless this heterogeneity is recognized, estimates of the proportion of people eventually infected before reaching herd immunity can be markedly inflated | Need to build probabilistic models that allow for more realistic assumptions; quantify uncertainty and continuously re-adjust models based on accruing evidence |
| High sensitivity of estimates | For models that use exponentiated variables, small errors may result in major deviations from reality | Inherently impossible to fix; can only acknowledge that uncertainty in calculations may be much larger than it seems |
| Lack of incorporation of epidemiological features | Almost all COVID-19 mortality models focused on number of deaths, without considering age structure and comorbidities. This can give very misleading inferences about the burden of disease in terms of quality-adjusted life-years lost, which is far more important than simple death count. For example, the Spanish flu killed young people with average age of 28 and its burden in terms of number of quality-adjusted person-years lost was about 1000-fold higher than the COVID-19 (at least as of June 8, 2020). | Incorporate best epidemiological estimates of age structure and comorbidities in the modeling; focus on quality-adjusted life-years rather than deaths |
| Poor past evidence on effects of available interventions | The core evidence to support “flatten-the-curve” efforts was based on observational data from the 1918 Spanish flu pandemic on 43 US cites. These data are >100-years old, of questionable quality, unadjusted for confounders, based on ecological reasoning, and pertaining to an entirely different (influenza) pathogen that had | While some interventions in the broader package of lockdown measures are likely to have beneficial effects, assuming huge benefits is incongruent with past (weak) evidence and should be avoided. Large benefits may be feasible from precise, focused measures (e.g. early, intensive testing with thorough contact tracing for the early detected cases, so as not to allow the epidemic wave to escalate [e.g. Taiwan or Singapore]; or draconian hygiene measures and thorough testing in nursing homes) rather than from blind lockdown of whole populations. |
| Lack of transparency | The methods of many models used by policy makers were not disclosed; most models were never formally peer-reviewed, and the vast majority have not appeared in the peer-reviewed literature even many months after they shaped major policy actions | While formal peer-review and publication may unavoidably take more time, full transparency about the methods and sharing of the code and data that inform these models is indispensable. Even with peer-review, many papers may still be glaringly wrong, even in the best journals. |
| Errors | Complex code can be error-prone, and errors can happen even by experienced modelers; using old-fashioned software or languages can make things worse; lack of sharing code and data (or sharing them late) does not allow detecting and correcting errors | Promote data and code sharing; use up-to-date and well-vetted tools and processes that minimize the potential for error through auditing loops in the software and code |
| Lack of determinacy | Many models are stochastic and need to have a large number of iterations run, perhaps also with appropriate burn-in periods; superficial use may lead to different estimates | Promote data and code sharing to allow checking the use of stochastic processes and their stability |
| Looking at only one or a few dimensions of the problem at hand | Almost all models that had a prominent role in decision-making focused on COVID-19 outcomes, often just a single outcome or a few outcomes (e.g. deaths or hospital needs). Models prime for decision-making need to take into account the impact on multiple fronts (e.g. other aspects of health care, other diseases, dimensions of the economy, etc.) | Interdisciplinarity is desperately needed; as it is unlikely that single scientists or even teams can cover all this space, it is important for modelers from diverse ways of life to sit at the same table. Major pandemics happen rarely, and what is needed are models which combine information from a variety of sources. Information from data, from experts in the field, and from past pandemics, need to combined in a logically consistent fashion if we wish to get any sensible predictions. |
| Lack of expertise in crucial disciplines | The credentials of modelers are sometimes undisclosed; when they have been disclosed, these teams are led by scientists who may have strengths in some quantitative fields, but these fields may be remote from infectious diseases and clinical epidemiology; modelers may operate in subject matter vacuum | Make sure that the modelers’ team is diversified and solidly grounded in terms of subject matter expertise |
| Groupthink and bandwagon effects | Models can be tuned to get desirable results and predictions; e.g. by changing the input of what are deemed to be plausible values for key variables. This is especially true for models that depend on theory and speculation, but even data-driven forecasting can do the same, depending on how the modeling is performed. In the presence of strong groupthink and bandwagon effects, modelers may consciously fit their predictions to what is the dominant thinking and expectations – or they may be forced to do so. | Maintain an open-minded approach; unfortunately, models are very difficult, if not impossible, to pre-register, so subjectivity is largely unavoidable and should be taken into account in deciding how much forecasting predictions can be trusted |
| Selective reporting | Forecasts may be more likely to be published or disseminated if they are more extreme | Very difficult to diminish, especially in charged environments; needs to be taken into account in appraising the credibility of extreme forecasts |
Fig. 1Predictions for ICU beds made by the IHME models on March 31 for three states: California, New Jersey, and New York. For New York, the model initially over predicted enormously, and then it under predicted. For New Jersey, a neighboring state, the model started well but then it underpredicts, while for California it predicted a peak which never eventuated.
Fig. 2Performance of four data-driven models, IHME, YYG, UT, and LANL, used to predict COVID-19 death counts by state in the USA for the following day. That is, these were predictions made only 24 h in advance of the day in question. The Figure shows the percentage of times that a particular model’s prediction was within 10% of the ground truth by state. All models failed in terms of accuracy; for the majority of states, this figure was less than 20%.
Fig. 3Performance of the same models examined in Fig. 2 in terms of their uncertainty quantification. If a model assessment of uncertainty is accurate, then we would expect 95% of the ground truth values to fall within the 95% prediction interval. Only one of the 4 models (the UT model) approached this level of accuracy.
Fig. 4Snapshot from https://reichlab.io/covid19-forecast-hub/ (a very useful site that collates information and predictions from multiple forecasting models) as of 11.14 AM PT on June 3, 2020. Predictions for number of US deaths during week 27 (only 3 weeks downstream) with these 8 models ranged from 2419 to 11190, which is a 4.5-fold difference, and the spectrum of 95% confidence intervals ranged from fewer than 100 deaths to over 16,000 deaths, which is almost a 200-fold difference.
Adverse consequences of precautionary actions, expecting excess of hospitalization and ICU needs (as forecasted by multiple models).
| PRECAUTIONARY ACTION | JUSTIFICATION | WHAT WENT WRONG |
|---|---|---|
| Stop elective procedures, delay other treatments | Focus available resources on preparing for the COVID-19 onslaught | Treatment for major conditions like cancer were delayed, ( |
| Send COVID-19 patients to nursing homes | Acute hospital beds are needed for the predicted COVID-19 onslaught, models predict hospital beds will not be enough | Thousands of COVID-19 infected patients were sent to nursing homes( |
| Inform the public that we are doing our best, but it is likely that hospitals will be overwhelmed by COVID-19 | Honest communication with the general public | Patients with major problems like heart attacks did not come to the hospital to be treated, ( |
| Re-orient all hospital operations to focus on COVID-19 | Be prepared for the COVID-19 wave, strengthen the response to crisis | Most hospitals saw no major COVID-19 wave and also saw a massive reduction in overall operations with major financial cost, leading to furloughs and lay-off of personnel; this makes hospitals less prepared for any major crisis in the future |
Taleb’s main statements and our responses.
| Forecasting single variables in fat tailed domains is in violation of both common sense and probability theory. | |
| Pandemics are extremely fat tailed. | Yes, and so are many other phenomena. The distribution of financial rewards from mining activity, for example, is incredibly fat tailed and very asymmetric. As such, it is important to accurately quantify the entire distribution of forecasts. From a Bayesian perspective, we can rely on the posterior distribution (as well as the posterior predictive distribution) as the basis of statistical inference and prediction ( |
| Science is not about making single points predictions but understanding properties (which can | We agree and that is why the focus should be on the |
| Risk management is concerned with tail properties and distribution of extrema, not averages or survival functions. | Quality data and calibrated (Bayesian) statistical models may be useful in estimating the behaviour of a random variable across the whole spectrum of outcomes, not just point estimates of summary statistics. While the three parameter Pareto distribution can be developed based on interesting mathematics, it is not clear that it will provide a measurably better fit to skewed data (e.g. pandemic data in ref. 19) than would a two parameter Gamma distribution fitted to the log counts. It is certainly not immediately obvious how to generalize either skewed distribution to allow for the use of all available sources of information in a logically consistent fully probabilistic model. In this regard, we note that upon examining the NY daily death count data studied in ( |
| Naive fortune cookie evidentiary methods fail to work under both risk management and fat tails as absence of evidence can play a large role in the properties. | A passenger may well get off the plane if it is on the ground and the skills of the pilot are in doubt, but what if he awakes to find he is on a nonstop from JFK to LAX? The poor passenger can stay put, cross her/his fingers, say a few prayers, or can get a parachute and jump; assuming s/he is able and willing to open the exit door in midflight. The choice is not so easy when there are considerable risks associated with either decision that need to be made in real time. We argue that acquiring further information on the pilot’s skill level, perhaps from the flight attendant as s/he strolls down the aisle with the tea trolley, as well as checking that the parachute under the seat (if available) has no holes, would be prudent courses of action. This is exactly the situation that faced New York- they did not arrange to be ground zero of the COVID-19 pandemic in the US. Various models forecast very high demand for ICU beds in New York state. As a result of this forecast, a decision was made to send COVID-19 patients to nursing homes, with tragic consequences. |
Fig. 5Population age-risk categories and COVID-19 deaths per age-risk category. The illustration uses estimates for a symptomatic case fatality rate of 0.05% in ages 0–49, 0.2% in ages 50–64, and 1.3% in ages 65 and over, similar to the CDC main planning scenario ( https://www.cdc.gov/coronavirus/2019-ncov/hcp/planning-scenarios.html). It also assumes that 50% of infections are asymptomatic in ages 0–49, 30% are asymptomatic in ages 50–64, and 10% are asymptomatic in ages 65 and over. Furthermore, it assumes that among people in nursing homes and related facilities (0.5% of the population in the USA), the infection fatality rate is 26%, as per Arons, Hatfield, Reddy, Kimball, James, et al. (2020). Finally, it assumes that some modest prognostic model is available where 4% of highest-risk people 0-49 years old explain 50% of the death risk in that category, the top 10% explains 70% of the deaths in the 50-64 years category, and the top 30% explains 90% of the risk in the 65 and above category. Based on available prognostic models (e.g. Williamson et al. (2020)), this prognostic classification should be readily attainable. As shown, <10% of the population is at high risk (shown with dense-colors and thus worth special protection and more aggressive measures), and these people account for >90% of the potential deaths. More than 90% of the population could possibly continue with non-disruptive measures as they account for only <10% of the total potential deaths.