Literature DB >> 36168444

Short-term local predictions of COVID-19 in the United Kingdom using dynamic supervised machine learning algorithms.

Xin Wang^1,2, Yijia Dong³, William David Thompson⁴, Harish Nair², You Li^1,2.

Abstract

Background: Short-term prediction of COVID-19 epidemics is crucial to decision making. We aimed to develop supervised machine-learning algorithms on multiple digital metrics including symptom search trends, population mobility, and vaccination coverage to predict local-level COVID-19 growth rates in the UK.
Methods: Using dynamic supervised machine-learning algorithms based on log-linear regression, we explored optimal models for 1-week, 2-week, and 3-week ahead prediction of COVID-19 growth rate at lower tier local authority level over time. Model performance was assessed by calculating mean squared error (MSE) of prospective prediction, and naïve model and fixed-predictors model were used as reference models. We assessed real-time model performance for eight five-weeks-apart checkpoints between 1st March and 14th November 2021. We developed an online application (COVIDPredLTLA) that visualised the real-time predictions for the present week, and the next one and two weeks.
Results: Here we show that the median MSEs of the optimal models for 1-week, 2-week, and 3-week ahead prediction are 0.12 (IQR: 0.08-0.22), 0.29 (0.19-0.38), and 0.37 (0.25-0.47), respectively. Compared with naïve models, the optimal models maintain increased accuracy (reducing MSE by a range of 21-35%), including May-June 2021 when the delta variant spread across the UK. Compared with the fixed-predictors model, the advantage of dynamic models is observed after several iterations of update. Conclusions: With flexible data-driven predictors selection process, our dynamic modelling framework shows promises in predicting short-term changes in COVID-19 cases. The online application (COVIDPredLTLA) could assist decision-making for control measures and planning of healthcare capacity in future epidemic growths.

Entities: Chemical

Keywords: Disease prevention; Epidemiology

Year: 2022 PMID： 36168444 PMCID： PMC9509378 DOI： 10.1038/s43856-022-00184-7

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

The COVID-19 pandemic has caused a substantial health and economic burden in the UK and globally. In the UK, SARS-CoV-2 has caused about 22 million cases, 960 thousand hospital admissions, and 201 thousand deaths as of 12th August 2022[1]. Given the continuous spread of SARS-CoV-2, short-term predictions are important to assist effective decision making for control measures and planning of healthcare capacity[2-4]. Population aggregated digital big data based on personal digital devices and applications have been widely used to predict COVID-19 epidemics as they capture well the changes in population behaviours without disclosing personal information, in a near-real-time manner. Population-level mobility and internet searches are two of the most important digital data metrics that are commonly used in modelling infectious disease outbreaks. Population-level mobility data are publicly accessible from several sources, including Google COVID-19 Community Mobility Reports[5], Apple COVID-19 Mobility Trends[6], and Facebook Data for Good[7], and have been increasingly used to understand changes in public physical contacts during the COVID-19 pandemic[8,9]. Between the sources, data are collected and provided differently. Our earlier work found that mobility at different types of locations are associated with varying degrees of changes in transmission of SARS-CoV-2, and visits to retail and recreation areas, workplaces, and transit stations are key drivers of COVID-19 epidemic in the UK[10]. Moreover, online search queries on Google have been used to detect and predict influenza and other infectious diseases before the COVID-19 pandemic[11-13]. The search trends of infection-related symptoms reflect infected people searching their symptoms in real time, thus can also be used as an early warning of the confirmed cases that are subject to the testing capacity and delays[14,15]. Integrating the two metrics above that capture different types of population behaviours may improve the prediction from the perspective of COVID-19 early warning[3,16]. Kogan and colleagues[16] developed an early warning approach to monitor COVID-19 activity with multiple digital databases among three states of the US, which showed that the predictions of COVID-19 cases and deaths were improved by integrating multiple digital traces. However, no dynamic models were applied in that study that allowed for data-driven adjustment over time (e.g., a fixed set of COVID-19 symptom search terms was used), and the predictions were made for March–September 2020, before the emergence of variants of concern and the mass roll-out of the COVID-19 vaccination. The Zoe COVID study[17] found that there was addition of new symptoms over time, e.g., after the vaccination or potentially associated with the emergence of new variants. Moreover, the COVID-19 vaccination coverage may be also relevant to the prediction given that the mass rollout of COVID-19 vaccines could have reduced COVID-19 transmission[18]. As the COVID-19 situation changes rapidly, it is essential for COVID-19 prediction modelling studies to develop flexible algorithms adaptive to the most up-to-date data. In this study, we predicted the short-term changes in the COVID-19 epidemics at finer geographical scale in the UK, through a dynamic supervised machine learning algorithm that could reflect the best real-time prediction informed by data on internet searches on COVID-19 symptoms, mobility, and vaccination coverage. The programmes achieved better predictive accuracy compared with two reference models, showing promises in forecasting future local COVID-19 outbreaks. Furthermore, we developed a publicly accessible web application to present the predictions.

Methods

Overview

Our primary objective was to develop data-driven machine-learning models for 1-, 2- and 3-week ahead predictions of growth rates in the COVID-19 cases (defined as 1-, 2- and 3-week growth rate, respectively) at lower-tier local authority (LTLA) level in the UK. In the UK, COVID-19 cases are reported by publication date (i.e., the date when the case was registered on the reporting system) and by the date of collection of specimen. Therefore, there were six prediction targets in our study, 1-, 2- and 3-week growth rates by publication date and those by the date of collection of specimen (Table 1). We focused on prediction by publication date in the main models, considering that the delayed reporting for COVID-19 cases by the collection date of specimen could affect real-time assessment of model performance (i.e., the prediction would be biased downwards due to delayed reporting).

Table 1

Prediction targets.

Outcome	Mathematic formula
Y1p_t: 1-week-ahead change in COVID-19 cases (as 1-week growth rate) compared with week t, by publication date^a	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{P}}_{t+1}}{{{{{{\rm{Case}}}}}}{{P}}_{t}}$$\end{document}logCasePt+1CasePt
Y2p_t: 2-week-ahead change in COVID-19 cases (as 2-week growth rate) compared with week t, by publication date	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{P}}_{t+2}}{{{{{{\rm{Case}}}}}}{{P}}_{t}}$$\end{document}logCasePt+2CasePt
Y3p_t: 3-week-ahead change in COVID-19 cases (as 3-week growth rate) compared with week t, by publication date	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{P}}_{t+3}}{{{{{{\rm{Case}}}}}}{{P}}_{t}}$$\end{document}logCasePt+3CasePt
Y1s_t: 1-week-ahead change in COVID-19 cases (as 1-week growth rate) compared with week t, by collection date of specimen^b	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{S}}_{t+1}}{{{{{{\rm{Case}}}}}}{{S}}_{t}}$$\end{document}logCaseSt+1CaseSt
Y2s_t: 2-week-ahead change in COVID-19 cases (as 2-week growth rate) compared with week t, by collection date of specimen	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{S}}_{t+2}}{{{{{{\rm{Case}}}}}}{{S}}_{t}}$$\end{document}logCaseSt+2CaseSt
Y3s_t: 3-week-ahead change in COVID-19 cases (as 3-week growth rate) compared with week t, by collection date of specimen	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\log }\frac{{{{{{\rm{Case}}}}}}{{S}}_{t+3}}{{{{{{\rm{Case}}}}}}{{S}}_{t}}$$\end{document}logCaseSt+3CaseSt

CaseP – number of COVID-19 cases by publication date at week t; CaseS – number of COVID-19 cases by collection date of specimen at week t.

aPublication date refers to the date when the case was registered on the reporting system.

bCollection date of specimen refers to the date when the respiratory specimen was taken for testing.

Prediction targets. CaseP – number of COVID-19 cases by publication date at week t; CaseS – number of COVID-19 cases by collection date of specimen at week t. aPublication date refers to the date when the case was registered on the reporting system. bCollection date of specimen refers to the date when the respiratory specimen was taken for testing.

Data sources

We analysed the Google Search Trends symptoms dataset[5], the Google Community Mobility Reports[19,20], COVID-19 vaccination coverage and the number of confirmed COVID-19 cases for the UK[1]. These data were formatted and aggregated from daily to weekly level where needed, and then linked by week and LTLA. We considered only the time series from 1st June 2020 (defined as week 1) for modelling, given that case reporting was relatively consistent and reliable at LTLA level after 1st June 2020. The modelling work initially began on 15th May 2021 and was continuously updated using the latest available data since then; when models were fit, only the versions of the data that were available in real time were used. In this study, we used 14th November 2021 as the time cut-off for reporting (i.e., data between 1st June 2020 and 14th November 2021 were included for modelling) although our model continues to update regularly. The Google symptom search trends show the relative popularity of symptoms in searches within a geographical area over time[21]. We used the percentage change in the symptom searches for each week during the pandemic compared to the pre-pandemic period (the three-year average for the same week during 2017–2019). We considered 173 symptoms for which the search trends had a high-level completeness in the analyses. These search trends were provided by upper-tier local authorities, and were extrapolated to each LTLA. The Google mobility dataset records daily population mobility relative to a baseline level for six specific areas, namely workplaces, residential areas, parks, retail and recreational areas, grocery and pharmacy, and transit stations[22]. The weekly averages of each of the six mobility metrics for each LTLA were the model inputs. The mobility in LTLAs of Hackney and City of London were averaged, given that they were grouped into one LTLA in other datasets. Cornwall and Isles of Scilly were combined likewise. The COVID-19 vaccination coverage dataset records the cumulative percentage of population vaccinated with the first dose of vaccine and that for the second dose on each day. Before the start of the vaccination rollout (7th December 2020 for first dose and 28th December 2020 for second dose), the coverage was deemed to be zero. We used the weekly maximum cumulative percentage of people vaccinated for the first dose and second dose for each LTLA in our models. Missing values on symptom search trends, mobility, and vaccination coverage were imputed using linear interpolation for each LTLA[23]. Thirteen LTLAs were excluded as data were insufficient to allow for linear interpolation.

Models

Algorithm for model selection

We developed a dynamic supervised machine learning algorithm based on log-linear regression. The algorithm could allow the optimal prediction models to vary over time given the best available data to date, and therefore reflected the best real-time prediction given all available data. Figure 1 shows the iteration of model selection and assessment. We started with a baseline model[24] that included LTLA (as dummy variables), the six Google mobility metrics, vaccination coverage for the first and second doses, and eight base symptoms from the Google symptom search trends, including cough, fever, fatigue, diarrhoea, vomiting, shortness of breath, confusion, and chest pain, which were most relevant to COVID-19 symptoms based on existing evidence[25]. Dysgeusia and anosmia as the two other main symptoms of COVID-19[26] were not included as base symptoms because Google symptom search data on the two symptoms were only sufficient to allow for modelling in about 56% of the LTLAs (the two symptoms were included as base symptoms in the sensitivity analysis described below). We then selected and assessed the optimal lag combination[15,27,28] between each predictor and growth rate. Next, starting from the eight base symptoms, we applied a forward data-driven method for including additional symptoms in the model. This would allow the inclusion of other symptoms that could improve model predictability. Lastly, we assessed the different predictor combinations (Fig. 1; Supplementary Methods and Supplementary Table 1).

Fig. 1

Schematic figure showing model selection and assessment.

Schematic figure showing model selection and assessment.

SE squared error, MSE mean squared error. In each of the assessment steps, the optimal model had the smallest MSE. X to X: mobility metrics at six locations. X to X: search metrics of the eight base symptoms. X and X: COVID-19 vaccination coverage for the first and second dose. Details are in Supplementary Method. At each of the steps, model performance was assessed through calculating an average mean squared error (MSE) of the predictions over the previous four weeks, i.e., 4-week MSE, with the MSE for each week being evaluated separately by fitting the same candidate model (Fig. 1 and Supplementary Methods). The calculated 4-week MSE reflected the average predictability of candidate models over the previous four weeks (referred to as retrospective 4-week MSE). Models with minimum 4-week MSE were considered for inclusion in each step. Separate model selection processes were conducted for each of the prediction targets. In addition, we considered naïve models as alternative model candidates for selection; naïve models (which assumed no changes in the growth rate) carried forward the last available observation for each of the outcomes as the prediction. Similar to the full models (i.e., models with predictors), we considered a time lag between zero and three weeks, and used the 4-week MSE for naïve models (Supplementary Table 2).

Prospective evaluation of model predictability

After selection of the optimal model based on the retrospective 4-week MSE, we proceeded to evaluating model predictability prospectively by calculating the prediction errors for forecasts of growth rates in the following 1–3 weeks (for the three prediction timeframes), referred to as prospective MSE (Supplementary Methods and Supplementary Table 3). As the optimal prediction models changed over time under our modelling framework, we selected a priori eight checkpoints that were five weeks apart for assessing model predictability (we did not assess every week due to the considerable computational time required): year 1/week 40 (the week of 1st March 2021), 1/45 (5th April), 1/50 (10th May), 2/3 (14th June), 2/8 (19th July), 2/13 (30th August), 2/18 (4th October) and 2/23 (14th November). For each checkpoint, we presented the composition of the optimal models as well as the corresponding prospective MSE. Two reference models were used to help evaluate our dynamic optimal models. We considered naïve models (with optimal time lag based on 4-week retrospective MSE) as the first reference model, to understand how much the models driven by covariates could outperform models that assume status quo. As the second reference model, to further demonstrate the advantages of our dynamic model selection approach over the conventional model with a fixed list of predictors, we used the optimal model for the first checkpoint (i.e., year 1/week 40) and fixed its covariates (referred to as fixed-predictors model); then we compared its prospective MSEs for the next seven checkpoints (i.e., year 1/week 45 onwards), allowing the model coefficients to vary.

Sensitivity analyses

As sensitivity analysis, the base symptoms were expanded to further include dysgeusia and anosmia, as well as headache, nasal congestion, and sore throat that have been recently reported as common symptoms of COVID-19[17] to assess how the predictive accuracy was influenced.

Web application

We developed a web application COVIDPredLTLA using R ShinyApp, presenting our best prediction results at local level of the UK given all available data to date. COVIDPredLTLA (https://leoly2017.github.io/COVIDPredLTLA/), officially launched on 1st December 2021, uses real-time data from the above sources and currently updates twice per week. The application presents the predicted percentage changes (and uncertainties where applicable) in the COVID-19 cases in the present week (nowcasts) and the one and two weeks ahead (forecasts) compared with the previous week, using the optimal models (which technically could be naïve models or any of the full models), by two forms (publication date and the collection date of specimen) for each LTLA. Analyses were done with R software (version 4.1.1). We followed the STROBE guidelines for the reporting of observational studies as well as the EPIFORGE guidelines for the reporting of epidemic forecasting and prediction research. All the data included in the analyses were population-aggregated data available in the public domain and therefore, ethical approval was not required.

20 in total

1. Mobility network models of COVID-19 explain inequities and inform reopening.

Authors: Serina Chang; Emma Pierson; Pang Wei Koh; Jaline Gerardin; Beth Redbird; David Grusky; Jure Leskovec
Journal: Nature Date: 2020-11-10 Impact factor: 49.962

2. Hospital admission and emergency care attendance risk for SARS-CoV-2 delta (B.1.617.2) compared with alpha (B.1.1.7) variants of concern: a cohort study.

Authors: Katherine A Twohig; Tommy Nyberg; Asad Zaidi; Simon Thelwall; Mary A Sinnathamby; Shirin Aliabadi; Shaun R Seaman; Ross J Harris; Russell Hope; Jamie Lopez-Bernal; Eileen Gallagher; Andre Charlett; Daniela De Angelis; Anne M Presanis; Gavin Dabrera
Journal: Lancet Infect Dis Date: 2021-08-27 Impact factor: 25.071

3. Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions.

Authors: Konstantinos Nikolopoulos; Sushil Punia; Andreas Schäfers; Christos Tsinopoulos; Chrysovalantis Vasilakis
Journal: Eur J Oper Res Date: 2020-08-08 Impact factor: 5.334

4. Detecting influenza epidemics using search engine query data.

Authors: Jeremy Ginsberg; Matthew H Mohebbi; Rajan S Patel; Lynnette Brammer; Mark S Smolinski; Larry Brilliant
Journal: Nature Date: 2009-02-19 Impact factor: 49.962

5. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study.

Authors: Annemarie B Docherty; Ewen M Harrison; Christopher A Green; Hayley E Hardwick; Riinu Pius; Lisa Norman; Karl A Holden; Jonathan M Read; Frank Dondelinger; Gail Carson; Laura Merson; James Lee; Daniel Plotkin; Louise Sigfrid; Sophie Halpin; Clare Jackson; Carrol Gamble; Peter W Horby; Jonathan S Nguyen-Van-Tam; Antonia Ho; Clark D Russell; Jake Dunning; Peter Jm Openshaw; J Kenneth Baillie; Malcolm G Semple
Journal: BMJ Date: 2020-05-22

6. The temporal association of introducing and lifting non-pharmaceutical interventions with the time-varying reproduction number (R) of SARS-CoV-2: a modelling study across 131 countries.

Authors: You Li; Harry Campbell; Durga Kulkarni; Alice Harpur; Madhurima Nundy; Xin Wang; Harish Nair
Journal: Lancet Infect Dis Date: 2020-10-22 Impact factor: 25.071

7. Impact and effectiveness of mRNA BNT162b2 vaccine against SARS-CoV-2 infections and COVID-19 cases, hospitalisations, and deaths following a nationwide vaccination campaign in Israel: an observational study using national surveillance data.

Authors: Eric J Haas; Frederick J Angulo; John M McLaughlin; Emilia Anis; Shepherd R Singer; Farid Khan; Nati Brooks; Meir Smaja; Gabriel Mircus; Kaijie Pan; Jo Southern; David L Swerdlow; Luis Jodar; Yeheskel Levy; Sharon Alroy-Preis
Journal: Lancet Date: 2021-05-05 Impact factor: 79.321

8. COVID-19 Symptom-Related Google Searches and Local COVID-19 Incidence in Spain: Correlational Study.

Authors: Alberto Jimenez Jimenez; Rosa M Estevez-Reboredo; Miguel A Santed; Victoria Ramos
Journal: J Med Internet Res Date: 2020-12-18 Impact factor: 5.428

9. Association between mobility patterns and COVID-19 transmission in the USA: a mathematical modelling study.

Authors: Hamada S Badr; Hongru Du; Maximilian Marshall; Ensheng Dong; Marietta M Squire; Lauren M Gardner
Journal: Lancet Infect Dis Date: 2020-07-01 Impact factor: 71.421