Antonio Naimoli1. 1. Università di Salerno, Dipartimento di Scienze Economiche e Statistiche (DISES), Via Giovanni Paolo II, 132, 84084, Fisciano, SA, Italy.
Abstract
The current Covid-19 pandemic is severely affecting public health and global economies. In this context, accurately predicting its evolution is essential for planning and providing resources effectively. This paper aims at capturing the dynamics of the positivity rate (PPR) of the novel coronavirus using the Heterogeneous Autoregressive (HAR) model. The use of this model is motivated by two main empirical features arising from the analysis of PPR time series: the changing long-run level and the persistent autocorrelation structure. Compared to the most frequently used Autoregressive Integrated Moving Average (ARIMA) models, the HAR is able to reproduce the strong persistence of the data by using components aggregated at different interval sizes, remaining parsimonious and easy to estimate. The relative merits of the proposed approach are assessed by performing a forecasting study on the Italian dataset. As a robustness check, the analysis of the positivity rate is also conducted by considering the case of the United States. The ability of the HAR-type models to predict the PPR at different horizons is evaluated through several loss functions, comparing the results with those generated by ARIMA models. The Model Confidence Set is used to test the significance of differences in the predictive performances of the models under analysis. Our findings suggest that HAR-type models significantly outperform ARIMA specifications in terms of forecasting accuracy. We also find that the PPR could represent an important metric for monitoring the evolution of hospitalizations, as the peak of patients in intensive care units occurs within 12-16 days after the peak in the positivity rate. This can help governments in planning socio-economic and health policies in advance.
The current Covid-19 pandemic is severely affecting public health and global economies. In this context, accurately predicting its evolution is essential for planning and providing resources effectively. This paper aims at capturing the dynamics of the positivity rate (PPR) of the novel coronavirus using the Heterogeneous Autoregressive (HAR) model. The use of this model is motivated by two main empirical features arising from the analysis of PPR time series: the changing long-run level and the persistent autocorrelation structure. Compared to the most frequently used Autoregressive Integrated Moving Average (ARIMA) models, the HAR is able to reproduce the strong persistence of the data by using components aggregated at different interval sizes, remaining parsimonious and easy to estimate. The relative merits of the proposed approach are assessed by performing a forecasting study on the Italian dataset. As a robustness check, the analysis of the positivity rate is also conducted by considering the case of the United States. The ability of the HAR-type models to predict the PPR at different horizons is evaluated through several loss functions, comparing the results with those generated by ARIMA models. The Model Confidence Set is used to test the significance of differences in the predictive performances of the models under analysis. Our findings suggest that HAR-type models significantly outperform ARIMA specifications in terms of forecasting accuracy. We also find that the PPR could represent an important metric for monitoring the evolution of hospitalizations, as the peak of patients in intensive care units occurs within 12-16 days after the peak in the positivity rate. This can help governments in planning socio-economic and health policies in advance.
Following the China's report (31 December 2019) of a cluster of cases of pneumonia of unknown aetiology (later identified as a new coronavirus Sars-CoV-2) in the city of Wuhan, the World Health Organisation (WHO) declared on 30 January 2020 the novel coronavirus outbreak (Covid-19) a Public Health Emergency of International Concern. On 11 March 2020, the Covid-19 outbreak was identified by the WHO as a global pandemic. As of 01 December 2020, Covid-19 has infected more than 64.06 million people with 1.54 million global deaths since its emergence. One year later, 01 December 2021, the number of confirmed cases has risen to 263.52 million, while the number of confirmed deaths has jumped to 5.23 million.1The general understanding of the evolution of the Covid-19 pandemic by researchers and policymakers is often based on data on confirmed cases and deaths. These are the variables that guide public policy in both the introduction and relaxation of non-pharmaceutical interventions, such as masking, social distancing, and other crucial public health measures. However, looking only at the number of cases and deaths can lead to a highly misleading picture of the true scale and spread of the Covid-19 pandemic. These data can only be meaningfully interpreted in conjunction with an accurate understanding of the extent and allocation of virus testing [1]. Rapid, reliable, and accurate testing to confirm cases is a prerequisite for successful contact tracing because, without it, infected individuals may remain unidentified and continue to act as sources that support community transmission. Control of the epidemic can be hampered when the percentage of asymptomatic cases is high as in the case of Covid-19 [2]. Thus, when the virus spreads, the only way to identify all infected individuals would be through a universal testing program. Therefore, test positivity is a metric of potential relevance in situations where the percentage of asymptomatic cases is particularly high [3]. In addition, according to World Health Organization [4], before governments can ease restrictions or begin to reopen, the positivity rate (or percent-positive rate) should be 5% or less for at least 14 days. Accordingly, the positivity rate (PPR) is a critical measure as it provides an indication of how prevalent the infection is in the testing area and whether testing levels are keeping up with disease transmission levels.The Covid-19 pandemic, and the associated stringent preventive measures taken by governments in response to the rapid growth in infections and deaths, have led to unprecedented socio-economic challenges around the world. For example, analyzing the economic connection with Covid-19, Coccia [5] pointed out that the pandemic and containment policies produced negative effects on economic growth. As a result, the pandemic shocked the global economy, from financial markets where asset prices declined and volatility increased characterizing both the impact and future uncertainty of the pandemic [[6], [7], [8], [9]], to the impacts on the supply chain [[10], [11], [12]]. Italy was the first Western country to experience a Covid-19 emergency with a spiral of infections that placed the country at the top of the international rankings (surpassing China on 19 March 2020 with 41,035 confirmed cases and 3,405 deaths2
) and consequently faced large-scale health and socio-economic challenges [13]. Several hypotheses have been advanced to explain the rapid spread of the virus within the country, including various geographical, environmental, and socio-economic similarities between Hubei Province in China (where the epidemic broke out) and Northern Italy [14]. Several studies have shown that airborne transmission is the dominant route of spread of Covid-19 disease, i.e., the main human-to-human diffusion mechanism [15]. However, other variables have been identified as potentially contributing to the spread of Covid-19, such as demographic factors [[16], [17], [18]], environmental and climatic factors [[19], [20], [21]], air pollution [14,22,23] and social interactions associated with the mobility of people and economic activities [[24], [25], [26], [27]]. Along these lines, Panarello and Tassinari [28], making use of a containment index, sanction data, and Google's movement trends across Italian provinces, provided evidence of a deterrent effect on mobility given by the increase in sanction rate and positivity rate among the population.Overall, the spread of the novel coronavirus has severely affected the integrity of the global economic, social, financial, and behavioral system. Analyzing the relationships between epidemiological and economic models, Verikios [29] pointed out that because of the greater uncertainty surrounding its nature and the stringent preventive measures taken by governments, Covid-19 is likely to be of longer duration and consequently more severe in its economic effects than previous pandemics. The current outbreak has turned out to be the first extraordinary long-term disruption of the global supply chain [11,30,31], and differently from the past, all supply chain players have been severely affected by this pandemic [10,12,32,33]. For example, for some supply chains, demand for necessary items such as personal protective equipment and beauty and personal care products has increased [12,30,32]. Conversely, for other industries, such as transportation and manufacturing, demand and supply have decreased dramatically, causing a halt in production [34,35]. In response to the current vulnerability of the entire supply system, several resilience strategies have been suggested to mitigate the impacts of Covid-19 and to recover from the current pandemic (see, e.g. Refs. [12,[36], [37], [38], [39], [40], [41], [42]]; among others). The rapid spread of Covid-19 has also significantly affected sustainability, raising several environmental, economic, and social issues [[43], [44], [45], [46], [47], [48], [49]]. However, in this negative scenario, it is worth highlighting some positive aspects mainly related to environmental sustainability: improved air quality, low carbon dioxide and greenhouse gas emissions, decreased energy use and environmental pollution [43,44,49,50].In this context, it is necessary to formulate functional planning for the health infrastructure and services in order to curb the spreading of the Covid-19 pandemic. An accurate forecast of the epidemiological trends is essential for health system management and government reform planning. Therefore, to support non-pharmaceutical intervention policies during the Covid-19 outbreak, several models have been proposed for fitting and forecasting the epidemic evolution (see, e.g., Refs. [[51], [52], [53]]).In the past, the Autoregressive Integrated Moving Average (ARIMA) model has been widely employed for forecasting time series of epidemic diseases. For example, ARIMA models have been successfully applied to estimate the incidence of Severe Acute Respiratory Syndrome (SARS) [54], malaria [55], tuberculosis [56], influenza viruses [57] and brucellosis [58]. This class of models is also being used to estimate and predict the evolution of the ongoing pandemic. Benvenuto et al. [59] performed ARIMA model on world data to predict the epidemiological trend of the prevalence and incidence of Covid-2019. In Singh et al. [60] it was used to predict confirmed cases, deaths, and recoveries for the top 15 countries. Sahai et al. [61] employed ARIMA models on the daily time series of total infected cases for US, Brazil, India, Russia and Spain for forecasting the spread of Covid-19. Monllor et al. [62] applied it to analyze the series of infected persons in China, Italy and Spain, finding a common pattern of disease spread. Ceylan [63] used ARIMA specifications to predict the epidemiological trend of total confirmed cases of Covid-19 in Italy, Spain, and France.Most of these papers that rely on ARIMA specifications, model the dynamics of the total number of infected cases, deaths or recoveries. To support policymakers in defining guidelines for the management of health systems, as well as to facilitate the development of plans for economic recovery, this paper investigates the dynamics of the Covid-19 positivity rate, defined as the number of new positive cases divided by the number of total tests. Our empirical analysis reveals that the positivity rate is characterized by a slowly moving long-run level and a highly persistent autocorrelation structure.ARMA models are not well suited to model the long-term behavior of time series. Long memory processes are characterized by a high-order correlation structure indicating persistent dependence between distant observations, implying that the effect of shocks takes a very long time to disappear. The conventional ARMA process is often referred to as a short memory process as it is unable to capture the dynamics of a long memory series. On the other hand, the Autoregressive Fractionally Integrated Moving Average (ARFIMA) process, allowing the order of integration of a series to take on fractional values, provides a useful tool for modelling and forecasting time series with long memory properties [64]. However, being a fractional integration model, the ARFIMA is not trivial to estimate and lacks a clear economic interpretation. This leads us to introduce a new approach to directly model and forecast the Covid-19 positivity rate. Namely, the time series behavior of the PPR can be adequately captured by the Heterogeneous Autoregressive (HAR) model by Corsi [65]. In this context, the HAR model represents an attractive alternative because of its computational simplicity, ease of interpretation, and remarkable forecasting performance.Although formally not belonging to the class of long memory models, the HAR model is able to closely mimic the observed long memory behavior by using variables aggregated at different interval sizes. Therefore, differently from the standard ARIMA specifications, HAR-type models are based on an additive cascade of components, from high-frequencies to low-frequencies, allowing to capture both the high degree of persistence (through the long-term component) and short-term dynamics (through the short- and medium-term components) that characterize the PPR behavior. Along with conventional HAR-type models, we also consider the possibility of selecting relevant lagged components through flexible HAR specifications based on the use of the least absolute shrinkage and selection operator (lasso) [66] and the adaptive lasso [67].The aim of this paper is to assess the usefulness of the HAR as a new modelling approach to predict the spread of the Covid-19 by capturing the short-, medium- and long-term dynamics of the PPR. Accurate short- and long-term predictions of the PPR can be essential both for developing non-pharmaceutical strategic planning by policymakers to address the current health emergency and for shaping new policies to overcome the severe negative impacts experienced by businesses and supply chains because of the pandemic. The positivity rate measures both the severity of the outbreak and the limitations of testing. That is, the PPR is a useful measure of whether sufficient testing has been done and what the current level of SARS-CoV-2 transmission is in the community. Therefore, this approach could provide a useful tool for both monitoring the spread of the virus and guiding policymakers to undertake actions to curb the spread of the disease. HAR-type models prove to be particularly useful as, on the one hand, they are able to adequately predict the short-, medium- and long-term trend of the positivity rate, and, on the other hand, the lasso-based HAR specifications are completely data-driven, thus reducing uncertainty in the choice of predictor lags. Therefore, the proposed approach provides a reliable tool that simplifies the decision-making process by moving towards a single data-driven direction.The profitability of the HAR approach in forecasting the Covid-19 positivity rate is evaluated through an application on the Italian dataset, as it was the first European country to be seriously affected by the pandemic. On 30 March 2020, more than 101 thousand people were positive to Covid-19.3
The empirical application shows that HAR-type models outperform the most popular ARIMA models revealing that the improvements are especially significant for longer forecast horizons as detected by the Model Confidence Set (MCS) [68]. These findings are also confirmed by analyzing the positivity rate of the United States, which was considered as a reference country to pursue a robustness check of the proposed approach. Finally, the positivity rate exhibits predictive ability with respect to hospitalizations.The remainder of the paper is organized as follows. Section 2 presents the Heterogeneous Autoregressive model and its lasso-based extensions. Section 3 describes the data and the main non-pharmaceutical measures adopted by the Italian government. Section 4 illustrates the results of the empirical study and robustness checks. Section 5 presents a broader discussion of the positivity rate along with some limitations and caveats. Finally, Section 6 summarizes the findings with concluding remarks.
Heterogeneous autoregressive models
Inspired by the Heterogeneous Market Hypothesis [69] and the asymmetric propagation of volatility between long-term and short-term horizons Corsi [65], proposed the HAR model to parsimoniously capture the strong persistence typically observed in Realized Volatility (RV) [70] by the sum of lagged RV components aggregated over different interval sizes.The HAR model is commonly used in modelling the dynamics of financial volatility as it is able to reproduce the main stylized facts of financial data such as the long memory and asymmetric propagation of volatility over time. In most empirical applications, the HAR model is specified as an additive cascade of three volatility components aggregated over different time intervals, that is daily, weekly and monthly, which implies a fixed (1,5,22) lag structure. However, the structure (1,5,22) may not fully reflect the characteristics of the data. Thus, determining the optimal lag structure of the HAR could significantly improve the predictive ability of the model. In this direction Audrino and Knaus [71], showed that the HAR-implied lag structure can be recovered asymptotically by the lasso only if the HAR is the underlying data generating process (DGP). On the other hand, differently from Audrino and Knaus [71] who employed the lasso on the Autoregressive (AR) framework Audrino et al. [72], referred to the adaptive lasso to investigate whether the lag structure implied by the HAR can be identified. However, their results highlight the difficulty of outperforming the forecast performance of the standard HAR model based on the daily, weekly and monthly components.The HAR model can be easily estimated by ordinary least squares (OLS), showing remarkable good volatility forecasting performance. Therefore, it is widely used to model the dynamics of RV, but it has never been used to predict the evolution of pandemics. In light of its inherent characteristics, to capture the slowly decaying autocorrelation structure (also known as long memory) of the Covid-19 positivity rate series, we propose to apply the HAR modelling approach.Let PPR
be the positivity rate at time t. The HAR model for the h-step-ahead daily PPR
can be specified aswhere is the k-period average of daily PPR and ϵ
is a zero mean innovation.This specification, substantially, states that tomorrow's PPR is a weighted sum of daily, weekly and monthly averages of PPRs that can be characterized by different dynamics of virus infection and transmission over time. For example, to take into account the different periods of time between infection and development of clinical symptoms as well as the transmission periods, the model in Equation (1) can be further extended asThus, the HAR model is parsimonious, it allows to approximate long memory in a very simple way and it can be consistently estimated by OLS.In this context, it becomes crucial to define the lag structure and the maximum order of the model. It is worth noting that the HAR can be represented as a constrained AR(p) model [65]. Considering the HAR process introduced in Equation (1), we can write it as a restricted AR(28) process, namelywhereIn contrast to the fixed daily-weekly-monthly time scale, extensions of the HAR model have been proposed to allow for potentially different predictive information arising from a different lag structure. In this direction, to investigate whether a more general lag structure provides more accurate predictions than the fixed (1,5,22) lag index Audrino and Knaus [71] and Audrino et al. [72], compared the standard HAR model to a lasso-based method.Let the daily PPR
be denoted by x
, with (x
, …, x
)′ the predictor variables. Then, the lasso4
[66] estimator of the AR(p) modelis obtained aswhere λ ≥ 0 is the tuning parameter which controls the strictness of the penalty term, with λ = 0 leading to the OLS estimator. The solution for the constant c is , that is zero for demeaned data.It has been shown that the lasso suffers from some drawbacks due to the lack of oracle properties. On the other hand, the adaptive lasso [67] estimator fulfils the oracle property in the sense introduced by Fan and Li [73], as it allows asymptotically consistent and efficient variables selection and provides asymptotically unbiased and normally distributed estimates of the non-zero coefficients.The adaptive lasso estimator is given bywhere the weights λ
can be computed as the inverse of the absolute value of the corresponding preliminary ridge regression or OLS estimator. The ordinary lasso is obtained as a special case for λ
= 1, ∀i = 1, …, p.The K-fold cross-validation is used to determine the optimal tuning parameter λ. Specifically, the data are randomly divided into K groups (G
1, …, G
) and for each group the mean squared error is estimated on the validation set byFor each tuning parameter value, the average error over all folds is computed asThus, the optimal λ is chosen by minimizing the CV(λ) function, that isThe Flexible HAR (FHAR) of Audrino et al. [72] can be estimated by applying the adaptive lasso procedure to select the active terms to be included in the model considering the following equationMotivated by these theoretical developments, this paper aims to apply the HAR and its lasso extensions to Covid-19 data to predict its spread and provide a good alternative to other models that have been proposed to study the dynamics of the ongoing pandemic.
Data description and management of the 2020 Covid-19 pandemic in Italy
The data used in this paper refer to the daily number of confirmed Covid-19 cases and daily total tests in Italy, between 24 February and 20 December 2020 for a total of 301 days. The data have been downloaded from the official Civil Protection Department website5
- Presidency of the Council of Ministers.Table 1 provides summary statistics for the new positive cases, number of tests (swabs) performed and PPR at a daily level in Italy for the full sample period. The occurrence of new cases of a disease developing in a population over a period of time, also known as “incidence” in epidemiology, can be used to map the frequency with which Covid-19 develops in a community.6
These peaked at 40,902 during the second wave (October–November 2020) along with the number of tests performed in a single day, 254,908. On the other hand, the positivity rate touched 46.21% during the first wave (February–March 2020), reaching its minimum of 0.23% in June during the Phase 2, characterized by an easing of previously adopted restrictive measures. A possible explanation for the high positivity rate in the very early phase of the epidemic is that on 25 February 2020, the Italian Ministry of Health issued more stringent testing policies. That is, testing was prioritized for patients with more severe clinical symptoms who were suspected of having Covid-19 and required hospitalization. Consequently, testing was limited for people who were asymptomatic or had mild symptoms. This strategy inevitably led to a high percentage of positive tests [74].
Table 1
Summary statistics.
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
Std.Dev.
Daily new positive cases
78
384
1,501
6,498
5,560
40,902
10,143.22
Daily tests
964
41,867
61,725
84,259
108,019
254,908
62,825.32
Daily positivity rate
0.23
0.90
2.39
6.69
11.61
46.21
7.61
Summary statistics.Fig. 1 displays the time series of the daily positivity rate given by (new positive cases/total tests) × 100 for Italy. It also shows that the PPR exhibits persistence, i.e. large changes in the positivity rate are often followed by other large changes and small changes are often followed by small changes. The presence of long memory can be identified by a data-driven empirical approach in terms of the persistence of observed autocorrelations. This feature is highlighted in Fig. 2
which displays the daily and weekly autocorrelation functions for the PPR up to lag 40. The correlograms show that the autocorrelations exhibit a clear pattern of slow decay and persistence. In particular, the sample autocorrelations reveal the presence of a hyperbolic decay rate, which is much slower than the usual geometric rate associated with stationary ARMA processes. Also, it was not possible to reject the null hypothesis of a unit root at the 5% level using the Augmented Dickey-Fuller test. However, the classical trend stationary I(0) and unit roots I(1) representations may be too restrictive with respect to the low-frequency dynamics of the series. This compelling evidence of long memory, i.e. the historical PPRs have a persistent impact on the future PPR, suggests that the Covid-19 positivity rate can be adequately modelled through HAR specifications.
Fig. 1
Time series of the daily positivity rate for Italy. Sample period: 24 February 2020 - 20 December 2020.
Fig. 2
Autocorrelation function of the daily (left) and weekly (right) positivity rate for Italy.
Time series of the daily positivity rate for Italy. Sample period: 24 February 2020 - 20 December 2020.Autocorrelation function of the daily (left) and weekly (right) positivity rate for Italy.To better understand the dynamics of the contagion and the importance of PPR behavior in guiding decisions about reopening schools and businesses, we briefly report the main measures taken by the Italian government to contain the epidemic.7
For a more extensive analysis of policy interventions implemented by the Italian government and their impact on health and non-health outcomes, see Berardi et al. [13].The Italian government confirmed the first cases of the disease in the country on 30 January 2020, when the novel coronavirus was detected in two Chinese tourists while visiting Italy. On request of the Italian Health Authorities, all flights to/from People's Republic of China (PRC) including Hong Kong, Macao and Tapei were suspended. Once the first internal outbreak was discovered, one of the first measures adopted was the quarantine of 11 municipalities in Northern Italy located in Lombardy and Veneto.On 23 February, the Council of Ministers decreed the total closure of the municipalities with active outbreaks. This is also confirmed by Fig. 1, that shows how the positivity rate continues to grow after 23 February, reaching its maximum of 46.21% on 09 March. During this time, there was a succession of different measures aimed at containing the epidemic, and on 09 March, the so-called Phase 1, began with the country being locked down until 03 May 2020. Italy was the first country to implement a national quarantine due to the 2020 coronavirus outbreak. As a result, the positivity rate began to slowly decline towards zero and on 26 April, the then-Prime Minister Giuseppe Conte announced the so-called Phase 2, that would start from 04 May. Phase 2 was characterized by a gradual relaxation of previous containment measures. Italy therefore tried to restart by reopening bars, restaurants and shops. All while observing the new safety rules, ranging from social distancing to the use of face masks. As it can clearly be seen from Fig. 1, the infection curve tends to flatten out and so from 15 June (end of Phase 2) to 07 October, Phase 3 of coexistence with Covid-19 began.Following the rise of the epidemic curve in the autumn, renewed restrictions were progressively introduced, mainly concerning commercial and private activities rather than restricting movement. This led to the second wave of the pandemic, with the positivity rate rising and new restrictive measures being introduced between 08 October and 05 November. Starting from 03 November 2020, the Regions and Autonomous Provinces of Trento and Bolzano have been classified into three areas, namely red, orange and yellow, according to the degree of risk, for which specific restrictive measures were envisaged. This classification is based on the ordinances issued by the Ministry of Health.To mitigate the effects of Covid-19 and make appropriate health, economic and social system decisions, it is crucial to understand the pandemic evolution. The effective reproductive number (R
) is a parameter that has been widely used to measure the transmissibility of the ongoing epidemic infection. Therefore, as a further tool to understand the actions enacted by the Italian government to counter the spread of the coronavirus, the R
is also estimated. The R
is a fundamental epidemiological parameter that characterizes the temporal dynamics of infectious disease as measuring the average number of secondary cases caused by an infected individual in a population composed of both susceptible and non-susceptible individuals [[75], [76], [77]].The Covid-19 R
has been estimated by the Cori et al. [76] approach, using the EpiEstim package [78] and R Core Team [79] software. To explicitly take into account the uncertainty in the serial interval (SI) distribution, the mean and standard deviation of the SI (time interval between the onset of symptoms in the primary and secondary cases) are allowed to vary according to truncated normal distributions, employing parameters estimated from existing studies (see, [[80], [81], [82]]; among others). Therefore, the R
was estimated on sliding weekly windows, with values drawn from a Gamma distribution, with the mean and variance sampled from 1,000 truncated normal distributions for which we used an average mean serial interval of 4.8 days (sd = 2.3, min = 3.8, max = 5.8), and an average standard deviation of 2.3 days (sd = 2.3, min = 1.3, max = 3.3).The resulting weekly R
series is reported in Fig. 3
. It clearly shows the peak of transmissibility during the first wave (February–March 2020), while between April and around mid-June the R
index remains below one, indicating that the spread of infection is decreasing. However, with the start of Phase 3 (15 June), a slight but steady increase in the national transmission index was noted, reaching a summer peak in the period 15 August - 31 August where R
reached 1.59 (95% CI: 1.41–1.75) with an incidence calculated as daily cases. It is worth noting that the estimates have a large stochastic variability, especially with regard to the summer period, which is overall characterized by a small number of cases.
Fig. 3
Effective reproductive number for Italy.
Estimates of the weekly effective reproductive number (R, solid line) and 95% credible interval (grey area) during the coronavirus disease 2019 outbreak in Italy. Sample period: 24 February 2020–20 December 2020.
Effective reproductive number for Italy.Estimates of the weekly effective reproductive number (R, solid line) and 95% credible interval (grey area) during the coronavirus disease 2019 outbreak in Italy. Sample period: 24 February 2020–20 December 2020.Because of the increase in the number of infections, on 16 August 2020, the Minister of Health Roberto Speranza signed an ordinance ordering the closure of discos and dance halls and making it compulsory, from 6 p.m. to 6 a.m., to wear masks even in public spaces. In September, with the start of the new school year, classroom activities resumed. To reduce the risk of infection, school staff were allowed to undergo serological testing. Finally, during the second wave (October–November 2020) an R
index above 1.5 is recorded for most of October, falling below 1 from mid-November until the end of December. As of 20 December 2020, a total of 622,760 people tested positive for Covid-19, with 68,799 deaths and 1,261,626 patients discharged/healed, nationwide.
Empirical analysis
In this section, we conduct several empirical studies to compare the in-sample and out-of-sample performance of the HAR with the commonly used ARMA models. The data employed for the empirical analysis consist of the daily Covid-19 positivity rate recorded in Italy between 24 February and 20 December 2020, with a full-sample period of 301 days.Regarding the Flexible-HAR (FHAR) based on the adaptive lasso, following Audrino et al. [72], the maximum lag order is setted to p = 50, while the tuning parameter λ is chosen by five-fold cross-validation.8
The weights λ
in the adaptive lasso are calibrated as the inverse of the absolute value of the corresponding preliminary ridge regression estimator [72].We also estimate a Flexible-HAR based on the lasso method to select appropriate HAR lag length, resulting in the Lasso-HAR (LHAR) model. All the lasso estimates are obtained using the glmnet package [83].For the purpose of comparing the in-sample and out-of-sample performance of the analyzed models, we consider the following loss functions [84]:where x
is the PPR
and is the prediction obtained by the HAR-type or ARIMA-type models.In addition, we further assess the significance of differences in forecasting performance of all competing models by means of the MCS [68]. The MCS relies on a sequence of statistical tests to identify, at a certain confidence level (1 − α), the set of superior models with respect to some appropriately-chosen measures of predictive ability. The MCS p-values are obtained by 5,000 bootstrap resamples generated by a block-bootstrap procedure, estimating the optimal block length through the method described in Patton et al. [85] .
In-sample results
Table 2 shows the results of the model comparison in terms of in-sample accuracy. In particular, we compare 15 ARIMA specifications with the HAR based on different lag structures, together with LHAR and FHAR. To determine whether differencing in the PPR series is required, we use the Augmented Dickey-Fuller (ADF) test, suggesting that one difference is needed to make the data stationary. This leads to using ARIMA(p,1,q) specifications.9
Table 2
In-sample model comparison.
MAE
MAEsd
MAElog
MAEprop
MSE
MSEsd
MSElog
MSEprop
QLIKE
ARIMA(1,1,0)
0.5317
0.1336
0.1953
0.1931
0.6574
0.0282
0.0691
0.0694
1.8096
ARIMA(2,1,0)
0.5357
0.1324
0.1882
0.1863
0.6979
0.0287
0.0631
0.0627
1.8068
ARIMA(3,1,0)
0.5505
0.1350
0.1889
0.1889
0.7837
0.0315
0.0647
0.0679
1.8081
ARIMA(0,1,1)
0.5263
0.1296
0.1827
0.1807
0.6878
0.0283
0.0609
0.0608
1.8058
ARIMA(0,1,2)
0.5589
0.1367
0.1915
0.1912
0.7891
0.0314
0.0640
0.0664
1.8077
ARIMA(0,1,3)
0.5716
0.1393
0.1946
0.1976
0.8328
0.0328
0.0654
0.0740
1.8093
ARIMA(1,1,1)
0.5413
0.1330
0.1872
0.1856
0.7296
0.0295
0.0620
0.0626
1.8065
ARIMA(1,1,2)
0.5252
0.1301
0.1851
0.1883
0.6919
0.0283
0.0603
0.0685
1.8068
ARIMA(1,1,3)
0.5284
0.1307
0.1857
0.1891
0.7011
0.0286
0.0607
0.0694
1.8070
ARIMA(2,1,1)
0.5428
0.1331
0.1869
0.1857
0.7400
0.0298
0.0617
0.0629
1.8064
ARIMA(2,1,2)
0.4708
0.1169
0.1691
0.1747
0.5997
0.0244
0.0556
0.0672
1.8049
ARIMA(2,1,3)
0.4698
0.1185
0.1745
0.1810
0.5776
0.0242
0.0581
0.0727
1.8064
ARIMA(3,1,1)
0.5478
0.1347
0.1889
0.1889
0.7727
0.0312
0.0649
0.0684
1.8082
ARIMA(3,1,2)
0.5437
0.1366
0.1978
0.1982
0.7413
0.0312
0.0736
0.0788
1.8123
ARIMA(3,1,3)
0.4673
0.1177
0.1729
0.1790
0.5736
0.0240
0.0575
0.0715
1.8060
HAR(1,7,14)
0.4593
0.1222
0.1872
0.1674
0.4711
0.0240
0.0743
0.0504
1.8081
HAR(1,7,21)
0.4674
0.1300
0.2076
0.1798
0.4796
0.0264
0.0900
0.0561
1.8139
HAR(1,7,28)
0.4835
0.1320
0.2073
0.1799
0.5079
0.0274
0.0909
0.0562
1.8141
HAR(1,7,14,21)
0.5008
0.1413
0.2276
0.1933
0.5285
0.0303
0.1060
0.0628
1.8199
HAR(1,7,14,28)
0.4539
0.1209
0.1852
0.1665
0.4567
0.0232
0.0718
0.0498
1.8072
HAR(1,7,14,21,28)
0.4432
0.1157
0.1731
0.1589
0.4572
0.0223
0.0641
0.0485
1.8046
LHAR(1,7,19,25)
0.4255
0.1118
0.1693
0.1563
0.4460
0.0218
0.0645
0.0490
1.8048
FHAR(1,7,27)
0.4410
0.1149
0.1726
0.1590
0.4796
0.0231
0.0675
0.0509
1.8059
The table reports the average values of the different loss functions for the models under analysis. The lowest value of the loss in each column is displayed in bold. The sample runs from 24 February 2020 to 20 December 2020. Note that for LHAR and FHAR the lags are not imposed, but the selected lag structure allowed by the lasso and adaptive lasso methods is reported, respectively.
In-sample model comparison.The table reports the average values of the different loss functions for the models under analysis. The lowest value of the loss in each column is displayed in bold. The sample runs from 24 February 2020 to 20 December 2020. Note that for LHAR and FHAR the lags are not imposed, but the selected lag structure allowed by the lasso and adaptive lasso methods is reported, respectively.The empirical results in Table 2 highlight that the selected lag structure by the adaptive lasso for the FHAR is (1,7,27), which is in line with the canonical daily-weekly-monthly lag structure of the standard HAR model, while the lasso suggests using an additional biweekly/triweekly lag for the LHAR, i.e. (1,7,19,25). It is worth noting that in seven out of nine cases the loss functions considered are minimized by HAR-type models. In particular, the LHAR minimizes MAE, MAE
, MAE
, MSE, MSE
whereas the HAR(1,7,14,21,28) minimizes MSE
and QLIKE. Finally, the ARIMA(2,1,2) is the specification that minimizes MAE
and MSE
(even though for MAE
the LHAR returns a very similar result).In Fig. 4
, we plot the actual (black line) and the estimated daily PPR given by the LHAR model (red line) and ARIMA(2,1,2) model (green line). It can easily be seen that while the LHAR better follows the dynamics of the positivity rate in both low and high infection periods, the ARIMA(2,1,2), being smoother, is not able to fully capture all variations and peaks in the actual PPR, especially in periods characterized by a high viral spread rate. These considerations remain essentially valid also for the other HAR and ARIMA specifications, which have not been shown for ease of interpretation.
Fig. 4
Actual vs estimated positivity rate.
Comparison of actual (black) and in-sample prediction of the Lasso-HAR (red) and of ARIMA(2,1,2) (green) of the daily positivity rate for Italy.
Actual vs estimated positivity rate.Comparison of actual (black) and in-sample prediction of the Lasso-HAR (red) and of ARIMA(2,1,2) (green) of the daily positivity rate for Italy.
Out-of-sample results
To investigate the predictive ability of the models, we conduct an h-step-ahead rolling window study at the forecasting horizons h = 1, h = 3 and h = 7. The forecasts are obtained by re-estimating the model parameters at each step with a rolling window of 200 observations (2/3 of the sample). To compare the forecasting performances, we consider the set of loss function specified in Section 4, while to assess the significance of differences of the competing models we refer to the MCS relying on the semi-quadratic statistic and the confidence levels of 75% and 90%10.We first consider the case of one-day-ahead PPR forecasts (h = 1), showing the results in Table 3
. Overall, HAR-type models provide significantly better performance than ARIMA models. The lowest loss values are always returned by the LHAR with the single exception of the MSE minimized by the HAR(1,7,14,21,28). The superiority of the HAR models is also confirmed by the MCS as the only ARIMAs falling in the 75% MCS are the ARIMA(2,1,3) for MAE and MAE
and the ARIMA(3,1,2) for MAE. Some other ARIMA specifications of order (p, q) ≥ 2 are included in the less restrictive 90% MCS in a few isolated cases. On the other hand, the LHAR and FHAR are the only models always entering the set of superior models for all the forecast criteria considered.
Table 3
Out-of-sample model comparison for Italy: forecast horizon h = 1.
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 1. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Out-of-sample model comparison for Italy: forecast horizon h = 1.The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 1. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.For the remaining forecast horizons, ARIMA models perform poorly in general. Table 4
and Table 5
report the forecast performance of all considered models for h = 3 and h = 7 periods ahead, respectively. It can easily be seen that the superiority of the HAR models remains unchanged over longer forecast horizons. Among the seven HAR-type models, the LHAR shows a dominant position for predicting the PPR at h = 3 since it minimizes the loss functions in eight out of nine cases and it is the only model that always enters 75% MCS (Table 4). At the same time, the FHAR provides better forecast accuracy at a weekly (h = 7) horizon, achieving the lowest losses for all criteria used and being the only model permanently included in 75% MCS (Table 5).
Table 4
Out-of-sample model comparison for Italy: forecast horizon h = 3.
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 3. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Table 5
Out-of-sample model comparison for Italy: forecast horizon h = 7.
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 7. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Out-of-sample model comparison for Italy: forecast horizon h = 3.The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 3. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.Out-of-sample model comparison for Italy: forecast horizon h = 7.The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 7. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.Summarizing, the out-of-sample results clearly show that HAR-type specifications outperform ARIMA models in predicting the positivity rate at the considered forecast horizons h = 1, h = 3 and h = 7. Also, it should be noted that, overall, both Lasso-HAR and Flexible-HAR, taking into account uncertainty in model specification, outperform the standard HAR based on a fixed lag index at each forecast horizon and by each criterion. Therefore, the above results suggest that allowing a more general specification of HAR is successful probably because the lasso-based models include only active predictors, letting the lag structure approximate the long memory observed in the data.
Robustness check: out-of-sample results for the United States
We investigate the robustness of the proposed approach by considering the evolution of PPR for the United States (US). The 2019 coronavirus pandemic has led to massive social upheaval around the world and in the US. As mentioned above, the first outbreak of the virus occurred in the city of Wuhan in China's Hubei province in December 2019, but the virus then spread to Asia, Europe and North America between January and March 2020. By the end of March, there were more than 700,000 confirmed cases of Covid-19 worldwide and more than 34,000 people had died from causes related to the virus, with the US reaching more confirmed cases than any other country, surpassing China and Italy with more than 86,000 positive tests [86].The data to conduct our analysis on the PPR for the US were downloaded from the Our World in Data website11
by Roser et al. [87]. Since data on testing are not available for the early phase of the pandemic the sample period goes from 01 March to 20 December 2020. As with the Italian PPR, forecasts are obtained by recursively estimating model parameters every day over a 200-day rolling window. Accordingly, the out-of-sample period runs from 16 September to 20 December 2020. However, in order to improve the overall presentation of the paper, the tables with the forecasting results have been reported in the Appendix, while we will discuss only the main findings here. Considering the forecasting horizons h = 1, h = 3 and h = 7, the results for the US confirm what was found for Italy. In particular, it turns out that LHAR and FHAR are the only models that consistently enter the MCS regardless of the chosen forecast horizon, always minimizing each of the nine loss functions considered. For h = 1 and h = 3 no other models are included in the MCS, while for h = 7 some HAR specifications with fixed lag index enter the set of superior models but only for MAE-type loss functions. Overall, for short forecast horizons, ARMA-type and HAR-type models with fixed-based lag structure tend to have similar performance, but for longer forecast horizons HAR specifications prevail. On the other hand, the lasso-based HAR specifications outperform competing models in each forecasting scenario, capturing both short- and long-run PPR dynamics.
Discussion
As discussed above, Covid-19 is an infection characterized by a high percentage of asymptomatic cases. Several studies have shown that more than 40% of cases may not reveal symptoms. This means that no country knows the true total number of people affected by Covid-19, but all we know is the infection status of who has been tested. As a result, testing is a crucial element in understanding the spread of the ongoing pandemic [1]. Better than simply counting the total number of tests and in conjunction with data on confirmed cases, the positivity rate represents a key metric for understanding the pandemic, as it measures both the severity of the epidemic and the limitations of testing. According to WHO, before a country can loosen restrictions or begin reopening, the positivity rate for a comprehensive testing program should be 5% or less for at least 14 days.High rates of positivity occur when, for example, the only people being tested are patients with more severe clinical symptoms who are suspected of having Covid-19 and have required hospitalization. Consequently, a high PPR means that countries should probably aim for a larger and more comprehensive testing program, suggesting that it is not a reasonable time to relax restrictions designed to reduce coronavirus transmission. At the same time, because a high positive rate suggests high rates of infection in the community caused by rapid transmission of the virus, this indicates that it may be useful to impose restrictions to slow the spread of the disease. Achieving a low test positivity rate may be the result of a large enough testing volume such that asymptomatic and mild cases as well as exposed contacts are monitored, even if they are asymptomatic. On the other hand, low positivity rates can be the result of enacting different types of public health interventions such as encouraging smart working, closing schools, banning mass gatherings, restrictions on eating in public places along with permissive to total lockdown home stay measures.Such a situation requires the parallel development of at least two dimensions: on the one hand, policymakers should plan ahead for needs in terms of medical facilities and equipment, while, on the other hand, analytical tools and models allowing the generation of reliable forecasts and future scenarios should be developed.For example, analyzing the data in Fig. 5
, which shows the time series of the nomalized 7-day moving average of PPR and patients in Intensive Care Unit (ICU) for Italy, it turns out that the peak of patients in ICU occurs between 12 and 16 days after the peak of PPR. Similar scenarios also arise at the regional level [88]. In this perspective, the approach proposed in this paper could help decision makers to plan public health policies in advance because the PPR could have a predictive capacity with respect to hospitalizations, changing the level of intensity of these interventions over the course of the epidemic. In addition, the ability of HAR to reproduce the short- and medium/long-term trend of the positivity rate could avoid both the immediate economic costs of lockdown and the societal costs of social distancing measures. For example, a long-term upward prediction of the PPR, could lead to a testing strategy on a larger population scale to monitor and reduce viral transmission. This will inevitably also have repercussions on the social networks of the various actors in the supply chain.
Fig. 5
Positivity rate vs hospitalized ICU patients.
Comparison of 7-day moving average of positivity rate in black and hospitalized patients in intensive care units (ICU) in red for Italy. For ease of comparison, variables were normalized to have a scale between 0 and 1.
Positivity rate vs hospitalized ICU patients.Comparison of 7-day moving average of positivity rate in black and hospitalized patients in intensive care units (ICU) in red for Italy. For ease of comparison, variables were normalized to have a scale between 0 and 1.
Conclusion
Understanding the dynamics of the current epidemic is essential for the development of non-pharmaceutical interventions and thus for reducing the health, economic and social impacts caused by the pandemic. This paper focuses on the positivity rate as it represents a crucial metric for understanding the Covid-19 outbreak. The positivity rate offers a measure of how adequately countries are testing and provides insight on the current level of coronavirus transmission in the community. Since ARMA models have been found to be poorly suited to model the long-term behavior of time series, in this paper, we propose to use the HAR approach to model the slowly-moving long-run level and the highly persistent autocorrelation structure that characterize the Covid-19 positivity rate.The empirical study of the Italian positivity rate, along with robustness checks on the US data, shows that HAR models generally outperform ARIMA specifications under various criteria and forecast horizons. The forecasting superiority of the HAR emerges from the MCS where the standard HAR and its lasso-based alternatives significantly outperform ARIMA models under the forecast horizons h = 1, h = 3 and h = 7. In particular, the gains widen as the forecast horizon increases. Also, the out-of-sample results point out that the more general HAR lasso-based lag structure is preferable compared with the HAR fixed-based lag structure. These results are confirmed by the in-sample analysis, as allowing for model specification uncertainty under the HAR framework leads to improvements in model fitting, minimizing the loss functions considered.Thus, this approach is particularly useful as it allows for accurately forecasting short-, medium-, and long-term trends of the positivity rate. In this regard, monitoring the trend in positivity rate and ICU admissions suggests that the PPR might have predictive ability with respect to hospitalizations because peaks in positivity rate precede peaks in hospitalizations, which occur on average within an interval of 12–16 days later. Generating accurate forecasts at different horizons of the PPR is relevant for reducing uncertainty around interventions, leading to maximize resource and investment allocations. Therefore, understanding the trend in Covid-19 positivity rates would allow governments to modify their social and health policies in advance. Also, since the model components are chosen in a completely data-driven fashion, this significantly reduces the uncertainty and arbitrariness associated with model specification. In this respect, lasso-based HAR-type models simplify the decision-making process by leading it towards a common direction driven by the dynamics of the data.However, it is worth noting that policy decisions should not only be determined by the positivity rate, but also by other health, economic, demographic, environmental, and climate variables. This is because the positivity rate provides some useful information about testing capacity and the spread of the virus in the community, but it can also depend on factors such as how it is calculated, the testing accessibility and the timeliness of the laboratory in providing results. Of course, no single metric gives us a complete picture of the prevalence of Covid-19 in the community. Therefore, each week, we need to monitor PPR trends along with other metrics such as recovered and active cases, percent change in new cases, hospitalizations and deaths.In the United States, there are no federal standards for reporting Covid-19 test data. It makes impossible to offer a single view of testing data at the national level and consequently test data are reported differently. In addition, there are several possible ways to calculate test positivity. For example, on the Johns Hopkins University & Medicine webpage Differences in Positivity Rates four possible ways to calculate positivity rates are outlined.Regarding Italy, the Italian Civil Protection provides comprehensive data at the regional level on some variables of interest such as swabs analyzed, cumulative confirmed cases, home isolation cases, hospitalized cases, ICU cases, and deaths. However, regionalization of the health care system and data fragmentation pose challenges in the management of the Covid-19 outbreak in Italy. This has led to the enactment of several regional policies, especially in terms of testing strategies. Consequently, in order to trace the real extension of Covid-19 infection, official data must be interpreted with caution considering several aspects. For example, because there are some inconsistencies and delays in the transmission of this data, on some days negative values of positive cases, tests, and deaths are reported. In addition, short-term fluctuations could affect the reliability of daily data. These fluctuations may be the result of laboratory delays (laboratory saturation) or calendar effects (typically, the number of tests tends to decrease on weekends). Thus, these inconsistencies should be considered to account for data variability. In this direction, in addition to official national bulletins, it might be useful to cross-reference information from different data sources.Currently, a complete picture of the drivers of Covid-19 spread that clarifies the causes of the variability in infections across provinces and regions within countries is still lacking. Although human-to-human transmission is recognized as the primary vehicle for virus transmission, several studies have argued that virus circulation may also be associated with geographical, environmental, and socio-economic factors. Accordingly, using information on these factors could allow the definition and refinement of epidemiological modelling and thus the design of appropriate policy responses to manage this threat to population health and, more generally, to socio-economic systems. In this perspective, the proposed approach could be improved by including health, environmental, and socio-economic variables in the HAR models.Along these lines, as a direction for future research, it could be useful to investigate, and potentially include in HAR models, the relationships between positivity rate and epidemiological variables (effective reproductive number R
); demographic parameters (social interactions, age, and sex); environmental and climatic factors (humidity, wind speed, and temperature); pollution indicators (air quality); and socio-economic activities (economic and social interactions within and between countries). At the same time, it might be appropriate to consider data inconsistencies. Considering these variables when generating forecasts could provide decision makers with better guidance in establishing additional control measures, loosening restrictions, enhancing benefits, and preventing the failure of measures already taken.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Declaration of competing interest
The author has no conflicts of interest to disclose.
Table A1
Out-of-sample model comparison for United States: forecast horizon h = 1
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 1. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Table A2
Out-of-sample model comparison for United States: forecast horizon h = 3
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 3. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Table A3
Out-of-sample model comparison for United States: forecast horizon h = 7
The table reports the average values of the different loss functions for the models under analysis. Bold numbers indicate the best performing model by each criterion at the forecast horizon h = 7. The numbers shaded in gray and light-gray denote that the corresponding models are included in the 75% and 90% MCS, respectively. We use a rolling window of 200 observations to estimate the coefficients of the models at each step.
Authors: Folorunso O Fasina; Mudasiru A Salami; Modupe M Fasina; Olutosin A Otekunrin; Almira L Hoogesteijn; James B Hittner Journal: Methods Date: 2021-05-25 Impact factor: 3.608