Literature DB >> 35317020

Predicting the spread of COVID-19 in Italy using machine learning: Do socio-economic factors matter?

Francesco Bloise¹, Massimiliano Tancioni².

Abstract

We exploit the provincial variability of COVID-19 cases registered in Italy to select the territorial predictors of the pandemic. Absent an established theoretical diffusion model, we apply machine learning to isolate, among 77 potential predictors, those that minimize the out-of-sample prediction error. We first estimate the model considering cumulative cases registered before the containment measures displayed their effects (i.e. at the peak of the epidemic in March 2020), then cases registered between the peak date and when containment measures were relaxed in early June. In the first estimate, the results highlight the dominance of factors related to the intensity and interactions of economic activities. In the second, the relevance of these variables is highly reduced, suggesting mitigation of the pandemic following the lockdown of the economy. Finally, by considering cases at onset of the "second wave", we confirm that the territorial distribution of the epidemic is associated with economic factors.

Entities: Chemical

Keywords: COVID-19; Coronavirus; Economic networks, Epidemic, Machine learning; Economic structure

Year: 2021 PMID： 35317020 PMCID： PMC7994006 DOI： 10.1016/j.strueco.2021.01.001

Source DB: PubMed Journal: Struct Chang Econ Dyn ISSN： 0954-349X

Introduction

The COVID-19 pandemic has opened up new challenges for understanding the factors associated with the spread of contagious diseases and the role played by social, economic, and environmental conditions. In this study we investigate the case of Italy—the first European country to experience a large number of registered cases in early 2020—to address three main questions. First, we use a methodology (based on a machine learning algorithm) for selecting the relevant predictors of registered COVID-19 cases from a large set of official, provincial-level data in Italy. These data include economic activity indicators within a conditioning set that considers a high number of potential triggers addressed in the current literature. Second, by repeating the analysis on cumulative cases observed at different points in time, we evaluate whether some selected predictors lose their relevance following the time-specific containment measures implemented by the Italian government, which might signal that these measures have been effective in mitigating the pandemic's further spread. Third, we setup a simulation strategy to verify the external validity of the model by testing its ability to predict the diffusion of the COVID-19 in "unseen" areas. Before going into the details of the analysis, we provide some background of the pandemic in Italy. On January 30, 2020, the first two cases of COVID-19 were detected, and the respective patients were hospitalized in Rome, Italy. Only about 20 days later, on February 20, 2020, the first COVID-19 outbreak was identified in Codogno, a municipality belonging to the province of Lodi in the Lombardy region. Since then, Italy came out as the first European country to be severely hit by the COVID-19 pandemic. Specifically, as of March 8, 7,375 COVID-19 cases were registered in Italy, of which nearly 57% were in Lombardy. On March 21, when the highest number of diagnosed new positive cases was registered (6,557), the cumulative number of cases increased to 53,578 (of which 47.6% were in Lombardy), exhibiting a growth rate close to 14%. Several increasingly restrictive containment measures have been adopted by the Italian government since the detection of the first outbreak, leading to a nearly complete lockdown of the country's economic activities on March 21, 2020. This strong containment measure (stage one) was in effect until May 3, 2020. From this date until June 3 (stage two), the containment had been partially relaxed. At the end of stage two on June 3, the number of registered cases was 233,836, with a 0.1% growth rate. The country then entered a "third stage" in the management of the COVID-19 spread, mostly based on the adoption of prescriptions for personal protection and social distancing. One of the main characteristics of the COVID-19 pandemic in Italy is its highly heterogeneous territorial distribution. As of June 3, 2020, Lombardy's official disease count accounted for 38.2% of all Italian cases, a number that was 128% higher than that expected for a homogeneous territorial distribution (i.e.16.7% regional share in the national population). Within Lombardy, official disease counts also indicated high heterogeneity in incidences across provinces. On June 3, 2020, COVID-19 counts per 100,000 inhabitants ranged from 408 cases, registered in the province of Varese, to 1,802 cases in the province of Cremona. A high territorial heterogeneity of the spread was also found in other Italian regions. In the central region of Lazio, the incidences ranged between a minimum of 94 cases per 100,000 inhabitants in the province of Latina to a maximum count of 248 cases, registered in the province of Rieti. In Sicily, official disease counts ranged from a minimum of 30 cases per 100,000 inhabitants in the province of Ragusa, to a maximum of 258 cases in the province of Enna. A number of explanations have been suggested in the ongoing debate for the uneven geographical spread of the pandemic. A natural conjecture is that such heterogeneity in disease counts reflects differences in the territorial distribution of its triggers. Demographics (Dowd et al., 2020), health care system characteristics (Black et al., 2020; Gan et al., 2020; Brindle and Gawande, 2020), enrollment in education systems (Chang et al., 2020, Li et al., 2020, Viner et al., 2020, Zhang et al., 2020), transport and mobility specificities (Do et al., 2020, Li et al., 2020, Zheng et al., 2020), climate factors (Bashir et al., 2020, Sajadi et al., 2020, Wang et al., 2020), pollution (Yongjian et al., 2020; Wu et al., 2020), social attitudes, and family ties (Bayer and Kuhn, 2020, Belloc et al., 2020, Borgonovi and Andrieu, 2020) are the general categories wherein the majority of speculations on potential disease triggers being proposed can be grouped. Only few and highly focused studies address the role of economic factors in the spread of the pandemic (Barbieri et al., 2020; Dingel and Neiman, 2020; Qiu et al., 2020; Fogli and Veldkamp, 2020), whereas much more attention is being devoted to the opposite causal nexus, that is, the investigation of the effects of the pandemic on economic activities and conditions (Atkenson, 2020; Baker et al., 2020; Baqaee and Fahri, 2020; Bronka et al., 2020, Decerf et al., 2020, Dosi et al., 2020; Fernandez, 2020; Gregory et al., 2020; Guan et al., 2020; Ludvigson et al., 2020; Palomino et al., 2020). Each of the many explanations has its own rationale. However, in the absence of randomized testing on individuals or of an established diffusion model for the infectious disease, their objective relevance for the prediction of the uneven spread of the epidemic remains questionable. Empirical analyses that focus on specific triggers in non-experimental environments are unavoidably at risk of strong under-specification biases, unless they rely on strategies able to consider all these factors jointly. Detecting the relevant correlates of the geographical heterogeneity of the epidemic from aggregate data is a difficult task, since potential predictors are many and highly correlated. In this setting, analyses that strictly focus on specific aspects often lead to biased estimates due to the omission of important controls. By increasing the set of correlates to reduce the bias, the standard errors of the estimates obtained with typical penalty functions (such as OLS) tend to inflate, implying that the statistical relevance of the conditioning sets shrinks toward zero, and the model's out-of-sample predictive performances deteriorate. In other words, standard estimation tools are bounded to an underspecification/overfitting trade-off. A viable strategy to handle the bias–variance trade-off and minimize the mean squared prediction error is to leave the predictors' set unrestricted (thus taking an agnostic perspective) and use statistical learning to select a parsimonious model. This basically implies the introduction of weights in the OLS estimator in the form of penalties that are able to select a compact model structure with "optimal" out-of-sample predictive properties. Given our empirical setting, we use the elastic net learning algorithm (Zou and Hastie, 2005; Hastie et al., 2009), which provides an estimator belonging to this general strategy. The elastic net estimator combines the properties of ridge regression, which basically mitigates the multicollinearity problem through regularization, and of the least absolute shrinkage and selection operator (LASSO), which further increases the predictive ability of models. This method has many advantages. First, it allows joint consideration of a very large number of correlated predictors in a unified empirical framework and reduction of the risk of overfitting by selecting only those variables that provide the highest predictive performance. Second, relative to other machine learning algorithms, the elastic net ensures good performances, even when the number of tested predictors is high (in principle even larger than the number of observations) relative to the sample size (Zou and Hastie, 2005). Third, since the elastic net can be conceptualized as a generalization of OLS, it allows a standard interpretation of results. With this strategy, we exploit an extended information set including COVID-19 cases observed at the provincial level and the many potential triggers addressed in the literature. Specifically, we consider a set of 77 province-specific candidate predictors that can be grouped in the following conceptual sub-sets of indicators: i) economic activity and intensity, ii) climate and pollution, iii) socio-demographic, iv) geographical and territorial distance, v) health-care-system-related indicators, vi) public and private mobility, and vii) educational-system-related indicators. As a first robustness check, we include regional dummies to check whether our results are driven by region-specific characteristics, an occurrence that would switch the set of predictors selected by the elastic net estimator. Robustness is also tested with respect to the role of the control identifying the distance from the first outbreak. To the best of our knowledge, this is the first study in which a very large set of the potential triggers of the geographical diffusion of COVID-19 addressed in the literature are jointly considered and analyzed in a unified empirical framework. The results point to a substantial improvement of the model's predictive properties with elastic net estimates. The gain is striking, as compared with OLS, and evident with respect to ridge regression and LASSO. The model estimated with the elastic net using the registered cumulative cases on March 21, 2020 identifies few relevant predictors of the geographical distribution of the epidemic among the 77 explanatory variables being considered. Importantly, we find that five out of 10 triggers belong to the economic sub-set. In the order of relative importance, productivity (value-added per employee), the intensity of firms' international relationships, the general employment rate, and the share of labor enrolled in manufacturing, denote a positive correlation with prevalence of COVID-19 cases, whereas the share of labor in agriculture is selected as a favorable trigger (negative correlation). The highest positive correlation is obtained by the measure of close Euclidean distance from the first outbreak (i.e. within 50 km from the province of Lodi). Health characteristics are shown to affect the geographical spread through a positive correlation with the mortality rate for infectious diseases. Climatic and pollution factors selected as critical triggers include the number of frost days in a year and the average concentration of PM10. Family ties, proxied by the average family size, are shown to have a weak but negative correlation with COVID-19 cases. Re-running the elastic net estimates on June 3, 2020 case data and controlling for cumulative cases registered on March 21, 2020 yield results that cancel three out of the five economic triggers, namely, valued-added per employee, the share of employment in manufacturing, and the general employment rate. We interpret this result as evidence of the effectiveness of the strong containment measures adopted by the government. This evidence, which confirms the results provided by Flaxman et al. (2020) for a set of 11 European countries, is reinforced by the deletion of the close distance identifier, which was selected as the most important trigger of the pandemic in the pre-containment sample. In the post-containment sample, the registered cumulative cases in March 21, 2020 (which can be conceived as a measure of the attack rate of the disease in the second stage) is found to be the most important trigger of the subsequent spread. Baseline results are robust to the introduction of regional dummies. From a simulation exercise based on estimates obtained over bootstrapped samples, the models selected with the elastic net algorithm are shown to outperform largely those estimated with OLS in terms of out-of-sample properties. Different from OLS, predictions obtained with the models estimated with the elastic net are shown to be stable across simulations and to replicate the real pattern correctly, irrespective of the randomly generated sample being considered. Furthermore, models estimated on five different training samples (in which 20% of provinces are iteratively excluded from the estimation set) are shown to maintain their predictive power also in "unseen" areas—those excluded by the training sample. This is a signal of the external validity of the model being selected by the elastic net algorithm, at least for the Italian case. Finally, using information on cumulative COVID-19 cases recorded between September and October 2020,we show that cross-province economic differences are confirmed as key factors to predict the spread of the epidemic in Italy also during the “second wave”. The paper proceeds as follows. Section 2 briefly describes the evolving literature. Section 3 discusses the data used in the analysis and provides some stylized descriptive evidence. Section 4 describes our estimation strategy and discusses its main advantages. Section 5 presents and discusses the main results, provides information about their robustness, shows the prediction improvement of the model selected by the elastic net algorithm as compared with OLS, and evaluates its external validity. Section 6 provides the conclusion.

The current literature

A number of contributions from authors belonging to different disciplines have recently tried to find out the reasons behind the territorial heterogeneity of COVID-19 spreads observed at the global level. Different views are emerging from separate studies focused on specific research questions. Such heterogeneity of views reflects the existence of both an objective puzzle and an investigation difficulty. On the one hand, in the case of Italy, the considerably higher rate of infection in Lombardy and in other northern provinces cannot be explained only by the fact that at the beginning of the epidemic in February (i.e. before the first important Italian outbreak in the province of Lodi was discovered and public authorities implemented stringent containment measures) most people got infected in the northern provinces. Evidence from registered cases clearly points out that for any given distance from the first outbreak, there were provinces in the north with a high rate of infection and others that displayed much lower prevalence. On the other hand, a common feature of recent contributions performed with standard tools is that the emerging correlations between the diffusion of COVID-19 and its triggers come out from investigations that miss a comprehensive handling of the potential predictors that are reasonably conjecturable. To contextualize our analysis, here we briefly mention only few of them, approximately covering the domain of the explanations to which interest has been directed. Among possible explanations of the territorial spread of the epidemic, interest has been first focused on demographic factors. From this perspective, large streams of studies have addressed the role of the age structure, sex, and intergenerational interactions for the observed territorial heterogeneity in the diffusion and fatality rate of COVID-19 (Dowd et al., 2020). Other works tried to evaluate the possible association between COVID-19 cases and climatic factors. Investigations were mostly focused on the role of air temperature and humidity, obtaining mixed results (Bashir et al., 2020, Zhu and Xie, 2020, Sajadi et al., 2020, Wang et al., 2020). Moreover, the infection risk for health care workers in hospitals and their role for the outer transmission of the disease have been the focus of other studies. Black et al. (2020), by noting that nearly 45% of secondary cases could be infected by index cases in a pre-symptomatic stage, and that in fully tested realities, asymptomatic cases ranged from 51% to 88%, made a strong case for the strategic role of mass health care workers testing to prevent propagation within and out of hospitals. From a similar perspective, Gan et al. (2020) addressed the case of the Singapore health system, and Brindle and Gawande (2020) studied the specifics of managing the pandemic risk in surgical systems. All these contributions implicitly conjecture that health care system characteristics and policies are critical for the spread of the pandemic. This possibility, still missing objective evidence, is the subject of harsh debate and juridical investigations in Italy. Some evidence on influenza outbreaks attributed an important role in the transmission of infectious diseases to school and education system arrangements (Jackson et al., 2016; Bin et al., 2018). This stimulated interest for investigations on the role of class attendance and participation in the education system as a trigger of the COVID-19 pandemic. In an early review study focusing on schools, Viner et al., 2020 showed that the evidence on school closures for the containment of the COVID-19 pandemic is weak or mixed. Unless reinforced with other stringent social distancing measures, social contacts outside schools are not less risky than child activities in schools (Chang et al., 2020). However, Zhang et al. (2020), in a model-based analysis focused on China, showed that school closure, by delaying the epidemic spread, can significantly reduce the peak incidence. Li et al. (2020), in a cross-country study on the time-varying effectiveness of a set of containment measures on the COVID-19 replication rate, showed that school closure decreases transmission by 15% after four weeks, while school reopening could increase transmission by 24%. Interest has also been directed at evaluating the role of environmental characteristics, with a specific focus on air pollution. Yongjian et al. (2020), in a strictly focused investigation, found a positive correlation between short-term exposure to air pollution and number of confirmed COVID-19 cases in China. Their investigation, however, does not take into account other factors that can be simultaneously correlated to the spread of the pandemic and air pollution. Wu et al. (2020) suggested that air pollution might be correlated to higher mortality rates in the United States, even after controlling for some confounding factors. Other studies focused on the correlation between the infection rates and social/family habits. Intergenerational family ties and cohabitation, known to be very high in Italy as compared with other high-income countries (Reher, 1998; Di Giulio and Rosina, 2007; Santarelli and Cottone, 2009), have been evaluated as a possible trigger of the epidemic. From this perspective, (Bayer and Kuhn, 2020) found a positive correlation between a measure of family vertical integration and the COVID-19 fatality rate, using cross-country data recorded at an early stage of the "first wave" of the pandemic. Belloc et al. (2020) argued that this result might be driven by country-specific factors simultaneously correlated to both intergenerational family ties and the spread of COVID-19. Such a potential selection bias is obviously high in analyses in which the variability across structurally and institutionally different countries or regions is exploited. In fact, Belloc et al. (2020) showed that the correlation between the COVID-19 fatality rate and a measure of vertical social integration (i.e. the share of adults aged 18–34 living with their parents) turns negative when the sample variability in the diffusion of the epidemic is referred to the 20 Italian regions, where the southern display higher family ties and lower case fatality rates. Under a similar perspective, Borgonovi and Andrieu (2020) evaluated the role of social capital (an index comprising both social norms and networks, obtained from an assortment of measured human attitudes, activities, and behaviors) in the response of U.S. county communities to COVID-19-related containment policies, measured in terms of changes in mobility patterns. They found that the social capital index is negatively correlated with mobility during the COVID-19 outbreak and thus, has a lowered risk of contagion. With regard to the economic literature, efforts are mostly focused on the potential effects of the pandemic on the economy. Different aspects have been addressed: global economic performances (Atkenson, 2020; Baqaee and Fahri, 2020; Fernandes, 2020; Gregory et al., 2020; Ludvigson et al., 2020), global supply chains (Guan et al., 2020), economic uncertainty (Baker et al., 2020), and the distribution of income and poverty (Bonacini et al., 2020; Bronka et al., 2020; Decerf et al., 2020; Dosi et al., 2020; Palomino et al., 2020). There are fewer studies that analyze economic factors as potential triggers of the pandemic. To cite some of them, Qiu et al. (2020) showed that the transmission rate of the infection increases with per capita GDP. Their result suggests that economic factors should be further investigated as important predictors of the COVID-19 diffusion. Fogli and Veldkamp (2020) suggested that areas that are more productive are socially and economically connected with each other and with the rest of the world and thus, are more vulnerable to spreads of infectious diseases. Ascani et al. (2020) find an association between the geographical spread of COVID-19 in Italy during the “first wave” and the structure of local economies. However, using the OLS estimator, they are forced to consider a limited number of explanatory variables to avoid multicollinearity issues. From a microeconomic perspective, Barbieri et al. (2020) evaluated the extent to which the probability of being infected varies across different categories of workers. Dingel and Neiman (2020) showed that the probability of infection is related to the possibility of working from home. These studies suggest that the link between a pandemic crisis and economic activity should be addressed considering two directions of causality. On the one hand, areas that are more productive and interconnected are more likely to be affected by high infection rates due to their higher degree of social and economic networks. On the other hand, highly infected areas contribute more to global supply chains and value-added, such that global economic growth may be strongly reduced in the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate (Guan et al., 2020). This brief and unavoidably incomplete review of the related literature provides a sketch of the contributions to which our work is related. It also helps in forming an idea about the difficulties arising from analyses focused on few specific predictors of the COVID-19 pandemic in a non-experimental environment. In the following chapters, we propose an analysis that is able to circumvent these problems, with the specific goal of selecting, among a large set of candidate economic and non-economic triggers, those that have the highest predictive power for the heterogeneous diffusion of the COVID-19 pandemic.

Data and descriptive evidence

Our investigation analyzes the geographical diffusion of the COVID-19 pandemic in Italy and identifies its predictors in a unified empirical framework. We collect data on registered COVID-19 cases and on a large set of potential predictors observed at the provincial level from different official data sources. First, we take information on COVID-19 cases provided by the Italian Civil Protection Department (ICPD) on a daily basis since the first cases were identified in the municipality of Codogno at the end of February.1 Although we are aware that information on registered cases is likely affected by a high degree of measurement error(the number of infected people has been largely underestimated due to the low number of swabs and tests carried out at the beginning of the epidemic), data provided by the ICPD are so far generally recognized as the sole official and controlled source available in Italy. We refer to the number of cumulative COVID-19 cases registered by the ICPD on two different dates: March 21, 2020 and June 3, 2020. The former date is selected to focus on the heterogeneity in the geographical distribution of COVID-19 cases observed before the containment measures implemented by the Italian government have had their effects.2 In considering cumulative cases as of June 3, 2020, we focus on the geographical distribution of COVID-19 cases registered between March 21, 2020 and June 3, 2020, that is, on infection events that occurred when the strongest containment measures were in place. By exploiting this difference in policy implementation, we evaluate whether the tested predictors of the spread of the epidemic vary due to the implementation of the measures.3 Fig. 1 shows the geographical diffusion of infected people per 100,000 inhabitants registered on March 21, 2020 (left-end map) and in June 3, 2020 (right-end map) by grouping the 107 provinces in deciles of the national distribution of COVID-19 cases. The two maps clearly show that most of Lombardy's provinces have been severely hit by the epidemic and are in the top decile of the national distribution, with 219 to 761 (819 to 1,802) cases per 100,000 inhabitants as of March 21, 2020 (June 3, 2020). As of March 21, 2020, aside from some areas in northern Italy, the two bordering provinces of Rimini and Pesaro–Urbino are the only other provinces along the eastern coast of Italy that belong to the top decile of the distribution. On June 3, 2020, they fall to the 8th and 9th decile, respectively. Most of the provinces in the central or southern part of Italy show a lower degree of infections with 2 to 7 (28 to 51) cases per 100,000 inhabitants on March 21, 2020 (June 3, 2020).

Fig. 1

Geographical distribution of COVID-19 cumulative cases per 100,000 inhabitants.

Geographical distribution of COVID-19 cumulative cases per 100,000 inhabitants. However, differences can be detected across provinces in each region or area. For instance, the number of infected people per 100,000 inhabitants is clearly higher in northern Sardinia (province of Sassari) than in the rest of the island, whereas the province of Enna in central Sicily, on June 3, 2020, shows a much higher degree of infection compared with other Sicilian provinces. To provide a comprehensive analysis of the many potential predictors of COVID-19 diffusion across Italian provinces, we use 77 explanatory variables, all observed at the provincial level. Data come from different official sources: the national statistical office (ISTAT), the Ministry of Economy and Finance, the Ministry of Economic Development, and the Ministry of Health. These data can be grouped into seven conceptual sub-sets of indicators: i) economic activity and intensity (19 variables), ii) climate and pollution (9 variables), iii) socio-demographic (9 variables), iv) geographical and territorial distance (8 variables), v) health-care-system-related (13 variables), vi) public and private mobility (12 variables), and vii) educational-system-related indicators (7 variables). In particular, the set of economic predictors includes labor market characteristics, specifically the employment and unemployment rates, the percentage of employment in agriculture, industrial districts, manufacturing, and services and the percentage of self-employed workers; economic and distribution characteristics, specifically the value-added per employee (productivity), the value added per employee in agriculture, manufacturing, and services, the poverty rate; firm characteristics, specifically the firm size, firm density, the share of employment in industrial districts, the intensity of firms’ export relationships, the share of unloaded goods in provincial harbors, the density of livestock units, and the density of firms producing animal-derived products. The full set of indicators for each conceptual sub-set is listed and described in detail in the Appendix (Tables A.1 to A.7).

Table A.1

Description of economic activity predictors.

Predictor	Description	Source (year)
Employment rate	Employed people over provincial population	ISTAT (2017)
Unemployment rate	Percentage of active provincial population aged 15-74 who are unemployed	ISTAT (2019)
Percentage of employment in agriculture	Percentage of total employees who work in agriculture activities	ISTAT (2017)
Percentage of employment in manufacturing	Percentage of total employees who work in manufacturing activities	ISTAT (2017)
Percentage of employment in services	Percentage of total employees employed in service activities	ISTAT (2017)
Percentage of self-employed workers	Percentage of provincial workers who are self-employed	ISTAT (2011)
Value added per employee	Value added in euro per employee (productivity)	ISTAT (2017)
Value added per capita	Value added in euro per capita	ISTAT (2017)
Value added per capita - Agriculture	Value added of agriculture in euro per resident	ISTAT (2017)
Value added per capita - Manufacturing	Value added of manufacturing in euro per resident	ISTAT (2017)
Value added per capita - Services	Value added of services in euro per resident	ISTAT (2017)
Poverty rate	Percentage of taxpayers declaring less than 10,000 euro in 2018	Italian Ministry of Economy and Finance (2019)
Firm density	Number of firms per km²	ISTAT (2017)
Firm size	Average number of employees per firm	ISTAT (2017)
Percentage of employment in industrial districts	Percentage of total employees who work in industrial districts	ISTAT (2017)
Intensity of export relationships	Average number of areas of the world (e.g. Europe, BRICs, rest of the world) where firms export their products	ISTAT (2018)
Unloaded goods in the local harbours	Tons of goods unloaded in the local harbours per inhabitant	ISTAT (2018)
Cattle density	Number of livestock units per km2	ISTAT (2010)
Density of firms producing animal-derived products	Number of firms producing goods derived from animal products per km²	Italian Ministry of Health (2018)

Table A.7

Description of education system predictors.

Predictor	Description	Source (year of reference)
Percentage of compulsory school students	Percentage of compulsory school students over total population.	ISTAT (2018)
Percentage of high-school graduates	Percentage of high school graduates over total population	ISTAT (2018)
Percentage of people below upper secondary education	Percentage of people below upper secondary education	ISTAT (2011)
Percentage of pre-school students	Percentage of provincial students enrolled in pre-school	ISTAT (2018)
Percentage of students	Percentage of students over provincial population	ISTAT (2018)
Percentage of tertiary graduates	Percentage of provincial population with a tertiary degree	ISTAT (2011)
Percentage of university students	Percentage of provincial students who are enrolled in universities	ISTAT (2018)

In line with many recent studies, we can exemplify the problems that potentially emerge by analyzing simple correlations between registered log cases per 100,000 residents in a province and specific factors that so far have been identified as possible predictors of the geographical spread of COVID-19 infections. For instance, Fig. 2 shows that COVID-19 cases are positively correlated with the average number of frost days in a year (Panel A), with the average concentration of PM10 (particles with diameter ≤ 10 μm), which can serve as proxy for air pollution in the province (Panel B), and with the mortality rate for infectious diseases (Panel D), while they are negatively correlated with the percentage of families with at least five members (Panel C). The latter result basically replicates that of Belloc et al. (2020), which was obtained at the regional level.

Fig. 2

Estimated correlations between log cumulative cases per 100,000 inhabitants and selected covariates as of March 21, 2020.

Estimated correlations between log cumulative cases per 100,000 inhabitants and selected covariates as of March 21, 2020. It is noteworthy that such analyses are not able to predict accurately the distribution of cases across areas. Single correlations, even if statistically relevant, may be driven by many other confounding factors that should be considered in a comprehensive predictive model. This risk is particularly high when, as in this case, there is no sound theoretical support for variable selection, from which a structural model of the unequal diffusion of COVID-19 across different territories can be derived. Moreover, although using few explanatory variables minimizes the risk of overfitting, we can hardly obtain a good predictive performance of the COVID-19 spread by focusing on single potential predictors. This is why in the following sections we base our model specification on a machine learning algorithm capable of selecting the pandemic's relevant triggers from a joint consideration of many potential explanatory variables.

Estimation strategy

We perform our estimation using the elastic net machine learning algorithm originally proposed by Zou and Hastie (2005). The elastic net algorithm combines the ridge and LASSO regularizations to increase further the predictive ability of a model.4 The role of these penalties is to select a parsimonious model (and/or shrink the size of its coefficients) from a very high number of explanatory variables (possibly exceeding the number of observations), and when the conditioning set displays near or exact collinearity. The model selection implies conditioning the penalties to an optimal target, that is, the maximization of the out-of-sample model's predictive abilities (or the minimization of the out-of-sample mean squared error). In practice, the penalty parameters are obtained from repeated rounds of model validation, known as cross-validation, in which the estimates obtained on estimation sets are generalized to predict unseen data (i.e. predictive sets). More specifically, the penalties are obtained by maximizing the model's ability to predict data that are not used in the estimates. Since the elastic net can be conceptualized as a generalization of OLS (i.e. a methodology belonging to the family of regularized least squares) we provide the details of the estimation method in relation to some key properties of the OLS estimator. The OLS estimator is very often exploited to predict a given number of observations of an outcome variable using a vector of predictors. Usually, the outcome variable of interest is predicted by estimating those parameters, ensuring that the in-sample sum of squares of residuals is as small as possible. However, there are two fundamental aspects of an estimator to be considered in the evaluation of its predictive performance: the bias and the variance. The former quantifies the error that is introduced by approximating an unknown data generating process. Specifically, if we assume N random samples associated to different data generating processes, we could obtain a range of predictions, one for each randomly drawn sample. The bias is thus a measure of the distance between the expected value of the prediction and the unknown function which captures the true relationship between the outcome variable and predictors. The variance is the variability of a model prediction around its expected value. According to the definition of bias and variance, the prediction performance of a model can be evaluated by looking at its mean squared error (MSE), which is the expected error in predicting a given outcome variable: In our study, denotes registered log cases at a given date per 100,000 inhabitants observed at the provincial level, and denotes predicted log cases per 100,000 inhabitants, which is a function of the vector of provincial predictors included in the model. The statistical learning literature (Hastie et al., 2009) shows that the MSE can be decomposed as follows:where the first term is the variance of the model; the second term is the square bias; and the last is the noise term, which cannot be reduced. To attain the best prediction, we should minimize the MSE by reducing both the first and the second components of Equation (2). In finite samples, the well-known trade-off between variance and bias requires a balance between the first two components of Equation (2), such that the lowest attainable MSE is conditional on the specific set of predictors at our disposal. Although OLS is the best linear unbiased estimator, it produces very poor predictions in the following two cases: When there is a high degree of correlation between predictors included in the vector . When too many predictors need to be included in the model with respect to the number of observations. In worst case-scenarios, typical of investigations that are missing the support of an established theoretical model, the number of predictors might exceed the number of observations, such that it is not even possible to estimate the parameters of interest using OLS. In both cases, the variance of the prediction can be extremely high, such that the predictive performance of the OLS estimator is very low, even though the bias component is minimized. Therefore, in very complex models, allowing for a small degree of bias is essential to obtain a strong reduction in variance and improve the prediction performance of the model. In our estimation problem, we need to identify the main predictors of the geographical spread of the epidemic considering a large set of potential predictors addressed by the literature, whose coefficients are to be estimated with a small sample size (i.e. the 107 Italian provinces). This is a typical case in which the prediction performance of the OLS estimator is very low, given several issues related to multicollinearity and overfitting. For this reason, we handle the variance–bias trade off by using the elastic net regularization algorithm originally proposed by Zou and Hastie (2005): The elastic net algorithm combines the penalties of the ridge regression and LASSO and mitigates some of the known drawbacks that affect LASSO, which have been shown to saturate when the number of predictors is very high with respect to the sample size, or when there is a high degree of correlation among predictors. In Equation (3), theparameter controls the relevance of the regularization term which shrinks the coefficient toward zero to reduce overfitting. When the parameter the elastic net algorithm collapses to the ridge regression and no predictor is excluded from the model. When , the elastic net algorithm is equivalent to the LASSO, which has the potential ability to set some of the coefficients to equal zero. When both and are greater than zero, the algorithm has the ability of setting some coefficients exactly to zero and shrinks others to minimize the prediction error. On the contrary, when both and equal zero, the elastic net algorithm collapses to the OLS case, such that all predictors are exploited to predict the outcome variable without any shrinkage. Therefore, it is possible to get very different predictive models and estimated coefficients for each combination value of and . Among all possible specifications, we select and by using k-fold cross-validation to minimize the out-of-sample MSE, and we evaluate the external predictive performance of the model by testing its ability to predict new data that were not used for its estimation (James et al., 2013). K-fold cross-validation is a re-sampling procedure that randomly splits the sample in K subsets (folds). Then, for each of the K-folds, one is iteratively defined as the test set, and the K-1 remaining folds are used to estimate the model coefficients. Following Mullainathan and Spiess (2017), we calibrate our algorithm and evaluate its out-of-sample performance in different steps: i) we randomly divide our data in a training sample (80% of the observations) and a hold-out sample (the remaining 20% of the data); ii) in the 80% training sample we use 5-fold cross-validation to select a specific pair among a set of different possible combinations of and . The selected and are the ones that minimize the average MSE computed across the five folds;5 iii) we run the algorithm in the training sample using the- combination selected through k-fold validation; iv) we compute the out-of-sample prediction error in the hold-out sample to test the model's capability to predict “unseen” data.

Results and discussion

Results

In this section we present the results of our analysis by using, as dependent variable in our regressions, the log of COVID-19 cumulative cases per 100,000 residents in the province, measured at different dates. In our baseline analysis, we refer to positive cases registered until March 21, 2020 to consider the geographical spread of cases across Italian provinces in the first stage of the epidemic (Table 1 ). Then, we focus on cumulative cases registered between March 22, 2020 and June 3, 2020 to detect whether, and to what extent, predictors change when containment measures are in force (Table 4).

Table 1

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020.

	Baseline
Distance from the first outbreak: less than or equal to 50 km	0.496
Value-added per employee	0.244
Intensity of export relationships	0.188
Nr. of frost days in a year	0.161
Mortality from infectious diseases	0.093
PM10	0.072
Employment rate	0.071
Percentage of employment in manufacturing	0.048
Average family members	-0.028
Percentage of employment in agriculture	-0.150
Observations	107
α selected by 5-fold cross-validation	0.333
λ selected by 5-fold cross-validation	0.393
MSE (hold-out sample)	0.527
Nr. of αvalues tested	10
Nr. of λvalues tested	50

Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations.

Table 4

Elastic net regression of log cumulative cases per 100,000 inhabitants—pre and post containment measures.

	March 21 (Model 1)	March 22-June 3 (Model 2)	March 22-June 3 (Model 3)
Distance from the first outbreak: less than or equal to 50 km	0.000	0.149	0.000
Distance from the first outbreak: between 51 km and 100 km	0.496	0.141	0.059
Value-added per employee	0.244	0.125	0.000
Intensity of export relationships	0.188	0.107	0.091
Mean altitude of the province	0.000	0.085	0.056
Frost days in a year	0.161	0.111	0.000
Mortality from infectious diseases	0.093	0.081	0.000
Percentage of hospital beds of the elderly	0.000	0.048	0.000
Municipality density	0.000	0.046	0.066
Mortality rate from pneumonia	0.000	0.000	0.052
Average hospital size	0.000	0.040	0.000
Foggy days in a year	0.000	0.038	0.000
PM10	0.072	0.036	0.000
N02	0.000	0.019	0.000
Mortality rate	0.000	0.000	0.010
Employment rate	0.071	0.000	0.000
Percentage of employment in manufacturing	0.048	0.000	0.000
Hot days in a year	0.000	-0.061	0.000
Percentage of families with 5 or more members	0.000	-0.065	-0.077
Average family members	-0.028	-0.093	-0.064
Percentage of employment in agriculture	-0.150	-0.088	-0.028
Hours of continuity health care services per capita	0.000	-0.111	0.000
Log cases over 100,0000 people on March 21	Not included	Not included	0.493
Observations	107	107	107

Constant terms and predictors that are not selected in any of the three models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020. Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations. To compare the magnitude of the estimated coefficients and identify which predictors are more relevant for our analysis, we standardize all explanatory variables so that we can interpret each estimated coefficient multiplied by 100 as the percentage increase of cases per 100,000 inhabitants for one standard deviation increase in predictors. Table 1 shows our baseline results obtained with the elastic net calibrated using a 5-fold cross-validation.6 In this case, among all 77 explanatory variables considered in the analysis, the algorithm selects only 10 predictors, presented in descending order of importance. The coefficients of the other 65 explanatory variables are penalized to zero, denoting that they are not relevant predictors of the epidemic and are not shown in Table 1. On the contrary, although the coefficients of the 10 variables selected by the algorithm are reduced in size to minimize the risk of overfitting, they are not set to equal zero. The results suggest that for a given province of the initial outbreak, all provinces within 50 km are more likely to have a high infection rate (49.6% more cases per 100,000 people) about 1 month after the beginning of the epidemic. Nevertheless, all other dummies of distance are not selected as potential predictors, suggesting that for distances above 50 km, there are other diffusion factors not related to the geographical distance. Among the other selected variables, economic factors are shown to be the main predictors of the epidemic spread. In particular, for a one standard deviation increase, the percentage of registered cases per 100,000 residents increases with value-added per capita (by 24.4%), intensity of firms’ export relationships (18.8%), overall employment rate (7.1%), and the percentage of employment in manufacturing (4.8%), and decreases with the percentage of employment in agriculture (-15.0%). Our results suggest that provinces that are more productive are more likely to be severely hit by the epidemic. Additionally, more intensive international relationships, a higher employment rate, and a large share of employees in the manufacturing industry are triggers of the initial COVID-19 geographical spread. It should be noted that manufacturing is the most strongly integrated sector in the global economy, as it is involved in global value chains and produces goods that make up the majority of exports in OECD countries (De Backer et al., 2015). Outside economic triggers and the Euclidean distance, the most relevant explanatory variable identified as a predictor of the COVID-19 spread is the average number of days with temperature below 3°C (+16.1 % for a one standard deviation increase). The rate of positive cases per 100,000 people is also higher where mortality for infection diseases and PM10 concentration are higher (+9.3 % and +7.2% for a one standard deviation increase, respectively). These results suggest that provinces with high COVID-19 registered cases are also those where the transmission rate of infections is generally higher and, as suggested by previous works, (Wu et al., 2020), where air pollution is higher. Finally, the other variable that has been selected by the algorithm (with lower estimated coefficients) is the average size of the family (-2.8% for a 1 standard deviation increase). The latter result could indicate that stricter family ties reduce the exposure of family members to the spread of infections through external social and professional networks.7 In the OLS case (last three columns of Table A.8 in the Appendix), all 77 coefficients, aside from the omitted categories, are estimated. However, given the large number of predictors with respect to the sample size and the high degree of collinearity among explanatory variables, all coefficients are imprecisely estimated and most of them are not statistically significant. Moreover, the out-of-sample MSE, which is the measure of the out-of-sample prediction error of the model, is considerably higher than the in-sample MSE (5.763 vs. 0.045, respectively). This result suggests that, even if the OLS performs very well within the specific sample we are using, it performs very poorly in predicting external “unseen” observations. Therefore, given the high degree of overfitting and multicollinearity, we can neither identify the main predictors of the geographical spread of COVID-19 across Italian provinces nor obtain a good out-of-sample predictive performance of our model using the standard OLS estimator.

Table A.8

Regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: Full results.

	Elastic net	OLS
	Coefficient	Coefficient	S.E.	P-value
Distance from the first outbreak: less than or equal to 50 km	0.496	2.066	1.212	0.098
Value-added per employee	0.244	0.380	0.985	0.702
Intensity of export relationships	0.188	-0.121	0.255	0.637
Nr. of frost days in a year	0.161	0.576	0.279	0.047
Mortality from infectious diseases	0.093	0.153	0.154	0.328
Concentration of PM10	0.072	0.244	0.305	0.430
Employment rate	0.071	-0.005	1.403	0.997
Percentage of employment in manufacturing	0.048	Omitted category
Percentage of employment in industrial districts	0.000	0.068	0.183	0.710
Percentage of employment in services	0.000	0.234	0.433	0.593
Percentage of workers who are self-employed	0.000	-0.011	0.139	0.937
Hospital beds per capita	0.000	-0.273	0.259	0.299
Percentage of hospital beds in private clinics	0.000	-0.147	0.193	0.450
Percentage of hospital beds for the elderly	0.000	-0.012	0.122	0.920
Average firm size	0.000	0.179	0.287	0.537
Average hospital size	0.000	0.207	0.193	0.293
Population density	0.000	-0.122	1.835	0.947
Municipality density	0.000	-0.101	0.270	0.711
Hospital density	0.000	0.110	1.059	0.918
Firm density	0.000	-0.099	1.177	0.933
Percentage of tertiary graduates	0.000	0.394	0.306	0.207
Percentage of high-school graduates	0.000	-0.300	0.235	0.211
Percentage of people below upper secondary education	0.000	Omitted category
Nr. of flights passengers per capita	0.000	0.020	0.146	0.894
Percentage of passengers from international locations	0.000	0.042	0.154	0.786
Nr. of public transport passengers per capita	0.000	-0.222	0.320	0.493
Nr. of public transport seats per km/resident	0.000	0.244	0.341	0.478
Car density	0.000	-0.142	0.208	0.498
Mortality rate for respiratory diseases	0.000	0.052	0.236	0.827
Mortality rate for pneumonia	0.000	-0.341	0.274	0.223
Mortality rate	0.000	-0.333	0.403	0.415
Percentage of students	0.000	-0.300	0.256	0.251
Percentage of university students	0.000	Omitted category
Percentage of compulsory school students	0.000	-0.237	0.273	0.391
Percentage of pre-school students	0.000	0.078	0.273	0.778
Value-added per capita	0.000	-3.255	2.991	0.284
Poverty rate	0.000	0.546	0.849	0.525
Percentage of families with 5 or more members	0.000	0.035	0.468	0.941
Percentage of males	0.000	-0.168	0.158	0.295
Average age of the population	0.000	0.293	1.199	0.808
Percentage of immigrants	0.000	0.572	0.361	0.123
Percentage of people aged 65 or more	0.000	-0.007	1.131	0.995
Concentration of PM2.5	0.000	-0.395	0.307	0.206
Nr. of foggy days in a year	0.000	0.036	0.194	0.853
Concentration of N02	0.000	0.112	0.230	0.628
Nr. of windy days in a year	0.000	0.007	0.195	0.972
Nr. of sunny days in a year	0.000	0.236	0.417	0.576
Nr. of hot days in a year	0.000	-0.230	0.175	0.198
Nr. of rainy days in a year	0.000	0.299	0.270	0.276
Percentage of people who live close to a train station	0.000	-0.167	0.171	0.337
Mean altitude of the province	0.000	0.700	0.273	0.015
Altitude of the province capital	0.000	-0.545	0.230	0.024
Unemployment rate	0.000	0.396	0.231	0.096
Commuters as a share of the population	0.000	0.282	0.272	0.307
Percentage of commuters outside their municipality of residence	0.000	-0.075	0.290	0.797
Unloaded goods in the local harbours per capita	0.000	0.180	0.170	0.299
Percentage of commuters who use a private vehicle	0.000	0.077	0.179	0.672
Agriculture valued-added per capita	0.000	0.264	0.188	0.170
Services valued-added per capita	0.000	1.676	1.150	0.154
Manufacturing valued-added per capita	0.000	1.889	0.979	0.062
Yearly ship passengers arriving in the local harbours over total population	0.000	0.055	0.125	0.661
Yearly registered visitors in accommodation facilities as a percentage of the population	0.000	0.163	0.212	0.448
People that actually live in the province as a percentage of residents	0.000	-0.088	0.164	0.597
Percentage of people who live close to the sea	0.000	-0.217	0.135	0.116
Cattle density	0.000	0.138	0.106	0.204
Firm density (derived from animal products)	0.000	-0.252	0.273	0.362
General practitioner per capita	0.000	0.187	0.294	0.529
Hours of continuity health care services per capita	0.000	-0.151	0.255	0.558
Cases handled by the medical homecare as a share of the population	0.000	-0.056	0.145	0.700
Clinic density	0.000	0.362	0.306	0.246
Degree of provincial interconnection	0.000	0.185	0.288	0.525
Distance from the first outbreak: between 51km and 100 km	0.000	0.924	1.101	0.407
Distance from the first outbreak: between 101km and 300 km	0.000	0.694	0.903	0.448
Distance from the first outbreak: between 301km and 500 km	0.000	0.542	0.714	0.453
Distance from the first outbreak: more than 500km	0.000	Omitted category
Average family members	-0.028	-0.104	0.562	0.855
Percentage of employment in agriculture	-0.150	-0.168	0.313	0.596
MSE (hold-out-sample)	0.578	5.763
MSE (training sample)	0.463	0.045

One common problem that arises in Machine Learning is the lack of a measure of dispersion of the estimated coefficients to evaluate their precision. This limitation cannot be easily overcome with standard methods since the theoretical distribution of the estimator is unknown. Following Hastie et al. (2015), we circumvent such a drawback in post-selection inference by using a bootstrap re-sampling method to approximate the data-specific distribution of the estimated coefficients. Based on this bootstrapped distribution, we evaluate how often each of the 77 coefficients is estimated to be different from zero. Specifically, using the elastic net algorithm properly calibrated through 5-fold cross-validation, we take 200 bootstrap replications of the 80% training sample to calculate how many times a given predictor exhibits a non-zero coefficient. Fig. A.2 in the Appendix shows that, although all potential predictors are selected at least once across the 200 bootstrapped samples, six relevant predictors selected by elastic net (e.g. value-added per employee, average frost days in a year, the dummy that identifies provinces located within 50 km from Lodi, mortality from infectious diseases, the percentage of employment in agriculture, and PM10) have a non-zero coefficient in more than 95% of the 200 replications, while the probability of selecting other predictors is generally lower.

Fig. A.2

Post-selection inference.

Robustness checks

As a first robustness check, we estimate a different model that also includes regional dummies to verify whether our results are driven by region-specific characteristics that modify the set of predictors selected by the elastic net algorithm. Since disease testing policies are managed at the regional level, these controls also capture potential differences in the testing ability and in the degree of measurement error in the registered infections.8 The results presented in Table 2 , which are based on March 21, 2020 data, clearly show that both the selection of the main predictors of COVID-19 spread and all estimated coefficients are robust to the inclusion of the regional dummies and comparable to the baseline model. Moreover, all the regional dummies are not selected by the elastic net algorithm.

Table 2

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: sensitivity to the inclusion of regional dummies.

	Baseline	Including regional dummies
Distance from the first outbreak: less than or equal to 50 km	0.496	0.495
Value-added per employee	0.244	0.244
Intensity of export relationships	0.188	0.188
Frost days in a year	0.161	0.160
Mortality from infectious diseases	0.093	0.092
PM10	0.072	0.072
Employment rate	0.071	0.071
Percentage of employment in manufacturing	0.048	0.048
Average family members	-0.028	-0.028
Percentage of employment in agriculture	-0.150	-0.150
Observations	107	107

Constant terms and predictors that are not selected in any of the two models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: sensitivity to the inclusion of regional dummies. Constant terms and predictors that are not selected in any of the two models are not shown. In a second robustness check we evaluate whether the Euclidean distance control is key for the characterization of the other predictors of the spread. By excluding this explanatory variable, we basically take the perspective of a pandemic event whose triggers are unconditional with respect the geographical detection of the first registered outbreak. Table 3 shows that baseline results are confirmed even in the absence of the distance controls. No other variables, but the number of foggy days in a year, are selected by the elastic net algorithm, and the size of the coefficients of the single predictors is basically unaffected.

Table 3

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020, excluding predictors related to the geographical location of the first outbreak.

	Baseline	Excluding distance
Distance from the first outbreak: less than or equal to 50 km	0.496	Not included
Value-added per employee	0.244	0.262
Intensity of export relationships	0.188	0.188
Frost days in a year	0.161	0.162
Mortality from infectious diseases	0.093	0.082
PM10	0.072	0.083
Employment rate	0.071	0.057
Percentage of employment in manufacturing	0.048	0.034
Foggy days in a year	0.000	0.033
Average family members	-0.028	-0.022
Percentage of employment in agriculture	-0.150	-0.145
Observations	107	107

Constant terms and predictors that are not selected in any of the two models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020, excluding predictors related to the geographical location of the first outbreak. Constant terms and predictors that are not selected in any of the two models are not shown.

Exploring the transmission channels of the containment measures

Table 4 presents the results of three alternative models. In the first column we summarize the results of the baseline model estimated by using the log cases of COVID-19 per 100,000 residents registered until March 21, 2020. In the second column, we evaluate how predictors change by considering cases registered between March 22, 2020 and June 3, 2020. Finally, in the last column we re-estimate the second model by including the log cases registered until March 21, 2020 in the conditioning set. This additional control is useful for eliminating the effect of the first stage of the epidemic on the transmission of infections that occurred after March 21 and, thus, on the selection of the relevant predictors of cases that occurred between March 22, 2020 and June 3, 2020. Elastic net regression of log cumulative cases per 100,000 inhabitants—pre and post containment measures. Constant terms and predictors that are not selected in any of the three models are not shown. The results showed in Table 4 suggest that the containment measures and the lockdown of economic activities might have reduced the transmission rate of the epidemic mostly by reducing the relative importance of economic factors. Specifically, moving from Model 1 to Model 3, all economic predictors, except for the percentage of workers in agriculture and the intensity of export relationships (which show a coefficient closer to zero in the updated estimate), are no longer included among the relevant predictors of the epidemic. The same result holds for the dummy that identifies provinces within 50 km from the province of the initial outbreak and for previously selected climate factors (i.e. the number of frost days). On the contrary, in the second stage of the epidemic some additional predictors unrelated to economic factors (i.e. percentage of families with 5 or more members, average family size, municipality density, mortality from pneumonia, mortality rate and mean altitude of the province) become relevant.9

Predictive performance

In this section we evaluate the predictive performance of our model by adopting different methodological perspectives. We first compare the predictive performance of elastic net with those obtainable with LASSO, ridge regression, and OLS by training each estimator in the 80% training sample and testing the corresponding out-of-sample performance in the 20% hold-out sample. Table 5 shows that elastic net outperforms LASSO, ridge regression, and OLS, providing the lowest out-of-sample MSE (even if OLS, as expected, exhibits the best in-sample predictive performance). It is relevant to note that the out-of-sample prediction error is also very imprecisely estimated using OLS. Specifically, the standard error of the MSE calculated using 200 bootstrapped replications of the hold-out sample is more than 14 times higher in the OLS case than in the case of elastic net.

Table 5

Predictive out-of-sample and in-sample performance of different estimators.

	Estimated MSE
	Hold-out sample	Training Sample
Elastic net	0.527	0.463
	(0.147)
Lasso	0.580	0.459
	(0.168)
Ridge regression	0.617	0.352
	(0.187)
OLS	5.763	0.045
	(2.159)
Observations	21	86

Bootstrapped standard errors (200 replications) in parenthesis.

Predictive out-of-sample and in-sample performance of different estimators. Bootstrapped standard errors (200 replications) in parenthesis. We then provide graphical comparisons of the predictive out-of-sample performance of the elastic net algorithm and of the OLS estimator using two alternative strategies. With the first, we provide an additional graphical intuition of the extent to which the elastic net (and other regularization methods) is able to reduce the variance component of the MSE. Specifically, we generate three bootstrapped realizations of the training sample to calibrate and estimate our model. Then, we exploit the three sets of estimated coefficients to predict COVID-19 cases per 100,000 residents registered as of March 21, 2020 in the hold-out sample. Using the OLS estimator, although the in-sample MSE is considerably low (see Table 5), the predicted COVID-19 cases in the hold-out sample vary substantially across the three bootstrapped realizations of the training sample (Fig. A.3 in the Appendix). Moreover, in the OLS case there are some provinces that are predicted to be in the upper decile of the geographical COVID-19 distribution that instead belong to the lowest deciles of the actual distribution. Finally, we find that the OLS estimator highly overestimates COVID-19 cases per 100,000 inhabitants in the top decile and predict many provinces to have zero COVID-19 cases.

Fig. A.3

Graphical illustration of the variability of the out-of-sample performance of OLS.

Using elastic net and the same three bootstrapped realizations of the training sample to calibrate and estimate the coefficients, we obtain that the predictive performance in the hold-out sample is very stable and generally close to the actual number of registered cases (Fig. A.4 in the Appendix). Moreover, even if there are some specific provinces for which the algorithm makes an error in predicting the actual decile of the COVID-19 distribution, there are no cases in which a province in the upper deciles is predicted to be in the lower deciles or vice-versa. The elastic net algorithm, besides predicting accurately the geographical pattern of the spread of the epidemic, is able to predict the minimum and maximum values of each decile with minimal errors.

Fig. A.4

Graphical illustration of the variability of the out-of-sample performance of elastic net.

The second evaluation methodology of the predictive performance of elastic net and OLS relies on randomly dividing our data into five equally sized hold-out samples. These provincial subgroups are iteratively exploited to predict COVID-19 cases per 100,000 inhabitants, using each 80% corresponding training sample to estimate the coefficients.10 This strategy helps to further illustrate the extent to which the elastic net algorithm, as compared to the standard OLS estimator, is able to dramatically improve prediction in "unseen" areas. Thus, it could be conceived as a tool capable of predicting the possible geographical diffusion of the epidemic at a given point in time. The geographical distribution of registered cases is shown to be accurately predicted by the elastic net algorithm, whereas the OLS estimator performs poorly, failing to identify the true decile for many provinces (Fig. 3 ). Specifically, OLS largely over-estimates cases in provinces in the top decile of the distribution, showing an unsatisfactory maximum value of 9,000 cases per 100,000. Moreover, OLS predicts some provinces in the lowest decile to have zero registered cases when, as of March 21, there were no provinces registering zero infected people per 100,000 inhabitants 2020 (compare Fig. 1 and Fig. 3).

Fig. 3

Graphical illustration of the predictive out-of-sample performances of elastic net and OLS.

Characterizing the predictors of the “second wave” of the COVID-19 epidemic: do economic factors still matter?

As of October 2020, Europe has been severely hit by a “second wave” of the COVID-19 epidemic. The strong containment measures adopted in Italy during the "first wave" prevented the spread of the contagion in the central and southern provinces. This is why, even though in the baseline analysis we control for regional dummies and the Euclidean distance from the first outbreak, some of the results obtained might still be related to provincial characteristics that are specific to the location of the first COVID-19 outbreak. However, as containment measures were relaxed in June, intense tourism movements during summer have caused a reshuffling of the provincial distribution of the COVID-19 contagion. This is why the "second wave" of the COVID-19 epidemic in Italy cannot be related anymore to the geographical location of any initial outbreak. For this reason, we update our analysis by considering cumulative provincial cases over 100,000 inhabitants recorded between September 1 and October 30, 2020 and the same conditioning set exploited in the baseline analysis. This allows us to verify if, and to what extent, relevant predictors selected on data from 21 March, 2020 are confirmed once the spread of the COVID-19 epidemic is unrelated to any initial outbreak and to an infection coming from abroad. The results summarized in Table 6 show that, once again, many relevant economic variables are selected by the elastic net algorithm. Specifically, richer and more productive provinces are more likely to experience higher infection rates, while rural areas and provinces characterized by high unemployment rates are generally less affected by the COVID-19 epidemic. It is noteworthy that, once COVID-19 outbreaks are spread throughout the country, the variables capturing the degree of international economic connections (i.e. the share of workers in manufacturing and the intensity of export relationships), which are selected as relevant during the "first wave" of the epidemic in March, are no more selected by elastic net from the prediction set.

Table 6

Elastic net regression of log cumulative cases per 100,000 inhabitants between September 1, 2020 and October 30, 2020.

Percentage of the population that lives close to a train station	0.049
Mean altitude of the province	0.046
People that actually live in the province as a percentage of residents	0.028
Firm density	0.028
Population density	0.024
Value-added per employee	0.019
Mortality from pneumonia	0.013
Windy days in a year	-0.012
Daily hours of sunshine	-0.041
Hours of continuity health care services per capita	-0.043
Unemployment rate	-0.051
Poverty rate	-0.060
Percentage of employment in agriculture	-0.103
Observations	107
α selected by 5-fold cross-validation	0.111
λ selected by 5-fold cross-validation	0.502
MSE (hold-out sample)	0.201
Nr. of α values tested	10
Nr. of λ values tested	50

Elastic net regression of log cumulative cases per 100,000 inhabitants between September 1, 2020 and October 30, 2020. Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations. Additional triggers of the epidemic (i.e. the share of population that lives close to a train station, population density, and the number of people that actually live in the province as a share of the residents) emerge in the “second wave”. We assume that these additional triggers remained hidden in our baseline analysis given that the containment measures adopted in early March prevented the spread of the epidemic in some of the most populated Italian cities. As a further result, the number of hours of continuity health care services (Guardia medica) per capita, a measure of the degree of proximity of the Italian health system, is selected as a relevant predictor negatively correlated with the number of infected people per 100,000 inhabitants. Thus, the response to the spread of COVID-19 is probably associated to the capacity of the health system to provide a proper territorial medical care. Finally, the geographical spread of COVID-19 cases per 100,000 inhabitants is confirmed to be strongly associated to climate and geographic factors such as the average number of sunny days in a year, the annual number of rainy days, and the mean altitude of the province. This result verifies that the COVID-19 epidemic might be subject to seasonality.

Concluding remarks

The analysis we propose here is motivated by the observation of a highly heterogeneous spread of COVID-19 across Italian geographical areas. Such heterogeneity seems to be also characteristic of the pandemic experience of other countries preceding and following the Italian case. The intensity and specificity of the uneven distribution of the spread in Italy falls far beyond that which is conceivable by taking into account only distance and geographical interrelating factors. This signals that other triggers, outside those that characterize a standard transmission mechanics from index to secondary cases are at work. A number of explanations have been proposed in the increasingly rich current literature. Studies are addressing the predictive ability of variables belonging to quite different conceptual clusters. We noted that each potential trigger proposed by the literature has its own rationale, as each one basically captures a different aspect of the human relationships emerging in a connected territorial, economic, and social environment. However, these studies are inherently focused on a specific aspect of the story, thus lacking a central requirement of analyses oriented at maximizing the out-of-sample predictive abilities of a model (i.e. its instrumental validity). Since a central feature of our analysis is that it considers economic and non-economic factors in a unified empirical framework, we adopted an empirical strategy, based on statistical learning, which is able to select, among a large set of potential triggers, those that maximize the out-of-sample predictive power of the selected model. From this perspective, we showed that the elastic net estimator clearly outperforms OLS and other alternative regularization methods such as the ridge regression and LASSO. The results point to a very contained subset of relevant triggers among the 77 included in the analysis. Within this subset, economic factors are shown to play an important role. This result signals a possible trade-off between economic intensity and epidemic risks. Since areas that are more productive are socially and economically connected domestically and internationally, they are more exposed to the spread of new infectious diseases. The link between a pandemic crisis and economic activity should thus be considered in both directions: economically developed areas are more likely to be affected by high infection rates due to their intense economic networks. However, since highly infected areas contribute more to value-added, global economic growth may be strongly reduced on the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate. Repetition of the analysis on the sub-sample of cumulative cases emerging after a time window in which containment measures were in force highlighted the cancellation of some of these economic triggers. This result is likely to signal the transmission channels of the containment measures being adopted by the Italian government. Finally, using provincial data on the early stages of the “second wave” of the epidemic, which cannot be linked to a specific initial outbreak, we confirm that highly productive provinces are more likely to be affected by infectious diseases. The opposite holds for poor areas characterized by higher levels of employment in agriculture. From the simulations of models estimated over bootstrapped samples, we showed that the elastic net algorithm is able to maintain stability and correctness of predictions of the actual spread of the pandemic by minimizing the variance component of the out-of-sample mean squared error. A further interesting result emerged from simulations obtained with models estimated on reduced samples. We showed that these models are able to provide reliable predictions of the pandemic spread also in unseen areas—those territories not considered in the retained sub-samples. This result highlights the external validity of the model selected by the elastic net. We wish to test whether this property continues to hold in other countries in future research.

Declaration of Competing Interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Table A.2

Description of climate and pollution predictors.

Predictor	Description	Source (year)
N02	Concentration of nitrogen dioxide (µg/m3) in the provincial capital	ISTAT (2018)
PM10	Concentration of particulate matter of 10 micrograms per cubic metre or less in diameter (µg/m3) in the provincial capital	ISTAT (2018)
PM2.5	Concentration of particulate matter that is 2.5 micrograms per cubic metre or less in diameter (µg/m3) in the provincial capital	ISTAT (2018)
Hot days in a year	Number of days in a year with a max temperature above 30°C: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)
Frost days in a year	Number of days in a year with a max temperature below 3°C: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)
Rainy days in a year	Number of rainy days in a year: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)
Foggy days in a year	Number of foggy days in a year: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)
Daily hours of sunshine	Number of daily hours of sunshine: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)
Number of windy days in a year	Number of days in a year with wind gusts greater than 25 knots: 2008 - 2018 average values.	Il Sole 24 ore. Quality of life index (2018)

Table A.3

Description of socio-demographic predictors.

Predictor	Description	Source (year)
Age of the population	Average age of provincial population	ISTAT (2018)
Family size	Average number of family members	ISTAT (2011)
Percentage of families with 5 or more members	Percentage families with at least 5 members	ISTAT (2011)
Percentage of immigrants	Percentage of foreign residents in the provincial population	ISTAT (2019)
Percentage of males	Percentage of male individuals	ISTAT (2019)
Percentage of population aged 65 or more	Percentage of provincial population aged 65 years old or more	ISTAT (2019)
Percentage of the population living close to a train station	Percentage of provincial population living in a municipality with at least one station with more than 2,500 daily visitors in a year	Ministry of Economic Development (2014)
Percentage of the population living close to the sea	Percentage of provincial population living in a municipality located close to the sea	ISTAT (2019)
People that actually live in the province as a percentage of residents	Number of people actually living in the province as a percentage of provincial residents	ISTAT (2011)
Population density	Number of people per km²	ISTAT (2019)

Table A.4

Description of geographical and territorial predictors.

Predictor	Description	Source (year)
Altitude of the province capital	Altitude of the provincial capital measured at City Hall	ISTAT
Distance from the first outbreak: less than or equal to 50 km	Dummy for a provincial capital which is 50 km or less away from the province of the first outbreak (Lodi)	Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 51 and 100 km	Dummy for a provincial capital which is between 51 km and 100 km away from the province of the first outbreak (Lodi)	Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 101 and 300 km	Dummy for a provincial capital which is between 101 km and 300 km away from the province of the first outbreak (Lodi)	Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 301 and 500 km	Dummy for a provincial capital which is between 301 km and 500 km away from the province of the first outbreak (Lodi)	Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: more than 500 km	Dummy for a provincial capital which is between 500 km away from the province of the first outbreak (Lodi)	Own elaboration using latitude, longitude and the curvature constant
Municipality density	Number of municipalities per km²	ISTAT (2020)
Mean altitude of the province	Mean altitude of the provincial territory	ISTAT

Table A.5

Description of health care system predictors.

Predictor	Description	Source (year)
Average hospital size	Average number of hospital beds per hospital	Italian Ministry of Health (2018)
Hospital beds per capita	Number of hospital beds per capita	Italian Ministry of Health (2018)
Hospital density	Number of hospitals per km²	Italian Ministry of Health (2018)
Mortality from infectious diseases	Number of deaths from infectious diseases per 10,000 people	ISTAT (2017)
Mortality rate	Number of deaths per 10,000 people	ISTAT (2017)
Mortality rate from pneumonia	Number of deaths from pneumonia per 10,000 people	ISTAT (2017)
Mortality rate from respiratory diseases	Number of deaths from respiratory diseases per 10,000 people	ISTAT (2017)
Percentage of hospital beds in private clinics	Percentage of total hospital beds hosted by private clinics	Italian Ministry of Health (2018)
Percentage of total hospital beds for the elderly	Percentage of provincial hospital beds dedicated to the elderly	Italian Ministry of Health (2018)
General practitioner per capita	Number of general practitioners as a share of the provincial population	Italian Ministry of Health (2018)
Hours of continuity health care services per capita	Total yearly hours of continuity health care services per capita	Italian Ministry of Health (2018)
Cases handled by the medical homecare as a share of the population	Number of patients handled by the medical homecare as a share of the population	Italian Ministry of Health (2018)
Clinic density	Number of clinics per km2	Italian Ministry of Health (2018)

Table A.6

Description of mobility predictors.

Predictor	Description	Source (year)
Car density	Number of cars per km²	ISTAT (2018)
Nr. of flight passengers per capita	Number of flight passengers in provincial airports between January and February 2020 as a percentage of the provincial population	Association of Italian airport operators (2020)
Nr. of public transport passengers per capita	Number of yearly public transport passengers per inhabitant in the provincial capital	ISTAT (2015)
Nr. of public transport seats per km/resident	Number of public transport seats per km/resident in the provincial capital	ISTAT (2015)
Percentage of commuters	Percentage of daily commuters in provincial population	ISTAT (2011)
Percentage of commuters outside their municipality of residence	Percentage of total commuters who travel daily outside the municipality of residence.	ISTAT (2011)
Percentage of commuters who use a private vehicle	Percentage of total commuters who use a private vehicle.	ISTAT (2011)
Percentage of flight passengers from international locations	Percentage of total flight passengers in local airports going to or coming from international locations between January and February 2020	Association of Italian airport operators (2020)
Yearly registered visitors in accommodation facilities (percentage)	Number of yearly visitors registered in accommodation facilities as a percentage of the provincial population	ISTAT (2018)
Yearly ship passenger arrivals in local harbours (percentage)	Number of ship passengers landing in provincial harbours as a percentage of provincial population	ISTAT (2018)
Percentage of the population living close to a train station	Percentage of provincial population living in a municipality with at least one station with more than 2,500 daily visitors in a year	Ministry of Economic Development (2014)
Degree of provincial interconnection	Number of commuters going to or coming from other provinces as a share of the provincial population	ISTAT (2011)

24 in total

1. Global supply-chain effects of COVID-19 control measures.

Authors: Dabo Guan; Daoping Wang; Stephane Hallegatte; Steven J Davis; Jingwen Huo; Shuping Li; Yangchun Bai; Tianyang Lei; Qianyu Xue; D'Maris Coffman; Danyang Cheng; Peipei Chen; Xi Liang; Bing Xu; Xiaosheng Lu; Shouyang Wang; Klaus Hubacek; Peng Gong
Journal: Nat Hum Behav Date: 2020-06-03

2. Bowling together by bowling alone: Social capital and COVID-19.

Authors: Francesca Borgonovi; Elodie Andrieu
Journal: Soc Sci Med Date: 2020-11-04 Impact factor: 4.634

3. Temperature, Humidity, and Latitude Analysis to Estimate Potential Spread and Seasonality of Coronavirus Disease 2019 (COVID-19).

Authors: Mohammad M Sajadi; Parham Habibzadeh; Augustin Vintzileos; Shervin Shokouhi; Fernando Miralles-Wilhelm; Anthony Amoroso
Journal: JAMA Netw Open Date: 2020-06-01

4. Association between ambient temperature and COVID-19 infection in 122 cities from China.

Authors: Jingui Xie; Yongjian Zhu
Journal: Sci Total Environ Date: 2020-03-30 Impact factor: 7.963

5. Association between short-term exposure to air pollution and COVID-19 infection: Evidence from China.

Authors: Yongjian Zhu; Jingui Xie; Fengming Huang; Liqing Cao
Journal: Sci Total Environ Date: 2020-04-15 Impact factor: 7.963

6. The temporal association of introducing and lifting non-pharmaceutical interventions with the time-varying reproduction number (R) of SARS-CoV-2: a modelling study across 131 countries.

Authors: You Li; Harry Campbell; Durga Kulkarni; Alice Harpur; Madhurima Nundy; Xin Wang; Harish Nair
Journal: Lancet Infect Dis Date: 2020-10-22 Impact factor: 25.071

7. The geography of COVID-19 and the structure of local economies: The case of Italy.

Authors: Andrea Ascani; Alessandra Faggian; Sandro Montresor
Journal: J Reg Sci Date: 2020-12-21

8. The Relationship Between School Holidays and Transmission of Influenza in England and Wales.

Authors: Charlotte Jackson; Emilia Vynnycky; Punam Mangtani
Journal: Am J Epidemiol Date: 2016-10-15 Impact factor: 4.897

9. Preventing Intra-hospital Infection and Transmission of Coronavirus Disease 2019 in Health-care Workers.

Authors: Wee Hoe Gan; John Wah Lim; David Koh
Journal: Saf Health Work Date: 2020-03-24

10. Correlation between climate indicators and COVID-19 pandemic in New York, USA.

Authors: Muhammad Farhan Bashir; Benjiang Ma; Bushra Komal; Muhammad Adnan Bashir; Duojiao Tan; Madiha Bashir
Journal: Sci Total Environ Date: 2020-04-20 Impact factor: 10.753

4 in total

1. Modelling the persistence of Covid-19 positivity rate in Italy.

Authors: Antonio Naimoli
Journal: Socioecon Plann Sci Date: 2022-01-07 Impact factor: 4.641

2. Modeling economic losses and greenhouse gas emissions reduction during the COVID-19 pandemic: Past, present, and future scenarios for Italy.

Authors: Dario Cottafava; Michele Gastaldo; Francesco Quatraro; Cristina Santhiá
Journal: Econ Model Date: 2022-03-01

3. Urban spatial risk prediction and optimization analysis of POI based on deep learning from the perspective of an epidemic.

Authors: Yecheng Zhang; Qimin Zhang; Yuxuan Zhao; Yunjie Deng; Hao Zheng
Journal: Int J Appl Earth Obs Geoinf Date: 2022-08-05

4. Does climate help modeling COVID-19 risk and to what extent?

Authors: Giovanni Scabbia; Antonio Sanfilippo; Annamaria Mazzoni; Dunia Bachour; Daniel Perez-Astudillo; Veronica Bermudez; Etienne Wey; Mathilde Marchand-Lasserre; Laurent Saboret
Journal: PLoS One Date: 2022-09-07 Impact factor: 3.752

4 in total