Literature DB >> 35317020

Predicting the spread of COVID-19 in Italy using machine learning: Do socio-economic factors matter?

Francesco Bloise1, Massimiliano Tancioni2.   

Abstract

We exploit the provincial variability of COVID-19 cases registered in Italy to select the territorial predictors of the pandemic. Absent an established theoretical diffusion model, we apply machine learning to isolate, among 77 potential predictors, those that minimize the out-of-sample prediction error. We first estimate the model considering cumulative cases registered before the containment measures displayed their effects (i.e. at the peak of the epidemic in March 2020), then cases registered between the peak date and when containment measures were relaxed in early June. In the first estimate, the results highlight the dominance of factors related to the intensity and interactions of economic activities. In the second, the relevance of these variables is highly reduced, suggesting mitigation of the pandemic following the lockdown of the economy. Finally, by considering cases at onset of the "second wave", we confirm that the territorial distribution of the epidemic is associated with economic factors.
© 2021 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  COVID-19; Coronavirus; Economic networks, Epidemic, Machine learning; Economic structure

Year:  2021        PMID: 35317020      PMCID: PMC7994006          DOI: 10.1016/j.strueco.2021.01.001

Source DB:  PubMed          Journal:  Struct Chang Econ Dyn        ISSN: 0954-349X


Introduction

The COVID-19 pandemic has opened up new challenges for understanding the factors associated with the spread of contagious diseases and the role played by social, economic, and environmental conditions. In this study we investigate the case of Italy—the first European country to experience a large number of registered cases in early 2020—to address three main questions. First, we use a methodology (based on a machine learning algorithm) for selecting the relevant predictors of registered COVID-19 cases from a large set of official, provincial-level data in Italy. These data include economic activity indicators within a conditioning set that considers a high number of potential triggers addressed in the current literature. Second, by repeating the analysis on cumulative cases observed at different points in time, we evaluate whether some selected predictors lose their relevance following the time-specific containment measures implemented by the Italian government, which might signal that these measures have been effective in mitigating the pandemic's further spread. Third, we setup a simulation strategy to verify the external validity of the model by testing its ability to predict the diffusion of the COVID-19 in "unseen" areas. Before going into the details of the analysis, we provide some background of the pandemic in Italy. On January 30, 2020, the first two cases of COVID-19 were detected, and the respective patients were hospitalized in Rome, Italy. Only about 20 days later, on February 20, 2020, the first COVID-19 outbreak was identified in Codogno, a municipality belonging to the province of Lodi in the Lombardy region. Since then, Italy came out as the first European country to be severely hit by the COVID-19 pandemic. Specifically, as of March 8, 7,375 COVID-19 cases were registered in Italy, of which nearly 57% were in Lombardy. On March 21, when the highest number of diagnosed new positive cases was registered (6,557), the cumulative number of cases increased to 53,578 (of which 47.6% were in Lombardy), exhibiting a growth rate close to 14%. Several increasingly restrictive containment measures have been adopted by the Italian government since the detection of the first outbreak, leading to a nearly complete lockdown of the country's economic activities on March 21, 2020. This strong containment measure (stage one) was in effect until May 3, 2020. From this date until June 3 (stage two), the containment had been partially relaxed. At the end of stage two on June 3, the number of registered cases was 233,836, with a 0.1% growth rate. The country then entered a "third stage" in the management of the COVID-19 spread, mostly based on the adoption of prescriptions for personal protection and social distancing. One of the main characteristics of the COVID-19 pandemic in Italy is its highly heterogeneous territorial distribution. As of June 3, 2020, Lombardy's official disease count accounted for 38.2% of all Italian cases, a number that was 128% higher than that expected for a homogeneous territorial distribution (i.e.16.7% regional share in the national population). Within Lombardy, official disease counts also indicated high heterogeneity in incidences across provinces. On June 3, 2020, COVID-19 counts per 100,000 inhabitants ranged from 408 cases, registered in the province of Varese, to 1,802 cases in the province of Cremona. A high territorial heterogeneity of the spread was also found in other Italian regions. In the central region of Lazio, the incidences ranged between a minimum of 94 cases per 100,000 inhabitants in the province of Latina to a maximum count of 248 cases, registered in the province of Rieti. In Sicily, official disease counts ranged from a minimum of 30 cases per 100,000 inhabitants in the province of Ragusa, to a maximum of 258 cases in the province of Enna. A number of explanations have been suggested in the ongoing debate for the uneven geographical spread of the pandemic. A natural conjecture is that such heterogeneity in disease counts reflects differences in the territorial distribution of its triggers. Demographics (Dowd et al., 2020), health care system characteristics (Black et al., 2020; Gan et al., 2020; Brindle and Gawande, 2020), enrollment in education systems (Chang et al., 2020, Li et al., 2020, Viner et al., 2020, Zhang et al., 2020), transport and mobility specificities (Do et al., 2020, Li et al., 2020, Zheng et al., 2020), climate factors (Bashir et al., 2020, Sajadi et al., 2020, Wang et al., 2020), pollution (Yongjian et al., 2020; Wu et al., 2020), social attitudes, and family ties (Bayer and Kuhn, 2020, Belloc et al., 2020, Borgonovi and Andrieu, 2020) are the general categories wherein the majority of speculations on potential disease triggers being proposed can be grouped. Only few and highly focused studies address the role of economic factors in the spread of the pandemic (Barbieri et al., 2020; Dingel and Neiman, 2020; Qiu et al., 2020; Fogli and Veldkamp, 2020), whereas much more attention is being devoted to the opposite causal nexus, that is, the investigation of the effects of the pandemic on economic activities and conditions (Atkenson, 2020; Baker et al., 2020; Baqaee and Fahri, 2020; Bronka et al., 2020, Decerf et al., 2020, Dosi et al., 2020; Fernandez, 2020; Gregory et al., 2020; Guan et al., 2020; Ludvigson et al., 2020; Palomino et al., 2020). Each of the many explanations has its own rationale. However, in the absence of randomized testing on individuals or of an established diffusion model for the infectious disease, their objective relevance for the prediction of the uneven spread of the epidemic remains questionable. Empirical analyses that focus on specific triggers in non-experimental environments are unavoidably at risk of strong under-specification biases, unless they rely on strategies able to consider all these factors jointly. Detecting the relevant correlates of the geographical heterogeneity of the epidemic from aggregate data is a difficult task, since potential predictors are many and highly correlated. In this setting, analyses that strictly focus on specific aspects often lead to biased estimates due to the omission of important controls. By increasing the set of correlates to reduce the bias, the standard errors of the estimates obtained with typical penalty functions (such as OLS) tend to inflate, implying that the statistical relevance of the conditioning sets shrinks toward zero, and the model's out-of-sample predictive performances deteriorate. In other words, standard estimation tools are bounded to an underspecification/overfitting trade-off. A viable strategy to handle the bias–variance trade-off and minimize the mean squared prediction error is to leave the predictors' set unrestricted (thus taking an agnostic perspective) and use statistical learning to select a parsimonious model. This basically implies the introduction of weights in the OLS estimator in the form of penalties that are able to select a compact model structure with "optimal" out-of-sample predictive properties. Given our empirical setting, we use the elastic net learning algorithm (Zou and Hastie, 2005; Hastie et al., 2009), which provides an estimator belonging to this general strategy. The elastic net estimator combines the properties of ridge regression, which basically mitigates the multicollinearity problem through regularization, and of the least absolute shrinkage and selection operator (LASSO), which further increases the predictive ability of models. This method has many advantages. First, it allows joint consideration of a very large number of correlated predictors in a unified empirical framework and reduction of the risk of overfitting by selecting only those variables that provide the highest predictive performance. Second, relative to other machine learning algorithms, the elastic net ensures good performances, even when the number of tested predictors is high (in principle even larger than the number of observations) relative to the sample size (Zou and Hastie, 2005). Third, since the elastic net can be conceptualized as a generalization of OLS, it allows a standard interpretation of results. With this strategy, we exploit an extended information set including COVID-19 cases observed at the provincial level and the many potential triggers addressed in the literature. Specifically, we consider a set of 77 province-specific candidate predictors that can be grouped in the following conceptual sub-sets of indicators: i) economic activity and intensity, ii) climate and pollution, iii) socio-demographic, iv) geographical and territorial distance, v) health-care-system-related indicators, vi) public and private mobility, and vii) educational-system-related indicators. As a first robustness check, we include regional dummies to check whether our results are driven by region-specific characteristics, an occurrence that would switch the set of predictors selected by the elastic net estimator. Robustness is also tested with respect to the role of the control identifying the distance from the first outbreak. To the best of our knowledge, this is the first study in which a very large set of the potential triggers of the geographical diffusion of COVID-19 addressed in the literature are jointly considered and analyzed in a unified empirical framework. The results point to a substantial improvement of the model's predictive properties with elastic net estimates. The gain is striking, as compared with OLS, and evident with respect to ridge regression and LASSO. The model estimated with the elastic net using the registered cumulative cases on March 21, 2020 identifies few relevant predictors of the geographical distribution of the epidemic among the 77 explanatory variables being considered. Importantly, we find that five out of 10 triggers belong to the economic sub-set. In the order of relative importance, productivity (value-added per employee), the intensity of firms' international relationships, the general employment rate, and the share of labor enrolled in manufacturing, denote a positive correlation with prevalence of COVID-19 cases, whereas the share of labor in agriculture is selected as a favorable trigger (negative correlation). The highest positive correlation is obtained by the measure of close Euclidean distance from the first outbreak (i.e. within 50 km from the province of Lodi). Health characteristics are shown to affect the geographical spread through a positive correlation with the mortality rate for infectious diseases. Climatic and pollution factors selected as critical triggers include the number of frost days in a year and the average concentration of PM10. Family ties, proxied by the average family size, are shown to have a weak but negative correlation with COVID-19 cases. Re-running the elastic net estimates on June 3, 2020 case data and controlling for cumulative cases registered on March 21, 2020 yield results that cancel three out of the five economic triggers, namely, valued-added per employee, the share of employment in manufacturing, and the general employment rate. We interpret this result as evidence of the effectiveness of the strong containment measures adopted by the government. This evidence, which confirms the results provided by Flaxman et al. (2020) for a set of 11 European countries, is reinforced by the deletion of the close distance identifier, which was selected as the most important trigger of the pandemic in the pre-containment sample. In the post-containment sample, the registered cumulative cases in March 21, 2020 (which can be conceived as a measure of the attack rate of the disease in the second stage) is found to be the most important trigger of the subsequent spread. Baseline results are robust to the introduction of regional dummies. From a simulation exercise based on estimates obtained over bootstrapped samples, the models selected with the elastic net algorithm are shown to outperform largely those estimated with OLS in terms of out-of-sample properties. Different from OLS, predictions obtained with the models estimated with the elastic net are shown to be stable across simulations and to replicate the real pattern correctly, irrespective of the randomly generated sample being considered. Furthermore, models estimated on five different training samples (in which 20% of provinces are iteratively excluded from the estimation set) are shown to maintain their predictive power also in "unseen" areas—those excluded by the training sample. This is a signal of the external validity of the model being selected by the elastic net algorithm, at least for the Italian case. Finally, using information on cumulative COVID-19 cases recorded between September and October 2020,we show that cross-province economic differences are confirmed as key factors to predict the spread of the epidemic in Italy also during the “second wave”. The paper proceeds as follows. Section 2 briefly describes the evolving literature. Section 3 discusses the data used in the analysis and provides some stylized descriptive evidence. Section 4 describes our estimation strategy and discusses its main advantages. Section 5 presents and discusses the main results, provides information about their robustness, shows the prediction improvement of the model selected by the elastic net algorithm as compared with OLS, and evaluates its external validity. Section 6 provides the conclusion.

The current literature

A number of contributions from authors belonging to different disciplines have recently tried to find out the reasons behind the territorial heterogeneity of COVID-19 spreads observed at the global level. Different views are emerging from separate studies focused on specific research questions. Such heterogeneity of views reflects the existence of both an objective puzzle and an investigation difficulty. On the one hand, in the case of Italy, the considerably higher rate of infection in Lombardy and in other northern provinces cannot be explained only by the fact that at the beginning of the epidemic in February (i.e. before the first important Italian outbreak in the province of Lodi was discovered and public authorities implemented stringent containment measures) most people got infected in the northern provinces. Evidence from registered cases clearly points out that for any given distance from the first outbreak, there were provinces in the north with a high rate of infection and others that displayed much lower prevalence. On the other hand, a common feature of recent contributions performed with standard tools is that the emerging correlations between the diffusion of COVID-19 and its triggers come out from investigations that miss a comprehensive handling of the potential predictors that are reasonably conjecturable. To contextualize our analysis, here we briefly mention only few of them, approximately covering the domain of the explanations to which interest has been directed. Among possible explanations of the territorial spread of the epidemic, interest has been first focused on demographic factors. From this perspective, large streams of studies have addressed the role of the age structure, sex, and intergenerational interactions for the observed territorial heterogeneity in the diffusion and fatality rate of COVID-19 (Dowd et al., 2020). Other works tried to evaluate the possible association between COVID-19 cases and climatic factors. Investigations were mostly focused on the role of air temperature and humidity, obtaining mixed results (Bashir et al., 2020, Zhu and Xie, 2020, Sajadi et al., 2020, Wang et al., 2020). Moreover, the infection risk for health care workers in hospitals and their role for the outer transmission of the disease have been the focus of other studies. Black et al. (2020), by noting that nearly 45% of secondary cases could be infected by index cases in a pre-symptomatic stage, and that in fully tested realities, asymptomatic cases ranged from 51% to 88%, made a strong case for the strategic role of mass health care workers testing to prevent propagation within and out of hospitals. From a similar perspective, Gan et al. (2020) addressed the case of the Singapore health system, and Brindle and Gawande (2020) studied the specifics of managing the pandemic risk in surgical systems. All these contributions implicitly conjecture that health care system characteristics and policies are critical for the spread of the pandemic. This possibility, still missing objective evidence, is the subject of harsh debate and juridical investigations in Italy. Some evidence on influenza outbreaks attributed an important role in the transmission of infectious diseases to school and education system arrangements (Jackson et al., 2016; Bin et al., 2018). This stimulated interest for investigations on the role of class attendance and participation in the education system as a trigger of the COVID-19 pandemic. In an early review study focusing on schools, Viner et al., 2020 showed that the evidence on school closures for the containment of the COVID-19 pandemic is weak or mixed. Unless reinforced with other stringent social distancing measures, social contacts outside schools are not less risky than child activities in schools (Chang et al., 2020). However, Zhang et al. (2020), in a model-based analysis focused on China, showed that school closure, by delaying the epidemic spread, can significantly reduce the peak incidence. Li et al. (2020), in a cross-country study on the time-varying effectiveness of a set of containment measures on the COVID-19 replication rate, showed that school closure decreases transmission by 15% after four weeks, while school reopening could increase transmission by 24%. Interest has also been directed at evaluating the role of environmental characteristics, with a specific focus on air pollution. Yongjian et al. (2020), in a strictly focused investigation, found a positive correlation between short-term exposure to air pollution and number of confirmed COVID-19 cases in China. Their investigation, however, does not take into account other factors that can be simultaneously correlated to the spread of the pandemic and air pollution. Wu et al. (2020) suggested that air pollution might be correlated to higher mortality rates in the United States, even after controlling for some confounding factors. Other studies focused on the correlation between the infection rates and social/family habits. Intergenerational family ties and cohabitation, known to be very high in Italy as compared with other high-income countries (Reher, 1998; Di Giulio and Rosina, 2007; Santarelli and Cottone, 2009), have been evaluated as a possible trigger of the epidemic. From this perspective, (Bayer and Kuhn, 2020) found a positive correlation between a measure of family vertical integration and the COVID-19 fatality rate, using cross-country data recorded at an early stage of the "first wave" of the pandemic. Belloc et al. (2020) argued that this result might be driven by country-specific factors simultaneously correlated to both intergenerational family ties and the spread of COVID-19. Such a potential selection bias is obviously high in analyses in which the variability across structurally and institutionally different countries or regions is exploited. In fact, Belloc et al. (2020) showed that the correlation between the COVID-19 fatality rate and a measure of vertical social integration (i.e. the share of adults aged 18–34 living with their parents) turns negative when the sample variability in the diffusion of the epidemic is referred to the 20 Italian regions, where the southern display higher family ties and lower case fatality rates. Under a similar perspective, Borgonovi and Andrieu (2020) evaluated the role of social capital (an index comprising both social norms and networks, obtained from an assortment of measured human attitudes, activities, and behaviors) in the response of U.S. county communities to COVID-19-related containment policies, measured in terms of changes in mobility patterns. They found that the social capital index is negatively correlated with mobility during the COVID-19 outbreak and thus, has a lowered risk of contagion. With regard to the economic literature, efforts are mostly focused on the potential effects of the pandemic on the economy. Different aspects have been addressed: global economic performances (Atkenson, 2020; Baqaee and Fahri, 2020; Fernandes, 2020; Gregory et al., 2020; Ludvigson et al., 2020), global supply chains (Guan et al., 2020), economic uncertainty (Baker et al., 2020), and the distribution of income and poverty (Bonacini et al., 2020; Bronka et al., 2020; Decerf et al., 2020; Dosi et al., 2020; Palomino et al., 2020). There are fewer studies that analyze economic factors as potential triggers of the pandemic. To cite some of them, Qiu et al. (2020) showed that the transmission rate of the infection increases with per capita GDP. Their result suggests that economic factors should be further investigated as important predictors of the COVID-19 diffusion. Fogli and Veldkamp (2020) suggested that areas that are more productive are socially and economically connected with each other and with the rest of the world and thus, are more vulnerable to spreads of infectious diseases. Ascani et al. (2020) find an association between the geographical spread of COVID-19 in Italy during the “first wave” and the structure of local economies. However, using the OLS estimator, they are forced to consider a limited number of explanatory variables to avoid multicollinearity issues. From a microeconomic perspective, Barbieri et al. (2020) evaluated the extent to which the probability of being infected varies across different categories of workers. Dingel and Neiman (2020) showed that the probability of infection is related to the possibility of working from home. These studies suggest that the link between a pandemic crisis and economic activity should be addressed considering two directions of causality. On the one hand, areas that are more productive and interconnected are more likely to be affected by high infection rates due to their higher degree of social and economic networks. On the other hand, highly infected areas contribute more to global supply chains and value-added, such that global economic growth may be strongly reduced in the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate (Guan et al., 2020). This brief and unavoidably incomplete review of the related literature provides a sketch of the contributions to which our work is related. It also helps in forming an idea about the difficulties arising from analyses focused on few specific predictors of the COVID-19 pandemic in a non-experimental environment. In the following chapters, we propose an analysis that is able to circumvent these problems, with the specific goal of selecting, among a large set of candidate economic and non-economic triggers, those that have the highest predictive power for the heterogeneous diffusion of the COVID-19 pandemic.

Data and descriptive evidence

Our investigation analyzes the geographical diffusion of the COVID-19 pandemic in Italy and identifies its predictors in a unified empirical framework. We collect data on registered COVID-19 cases and on a large set of potential predictors observed at the provincial level from different official data sources. First, we take information on COVID-19 cases provided by the Italian Civil Protection Department (ICPD) on a daily basis since the first cases were identified in the municipality of Codogno at the end of February.1 Although we are aware that information on registered cases is likely affected by a high degree of measurement error(the number of infected people has been largely underestimated due to the low number of swabs and tests carried out at the beginning of the epidemic), data provided by the ICPD are so far generally recognized as the sole official and controlled source available in Italy. We refer to the number of cumulative COVID-19 cases registered by the ICPD on two different dates: March 21, 2020 and June 3, 2020. The former date is selected to focus on the heterogeneity in the geographical distribution of COVID-19 cases observed before the containment measures implemented by the Italian government have had their effects.2 In considering cumulative cases as of June 3, 2020, we focus on the geographical distribution of COVID-19 cases registered between March 21, 2020 and June 3, 2020, that is, on infection events that occurred when the strongest containment measures were in place. By exploiting this difference in policy implementation, we evaluate whether the tested predictors of the spread of the epidemic vary due to the implementation of the measures.3 Fig. 1 shows the geographical diffusion of infected people per 100,000 inhabitants registered on March 21, 2020 (left-end map) and in June 3, 2020 (right-end map) by grouping the 107 provinces in deciles of the national distribution of COVID-19 cases. The two maps clearly show that most of Lombardy's provinces have been severely hit by the epidemic and are in the top decile of the national distribution, with 219 to 761 (819 to 1,802) cases per 100,000 inhabitants as of March 21, 2020 (June 3, 2020). As of March 21, 2020, aside from some areas in northern Italy, the two bordering provinces of Rimini and Pesaro–Urbino are the only other provinces along the eastern coast of Italy that belong to the top decile of the distribution. On June 3, 2020, they fall to the 8th and 9th decile, respectively. Most of the provinces in the central or southern part of Italy show a lower degree of infections with 2 to 7 (28 to 51) cases per 100,000 inhabitants on March 21, 2020 (June 3, 2020).
Fig. 1

Geographical distribution of COVID-19 cumulative cases per 100,000 inhabitants.

Geographical distribution of COVID-19 cumulative cases per 100,000 inhabitants. However, differences can be detected across provinces in each region or area. For instance, the number of infected people per 100,000 inhabitants is clearly higher in northern Sardinia (province of Sassari) than in the rest of the island, whereas the province of Enna in central Sicily, on June 3, 2020, shows a much higher degree of infection compared with other Sicilian provinces. To provide a comprehensive analysis of the many potential predictors of COVID-19 diffusion across Italian provinces, we use 77 explanatory variables, all observed at the provincial level. Data come from different official sources: the national statistical office (ISTAT), the Ministry of Economy and Finance, the Ministry of Economic Development, and the Ministry of Health. These data can be grouped into seven conceptual sub-sets of indicators: i) economic activity and intensity (19 variables), ii) climate and pollution (9 variables), iii) socio-demographic (9 variables), iv) geographical and territorial distance (8 variables), v) health-care-system-related (13 variables), vi) public and private mobility (12 variables), and vii) educational-system-related indicators (7 variables). In particular, the set of economic predictors includes labor market characteristics, specifically the employment and unemployment rates, the percentage of employment in agriculture, industrial districts, manufacturing, and services and the percentage of self-employed workers; economic and distribution characteristics, specifically the value-added per employee (productivity), the value added per employee in agriculture, manufacturing, and services, the poverty rate; firm characteristics, specifically the firm size, firm density, the share of employment in industrial districts, the intensity of firms’ export relationships, the share of unloaded goods in provincial harbors, the density of livestock units, and the density of firms producing animal-derived products. The full set of indicators for each conceptual sub-set is listed and described in detail in the Appendix (Tables A.1 to A.7).
Table A.1

Description of economic activity predictors.

PredictorDescriptionSource (year)
Employment rateEmployed people over provincial populationISTAT (2017)
Unemployment ratePercentage of active provincial population aged 15-74 who are unemployedISTAT (2019)
Percentage of employment in agriculturePercentage of total employees who work in agriculture activitiesISTAT (2017)
Percentage of employment in manufacturingPercentage of total employees who work in manufacturing activitiesISTAT (2017)
Percentage of employment in servicesPercentage of total employees employed in service activitiesISTAT (2017)
Percentage of self-employed workersPercentage of provincial workers who are self-employedISTAT (2011)
Value added per employeeValue added in euro per employee (productivity)ISTAT (2017)
Value added per capitaValue added in euro per capitaISTAT (2017)
Value added per capita - AgricultureValue added of agriculture in euro per residentISTAT (2017)
Value added per capita - ManufacturingValue added of manufacturing in euro per residentISTAT (2017)
Value added per capita - ServicesValue added of services in euro per residentISTAT (2017)
Poverty ratePercentage of taxpayers declaring less than 10,000 euro in 2018Italian Ministry of Economy and Finance (2019)
Firm densityNumber of firms per km2ISTAT (2017)
Firm sizeAverage number of employees per firmISTAT (2017)
Percentage of employment in industrial districtsPercentage of total employees who work in industrial districtsISTAT (2017)
Intensity of export relationshipsAverage number of areas of the world (e.g. Europe, BRICs, rest of the world) where firms export their productsISTAT (2018)
Unloaded goods in the local harboursTons of goods unloaded in the local harbours per inhabitantISTAT (2018)
Cattle densityNumber of livestock units per km2ISTAT (2010)
Density of firms producing animal-derived productsNumber of firms producing goods derived from animal products per km2Italian Ministry of Health (2018)
Table A.7

Description of education system predictors.

PredictorDescriptionSource (year of reference)
Percentage of compulsory school studentsPercentage of compulsory school students over total population.ISTAT (2018)
Percentage of high-school graduatesPercentage of high school graduates over total populationISTAT (2018)
Percentage of people below upper secondary educationPercentage of people below upper secondary educationISTAT (2011)
Percentage of pre-school studentsPercentage of provincial students enrolled in pre-schoolISTAT (2018)
Percentage of studentsPercentage of students over provincial populationISTAT (2018)
Percentage of tertiary graduatesPercentage of provincial population with a tertiary degreeISTAT (2011)
Percentage of university studentsPercentage of provincial students who are enrolled in universitiesISTAT (2018)
In line with many recent studies, we can exemplify the problems that potentially emerge by analyzing simple correlations between registered log cases per 100,000 residents in a province and specific factors that so far have been identified as possible predictors of the geographical spread of COVID-19 infections. For instance, Fig. 2 shows that COVID-19 cases are positively correlated with the average number of frost days in a year (Panel A), with the average concentration of PM10 (particles with diameter ≤ 10 μm), which can serve as proxy for air pollution in the province (Panel B), and with the mortality rate for infectious diseases (Panel D), while they are negatively correlated with the percentage of families with at least five members (Panel C). The latter result basically replicates that of Belloc et al. (2020), which was obtained at the regional level.
Fig. 2

Estimated correlations between log cumulative cases per 100,000 inhabitants and selected covariates as of March 21, 2020.

Estimated correlations between log cumulative cases per 100,000 inhabitants and selected covariates as of March 21, 2020. It is noteworthy that such analyses are not able to predict accurately the distribution of cases across areas. Single correlations, even if statistically relevant, may be driven by many other confounding factors that should be considered in a comprehensive predictive model. This risk is particularly high when, as in this case, there is no sound theoretical support for variable selection, from which a structural model of the unequal diffusion of COVID-19 across different territories can be derived. Moreover, although using few explanatory variables minimizes the risk of overfitting, we can hardly obtain a good predictive performance of the COVID-19 spread by focusing on single potential predictors. This is why in the following sections we base our model specification on a machine learning algorithm capable of selecting the pandemic's relevant triggers from a joint consideration of many potential explanatory variables.

Estimation strategy

We perform our estimation using the elastic net machine learning algorithm originally proposed by Zou and Hastie (2005). The elastic net algorithm combines the ridge and LASSO regularizations to increase further the predictive ability of a model.4 The role of these penalties is to select a parsimonious model (and/or shrink the size of its coefficients) from a very high number of explanatory variables (possibly exceeding the number of observations), and when the conditioning set displays near or exact collinearity. The model selection implies conditioning the penalties to an optimal target, that is, the maximization of the out-of-sample model's predictive abilities (or the minimization of the out-of-sample mean squared error). In practice, the penalty parameters are obtained from repeated rounds of model validation, known as cross-validation, in which the estimates obtained on estimation sets are generalized to predict unseen data (i.e. predictive sets). More specifically, the penalties are obtained by maximizing the model's ability to predict data that are not used in the estimates. Since the elastic net can be conceptualized as a generalization of OLS (i.e. a methodology belonging to the family of regularized least squares) we provide the details of the estimation method in relation to some key properties of the OLS estimator. The OLS estimator is very often exploited to predict a given number of observations of an outcome variable using a vector of predictors. Usually, the outcome variable of interest is predicted by estimating those parameters, ensuring that the in-sample sum of squares of residuals is as small as possible. However, there are two fundamental aspects of an estimator to be considered in the evaluation of its predictive performance: the bias and the variance. The former quantifies the error that is introduced by approximating an unknown data generating process. Specifically, if we assume N random samples associated to different data generating processes, we could obtain a range of predictions, one for each randomly drawn sample. The bias is thus a measure of the distance between the expected value of the prediction and the unknown function which captures the true relationship between the outcome variable and predictors. The variance is the variability of a model prediction around its expected value. According to the definition of bias and variance, the prediction performance of a model can be evaluated by looking at its mean squared error (MSE), which is the expected error in predicting a given outcome variable: In our study, denotes registered log cases at a given date per 100,000 inhabitants observed at the provincial level, and denotes predicted log cases per 100,000 inhabitants, which is a function of the vector of provincial predictors included in the model. The statistical learning literature (Hastie et al., 2009) shows that the MSE can be decomposed as follows:where the first term is the variance of the model; the second term is the square bias; and the last is the noise term, which cannot be reduced. To attain the best prediction, we should minimize the MSE by reducing both the first and the second components of Equation (2). In finite samples, the well-known trade-off between variance and bias requires a balance between the first two components of Equation (2), such that the lowest attainable MSE is conditional on the specific set of predictors at our disposal. Although OLS is the best linear unbiased estimator, it produces very poor predictions in the following two cases: When there is a high degree of correlation between predictors included in the vector . When too many predictors need to be included in the model with respect to the number of observations. In worst case-scenarios, typical of investigations that are missing the support of an established theoretical model, the number of predictors might exceed the number of observations, such that it is not even possible to estimate the parameters of interest using OLS. In both cases, the variance of the prediction can be extremely high, such that the predictive performance of the OLS estimator is very low, even though the bias component is minimized. Therefore, in very complex models, allowing for a small degree of bias is essential to obtain a strong reduction in variance and improve the prediction performance of the model. In our estimation problem, we need to identify the main predictors of the geographical spread of the epidemic considering a large set of potential predictors addressed by the literature, whose coefficients are to be estimated with a small sample size (i.e. the 107 Italian provinces). This is a typical case in which the prediction performance of the OLS estimator is very low, given several issues related to multicollinearity and overfitting. For this reason, we handle the variance–bias trade off by using the elastic net regularization algorithm originally proposed by Zou and Hastie (2005): The elastic net algorithm combines the penalties of the ridge regression and LASSO and mitigates some of the known drawbacks that affect LASSO, which have been shown to saturate when the number of predictors is very high with respect to the sample size, or when there is a high degree of correlation among predictors. In Equation (3), theparameter controls the relevance of the regularization term which shrinks the coefficient toward zero to reduce overfitting. When the parameter the elastic net algorithm collapses to the ridge regression and no predictor is excluded from the model. When , the elastic net algorithm is equivalent to the LASSO, which has the potential ability to set some of the coefficients to equal zero. When both and are greater than zero, the algorithm has the ability of setting some coefficients exactly to zero and shrinks others to minimize the prediction error. On the contrary, when both and equal zero, the elastic net algorithm collapses to the OLS case, such that all predictors are exploited to predict the outcome variable without any shrinkage. Therefore, it is possible to get very different predictive models and estimated coefficients for each combination value of and . Among all possible specifications, we select and by using k-fold cross-validation to minimize the out-of-sample MSE, and we evaluate the external predictive performance of the model by testing its ability to predict new data that were not used for its estimation (James et al., 2013). K-fold cross-validation is a re-sampling procedure that randomly splits the sample in K subsets (folds). Then, for each of the K-folds, one is iteratively defined as the test set, and the K-1 remaining folds are used to estimate the model coefficients. Following Mullainathan and Spiess (2017), we calibrate our algorithm and evaluate its out-of-sample performance in different steps: i) we randomly divide our data in a training sample (80% of the observations) and a hold-out sample (the remaining 20% of the data); ii) in the 80% training sample we use 5-fold cross-validation to select a specific pair among a set of different possible combinations of and . The selected and are the ones that minimize the average MSE computed across the five folds;5 iii) we run the algorithm in the training sample using the- combination selected through k-fold validation; iv) we compute the out-of-sample prediction error in the hold-out sample to test the model's capability to predict “unseen” data.

Results and discussion

Results

In this section we present the results of our analysis by using, as dependent variable in our regressions, the log of COVID-19 cumulative cases per 100,000 residents in the province, measured at different dates. In our baseline analysis, we refer to positive cases registered until March 21, 2020 to consider the geographical spread of cases across Italian provinces in the first stage of the epidemic (Table 1 ). Then, we focus on cumulative cases registered between March 22, 2020 and June 3, 2020 to detect whether, and to what extent, predictors change when containment measures are in force (Table 4).
Table 1

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020.

Baseline
Distance from the first outbreak: less than or equal to 50 km0.496
Value-added per employee0.244
Intensity of export relationships0.188
Nr. of frost days in a year0.161
Mortality from infectious diseases0.093
PM100.072
Employment rate0.071
Percentage of employment in manufacturing0.048
Average family members-0.028
Percentage of employment in agriculture-0.150
Observations107
α selected by 5-fold cross-validation0.333
λ selected by 5-fold cross-validation0.393
MSE (hold-out sample)0.527
Nr. of αvalues tested10
Nr. of λvalues tested50

Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations.

Table 4

Elastic net regression of log cumulative cases per 100,000 inhabitants—pre and post containment measures.

March 21 (Model 1)March 22-June 3 (Model 2)March 22-June 3 (Model 3)
Distance from the first outbreak: less than or equal to 50 km0.0000.1490.000
Distance from the first outbreak: between 51 km and 100 km0.4960.1410.059
Value-added per employee0.2440.1250.000
Intensity of export relationships0.1880.1070.091
Mean altitude of the province0.0000.0850.056
Frost days in a year0.1610.1110.000
Mortality from infectious diseases0.0930.0810.000
Percentage of hospital beds of the elderly0.0000.0480.000
Municipality density0.0000.0460.066
Mortality rate from pneumonia0.0000.0000.052
Average hospital size0.0000.0400.000
Foggy days in a year0.0000.0380.000
PM100.0720.0360.000
N020.0000.0190.000
Mortality rate0.0000.0000.010
Employment rate0.0710.0000.000
Percentage of employment in manufacturing0.0480.0000.000
Hot days in a year0.000-0.0610.000
Percentage of families with 5 or more members0.000-0.065-0.077
Average family members-0.028-0.093-0.064
Percentage of employment in agriculture-0.150-0.088-0.028
Hours of continuity health care services per capita0.000-0.1110.000
Log cases over 100,0000 people on March 21Not includedNot included0.493
Observations107107107

Constant terms and predictors that are not selected in any of the three models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020. Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations. To compare the magnitude of the estimated coefficients and identify which predictors are more relevant for our analysis, we standardize all explanatory variables so that we can interpret each estimated coefficient multiplied by 100 as the percentage increase of cases per 100,000 inhabitants for one standard deviation increase in predictors. Table 1 shows our baseline results obtained with the elastic net calibrated using a 5-fold cross-validation.6 In this case, among all 77 explanatory variables considered in the analysis, the algorithm selects only 10 predictors, presented in descending order of importance. The coefficients of the other 65 explanatory variables are penalized to zero, denoting that they are not relevant predictors of the epidemic and are not shown in Table 1. On the contrary, although the coefficients of the 10 variables selected by the algorithm are reduced in size to minimize the risk of overfitting, they are not set to equal zero. The results suggest that for a given province of the initial outbreak, all provinces within 50 km are more likely to have a high infection rate (49.6% more cases per 100,000 people) about 1 month after the beginning of the epidemic. Nevertheless, all other dummies of distance are not selected as potential predictors, suggesting that for distances above 50 km, there are other diffusion factors not related to the geographical distance. Among the other selected variables, economic factors are shown to be the main predictors of the epidemic spread. In particular, for a one standard deviation increase, the percentage of registered cases per 100,000 residents increases with value-added per capita (by 24.4%), intensity of firms’ export relationships (18.8%), overall employment rate (7.1%), and the percentage of employment in manufacturing (4.8%), and decreases with the percentage of employment in agriculture (-15.0%). Our results suggest that provinces that are more productive are more likely to be severely hit by the epidemic. Additionally, more intensive international relationships, a higher employment rate, and a large share of employees in the manufacturing industry are triggers of the initial COVID-19 geographical spread. It should be noted that manufacturing is the most strongly integrated sector in the global economy, as it is involved in global value chains and produces goods that make up the majority of exports in OECD countries (De Backer et al., 2015). Outside economic triggers and the Euclidean distance, the most relevant explanatory variable identified as a predictor of the COVID-19 spread is the average number of days with temperature below 3°C (+16.1 % for a one standard deviation increase). The rate of positive cases per 100,000 people is also higher where mortality for infection diseases and PM10 concentration are higher (+9.3 % and +7.2% for a one standard deviation increase, respectively). These results suggest that provinces with high COVID-19 registered cases are also those where the transmission rate of infections is generally higher and, as suggested by previous works, (Wu et al., 2020), where air pollution is higher. Finally, the other variable that has been selected by the algorithm (with lower estimated coefficients) is the average size of the family (-2.8% for a 1 standard deviation increase). The latter result could indicate that stricter family ties reduce the exposure of family members to the spread of infections through external social and professional networks.7 In the OLS case (last three columns of Table A.8 in the Appendix), all 77 coefficients, aside from the omitted categories, are estimated. However, given the large number of predictors with respect to the sample size and the high degree of collinearity among explanatory variables, all coefficients are imprecisely estimated and most of them are not statistically significant. Moreover, the out-of-sample MSE, which is the measure of the out-of-sample prediction error of the model, is considerably higher than the in-sample MSE (5.763 vs. 0.045, respectively). This result suggests that, even if the OLS performs very well within the specific sample we are using, it performs very poorly in predicting external “unseen” observations. Therefore, given the high degree of overfitting and multicollinearity, we can neither identify the main predictors of the geographical spread of COVID-19 across Italian provinces nor obtain a good out-of-sample predictive performance of our model using the standard OLS estimator.
Table A.8

Regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: Full results.

Elastic netOLS
CoefficientCoefficientS.E.P-value
Distance from the first outbreak: less than or equal to 50 km0.4962.0661.2120.098
Value-added per employee0.2440.3800.9850.702
Intensity of export relationships0.188-0.1210.2550.637
Nr. of frost days in a year0.1610.5760.2790.047
Mortality from infectious diseases0.0930.1530.1540.328
Concentration of PM100.0720.2440.3050.430
Employment rate0.071-0.0051.4030.997
Percentage of employment in manufacturing0.048Omitted category
Percentage of employment in industrial districts0.0000.0680.1830.710
Percentage of employment in services0.0000.2340.4330.593
Percentage of workers who are self-employed0.000-0.0110.1390.937
Hospital beds per capita0.000-0.2730.2590.299
Percentage of hospital beds in private clinics0.000-0.1470.1930.450
Percentage of hospital beds for the elderly0.000-0.0120.1220.920
Average firm size0.0000.1790.2870.537
Average hospital size0.0000.2070.1930.293
Population density0.000-0.1221.8350.947
Municipality density0.000-0.1010.2700.711
Hospital density0.0000.1101.0590.918
Firm density0.000-0.0991.1770.933
Percentage of tertiary graduates0.0000.3940.3060.207
Percentage of high-school graduates0.000-0.3000.2350.211
Percentage of people below upper secondary education0.000Omitted category
Nr. of flights passengers per capita0.0000.0200.1460.894
Percentage of passengers from international locations0.0000.0420.1540.786
Nr. of public transport passengers per capita0.000-0.2220.3200.493
Nr. of public transport seats per km/resident0.0000.2440.3410.478
Car density0.000-0.1420.2080.498
Mortality rate for respiratory diseases0.0000.0520.2360.827
Mortality rate for pneumonia0.000-0.3410.2740.223
Mortality rate0.000-0.3330.4030.415
Percentage of students0.000-0.3000.2560.251
Percentage of university students0.000Omitted category
Percentage of compulsory school students0.000-0.2370.2730.391
Percentage of pre-school students0.0000.0780.2730.778
Value-added per capita0.000-3.2552.9910.284
Poverty rate0.0000.5460.8490.525
Percentage of families with 5 or more members0.0000.0350.4680.941
Percentage of males0.000-0.1680.1580.295
Average age of the population0.0000.2931.1990.808
Percentage of immigrants0.0000.5720.3610.123
Percentage of people aged 65 or more0.000-0.0071.1310.995
Concentration of PM2.50.000-0.3950.3070.206
Nr. of foggy days in a year0.0000.0360.1940.853
Concentration of N020.0000.1120.2300.628
Nr. of windy days in a year0.0000.0070.1950.972
Nr. of sunny days in a year0.0000.2360.4170.576
Nr. of hot days in a year0.000-0.2300.1750.198
Nr. of rainy days in a year0.0000.2990.2700.276
Percentage of people who live close to a train station0.000-0.1670.1710.337
Mean altitude of the province0.0000.7000.2730.015
Altitude of the province capital0.000-0.5450.2300.024
Unemployment rate0.0000.3960.2310.096
Commuters as a share of the population0.0000.2820.2720.307
Percentage of commuters outside their municipality of residence0.000-0.0750.2900.797
Unloaded goods in the local harbours per capita0.0000.1800.1700.299
Percentage of commuters who use a private vehicle0.0000.0770.1790.672
Agriculture valued-added per capita0.0000.2640.1880.170
Services valued-added per capita0.0001.6761.1500.154
Manufacturing valued-added per capita0.0001.8890.9790.062
Yearly ship passengers arriving in the local harbours over total population0.0000.0550.1250.661
Yearly registered visitors in accommodation facilities as a percentage of the population0.0000.1630.2120.448
People that actually live in the province as a percentage of residents0.000-0.0880.1640.597
Percentage of people who live close to the sea0.000-0.2170.1350.116
Cattle density0.0000.1380.1060.204
Firm density (derived from animal products)0.000-0.2520.2730.362
General practitioner per capita0.0000.1870.2940.529
Hours of continuity health care services per capita0.000-0.1510.2550.558
Cases handled by the medical homecare as a share of the population0.000-0.0560.1450.700
Clinic density0.0000.3620.3060.246
Degree of provincial interconnection0.0000.1850.2880.525
Distance from the first outbreak: between 51km and 100 km0.0000.9241.1010.407
Distance from the first outbreak: between 101km and 300 km0.0000.6940.9030.448
Distance from the first outbreak: between 301km and 500 km0.0000.5420.7140.453
Distance from the first outbreak: more than 500km0.000Omitted category
Average family members-0.028-0.1040.5620.855
Percentage of employment in agriculture-0.150-0.1680.3130.596
MSE (hold-out-sample)0.5785.763
MSE (training sample)0.4630.045
One common problem that arises in Machine Learning is the lack of a measure of dispersion of the estimated coefficients to evaluate their precision. This limitation cannot be easily overcome with standard methods since the theoretical distribution of the estimator is unknown. Following Hastie et al. (2015), we circumvent such a drawback in post-selection inference by using a bootstrap re-sampling method to approximate the data-specific distribution of the estimated coefficients. Based on this bootstrapped distribution, we evaluate how often each of the 77 coefficients is estimated to be different from zero. Specifically, using the elastic net algorithm properly calibrated through 5-fold cross-validation, we take 200 bootstrap replications of the 80% training sample to calculate how many times a given predictor exhibits a non-zero coefficient. Fig. A.2 in the Appendix shows that, although all potential predictors are selected at least once across the 200 bootstrapped samples, six relevant predictors selected by elastic net (e.g. value-added per employee, average frost days in a year, the dummy that identifies provinces located within 50 km from Lodi, mortality from infectious diseases, the percentage of employment in agriculture, and PM10) have a non-zero coefficient in more than 95% of the 200 replications, while the probability of selecting other predictors is generally lower.
Fig. A.2

Post-selection inference.

Robustness checks

As a first robustness check, we estimate a different model that also includes regional dummies to verify whether our results are driven by region-specific characteristics that modify the set of predictors selected by the elastic net algorithm. Since disease testing policies are managed at the regional level, these controls also capture potential differences in the testing ability and in the degree of measurement error in the registered infections.8 The results presented in Table 2 , which are based on March 21, 2020 data, clearly show that both the selection of the main predictors of COVID-19 spread and all estimated coefficients are robust to the inclusion of the regional dummies and comparable to the baseline model. Moreover, all the regional dummies are not selected by the elastic net algorithm.
Table 2

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: sensitivity to the inclusion of regional dummies.

BaselineIncluding regional dummies
Distance from the first outbreak: less than or equal to 50 km0.4960.495
Value-added per employee0.2440.244
Intensity of export relationships0.1880.188
Frost days in a year0.1610.160
Mortality from infectious diseases0.0930.092
PM100.0720.072
Employment rate0.0710.071
Percentage of employment in manufacturing0.0480.048
Average family members-0.028-0.028
Percentage of employment in agriculture-0.150-0.150
Observations107107

Constant terms and predictors that are not selected in any of the two models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020: sensitivity to the inclusion of regional dummies. Constant terms and predictors that are not selected in any of the two models are not shown. In a second robustness check we evaluate whether the Euclidean distance control is key for the characterization of the other predictors of the spread. By excluding this explanatory variable, we basically take the perspective of a pandemic event whose triggers are unconditional with respect the geographical detection of the first registered outbreak. Table 3 shows that baseline results are confirmed even in the absence of the distance controls. No other variables, but the number of foggy days in a year, are selected by the elastic net algorithm, and the size of the coefficients of the single predictors is basically unaffected.
Table 3

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020, excluding predictors related to the geographical location of the first outbreak.

BaselineExcluding distance
Distance from the first outbreak: less than or equal to 50 km0.496Not included
Value-added per employee0.2440.262
Intensity of export relationships0.1880.188
Frost days in a year0.1610.162
Mortality from infectious diseases0.0930.082
PM100.0720.083
Employment rate0.0710.057
Percentage of employment in manufacturing0.0480.034
Foggy days in a year0.0000.033
Average family members-0.028-0.022
Percentage of employment in agriculture-0.150-0.145
Observations107107

Constant terms and predictors that are not selected in any of the two models are not shown.

Elastic net regression of log cumulative cases per 100,000 inhabitants on March 21, 2020, excluding predictors related to the geographical location of the first outbreak. Constant terms and predictors that are not selected in any of the two models are not shown.

Exploring the transmission channels of the containment measures

Table 4 presents the results of three alternative models. In the first column we summarize the results of the baseline model estimated by using the log cases of COVID-19 per 100,000 residents registered until March 21, 2020. In the second column, we evaluate how predictors change by considering cases registered between March 22, 2020 and June 3, 2020. Finally, in the last column we re-estimate the second model by including the log cases registered until March 21, 2020 in the conditioning set. This additional control is useful for eliminating the effect of the first stage of the epidemic on the transmission of infections that occurred after March 21 and, thus, on the selection of the relevant predictors of cases that occurred between March 22, 2020 and June 3, 2020. Elastic net regression of log cumulative cases per 100,000 inhabitants—pre and post containment measures. Constant terms and predictors that are not selected in any of the three models are not shown. The results showed in Table 4 suggest that the containment measures and the lockdown of economic activities might have reduced the transmission rate of the epidemic mostly by reducing the relative importance of economic factors. Specifically, moving from Model 1 to Model 3, all economic predictors, except for the percentage of workers in agriculture and the intensity of export relationships (which show a coefficient closer to zero in the updated estimate), are no longer included among the relevant predictors of the epidemic. The same result holds for the dummy that identifies provinces within 50 km from the province of the initial outbreak and for previously selected climate factors (i.e. the number of frost days). On the contrary, in the second stage of the epidemic some additional predictors unrelated to economic factors (i.e. percentage of families with 5 or more members, average family size, municipality density, mortality from pneumonia, mortality rate and mean altitude of the province) become relevant.9

Predictive performance

In this section we evaluate the predictive performance of our model by adopting different methodological perspectives. We first compare the predictive performance of elastic net with those obtainable with LASSO, ridge regression, and OLS by training each estimator in the 80% training sample and testing the corresponding out-of-sample performance in the 20% hold-out sample. Table 5 shows that elastic net outperforms LASSO, ridge regression, and OLS, providing the lowest out-of-sample MSE (even if OLS, as expected, exhibits the best in-sample predictive performance). It is relevant to note that the out-of-sample prediction error is also very imprecisely estimated using OLS. Specifically, the standard error of the MSE calculated using 200 bootstrapped replications of the hold-out sample is more than 14 times higher in the OLS case than in the case of elastic net.
Table 5

Predictive out-of-sample and in-sample performance of different estimators.

Estimated MSE
Hold-out sampleTraining Sample
Elastic net0.5270.463
(0.147)
Lasso0.5800.459
(0.168)
Ridge regression0.6170.352
(0.187)
OLS5.7630.045
(2.159)
Observations2186

Bootstrapped standard errors (200 replications) in parenthesis.

Predictive out-of-sample and in-sample performance of different estimators. Bootstrapped standard errors (200 replications) in parenthesis. We then provide graphical comparisons of the predictive out-of-sample performance of the elastic net algorithm and of the OLS estimator using two alternative strategies. With the first, we provide an additional graphical intuition of the extent to which the elastic net (and other regularization methods) is able to reduce the variance component of the MSE. Specifically, we generate three bootstrapped realizations of the training sample to calibrate and estimate our model. Then, we exploit the three sets of estimated coefficients to predict COVID-19 cases per 100,000 residents registered as of March 21, 2020 in the hold-out sample. Using the OLS estimator, although the in-sample MSE is considerably low (see Table 5), the predicted COVID-19 cases in the hold-out sample vary substantially across the three bootstrapped realizations of the training sample (Fig. A.3 in the Appendix). Moreover, in the OLS case there are some provinces that are predicted to be in the upper decile of the geographical COVID-19 distribution that instead belong to the lowest deciles of the actual distribution. Finally, we find that the OLS estimator highly overestimates COVID-19 cases per 100,000 inhabitants in the top decile and predict many provinces to have zero COVID-19 cases.
Fig. A.3

Graphical illustration of the variability of the out-of-sample performance of OLS.

Using elastic net and the same three bootstrapped realizations of the training sample to calibrate and estimate the coefficients, we obtain that the predictive performance in the hold-out sample is very stable and generally close to the actual number of registered cases (Fig. A.4 in the Appendix). Moreover, even if there are some specific provinces for which the algorithm makes an error in predicting the actual decile of the COVID-19 distribution, there are no cases in which a province in the upper deciles is predicted to be in the lower deciles or vice-versa. The elastic net algorithm, besides predicting accurately the geographical pattern of the spread of the epidemic, is able to predict the minimum and maximum values of each decile with minimal errors.
Fig. A.4

Graphical illustration of the variability of the out-of-sample performance of elastic net.

The second evaluation methodology of the predictive performance of elastic net and OLS relies on randomly dividing our data into five equally sized hold-out samples. These provincial subgroups are iteratively exploited to predict COVID-19 cases per 100,000 inhabitants, using each 80% corresponding training sample to estimate the coefficients.10 This strategy helps to further illustrate the extent to which the elastic net algorithm, as compared to the standard OLS estimator, is able to dramatically improve prediction in "unseen" areas. Thus, it could be conceived as a tool capable of predicting the possible geographical diffusion of the epidemic at a given point in time. The geographical distribution of registered cases is shown to be accurately predicted by the elastic net algorithm, whereas the OLS estimator performs poorly, failing to identify the true decile for many provinces (Fig. 3 ). Specifically, OLS largely over-estimates cases in provinces in the top decile of the distribution, showing an unsatisfactory maximum value of 9,000 cases per 100,000. Moreover, OLS predicts some provinces in the lowest decile to have zero registered cases when, as of March 21, there were no provinces registering zero infected people per 100,000 inhabitants 2020 (compare Fig. 1 and Fig. 3).
Fig. 3

Graphical illustration of the predictive out-of-sample performances of elastic net and OLS.

Graphical illustration of the predictive out-of-sample performances of elastic net and OLS.

Characterizing the predictors of the “second wave” of the COVID-19 epidemic: do economic factors still matter?

As of October 2020, Europe has been severely hit by a “second wave” of the COVID-19 epidemic. The strong containment measures adopted in Italy during the "first wave" prevented the spread of the contagion in the central and southern provinces. This is why, even though in the baseline analysis we control for regional dummies and the Euclidean distance from the first outbreak, some of the results obtained might still be related to provincial characteristics that are specific to the location of the first COVID-19 outbreak. However, as containment measures were relaxed in June, intense tourism movements during summer have caused a reshuffling of the provincial distribution of the COVID-19 contagion. This is why the "second wave" of the COVID-19 epidemic in Italy cannot be related anymore to the geographical location of any initial outbreak. For this reason, we update our analysis by considering cumulative provincial cases over 100,000 inhabitants recorded between September 1 and October 30, 2020 and the same conditioning set exploited in the baseline analysis. This allows us to verify if, and to what extent, relevant predictors selected on data from 21 March, 2020 are confirmed once the spread of the COVID-19 epidemic is unrelated to any initial outbreak and to an infection coming from abroad. The results summarized in Table 6 show that, once again, many relevant economic variables are selected by the elastic net algorithm. Specifically, richer and more productive provinces are more likely to experience higher infection rates, while rural areas and provinces characterized by high unemployment rates are generally less affected by the COVID-19 epidemic. It is noteworthy that, once COVID-19 outbreaks are spread throughout the country, the variables capturing the degree of international economic connections (i.e. the share of workers in manufacturing and the intensity of export relationships), which are selected as relevant during the "first wave" of the epidemic in March, are no more selected by elastic net from the prediction set.
Table 6

Elastic net regression of log cumulative cases per 100,000 inhabitants between September 1, 2020 and October 30, 2020.

Percentage of the population that lives close to a train station0.049
Mean altitude of the province0.046
People that actually live in the province as a percentage of residents0.028
Firm density0.028
Population density0.024
Value-added per employee0.019
Mortality from pneumonia0.013
Windy days in a year-0.012
Daily hours of sunshine-0.041
Hours of continuity health care services per capita-0.043
Unemployment rate-0.051
Poverty rate-0.060
Percentage of employment in agriculture-0.103
Observations107
α selected by 5-fold cross-validation0.111
λ selected by 5-fold cross-validation0.502
MSE (hold-out sample)0.201
Nr. of α values tested10
Nr. of λ values tested50

Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations.

Elastic net regression of log cumulative cases per 100,000 inhabitants between September 1, 2020 and October 30, 2020. Constant terms and unselected predictors are not shown. The combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations. Additional triggers of the epidemic (i.e. the share of population that lives close to a train station, population density, and the number of people that actually live in the province as a share of the residents) emerge in the “second wave”. We assume that these additional triggers remained hidden in our baseline analysis given that the containment measures adopted in early March prevented the spread of the epidemic in some of the most populated Italian cities. As a further result, the number of hours of continuity health care services (Guardia medica) per capita, a measure of the degree of proximity of the Italian health system, is selected as a relevant predictor negatively correlated with the number of infected people per 100,000 inhabitants. Thus, the response to the spread of COVID-19 is probably associated to the capacity of the health system to provide a proper territorial medical care. Finally, the geographical spread of COVID-19 cases per 100,000 inhabitants is confirmed to be strongly associated to climate and geographic factors such as the average number of sunny days in a year, the annual number of rainy days, and the mean altitude of the province. This result verifies that the COVID-19 epidemic might be subject to seasonality.

Concluding remarks

The analysis we propose here is motivated by the observation of a highly heterogeneous spread of COVID-19 across Italian geographical areas. Such heterogeneity seems to be also characteristic of the pandemic experience of other countries preceding and following the Italian case. The intensity and specificity of the uneven distribution of the spread in Italy falls far beyond that which is conceivable by taking into account only distance and geographical interrelating factors. This signals that other triggers, outside those that characterize a standard transmission mechanics from index to secondary cases are at work. A number of explanations have been proposed in the increasingly rich current literature. Studies are addressing the predictive ability of variables belonging to quite different conceptual clusters. We noted that each potential trigger proposed by the literature has its own rationale, as each one basically captures a different aspect of the human relationships emerging in a connected territorial, economic, and social environment. However, these studies are inherently focused on a specific aspect of the story, thus lacking a central requirement of analyses oriented at maximizing the out-of-sample predictive abilities of a model (i.e. its instrumental validity). Since a central feature of our analysis is that it considers economic and non-economic factors in a unified empirical framework, we adopted an empirical strategy, based on statistical learning, which is able to select, among a large set of potential triggers, those that maximize the out-of-sample predictive power of the selected model. From this perspective, we showed that the elastic net estimator clearly outperforms OLS and other alternative regularization methods such as the ridge regression and LASSO. The results point to a very contained subset of relevant triggers among the 77 included in the analysis. Within this subset, economic factors are shown to play an important role. This result signals a possible trade-off between economic intensity and epidemic risks. Since areas that are more productive are socially and economically connected domestically and internationally, they are more exposed to the spread of new infectious diseases. The link between a pandemic crisis and economic activity should thus be considered in both directions: economically developed areas are more likely to be affected by high infection rates due to their intense economic networks. However, since highly infected areas contribute more to value-added, global economic growth may be strongly reduced on the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate. Repetition of the analysis on the sub-sample of cumulative cases emerging after a time window in which containment measures were in force highlighted the cancellation of some of these economic triggers. This result is likely to signal the transmission channels of the containment measures being adopted by the Italian government. Finally, using provincial data on the early stages of the “second wave” of the epidemic, which cannot be linked to a specific initial outbreak, we confirm that highly productive provinces are more likely to be affected by infectious diseases. The opposite holds for poor areas characterized by higher levels of employment in agriculture. From the simulations of models estimated over bootstrapped samples, we showed that the elastic net algorithm is able to maintain stability and correctness of predictions of the actual spread of the pandemic by minimizing the variance component of the out-of-sample mean squared error. A further interesting result emerged from simulations obtained with models estimated on reduced samples. We showed that these models are able to provide reliable predictions of the pandemic spread also in unseen areas—those territories not considered in the retained sub-samples. This result highlights the external validity of the model selected by the elastic net. We wish to test whether this property continues to hold in other countries in future research.

Declaration of Competing Interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Table A.2

Description of climate and pollution predictors.

PredictorDescriptionSource (year)
N02Concentration of nitrogen dioxide (µg/m3) in the provincial capitalISTAT (2018)
PM10Concentration of particulate matter of 10 micrograms per cubic metre or less in diameter (µg/m3) in the provincial capitalISTAT (2018)
PM2.5Concentration of particulate matter that is 2.5 micrograms per cubic metre or less in diameter (µg/m3) in the provincial capitalISTAT (2018)
Hot days in a yearNumber of days in a year with a max temperature above 30°C: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Frost days in a yearNumber of days in a year with a max temperature below 3°C: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Rainy days in a yearNumber of rainy days in a year: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Foggy days in a yearNumber of foggy days in a year: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Daily hours of sunshineNumber of daily hours of sunshine: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Number of windy days in a yearNumber of days in a year with wind gusts greater than 25 knots: 2008 - 2018 average values.Il Sole 24 ore. Quality of life index (2018)
Table A.3

Description of socio-demographic predictors.

PredictorDescriptionSource (year)
Age of the populationAverage age of provincial populationISTAT (2018)
Family sizeAverage number of family membersISTAT (2011)
Percentage of families with 5 or more membersPercentage families with at least 5 membersISTAT (2011)
Percentage of immigrantsPercentage of foreign residents in the provincial populationISTAT (2019)
Percentage of malesPercentage of male individualsISTAT (2019)
Percentage of population aged 65 or morePercentage of provincial population aged 65 years old or moreISTAT (2019)
Percentage of the population living close to a train stationPercentage of provincial population living in a municipality with at least one station with more than 2,500 daily visitors in a yearMinistry of Economic Development (2014)
Percentage of the population living close to the seaPercentage of provincial population living in a municipality located close to the seaISTAT (2019)
People that actually live in the province as a percentage of residentsNumber of people actually living in the province as a percentage of provincial residentsISTAT (2011)
Population densityNumber of people per km2ISTAT (2019)
Table A.4

Description of geographical and territorial predictors.

PredictorDescriptionSource (year)
Altitude of the province capitalAltitude of the provincial capital measured at City HallISTAT
Distance from the first outbreak: less than or equal to 50 kmDummy for a provincial capital which is 50 km or less away from the province of the first outbreak (Lodi)Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 51 and 100 kmDummy for a provincial capital which is between 51 km and 100 km away from the province of the first outbreak (Lodi)Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 101 and 300 kmDummy for a provincial capital which is between 101 km and 300 km away from the province of the first outbreak (Lodi)Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: between 301 and 500 kmDummy for a provincial capital which is between 301 km and 500 km away from the province of the first outbreak (Lodi)Own elaboration using latitude, longitude and the curvature constant
Distance from the first outbreak: more than 500 kmDummy for a provincial capital which is between 500 km away from the province of the first outbreak (Lodi)Own elaboration using latitude, longitude and the curvature constant
Municipality densityNumber of municipalities per km2ISTAT (2020)
Mean altitude of the provinceMean altitude of the provincial territoryISTAT
Table A.5

Description of health care system predictors.

PredictorDescriptionSource (year)
Average hospital sizeAverage number of hospital beds per hospitalItalian Ministry of Health (2018)
Hospital beds per capitaNumber of hospital beds per capitaItalian Ministry of Health (2018)
Hospital densityNumber of hospitals per km2Italian Ministry of Health (2018)
Mortality from infectious diseasesNumber of deaths from infectious diseases per 10,000 peopleISTAT (2017)
Mortality rateNumber of deaths per 10,000 peopleISTAT (2017)
Mortality rate from pneumoniaNumber of deaths from pneumonia per 10,000 peopleISTAT (2017)
Mortality rate from respiratory diseasesNumber of deaths from respiratory diseases per 10,000 peopleISTAT (2017)
Percentage of hospital beds in private clinicsPercentage of total hospital beds hosted by private clinicsItalian Ministry of Health (2018)
Percentage of total hospital beds for the elderlyPercentage of provincial hospital beds dedicated to the elderlyItalian Ministry of Health (2018)
General practitioner per capitaNumber of general practitioners as a share of the provincial populationItalian Ministry of Health (2018)
Hours of continuity health care services per capitaTotal yearly hours of continuity health care services per capitaItalian Ministry of Health (2018)
Cases handled by the medical homecare as a share of the populationNumber of patients handled by the medical homecare as a share of the populationItalian Ministry of Health (2018)
Clinic densityNumber of clinics per km2Italian Ministry of Health (2018)
Table A.6

Description of mobility predictors.

PredictorDescriptionSource (year)
Car densityNumber of cars per km2ISTAT (2018)
Nr. of flight passengers per capitaNumber of flight passengers in provincial airports between January and February 2020 as a percentage of the provincial populationAssociation of Italian airport operators (2020)
Nr. of public transport passengers per capitaNumber of yearly public transport passengers per inhabitant in the provincial capitalISTAT (2015)
Nr. of public transport seats per km/residentNumber of public transport seats per km/resident in the provincial capitalISTAT (2015)
Percentage of commutersPercentage of daily commuters in provincial populationISTAT (2011)
Percentage of commuters outside their municipality of residencePercentage of total commuters who travel daily outside the municipality of residence.ISTAT (2011)
Percentage of commuters who use a private vehiclePercentage of total commuters who use a private vehicle.ISTAT (2011)
Percentage of flight passengers from international locationsPercentage of total flight passengers in local airports going to or coming from international locations between January and February 2020Association of Italian airport operators (2020)
Yearly registered visitors in accommodation facilities (percentage)Number of yearly visitors registered in accommodation facilities as a percentage of the provincial populationISTAT (2018)
Yearly ship passenger arrivals in local harbours (percentage)Number of ship passengers landing in provincial harbours as a percentage of provincial populationISTAT (2018)
Percentage of the population living close to a train stationPercentage of provincial population living in a municipality with at least one station with more than 2,500 daily visitors in a yearMinistry of Economic Development (2014)
Degree of provincial interconnectionNumber of commuters going to or coming from other provinces as a share of the provincial populationISTAT (2011)
  24 in total

1.  Global supply-chain effects of COVID-19 control measures.

Authors:  Dabo Guan; Daoping Wang; Stephane Hallegatte; Steven J Davis; Jingwen Huo; Shuping Li; Yangchun Bai; Tianyang Lei; Qianyu Xue; D'Maris Coffman; Danyang Cheng; Peipei Chen; Xi Liang; Bing Xu; Xiaosheng Lu; Shouyang Wang; Klaus Hubacek; Peng Gong
Journal:  Nat Hum Behav       Date:  2020-06-03

2.  Bowling together by bowling alone: Social capital and COVID-19.

Authors:  Francesca Borgonovi; Elodie Andrieu
Journal:  Soc Sci Med       Date:  2020-11-04       Impact factor: 4.634

3.  Temperature, Humidity, and Latitude Analysis to Estimate Potential Spread and Seasonality of Coronavirus Disease 2019 (COVID-19).

Authors:  Mohammad M Sajadi; Parham Habibzadeh; Augustin Vintzileos; Shervin Shokouhi; Fernando Miralles-Wilhelm; Anthony Amoroso
Journal:  JAMA Netw Open       Date:  2020-06-01

4.  Association between ambient temperature and COVID-19 infection in 122 cities from China.

Authors:  Jingui Xie; Yongjian Zhu
Journal:  Sci Total Environ       Date:  2020-03-30       Impact factor: 7.963

5.  Association between short-term exposure to air pollution and COVID-19 infection: Evidence from China.

Authors:  Yongjian Zhu; Jingui Xie; Fengming Huang; Liqing Cao
Journal:  Sci Total Environ       Date:  2020-04-15       Impact factor: 7.963

6.  The temporal association of introducing and lifting non-pharmaceutical interventions with the time-varying reproduction number (R) of SARS-CoV-2: a modelling study across 131 countries.

Authors:  You Li; Harry Campbell; Durga Kulkarni; Alice Harpur; Madhurima Nundy; Xin Wang; Harish Nair
Journal:  Lancet Infect Dis       Date:  2020-10-22       Impact factor: 25.071

7.  The geography of COVID-19 and the structure of local economies: The case of Italy.

Authors:  Andrea Ascani; Alessandra Faggian; Sandro Montresor
Journal:  J Reg Sci       Date:  2020-12-21

8.  The Relationship Between School Holidays and Transmission of Influenza in England and Wales.

Authors:  Charlotte Jackson; Emilia Vynnycky; Punam Mangtani
Journal:  Am J Epidemiol       Date:  2016-10-15       Impact factor: 4.897

9.  Preventing Intra-hospital Infection and Transmission of Coronavirus Disease 2019 in Health-care Workers.

Authors:  Wee Hoe Gan; John Wah Lim; David Koh
Journal:  Saf Health Work       Date:  2020-03-24

10.  Correlation between climate indicators and COVID-19 pandemic in New York, USA.

Authors:  Muhammad Farhan Bashir; Benjiang Ma; Bushra Komal; Muhammad Adnan Bashir; Duojiao Tan; Madiha Bashir
Journal:  Sci Total Environ       Date:  2020-04-20       Impact factor: 10.753

View more
  4 in total

1.  Modelling the persistence of Covid-19 positivity rate in Italy.

Authors:  Antonio Naimoli
Journal:  Socioecon Plann Sci       Date:  2022-01-07       Impact factor: 4.641

2.  Modeling economic losses and greenhouse gas emissions reduction during the COVID-19 pandemic: Past, present, and future scenarios for Italy.

Authors:  Dario Cottafava; Michele Gastaldo; Francesco Quatraro; Cristina Santhiá
Journal:  Econ Model       Date:  2022-03-01

3.  Urban spatial risk prediction and optimization analysis of POI based on deep learning from the perspective of an epidemic.

Authors:  Yecheng Zhang; Qimin Zhang; Yuxuan Zhao; Yunjie Deng; Hao Zheng
Journal:  Int J Appl Earth Obs Geoinf       Date:  2022-08-05

4.  Does climate help modeling COVID-19 risk and to what extent?

Authors:  Giovanni Scabbia; Antonio Sanfilippo; Annamaria Mazzoni; Dunia Bachour; Daniel Perez-Astudillo; Veronica Bermudez; Etienne Wey; Mathilde Marchand-Lasserre; Laurent Saboret
Journal:  PLoS One       Date:  2022-09-07       Impact factor: 3.752

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.