During the COVID-19 pandemic, most US states have taken measures of varying strength, enforcing social and physical distancing in the interest of public safety. These measures have enabled counties and states, with varying success, to slow down the propagation and mortality of the disease by matching the propagation rate to the capacity of medical facilities. However, each state's government was making its decisions based on limited information and without the benefit of being able to look retrospectively at the problem at large and to analyze the commonalities and the differences among the states and the counties across the country. We developed models connecting people's mobility, socioeconomic, and demographic factors with severity of the COVID pandemic in the US at the County level. These models can be used to inform policymakers and other stakeholders on measures to be taken during a pandemic. They also enable in-depth analysis of factors affecting the relationship between mobility and the severity of the disease. With the exception of one model, that of COVID recovery time, the resulting models accurately predict the vulnerability and severity metrics and rank the explanatory variables in the order of statistical importance. We also analyze and explain why recovery time did not allow for a good model.
During the COVID-19 pandemic, most US states have taken measures of varying strength, enforcing social and physical distancing in the interest of public safety. These measures have enabled counties and states, with varying success, to slow down the propagation and mortality of the disease by matching the propagation rate to the capacity of medical facilities. However, each state's government was making its decisions based on limited information and without the benefit of being able to look retrospectively at the problem at large and to analyze the commonalities and the differences among the states and the counties across the country. We developed models connecting people's mobility, socioeconomic, and demographic factors with severity of the COVID pandemic in the US at the County level. These models can be used to inform policymakers and other stakeholders on measures to be taken during a pandemic. They also enable in-depth analysis of factors affecting the relationship between mobility and the severity of the disease. With the exception of one model, that of COVID recovery time, the resulting models accurately predict the vulnerability and severity metrics and rank the explanatory variables in the order of statistical importance. We also analyze and explain why recovery time did not allow for a good model.
It has been anecdotally and scientifically established since the early days of the COVID-19 pandemic that social and physical distancing, as well as personal protection equipment (PPE) and general hygienic practices, all contributed to people’s safety. Yet for people working in stores, restaurants, at gas stations; people working in the taxi, ride-sharing, and delivery services; law enforcement officers; medics; itinerant workers, etc., — staying put is not an option. They have to travel during the workday, sometimes to other cities and counties away from their primary residence locations. Their exposure to coronavirus is presumably greater than that of people in the more affluent communities, who can afford to stay at home, ordering food and supplies to be delivered to them. Consequently, it stands to reason that in the poorer communities people would be more vulnerable to coronavirus than in the wealthier communities. This has been corroborated by multiple studies by CDC, WHO, and academic researchers (e.g., [1], or [2]). There have also been articles published describing use of mathematical methods in correlation analysis (e.g., [3]) and prediction (e.g., [4]) of spread of COVID-19.Our goal is to formulate comprehensive predictive models to predict the severity of the disease and the risk to the individuals and the communities, based on which policymakers can develop holistic approaches to dealing with pandemics, targeting such approaches at reducing the severity of disease and the risks of contracting it in their constituent communities. To achieve this goal, we deemed it necessary not only to rely on the existing metrics such as disease propagation rate and mortality, but also to formulate new pandemic severity measures and to take into account the joint effects of mobility, socioeconomic, and demographic metrics at US County level and to be able to rank such metrics used as features in our models in the order of their relative importance for each of the pandemic severity measures.In the early days of the pandemic, Oxford University came up with a way to quantify the strength of government response measures, creating a set of metrics for countries and for US States called OxCGRT (Oxford Covid Government Response Tracker). We explored the possibility of using OxCGRT composite indices to better understand how government actions influenced the propagation and severity of the disease. At the same time, the understanding that government-imposed restrictions may not be effective if the population does not observe them has led us to formulating a new statistically and theoretically sound composite metric of mobility restraint – Mobility Restraint Index (MRI) – based on observed mobility data.To measure COVID severity, we used the standard metrics — disease rate and mortality. However, these metrics do not point directly to risk and severity of the disease. To rectify this situation, we formulated three new statistically and theoretically sound metrics of disease severity and a probabilistic measure of risk — Recovery Time, Disease Severity Metric, and Risk. Details are presented in Section 3 and in Appendices.Our models are built based on data from Data For Good Humanitarian Data Exchange (HDX), US Census Bureau, Bureau of Economic Analysis, Johns Hopkins Coronavirus Resource Center, and Oxford Covid Government Response Tracker (OxCGRT) to understand the relative importance of the metrics describing people’s mobility, socioeconomic status, and governmental responses in the United States for the risk of contracting COVID-19. Our intent was to make a step towards furthering our understanding of community resilience in a disaster at the scale of a deadly pandemic.The rest of this paper is organized as follows. Section 2 is literature review. In Section 3, we describe our hypothesis and the methodological details, including the data sources, the modeling approach, and the exploratory data analysis. Section 4 describes the modeling results with regard to model evaluation and feature importance, while Section 5 is where we discuss the results and outline directions for further research opened by this work. In Section 6, we draw the final conclusions.
Literature review
Epidemiology
Epidemiology is defined in [5] as “the study of occurrence and distribution of health-related events, states and processes in specified populations, including the study of the determinants influencing such processes, and the application of such knowledge to control relevant health problems”. To understand how disease severity can be controlled, we need to understand the models used by epidemiologists. Two primary models are currently in use — Susceptible–Infected–Susceptible (SIS) and Susceptible–Infected–Recovered (SIR). [6] provide details of these models, while [7] describes it at a more general level. [4] demonstrate how epidemiological models can be used to predict near-term transmissibility of COVID-19. Their paper is singularly important in that Hong Kong digital transaction metadata have been made available to the researchers. Due to privacy concerns, we have not been able to avail ourselves of such data in the United States. However, there are several authoritative sources of data on COVID propagation in the United States, most notably Johns Hopkins University [8], which adhere to the US privacy requirements, and which we used in our study. [9] underlines the importance of keeping in mind that the most ubiquitous derived metrics of disease severity, such as case rate and mortality, are relative; therefore the denominator (population or number of infected people, respectively) is very important.
Socioeconomic conditions
Socioeconomic conditions, such as income level, will have an effect on the individual’s mobility: we assume that mobility will be different in the wealthy and upper-middle-class communities vs. in the lower-income and poor communities. There is a vast amount of research on measures of socioeconomic conditions, including a variety of composite metrics. We are interested in a link between socioeconomic inequality and objective well-being conditions between communities. [10] draw parallels between socioeconomic inequity and health outcomes, measured as infant mortality and life expectancy at birth. This analysis was conducted at the country aggregation level, and connected “health production (access, coverage and prevention) and intersectoral variables: demographic, socioeconomic, governance and health risks”. This work accounts for the cultural traits and physical and financial resources. The authors are focusing on health-related financial resources and not looking at the more generalizable socioeconomic metrics like personal income and its distribution, which are in the focus of our analysis: we are interested in the general wealth of the community and its effect on mobility and COVID vulnerability. An important finding is reported by [11]: “... there were disparities in excess mortality depending on community socioeconomic positions”. The authors also point out the importance of keeping track of the second-tier effect of COVID:Patients with severe illness who visit the ED (emergency department) may also experience delays in proper treatment, which are due to a lack of resources resulting from the consumption of many medical resources, such as beds, intensive care units (ICUs) and medical staff, in the course of responding to the COVID-19 outbreak.With regard to personal income and its distribution, we found two important papers: [12], [13]. The former performs a thorough analysis of personal income distribution (PID) and concludes that, while the income bins used in computing Gini index have been historically too wide to enable an accurate estimation, “relevant overall PIDs (personal income distribution) evidence that the fixed income hierarchy has been observed from the very beginning of the CPS (US Census Current Population Survey)”. The latter proposes that power position of the individual, measured as the number of their subordinates, is the strongest determinant of the individual’s personal income. Extending the logic of comparing a city with a person proposed by [14], it can be hypothesized that the influence of a county or a city can be expected to have an effect on its average personal income.
Mobility
Multiple researchers report a relatively strong connection between mobility and propagation of the pandemic. Thus, [15] used a parametric model (multiple regression) to investigate the effect of mobility habits on the spread of COVID-19 in Italy. They show for their best linear model. While the authors correctly say that this is not an unusual value for , this means that their model only explains 42% of the variance in the COVID spread metric that they aimed to predict. The authors use a “mobility habits” index for each region in Italy, which they calculate as the product of region’s population, average daily mobility rate prior to COVID-19, and percentage of variation due to the pandemic. It is similar to the mobility metrics that we use in our research: change in the range of daily movement. The average daily mobility rate (ADMR) is estimated as “percentage of residents in the ... region who during a day make at least one trip, for whatever purpose, with the exception of pedestrian trips shorter than 5 min..”.. This is similar to the percentage of limited-mobility residents that we use as one of the two mobility metrics we use in our work.On the other hand, [16] propose a model for predicting the spread of COVID based on mobility and mask mandate information. Their work is important for our research: by using nonparametric model (LSTM neural network), they have achieved predictions significantly better than obtained with the standard ARIMA method. However, the exceedingly low values of average percent difference that the authors achieved for New York (0.5%) correspond to , which may point to overfitting. On the other hand, their results for Florida were generally acceptable, with average percent difference of , or . This tells us that nonparametric models of COVID spread by mobility and other factors have a high potential of explaining the variance in the predicted metric.In [17], the author employs fixed effects models with correction for time trend. The author goes on to identify the modes of transportation and types of assembly that have a positive or a negative effect on — the reproduction rate of the disease. The author points out that “(t)he mobility measures provided are all highly correlated, making individual inferences on each impossible if all are included within the same model”. Therefore, the author goes on to estimate the independent effect of each of the mobility measures (retail, grocery, parks, transit, workplace, and residence) on using a log-linear model and concludes that a significant reduction of mobility outside residence is needed to achieve . The author reports adjusted between 0.38 and 0.70. In this case, given the large sample sizes and the relatively small numbers of parameters, the difference between Adjusted and can be deemed insignificant. An important observation made by the author about the need to account for the incubation period of the disease is taken into account in our work by relative time-shifting the mobility and the COVID spread and mortality data.[4] develop a nowcasting framework predicting COVID reproduction rate accounting for mobility, mixing of people, their ages, and other parameters. This work made it possible “to accurately track the local effective reproduction number of COVID-19 in near real time”, eliminating the 9-day delay between infection and reporting of cases. The authors do not provide an estimation of their model fit, however, other than graphical representations of observed data and their predictions with confidence intervals. The fit is not great, but may be considered acceptable. In the spirit of [4], [17], [18] demonstrates the possibility to forecast COVID-related changes in mobility in six different categories (Retail and Recreation, Grocery and Pharmacy, Parks, Transit Stations, Workplaces, and Residential). The author demonstrates that, with the exception of Residential, all categories experienced a reduction in mobility in the Philippines during the early months of 2020, when the country experienced an onset of COVID-19.
Government response to pandemics
Government response to the pandemics describes state government policies and is very important for modeling disease severity, propagation rate, and risk to the individuals and communities. [19] conduct a regression analysis of public trust and compliance with government measures to stem the spread of COVID-19. They conclude that (i) stricter government restrictions lead to an increase in compliance and (ii)“(a)mong political regimes, higher levels of public trust significantly increase the predicted compliance as stringency level rises in authoritarian and democratic countries”. [20]’s analysis is different from [19] in that the authors investigate the differences in compliance among states of one country (Mexico), which represents a similar approach to ours for the United States. The authors brought together the University of Oxford Coronavirus Government Response Tracker (OxCGRT), population size and density, poverty, and mobility, and calculated the doubling time of COVID-19 cases as the response variable of their model. The parallel with our work is obvious. The authors use a linear model to predict doubling time, putting together the listed metrics. The model suggests that “income per capita emerged as a strong significant positive predictor of the doubling time of COVID-19 cases”.At the same time, it appeared that mobility and prevalence of poverty, too, are positively correlated with the doubling time — reducing COVID spread rate. On the surface, this seems paradoxical. The authors explain this by going deeper into the socioeconomic and demographic situation in the states with higher poverty rate. We believe that in this case, linear model with strongly correlated features (GDP and Poverty; Population and its Density; an unknown relationship between Poverty and Mobility, etc.) should not have been used directly due to the multicollinearity, as it may lead to paradoxical effects (e.g., Simpson’s paradox, [21], [22]): collinear features interact in ways that appear strongly counter-intuitive. That said, main effects are not susceptible to Simpson’s paradox. Therefore, the authors’ statement that “increased mobility was significantly associated with more rapid viral transmission” is statistically sound.On the other hand, the use of doubling time as the response variable in a linear regression should be taken very cautiously: time between independent events does not follow the Gaussian distribution, resembling either an Exponential, a Gamma, or a Power-Law distribution. Due to these reasons, unless confidence intervals of the prediction are not important, linear regression should not be used in this kind of analysis. Overall, the work by the authors of [20] is highly innovative and inspirational. The authors of [23] explain that the spread of a pandemic should be modeled using the S–I–R (Susceptible–Infected–Recovering) model, not by a linear formula. This, too, may be responsible for the logical discrepancies reported by [20]. This is corroborated by [24], who extend the S–I–R model to account for the deviations from steady-state conditions and compute the (basic infection rate) and (effective infection rate) to predict the peaks in the numbers of COVID cases in the US, Brazil, China, and Italy.
Community resilience and cohesion during COVID-19
It takes the entire community to stop a community-wide disaster. Coordination and willingness to undertake the necessary measures is key to stopping it. Therefore cohesion and resilience are very important factors in this research.[25] developed a general framework of measuring system resilience, where the system’s performance function, , shifts level from a stable value due to a disturbance and after a time recovers to a new stable value. The difference between the original stable value and the disturbed level, divided by the difference between the recovered level and the disturbed level is proposed as a measure of community resilience: , where is the last measurement of before the disturbance; is the time of stabilization at the disturbed state; is the time of stabilization at the recovered level of . [26] further developed the measurement of community resilience. The authors take the idea of resilience as a process and view its metric as a property of the community, proposing to measure it by the sensitivity of stabilization time to the size of the disturbance. They then propose a self-correcting Bayesian method to measure community resilience as the inverse of the sensitivity of the recovery time to the severity of the disturbance. This approach is demonstrated to be viable in a set of longitudinal simulations, where one community is assumed to be taking multiple disturbances. We believe that, when multiple communities are hit with the same disturbing event, such as COVID-19 pandemic, the same approach can be applied as well.[27] is one of the most recent publications on the topic of community resilience during the COVID-19 onset. They outline the current state of research into social cohesion and resilience during the pandemic. The authors describe equity, sustainable development, and separation between the public and the government as factors affecting social cohesion. However, no data are proposed in support of the analysis in this paper. It is intuitive that communication, especially risk communication, can be instrumental in building community resilience. In [28], the authors examine the importance of public awareness of risk and the role of public participation in risk communication and resilience building.The authors warn:A community’s resilience to tragedies strongly depends on the communication surrounding them before, during, and after crises. ... Understanding of this process, however, is limited, and efforts to improve public awareness of hazards and removing people from harm’s way have proven ineffective.They further explain that “...social media should be viewed as complementing traditional media, not as operating in opposition to traditional media (Jin and Liu 2010, op.cit., p. 199)”. We would like to add to this that even when social media operates with traditional media, they should be viewed as independent entities. This is of importance for our work, as it pertains to the analysis, among other features, of the effect of state government measures in response to COVID on its contagion and mortality.Origins of the New metrics.
Model quality baselining
Our research involved creating and modeling new metrics. It is important to see what portion of the variance in these metrics our models explain, and whether they are on par with models reported in the literature by other researchers. This question is answered by baselining the of the models. Based on the literature review and our models, we have compiled Table 1. Importantly, the results presented in it are only to be used to qualitatively establish that our models are at least as good as the models we found in the literature. As we see from Table 1, [16]’s LSTM neural network model quality is the best of the compared models, albeit, as we discussed in Literature Review, it may be an overfit for some communities (New York, 0.995). With our approach, the model quality exceeds two out of three cited models for 4 out of 5 metrics, with one metric in the same ballpark as [17] and better than [15].
Table 1
Benchmarking the quality of our models with other researchers’ models.
Metric
Model
R2
Publication
COVID Rate
RFR
0.93
*
Duration
RFR
0.61
*
Mortality
RFR
0.76
*
DSM
RFR
0.85
*
Risk
RFR
0.86
*
COVID Spread
Linear Model
0.41
[15]
COVID Spread
LSTM NN
0.95+
[16]
Rt
Fixed-Effect
0.38–0.70
[17]
The asterisk (*) denotes our research.
“DSM” stands for Disease Severity Metric.
Benchmarking the quality of our models with other researchers’ models.The asterisk (*) denotes our research.“DSM” stands for Disease Severity Metric.
Methodology
In this section, we describe how we connected the predicted metrics – traditional (COVID rate, Mortality) and derived (Recovery Time, Severity, and Risk) – with government measures and mobility, as well as demographic and socioeconomic factors. Derivation of all metrics is outlined in Fig. 1 and detailed in the Appendices.
Fig. 1
Origins of the New metrics.
Hypothesis and working model
Our hypothesis is that people’s socioeconomic status, population density, government measures to contain the pandemic, as well as mobility have an impact on the risk of contracting the virus, as well as severity of the illness. We validate it by modeling and analyzing the quality of our models. Five categories of metrics interact in complex feedback and feed-forward relationships in our models: (i) Government Response; (ii) Socioeconomic Metrics; (iii) Population Mobility (iv) Population Density Metrics; (v) COVID Metrics. We put together data from several sources, quantifying people’s mobility, socioeconomic conditions, population dynamics, and state government responses to COVID-19. (See Fig. 2 for illustration.) We predict COVID Rate, Mortality, Disease Severity Metric, Risk, and Recovery Time as functions of mobility, government measures, socioeconomic, and demographic metrics. Details on the definitions of dependent and independent metrics are presented in Section 3.5. Derivations are provided in the Appendix.
Fig. 2
The Modeling Flow.
The Modeling Flow.
Data sources
In this research, we used the following publicly available data sources:The Humanitarian Data Exchange (HDX) is an online assembly of data, contributed by industry, academia, and governments, enabling researchers and policymakers to analyze a number of important metrics, such as, during the COVID 19 pandemic, movement range data.URL: https://bit.ly/3qmYvwbJohns Hopkins Coronavirus Data. The Johns Hopkins Coronavirus Resource Center (CRC) is a continuously updated source of COVID-19 data and expert guidance. CRC team “...collect and analyze the best data available on cases, deaths, tests, hospitalizations, and vaccines to help the public, policymakers, and healthcare professionals worldwide respond to the pandemic”.URL: https://coronavirus.jhu.edu/data;Github: https://bit.ly/3qnN1Z0Personal Income Data by US County. This dataset was downloaded from Bureau of Economic Analysis.URL: https://bit.ly/2OtgbJfPopulation and Land Area of US Counties. These data were downloaded from the US Census Bureau siteURL: https://bit.ly/2PvVdK4Oxford Covid Government Response Tracker (OxCGRT) was created by Blavatnik School of Government at the country level during the early days of the pandemic. Later a US-specific tracker was developed. We downloaded it and used in our research.URL: https://bit.ly/38foBuu;Github: https://bit.ly/38foBuu
Data alignment
To ensure that we would be able to infer causation where causation was possible (e.g., increased mobility leading to higher COVID rates) and would not be discovering false causalities (e.g., lower COVID rates leading to more stringent government-imposed social-distancing restrictions), we used the broadly available information about approximate incubation period of COVID-19 ( days). We phase-shifted COVID metrics days back from mobility and OxCGRT metrics. For the same reason, we phase-shifted COVID death data 4 more days back. It is impossible to accurately estimate the time it took each state to formulate each measure in response to the pandemic. No phase-shift was implemented for OxCGRT measures. This imposes certain precautions necessary when estimating the causality of the correlations between government measures and other metrics.
Modeling approach
Having collected data from the sources listed in Section 3.2, we performed feature engineering to derive metrics describing (i) reduction of people’s mobility, (ii) COVID severity, and (iii) risk. That done, we analyzed pairwise and more complex relationships among metrics and built predictive models using the methodology described in [29] and outlined in Algorithm 1. The rationale for the modeling tools we chose is outlined in A.1, A.2.This enabled us to rank the feature metrics in the order of their impact on the target variables (COVID Metrics) in our models. These results are reported in Section 4. To confirm or disprove our hypothesis described in Section 3.1, we model COVID metrics (dependent variables) as functions of mobility, socioeconomic, and demographic metrics (independent variables, or features). While this approach does not provide insights into the mechanisms via which the dependent variables are influenced by the selected features, it enables researchers to evaluate the relative importance of the features for each of the dependent variables. To build an independent model for each of the dependent variables, we need to verify that they are uncorrelated with each other. This is done in Section 3.5.
Metrics
In this work, we analyzed the possibility of modeling COVID-19 metrics based on government measures, as well as mobility, socioeconomic, and demographic factors. In this subsection, we list the metrics available to us, derived metrics, and how we used each of these metrics in our models.Derivations, mathematical expressions, and other details pertaining to the metrics are provided in the Appendices.
Available and standard derived metrics
Johns Hopkins University Coronavirus Research Center [8] provides data on the total daily number of confirmed COVID cases (cumulative and daily) and total number of deaths in the US at the county granularity level. We call COVID rate and COVID mortality standard derived metrics because they are computed and used in this research in the same way as, e.g., defined in [9].We used two standard derived metrics in our work:COVID RATE. COVID rate is an important measure of disease propagation in a county. It expresses the proportion of cases in the population. It was computed as where is the number of cases in the county, and is the county’s population count.COVID MORTALITY. COVID mortality is an aggregate measure of severity of the disease and the state of preparedness, or lack thereof, of the county’s medical facilities, supplies, and personnel, including first responders and hospital workers, for handling the pandemic. It was computed as the number of deaths divided by the number of cases: , where is the day’s death toll.
New (GRM) derived metrics
To further our understanding of the diverse aspects of the pandemic represented by Johns Hopkins University’s Coronavirus Research Center (CRC)’s data, and the impact of mobility and socioeconomic metrics on the severity of the disease and people’s risk, we developed the following new metrics of COVID severity and risk. The following metrics were derived in this work:Recovery time (average duration of being ill).Disease recovery time is a very important metric for community planning and preparations for pandemics and other health-related disasters. It is an important indicator of disease severity and a critical element in estimating capacity and personnel requirements for medical treatment facilities and the impact of the disease on the economy. We used Eq. (2) for the average time to recover (duration of being ill) where is the number of cases; is the average recovery time (duration of the person’s being ill), is the number of deaths. Then represents recovery speed, and is the concurrent number of recovered COVID patients. For derivation, please refer to Appendix B.3.1. Hereinafter Recovery Time, Duration of Being Ill, as well as will be used interchangeably.Disease Severity Metric.To streamline the analysis and modeling, and to avoid confounding introduced by correlated features, we formulated a – a linear combination of these three measures – total number of cases, new cases (arrival rate), and death toll. We define it as where is the value of Disease Severity Metric at timestamp , is the set of principal components (see below), and is the th eigenvalue (size of the th principal component) at timestamp . The rationale behind this approach is as follows. As we explained in Appendix B.1, Johns Hopkins University’s CRC provides three important standard derived metrics — total number of cases, new cases (arrival rate), and death toll, and they are pairwise correlated. To resolve the correlations, we used PCA. (See Appendix A.2 for details. For derivation, please refer to Appendix B.3.2. Disease Severity Metric is a statistically sound composite metric combining the signal from new COVID cases, total COVID cases, and number of deaths. It carries as much information as its component metrics, but it removes the redundancy intrinsic in multicollinearity. For these reasons, we are using the Disease Severity Metric (DSM), rather than its elements, as an output variable of our model in this study.Covid riskRisk is the metric bringing together the severity of the disease and the probability of getting the disease. where is the risk; is severity measured by the Disease Severity Metric. In turn, is COVID rate computed as defined by Eq. (1). For details, please refer to Appendix B.3.3.Mobility restraint indexIn Section 3.1, we stated our hypothesis that mobility, along with socioeconomic status, population density, and government measures, have an effect on the severity and propagation rate of the disease. We have two aggregate metrics of daily mobility during COVID, proportion of each county’s population that stayed within the same Level 16 Bing tiles (600 × 600 m squares) during the day and range of movement change measured as the relative change in the number of boundaries of Level 16 Bing tiles that each individual crossed during the day. They describe two different aspects of mobility restraint within each US county. To combine the information from these two metrics and eliminate redundancy, we formulate a metric of mobility restraint exercised by the population. Like Disease Severity Metric, Principal Component Analysis (PCA) has been used to formulate this metric, and like DSM, we defined MRI using equation similar to Eq. (3): where is the mobility restraint index (MRI) at time , is the first principal component at time , and is the second principal component at time , returned by PCA. Please see Appendix C.3 for details.
Exploratory data analysis. Important correlations
We analyzed pairwise correlations of COVID metrics and population density, socioeconomic metrics (average personal income), state government response measures, and mobility. One of the important discoveries was that, while there are weak linear correlations (see Fig. 3), most relationships are characterized by one or two critical values of the driving variable: e.g., government-response Stringency Index has a critical value above which we observed a reduction in COVID rate and mortality (Fig. 4). On the other hand, income level has two critical values, falling at the Pew Research Institute’s boundaries of middle income, between which Mobility Restraint Index was at its highest value (Fig. 6, Fig. 7).
Fig. 3
State Government Measures (OxCGRT data), Mobility, and Risk of COVID.
Fig. 4
StringencyIndex of the State Government Response (OxCGRT data) and Covid Rate (a) and Mortality (b). The critical value of Stringency Index is marked as the vertical red dashed line.
Fig. 6
Income level and Mobility Restraint Index in June, 2020. (Other months look similar.) Logarithmic scale of the X axis demonstrates the low-income communities’ trend towards stronger mobility restraint as the average personal income increases. It also shows the upper-income communities trend towards weaker mobility restraint as the income increases.
Fig. 7
Pew Research Center minimum income boundaries. MRI and income “buckets” are brought together in Table 4.
State government response measures and other metricsFrom Fig. 3, we see that there is a noticeable negative correlation between StringencyIndex (hereinafter SI) and county COVID rate (), , and other relevant metrics.Because of the stronger correlations between Stringency Index (SI) and COVID-related, mobility, and socioeconomic metrics, we are focusing on it as the primary measure of government response. Table 2 provides the pairwise Pearson correlations between SI and some other metrics. The P values in Fig. 3 and in Table 2 are less than 0.001, meaning that the correlations are significant with confidence higher than 99.9%.
Table 2
State Government Stringency Index correlations with COVID risk factors and mobility.
rho
StringencyIndex : COVID Rate
−0.416
StringencyIndex : Mobility Restraint Index
0.143
StringencyIndex : Disease Severity Metric
0.081
StringencyIndex : Risk
0.273
State Government Measures (OxCGRT data), Mobility, and Risk of COVID.Note that correlations between SI and COVID Risk and DSM (i) are weak and (ii) should not be read as increases in government response stringency causing increase in disease severity or risk. It is reasonable to assume that in states where COVID was more advanced, governments took stricter measures against the disease.State Government Stringency Index correlations with COVID risk factors and mobility.The time shift described in Section 3.3 accounts for the delay in implementation of government measures and the normal time constant inherent in the system.The images in the visualizations below show that a big part of the variance in the dependent variables (on the vertical axis of Fig. 4, Fig. 5b) is not explainable by a single independent variable shown on the respective horizontal axis. However, it is clear that the relationships shown in these visualizations are nonlinear, and many have a critical point where the behavior of the dependent variable changes. We illustrated this by showing the vertical dotted lines at such points and the arrows to highlight the visible behavior at the lower and the higher values of the independent variables. From these illustrations, it is clear that linear modeling will not work well in describing and predicting the behavior of the metrics of interest, and we need to use nonparametric modeling (e.g., Random Forest Regression) to derive rigorous models. At the same time, the existence of critical points identified in these plots will help in informing the government policies and sensitivity analysis of the RFR models. We see from Fig. 4 that there is a nonlinear relationship between the Stringency Index and the rate of COVID spread (), as well as Stringency Index and COVID mortality ().
Fig. 5
Relationship of Mobility Restraint Index on Covid Risk (a) and Mortality (b).
Note that there is a critical value of StringencyIndex, at about 40% of the range, at which COVID-19 rate starts visibly decreasing. We shall call it rate-critical value of SI, . At the same time, COVID mortality () kept increasing past , despite the increase in StringencyIndex (Fig. 4(b)). While a piecewise regression analysis would be needed to better identify the critical values and characterize the behavior of the metrics below, at, and above such values of the StringencyIndex, the focus of this paper is beyond bivariate correlation analysis. We are building a comprehensive predictive model of COVID risk and severity metrics, accounting for the other features and their interactions. That said, Table 3 provides the numerical estimates of the mean and p90 of rate and mortality below, at, and above the corresponding critical values (CVs) of the StringencyIndex. The numbers corroborate the visual observations.
Table 3
Stringency Index and COVID rate and mortality below, at, and above the critical values identified in Fig. 4.
Bucket
Rateavg
Ratep90
Mortalityavg
Mortalityp90
Below CV
0.32
0.086
0.022
0.047
At CV
0.031
0.079
0.031
0.082
Above CV
0.011
0.038
0.028
0.083
StringencyIndex of the State Government Response (OxCGRT data) and Covid Rate (a) and Mortality (b). The critical value of Stringency Index is marked as the vertical red dashed line.The increase in COVID mortality at SI below its critical value is not to be interpreted as a statement that government actions were causing an increase in mortality — it is more reasonable to assume that restrictions imposed by governments were in response to the growing rate and mortality of the disease. However, this causality may go in both directions. Indeed, many actions taken by federal, state, and local governments in the US, especially in the early days of the pandemic, were made based on extremely limited information about COVID-19’s spread mechanisms and its contagiousness, under conditions of long test turnaround times and shortage of test kits. These decisions, and delays, often resulted in mistakes, leading to increased contagion and mortality. At a mortality-critical value of StringencyIndex, , the tide began to turn.Stringency Index and COVID rate and mortality below, at, and above the critical values identified in Fig. 4.Mobility restraint index, COVID mortality, and riskFrom Eq. (4), it is clear that there is a negative nonlinear correlation between MRI and COVID risk. It is intuitively understandable: counties where people travel more frequently, and longer distances, are likely to have higher COVID risk than the lower-mobility counties, and we see it in Fig. 5(a). We also see from Fig. 5(b) that there is a clear critical value of MRI, at of the range, above which COVID mortality starts going down.Mobility restraint index and personal incomeRelationship of Mobility Restraint Index on Covid Risk (a) and Mortality (b).The relationship between personal income and mobility restraint due to COVID has turned out to be more complicated than we had expected. Thus, we see from Fig. 6 that middle-class counties had a higher Mobility Restraint Index than their wealthier and lower-income counterparts. Income data are strongly skewed, obscuring the low- and middle-income numbers for Mobility Restraint Index. To resolve it, we used the logarithmic transformation of the X axis in Fig. 6. It shows that MRI in the low-income counties increases with income, reaching its maximum in the middle-class range (Fig. 6 for reference), and going down at higher incomes. We see that the peak of Mobility Restraint Index (MRI) falls on the range which, according to the Pew Research Center [30], corresponds to the upper middle class in the US (see Fig. 7). Table 4 shows the aggregate values of MRI in the three income buckets delineated by the vertical lines in Fig. 6.
Table 4
Personal Income and MRI in the three income buckets.
Bucket
MRIavg
MRIp90
Low
0.187
0.369
Middle
0.205
0.388
High
0.175
0.310
We see that both the average and the high levels of MRI are higher in the middle-class counties than in the predominantly low- and high-income counties. Based on the analysis of Fig. 6, Fig. 7, we conclude that there may be some sort of a causal relationship between mobility restraint and income, and middle-class counties have been more restrained in their mobility. A possible explanation is hypothesized as follows. Low income communities have low Mobility Restraint Index: their members cannot afford to stay at home. Communities in the higher income range, approximately from the middle of the middle-class range (), are characterized by the highest MRI values.Income level and Mobility Restraint Index in June, 2020. (Other months look similar.) Logarithmic scale of the X axis demonstrates the low-income communities’ trend towards stronger mobility restraint as the average personal income increases. It also shows the upper-income communities trend towards weaker mobility restraint as the income increases.Pew Research Center minimum income boundaries. MRI and income “buckets” are brought together in Table 4.Personal Income and MRI in the three income buckets.Circumstantially, we can connect this with (i) the closures and workforce reductions of many businesses considered non-essential during the COVID pandemic, leading to people not driving to and from work every day, and (ii) people being able to work from home during COVID. Finally, for the upper middle class and the wealthy communities, Mobility Restraint Index goes down. Following the same logic, we can say that their mobility plays a role in their being in the high-income range. US Census data for personal income from the last pre-COVID year, 2019, were used in this analysis; the economic impact of the pandemic is outside the scope of this research.
Summary of COVID metrics and their factors
In Section 3.5, we made the following important findings: (i) There are many collinear metrics in our datasets, and it is important to be able to group them. Principal Component Analysis (PCA) provides one way of doing so. We used it to formulate Disease Severity Metric (DSM) and Mobility Restraint Index (MRI). (ii) There is a strong correlation between mobility restraint and risk of COVID defined as Eq. (4) — the product of COVID rate and Disease Severity Metric, which we defined in Section 3.5.2. (iii) There is also a strong non-linear correlation between socioeconomic conditions (average personal income) and mobility restraint. (iv) Nonlinear correlations exist between MRI and COVID rate, as well as MRI and COVID mortality. (v) Other variables, such as personal income and population density interact with MRI, Stringency Index, and COVID metrics.
Results. Model evaluations and factor importance analysis
In this section, we build and evaluate descriptive models that can be used to predict COVID severity and propagation rate, to help inform state policies and identify local measures aimed at preventing the spread of pandemics. Such models can be used in a simulation environment to answer a variety of what-if questions. In their simplest form, they can be used to measure the relative importance of the factors that we have taken into consideration. We used the approach outlined in [29], with some modifications. Specifically, due to the new (GRM) features we derived (see Fig. 1), we did not need to perform any further dimensionality reduction: the set of features is relatively small. The Random Forest Regression (see Appendix A.1 for the rationale behind choosing this method) was fitted using the following features: (i) Mobility Restraint Index (MRI); (ii) Population Density; (iii) Personal Income; (iv) StringencyIndex; (v) ContainmentHealthIndex; (vi) EconomicSupportIndex.We predicted the following COVID metrics: (i) COVID Rate; (ii) COVID Mortality; (iii) Average duration of being ill (Recovery Time); (iv) Disease Severity Metric (DSM); (v) COVID Risk
Model hyperparameters and evaluation procedure
In this analysis, we used the default RFR parameters from the sklearn.ensemble.RandomForestRegressor Python package. We increased the number of decision trees (evaluators) in the random forest to 50 (the RandomForestRegressor version we used had a default of 10. Beyond 50 we did not notice significant improvements in model performance). Loss function has been set to the L2 Norm — mean-square error (MSE), to maximize the models’ robustness to outliers and pivot points. We enabled bootstrapping to ensure that each decision tree in the random forest got a sufficient number of data points and did not limit the maximum number of features used in each tree. If we had obtained poor results in terms of model quality, we would have tuned these hyperparameters for each model.To evaluate the models and to ascertain their ability to predict unseen data, we used a 3:1 train:test split of the data aggregated by county and day, making each data point correspond to a (county, day) pair. As we discussed above, we used feature engineering techniques, such as time shifting, to avoid finding spurious correlations. With the understanding that relationships among the metrics would develop in time, and therefore leaving time outside the list of independent variables would add uncertainty to our models, we focused on finding the underlying fundamental relationships between our features and the predicted metrics. In the future, we are planning to add a temporal component to our research (see Section 5.2). Abstracting away the time enabled us to streamline the model quality evaluation by applying the standard 3:1 randomized split between the training and the testing datasets.
Model quality analysis
In order to estimate the degree of trust in the stack ranking of features performed by Random Forest Regression models, we must know how good the models are. Each of the COVID metrics we analyzed in this research uniquely quantifies the risk, severity, and other aspects of the pandemic. This makes it important to independently evaluate the models and identify the order of importance of its features, for each COVID metric. Using a 3:1 train:test split of the datasets for each model enabled us to ascertain that the models generalize sufficiently well. Table 5 summarizes our findings. As we can see from it, features that remained outside the scope of our work are responsible for 7% to 39% of the variance in the respective COVID metrics. Visual analyses of the best and the worst models as scatter plots of model prediction as a function of observed metric value tell us whether the model is good or we need to change the model hyperparameters, search for additional features, or make the conclusion that this direction is not going to be successful.
Table 5
Feature Importance and Model Quality based on Out-Of-Sample R-Squared.
Metric
Feature
Importance
R2
Unexplained variance
COVID Rate
0.93
0.07
StringencyIndex
0.360186
ContainmentHealthIndex
0.221393
population_density
0.166269
personal_income
0.123443
EconomicSupportIndex
0.077283
mobility_restraint_index
0.051426
Recovery Time
0.61
0.39
mobility_restraint_index
0.274447
StringencyIndex
0.227994
ContainmentHealthIndex
0.200501
population_density
0.142050
personal_income
0.110997
EconomicSupportIndex
0.044011
Mortality
0.76
0.24
personal_income
0.220112
ContainmentHealthIndex
0.217296
population_density
0.216265
StringencyIndex
0.126412
mobility_restraint_index
0.118064
EconomicSupportIndex
0.101851
DSM
0.85
0.15
population_density
0.619673
mobility_restraint_index
0.121538
ContainmentHealthIndex
0.091995
personal_income
0.081905
StringencyIndex
0.063282
EconomicSupportIndex
0.021607
Risk
0.86
0.14
population_density
0.479956
ContainmentHealthIndex
0.154382
StringencyIndex
0.137285
personal_income
0.104640
mobility_restraint_index
0.079215
EconomicSupportIndex
0.044522
Model summary
To visualize model quality, we plotted model prediction as a function of observed values in a scatter plot. When the points fall more or less on the line, it tells us that the bias in model prediction cannot be improved any more. E.g., in Fig. 8: (a) COVID Rate and (b) Recovery Time. We observe a good fit for COVID Rate, but we also see a clear bias for Recovery Time. In this case, the model for Recovery Time is underfitting the data: we do not have the missing features needed to explain the variance of this metric.
Fig. 8
Model prediction quality for COVID Rate (a) and for Recovery Time (b). In the ideal fit, all points fall around, and near, the line. Local Deviations from this line usually point to bias, which often can be mitigated by expanding the feature search.
Feature Importance and Model Quality based on Out-Of-Sample R-Squared.Model prediction quality for COVID Rate (a) and for Recovery Time (b). In the ideal fit, all points fall around, and near, the line. Local Deviations from this line usually point to bias, which often can be mitigated by expanding the feature search.
Feature analysis
A brief summary of feature importances is below.COVID Rate: As we see in Table 5, the most important feature for COVID Rate is StringencyIndex, followed by ContainmentHealthIndex, Population Density, and Personal Income. MRI and EconomicSupportIndex are not as critical for COVID rate as the other four metrics.Recovery Time: Of the metrics used in our model, MRI; StringencyIndex; and ContainmentHealthIndex are equally important, while Population Density, Personal Income, and EconomicSupportIndex play a significantly smaller role in this metric. Other factors, like COVID-readiness of the local hospitals, including availability of ICU beds, COVID-specific diagnostic and treatment equipment and medication, as well as staffing and fatigue levels of the front-line medical personnel, supply of personal protective equipment(PPE), infected individuals’ age, medical history, immediate pre-COVID conditions, and other factors that are outside the scope of this study will be taken into consideration in followup research.COVID Mortality: Here too, availability of rapid test kits, hospital personnel and ICU beds, supply of personal protective equipment(PPE), and other factors that are outside the scope of this study must be taken into consideration; while the model was not the worst fit at , it still leaves 24% of the variance unexplained. Of the features used in modeling, Personal Income, ContainmentHealthIndex, and Population Density are the Top 3 predictors of COVID Mortality.Disease Severity Metric: Notably, Population Density is the single most important feature of our model for Disease Severity Metric (DSM) — a linear combination of new cases, death toll, and total number of cases, followed with a large margin by MRI. Given that the model fit was good at – the model explained 85% of the variance in DSM, leaving only 15% to metrics that were out of scope of this research – it is reasonable to say that, despite well developed medical and emergency infrastructure in the high-density counties, the probability of contact with infected individuals has led to higher DSM values than in the rural (low-density) counties.COVID Risk: Because of the way we defined risk in Eq. (4), it is no surprise that Population Density is the most important factor for this metric. What is surprising is that MRI has shifted down to the fifth position out of the six, displaced by ContainmentHealthIndex, StringencyIndex, and Personal Income. Intuitively, we believe two factors are at play here: (i) MRI does not account for how many county lines the individuals crossed during the day, and what the COVID situation was in the counties entered; and (ii) Personal Protective Equipment (PPE) was becoming available and mandated by the state and local regulations at about the same time as MRI was restabilizing (Fig. 18). These circumstances have confounded the effect of MRI on COVID Risk. Mathematically, considering that ContainmentHealthIndex and StringencyIndex are the most important features for COVID Rate, and Rate is one of the components of our definition of COVID Risk, it makes sense.
Fig. 18
Mobility Restraint Index (a), Low-Mobility Fraction of the Population (b), and Mobility Range Change (c).
Discussion and directions of further research
We have identified the metrics related to propagation of the COVID-19 pandemic. We understood their interactions and, based on these interactions, developed composite metrics that work well to predict disease severity, rate, and mortality in a pandemic.These metrics and disease characteristics have been put as features and target variables, respectively, through Random-Forest Regression models to rank the features in order of their effect on each of the selected metric.
Observations and analysis
StringencyIndex, ContainmentHealthIndex, Personal Income, and Population Density, as well as Mobility Restraint Index are strong contributors to the COVID-related metrics. EconomicSupportIndex has negligible effect on any of the five COVID metrics we have considered in this study.Government measures – their StringencyIndex and Containment HealthIndex – have a strong effect on COVID rate — measured as , along with (to a smaller degree) Population Density and Personal Income. Mobility Restraint Index (MRI) has negligible effect on COVID rate.Mortality – measured as – is equally affected by Personal Income, ContainmentHealthIndex, and Population Density. Stringency Index (SI) and MRI, as well as EconomicSupportIndex each have of their effect. Of the metrics considered in this study, duration of being ill (days_sick) needs the most work in the future. Mobility restraint has the biggest effect on it; however there are other factors that are outside the scope of this work.Disease Severity Metric (DSM) – a linear combination of new COVID cases, COVID death counts, and average duration of being ill – from diagnosis to second negative test — is primarily affected by Population Density and Mobility Restrained Index (MRI). Personal Income and Government measures have a smaller effect on this metric.COVID Risk – defined as the product of DSM and COVID rate – is primarily affected by Population Density, government health- and containment-related measures, and overall stringency of government measures. The effects of Personal Income and MRI are significantly smaller.
Further research
This work has posed important questions along the following axes:RECOVERY TIME As we saw in Section 4, COVID recovery time is affected by features that remained outside the scope of the current research.GOVERNMENT MEASURES AND MOBILITY We saw in Fig. 3 that the correlation between state governments’ StringencyIndex and MRI is very low at 0.14. Fig. 4 tells us that their relationship is very complex, with possible feedback causal loops, as well as other factors, including political and social events happening at the same time. We would like to better understand the factors at play, to enable state and local governments to form more effective policies and measures with the purpose of getting a more intuitive response from the population.PUBLIC AND SOCIAL MEDIA, GOVERNMENT SUPPORT, AND COMMUNITY RESILIENCE Public media, and people’s communication through social media, increase people’s awareness of the risks of COVID. This serves as a feedback to people’s movement range, moderating mobility metrics. In turn, this reduces the risk of spread of coronavirus, but has the unintended consequence of risk to the livelihood and welfare of those people whose income is not high enough to enable them to stay in place through the day (e.g., to work from home). To offset this, federal, state, and local governments have been taking economic support measures, ranging from anti-eviction regulations and rent control to stimulus packages and are designed to help the low-income individuals sustain their livelihoods and welfare during the pandemic. This works as a feedback loop. Its effect on community resilience is very important to understand in order to inform public policies and to foster resilient communities.TIME AS AN EXPLANATORY VARIABLE As we discussed in Section 4.1, time is an important parameter affecting how the relationships between the explanatory and the dependent variables evolve. Now that we have a way to reduce the scope of the modeling problems by highlighting the statistically important features via nonparametric modeling, we can incorporate time as an explanatory variable — either (i) by parametric modeling of the relationships we found in the current work and analyzing the time dependency of the parameters of such models, or (b) by incorporating modern developments in multivariate time series analysis and deep learning, as proposed, e.g., by [16].
Conclusions
Fighting the deadly pandemic with science and data has been paying off. Consequently, many sources of authoritative publicly available data have been formed. To better understand the complexity of the processes happening during a pandemic, and to help formulate policies in the future, we have built predictive models explaining the joint impact of mobility, socioeconomic, and demographic variables on disease severity.In this process, to measure disease severity, we used the standard metrics of disease rate and mortality, and also defined new statistically and theoretically justified metrics, Disease Severity Index (DSM), Average Recovery Time (Duration of Being Ill), and Risk. To streamline the analysis of the effects of mobility on COVID metrics, we formulated a metric that we called Mobility Restraint Index (MRI). This enabled us to (i) investigate the applicability of state-government response metrics formulated by Oxford University (OxCGRT) to predicting the outlook of the pandemic; (ii) determine the existence of critical values of OxCGRT composite metrics at which COVID spread slows down and patients’ chances of survival improve; (iii) build comprehensive models connecting socioeconomic, demographic, mobility, and government response metrics with pandemic-related measures, such as rate, mortality, severity, and risk; (iv) rank the factors in the order of their relative importance for each of the pandemic measure, identifying which features should be used in predicting each of the dependent variables. The new metrics and the models contribute to our understanding of the pandemic and open new opportunities for further research. We outlined some of these directions in Section 5.2.
CRediT authorship contribution statement
Alexander Gilgur: Equally contributed to the detailed development of the idea, Contributed the identification of data sources, Data engineering, Tool development, Implementation of the metrics and the algorithms, Analyses of the results. Jose Emmanuel Ramirez-Marquez: Contributed the idea, Advising, Editing, Reviewing during all stages of our work on this paper, Equally contributed to the detailed development of the idea.
Table 6
Statistical Summary of COVID Rate across all US counties.
Covid_rate
mean
2.40e−02
std
2.94e−02
min
9.96e−08
max
2.06e−01
Table 7
Statistical summary of Gini Index for a sampling of 828 US counties.