Literature DB >> 33636583

Early warning of vulnerable counties in a pandemic using socio-economic variables.

Damian J Ruck¹, R Alexander Bentley², Joshua Borycz³.

Abstract

In the U.S. in early 2020, heterogenous and incomplete county-scale data on COVID-19 hindered effective interventions in the pandemic. While numbers of deaths can be used to estimate actual number of infections after a time lag, counties with low death counts early on have considerable uncertainty about true numbers of cases in the future. Here we show that supplementing county-scale mortality statistics with socioeconomic data helps estimate true numbers of COVID-19 infections in low-data counties, and hence provide an early warning of future concern. We fit a LASSO negative binomial regression to select a parsimonious set of five predictive variables from thirty-one county-level covariates. Of these, population density, public transportation use, voting patterns and % African-American population are most predictive of higher COVID-19 death rates. To test the model, we show that counties identified as under-estimating COVID-19 on an early date (April 17) have relatively higher deaths later (July 1) in the pandemic.

Entities: Chemical

Keywords: COVID-19; County-scale estimation; LASSO; Negative binomial regression; Pandemics; Social networks; Socioeconomic covariates

Mesh：

Year: 2021 PMID： 33636583 PMCID： PMC8054145 DOI： 10.1016/j.ehb.2021.100988

Source DB: PubMed Journal: Econ Hum Biol ISSN： 1570-677X Impact factor: 2.184

Introduction

Coronavirus disease 2019 (COVID-19), the pandemic that emerged in Wuhan, China in 2019 (World Health Organization, 2020), increased rapidly across the United States in early 2020, challenging the capacity for a coordinated response. In the absence of a vaccine, two strategies for containing the virus have been social distancing and widespread testing (Bedford et al., 2020, Maharaj and Kleczkowski, 2012). Widespread testing reduces selection bias in estimating the numbers of undocumented infections, a crucial variable in the dynamics of spread (Munster et al., 2020, Li et al., 2020). In early 2020, however, there were not enough COVID-19 testing data in the U.S. to predict infections and health care demand (Munster et al., 2020), given substantial heterogeneity in testing rates in both geographic and socio-economic terms (Chowell and Mizumoto, 2020). Under-reporting of COVID-19 infections was likely substantial, perhaps by orders of magnitude in the U.S., both overall and at the county level (Bendavid et al., 2020, Lau et al., 2020). During the urgent early phase of such a pandemic, decisions at the level of both individual behavior and public health response are not only crucial (e.g., Dehning et al., 2020), but “have to be made using the scarce data available” (Zhang and Qian, 2020). Rapid, approximate estimates of infection rates, using online data, are valuable in this phase (Bentley and Ormerod, 2010, McIver and Brownstein, 2014, Bancroft and Lee, 2014, Chunara et al., 2013). Here we use U.S. COVID-19 data, from early (17 April 2020) and later (1 July 2020) in the progression of the pandemic, to test a means of county-scale estimation of pandemic virus infections when testing data are still incomplete and heterogeneous. This then offers a means of identifying the most vulnerable counties that have not yet reported significant statistics. At the scale of U.S. counties, we assume that the numbers of recorded deaths by a given day are the most complete data on the extent of the virus (Baud et al., 2020, Flaxman et al., 2020, Marchant et al., 2020). Even though there will naturally be a distribution of times between infection and death, for purposes of statistical prediction, we assume the number of infections will be proportional to the number of deaths a certain number of days afterward. With the clinical literature as a guide (Huang et al., 2020), we follow Diebner and Timmesfeld (2020) in using regressions to determine the most predictive number of days lag between cases and deaths. We expect the optimal lag, in terms of predictive value to be between one and two weeks. In two clinical studies of patients with confirmed COVID-19 cases in Wuhan, China, in early 2020, the median time from onset of symptoms to ICU admission was 10.5 days ( patients) in one study (Huang et al., 2020) and 10 days ( patients) in another study (Wilson et al., 2020). Comparing the Pearson correlations between the logarithm of cumulative COVID-19 cases versus the logarithm of cumulative COVID-19 deaths in Germany at different time lags, Diebner and Timmesfeld (2020) find that 13 days was the optimal lag. The same 13-day optimal lag was found when comparing log-transformed new (daily) deaths versus cases (Diebner and Timmesfeld, 2020). Other studies determine or use a time delay of 13 days from illness onset to death (Linton et al., 2019, Wang et al., 2020, Yang et al., 2020), so while our regression estimate of ten days (below) delay for the U.S. seems reasonable, we find only limited effect on our regression results by using a 14-day delay. Similarly, an epidemiological model found little impact between using lags of 13, 15 or 18 days (Flaxman et al., 2020). Although COVID-19 mortality rates are age-dependent (Kucharski et al., 2020, Verity et al., 2020), we use a generalized fatality rate in our regressions, as this may be the only variable available from early county-aggregated data, as well as include two age variables in our regressions (mean age and % over 65 years old). In using county-scale death statistics, a source of statistical uncertainty is the relatively low numbers of deaths early in the pandemic. On April 17, 2020, for example, a large portion of U.S. counties were not reporting any deaths, whereas by 1 July 2020 many more counties were reporting nonzero deaths (Fig. 1a and b). Since the early increase of infections was exponential (Fig. S2), as expected in compartmental models (Wu et al., 2020b, Kucharski et al., 2020), the level of under-reporting can outpace increases in testing rates. As the early numbers of COVID-19 cases in states can differ by orders of magnitude (e.g., New York vs. Great Plains states; Figs. 1a and S2), their differences in under-reporting could be substantial.

Fig. 1

Maps of the numbers, per million in each county population, of confirmed COVID-19 cases by April (upper left) versus July 2020 (upper right) and deaths by April (lower left) versus July 2020 (lower right). Gray counties designate zero reported cases/deaths. This raises the concern as to which counties reporting few deaths and cases early in the pandemic might be most vulnerable to relatively higher cases and deaths as the pandemic spreads. To address this, we can make use of the rich co-variate data available at the U.S. county level to improve estimates of under-reporting. Although these county-level estimates are our main objective, the multivariate regression also identifies factors most predictive of COVID-19 deaths. As our objective is prediction, we note that the most predictive variables may or may not be the most causal. Given, for example, the importance of age, physical clustering and pre-existing conditions to COVID-19 (Kucharski et al., 2020, Verity et al., 2020, Centers for Disease Control, 2020, Thompson, 2020, Lu et al., 2020, Yusef et al., 2020), certain county-level variables may have direct correlation; others will have indirect correlation. For example, a likely direct effect on the COVID-19 death to case ratio is the number of hospital ICU beds in the county (Schulte et al., 2020, King, 2020). An indirect effect may be median household income: high income counties may have more jobs that can be performed remotely (del Rio-Chanona et al., 2020). Some variables likely subsume the effects of other variables. Nonwhite populations may have higher rates of COVID-19 infections and/or deaths, for reasons that could include distrust in healthcare (Armstrong et al., 2008), reliance on public transportation (Anderson, 2016), exposure to greater air pollution (Ard and Bullock, 2020) or higher incidence of chronic health conditions (Fang et al., 2020, Lighter et al., 2020). For our ‘early warning’ estimates, we do not use data on protective behaviors such as social distancing and mask wearing. While these behaviors affect (reduce) the number of infections, they are delayed, dynamic responses to the number of cases (see Supplementary Fig. S1) and unlikely to be available early in the pandemic. In the case of COVID-19, for example, survey data on mask wearing were collected and reported in mid-July 2020 (Katz et al., 2020), months after an early warning system could be implemented. In essence, we use the county-level death statistics to estimate how many people in the county were infected by COVID-19 on a given date. Even if the death counts were 100% accurate, however, due to all the counties with low numbers, we need to impute the underlying likelihood of death in those counties. The county-scale likelihood estimates provide non-zero values for predictive purposes on all counties, including sparsely-populated counties or counties that the epidemic has not yet fully reached. Comparing observations of cases and deaths from April and July 2020, we can test whether the method identified in April the most vulnerable counties to subsequent infections in July. Due to the exponential growth in numbers plus the sparseness of count data from many counties, here we will be using a negative binomial regression with the Least Absolute Shrinkage and Selection Operator (LASSO) method to estimate deaths across the country at a given date (a new regression must be carried for each selected date). When these death estimates are divided by a generalized fatality rate, the result is an estimation of the true numbers of infected people in each county. Comparing the estimated numbers of infected people to the observed number of confirmed cases gives us a measure of case under-estimation in each county.

Methods

To improve estimations from heterogeneous COVID-19 death data at the U.S. county level, here we introduce thirty-one covariates (Table 1 ) into a negative-binomial regression. These covariates cover nine broad categories: clustering of population, voting behavior, health-care access, preexisting health conditions, age, ethnicity, links to COVID-19 hotspots, employment and education. Included among these covariates are three scalar variables to capture some of the international spread of COVID-19 from early hotspots (Callaway et al., 2020), specifically China, Italy, and Iran (Table 1). These fixed scalar effects for each U.S. county are derived from the Facebook social connectedness index (Bailey et al., 2018).

Table 1

Covariates in matrix X	Description	Source
Clustering
P	Population size	U.S. Census
ρ	Population density	U.S. Census
ph	Persons per house	U.S. Census
pb	Persons per bedroom	U.S. Census
pr	Persons per room	U.S. Census
Pt	Public transport	U.S. Census

Voteshare
v	% Democratic – Republican	MIT Election Lab

Health care
B	ICU beds per capita	Schulte et al. (2020)
H	Hospitals	Schulte et al. (2020)
U	% no health insurance	U.S. Census

Health
O	% Obese (BMI ≥ 30)	CDC
D	Diabetes	CDC
Ht	Hypertension	Olives et al. (2013)
pm2.5	Air pollution (	Wu et al. (2020a)

Age
a	Mean age	U.S. Census
a65	% over 65 years old	U.S. Census

Ethnicity
Black	Black	U.S. Census
Hispanic	Hispanic	U.S. Census
Native	Native American	U.S. Census

Facebook connections to:
SCICN	China (per capita)	Facebook SCI
SCIIT	Italy (per capita)	Facebook SCI
SCIIR	Iran (per capita)	Facebook SCI

Employment
In	Median household income	U.S. Census
Jp, Js,	Jobs: professional, service	U.S. Census
Jo, Jt, Jr	Jobs: office, trade, transport	U.S. Census
Ic	Incarcerated	Vera Inst. (2020)

Education
EBA	College educated	U.S. Census
EHS	No high school	U.S. Census

Covariates for predicting cumulative COVID-19 deaths and under-reporting rates, at the county level of aggregation. The portal for U.S. Census data (U.S. Census Bureau, 2020a, U.S. Census Bureau, 2020b) is www.census.gov. All variables in Table 1 represent fixed, county-level effects. While dynamic effects of social networks and inter-county travel were surely a factor in coronavirus spread (Maier and Brockmann, 2020), we do not employ such dynamic data here for two reasons. The first is that we aim for early estimation using fixed county-level covariates that would be already available at the onset of a future pandemic. Secondly, there is not currently an established method for adding network covariates to the method we employ here, a LASSO Negative Binomial regression (Hays et al., 2010, Silk et al., 2017, Leifeld and Cranmer, 2019). We reserve the challenge of adding dynamic network data for future work. Since case numbers grow exponentially, our regressions use the logarithm of county-level death counts on day , , where the vector contains entries for each of 3088 counties. Even when log-transformed, however, the count data are likely to be overdispersed and subsequent standard errors underestimated. For this reason, we use a negative binomial distribution of errors in the regressions, which allows us to use number of COVID-19 deaths as count data for predictions. To estimate the numbers of deaths, , in all counties on day , the regression relationship is:where is the vector of weights for the covariates (see Table 1). The errors follow a negative binomial distribution and have a variance for a given mean, , of , where is a dispersion parameter. In the regression, we use sandwich corrected estimates of standard errors (Luque-Fern et al., 2016). Using thirty-one related covariates, while comprehensive, will likely result in both overfitting and colinearity in the Negative Binomial regression. To mitigate these risks, we employ a LASSO penalization in the loss function of the Negative Binomial regression. LASSO selects the most predictive variables by regularization, forcing many of the estimated effect sizes to zero; the most predictive covariates are those left with non-zero effects. Importantly, as LASSO is a method of dimension reduction that focuses on prediction, covariates that are set to zero may actually be causal in the real world. Conversely, highly predictive covariates may actually not be causal, they may have just subsumed the variance of many other truly causal covariates. In the LASSO process, the choice of the regularization parameter, , is important, as higher results in fewer non-zero parameters. We optimize using 2-fold cross-validation over a range of possible values. The LASSO regularization minimizes the following function, which plays-off the log likelihood, versus the sum of the individual coefficients in multiplied by (Lehman and Archer, 2019):where in this case, indexes the 3088 counties in the sample. For a given likelihood , the higher is, the smaller must be, and the fewer non-zero parameters are allowed. We apply this negative binomial regression to COVID-19 deaths data on both 17 April and 1 July, 2020. To maximize the number of non-zero predictors, we chose the largest that did not markedly reduce the out-of-sample predictive power of the LASSO regression. This turned out to be ; higher values reduced the cross-validated log likelihood of the LASSO regression (see Fig. S3). Next, as our estimate of the number actually infected in county on date , , we divide by the fatality rate, . We then define the under-estimation in the county, , as the ratio of confirmed cases on record, , to . Measuring under-estimation as a ratio—the difference of the log-transformed values ()—rather than a simple difference helps account for the highly skewed distribution of cases across counties. Using vector notation to represent all 3088 counties in the sample, we determine : Using data from the two dates allows us to determine whether determined for low-data counties in April is predictive of unexpectedly high numbers of COVID-19 deaths in July.

Results

For estimating COVID-19 infection rates from death rates, we first determined an optimum lag between COVID-19 infection and death statistics (log-transformed) in 3088 counties. We ran forty regressions of the logged number of new deaths on April 13 against the logged number of new cases, at lag times ranging between 1 and 40 days. The largest value occurred with a lag of (see Fig. 2 ), which is in line with findings from other studies (Backer et al., 2020, Huang et al., 2020, Russell et al., 2020, Wilson et al., 2020).

Fig. 2

Left: COVID-19 cases on day versus deaths on day (both log scale) for all U.S. counties with more than 1,000,000 people, from to days’ delay. All are comparing versus a case date of 04/13/2020. Right: Daily deaths versus daily cases in the U.S., with logarithmic y-axis. We confirm that our results are robust to variation in lag time by showing our under-reporting statistic still predicts more future deaths, using mean lag times between contracting COVID-19 and death, , of 0, 7, 10 and 14 days (Supplementary Figs. S4–S6). We find positive slopes in all cases (Supplementary Tables S3–S6), indicating that our under-reporting statistic is predictive of a higher number of future COVID-19 cases. We stress that the value of is strictly a parameter for use in prediction, not a definitive statement about duration of infection for individuals. Indeed the parameter —which we estimated for COVID-19 in the U.S. in 2020—would ideally be estimated independently for each case study. The lag would likely differ, for example, in a future pandemic and/or within health/hospital systems of the world (e.g., Wood, 2020).

Five predictive covariates

Of the thirty-one variables (Table 1) entered into the LASSO negative binomial regression, five were retained by the LASSO analysis at the optimal level of as most predictive of COVID-19 deaths by county (Table 2 ). Consistent with existing literature (discussed below), the five predictive variables are: population size, population density, public transport, voteshare, and % African-American population. Table 2 shows their regression coefficients via Eq. (1). These five variables explain a significant proportion of the variance in county death data: comparing predicted versus confirmed death counts (both log-transformed) yields for 17 April, and for 1 July (Table 2). Using these coefficients, Fig. 3 compares vs in all counties for two dates in 2020, April 17 and July 1. Using a similar LASSO regression (see Supplement), we confirm that deaths are a much better predictor of than the full 31 covariates of Table 1. By themselves, these fixed socioeconomic variables cannot predict which counties are likely to under-report COVID-19 cases.

Table 2

	Deaths	Deaths
Co-variate	17 April	1 July
Population size (logP)	0.87	0.92
Population density (logρ)	0.14	0.13
Public transport (Pt)	5.36	1.36
Voteshare (v)	0.18	0.05
% African American	1.07	1.96
R2 vs. observed deaths	0.47	0.60

Fig. 3

Predicted log-transformed COVID-19 infections, () versus log-transformed confirmed cases (). Each plot represents a different state, and datum points are counties within each state. Colors show determinations for two dates, April 17, 2020 (red) and July 1, 2020 (blue). Regression lines use points weighted by the county populations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Coefficients for the five variables retained by the LASSO regression () for prediction of COVID-19 deaths and cases, on 17 April 2020 and 1 July 2020. Variables from Table 1 with zero effect at are not shown. Population variables were log-transformed. There were 3088 observations. For a full list of effects at different levels of , see Supplementary Tables. Predicted log-transformed COVID-19 infections, () versus log-transformed confirmed cases (). Each plot represents a different state, and datum points are counties within each state. Colors show determinations for two dates, April 17, 2020 (red) and July 1, 2020 (blue). Regression lines use points weighted by the county populations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Since LASSO is a factor-reduction technique, rather than a definitive statement of causality, we discuss first the predictive value of these five covariates as an ‘early warning system’ to identify US counties of particular concern. The predictions are mapped at county level in Fig. 4 , which shows county-level predicted COVID-19 cases and deaths for April 17, and July 1, 2020. The maps fill in the data gaps across the less populated counties in the middle of the country, including the Great Plains (cf. Fig. 1). Similar to the raw numbers (Fig. 1), the predicted numbers are highest on the coasts, the northeast, Florida, and major urban areas.

Fig. 4

County-level predicted COVID-19 infections, (top row) and deaths, (bottom row) for April 17, and July 1, 2020. Data are log-transformed.

County-level predicted COVID-19 infections, (top row) and deaths, (bottom row) for April 17, and July 1, 2020. Data are log-transformed. Fig. 5 shows county-level maps of the under-estimation measure, , for April 17 and July 1. These maps of have less obvious patterning than maps of observed infections or deaths (Fig. 1, Fig. 4). Many of the highest values of in July are found in counties in the northern states (Maine, Idaho, Montana and Michigan), as well as parts of northern California and Oregon. Using a fatality rate of 1%, our estimations of actual infections, , are one to two orders of magnitude larger than confirmed case numbers, , as shown in Fig. 6 a, which shows both and on April 17 and July 1. This is broadly consistent with localized estimates, such as a California county where the COVID-19 infection rate was “50–85-fold more than the number of confirmed cases” in early April (Bendavid et al., 2020). In China, by comparison, 6 out of 7 COVID-19 infections (before 23 January) were potentially not reported (Li et al., 2020).

Fig. 5

COVID-19 under-estimation, , for April 17, and July 1, 2020.

Fig. 6

(a) Deaths versus COVID-19 infections in each county. (Green dots: confirmed cases vs deaths on 17 April, 2020. Blue dots: confirmed cases vs deaths on 1 July, 2020. Red band: Predicted infections for a range of fatality rates between 0.6 and 2%.) (b) Under-reporting on 17 April, 2020 vs deaths on July 1, 2020; where green circles are counties where on 17 April (), blue circles where () and red circles where (). Solid lines are best fits from a negative binomial regression. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

COVID-19 under-estimation, , for April 17, and July 1, 2020. (a) Deaths versus COVID-19 infections in each county. (Green dots: confirmed cases vs deaths on 17 April, 2020. Blue dots: confirmed cases vs deaths on 1 July, 2020. Red band: Predicted infections for a range of fatality rates between 0.6 and 2%.) (b) Under-reporting on 17 April, 2020 vs deaths on July 1, 2020; where green circles are counties where on 17 April (), blue circles where () and red circles where (). Solid lines are best fits from a negative binomial regression. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 6a shows that the gap between the confirmed cases and the predicted infections (red band) increases with number of deaths (i.e., population size). We test how well estimated for low-information counties on 17 April identifies those counties with high death counts by July. For low-data counties, our estimate of can offer an indicator of risk at the county level, as shown in Fig. 6b. Determining for 17 April, we predict which counties were at most risk for relatively high levels of COVID-19 by July 1. Fig. 6b shows that, for counties with little data in April 2020, a high under-reporting estimate was an effective early warning signal of excess deaths by July 2020. Fig. 6b shows this for the 1171 counties with 0 (), 1 () or 2 () confirmed cases in April. The fits show that counties with larger under report scores had higher COVID-19 deaths, ranging between just over 0 and 5 on average. This may seem like a small number, but with our assumed fatality rate of 1%, this represents an outbreak stretching into the hundreds in counties with fewer than 3 reported cases in April. There is also a substantial range on these predictions, with 13 out of the 1171 counties reporting 15 or more deaths, indicating an outbreak of more than 1500 cases.

Discussion and conclusion

In this prediction exercise, certain variables identified by LASSO explain more of the variation in the outcome than others. We should avoid assigning actual causation with the value these variables have in predicting true number of infections. Nevertheless, the five covariates in Table 2 are among the prominent risk factors listed by the CDC, and we may speculate on how they relate to the twenty-two variables not retained by the LASSO at the optimized value (see Supplement for results using other values of ). Equally notable are the twenty-two variables from Table 1 not returned in the LASSO results (Table 2). This does not mean these variables are not important in the real world, but that for the purposes of predicting case numbers, the information in the five are sufficient to supersede, or encompass, the information residing in the other twenty-six. Notably, none of the four pre-existing health conditions—obesity, diabetes, hypertension or pollution—was selected by LASSO as predictive of COVID-19 deaths or cases. Given the studies showing these to be genuine risk factors for individuals (Lighter et al., 2020), one of the five variables in Table 2 must be subsuming their predictive effects. The importance of African-American proportion of the population is consistent with the literature on socioeconomic correlates with COVID-19. In different ZIP codes of New York City, for example, Lieberman-Cribbin et al. (2020) found that, as the proportion of white residents increased, the number of COVID-19 tests increased and fraction of those testing positive decreased. Again, as LASSO is a factor-reduction method, we suspect that the proportion African American has such predictive significance that it subsumed the predictive effects of other covariates such as diabetes, hypertension, pollution and income. This may explain why income was not an important predictive variable in the LASSO results, despite the relative lack of testing resources in poorer U.S. counties (Schulte et al., 2020, van Dorn et al., 2020) and the overall disparity in COVID-19 testing efforts and resources attributable to socioeconomic and racial disparities in healthcare access (Karaye and Horney, 2020, Lieberman-Cribbin et al., 2020). As COVID-19 mortality rates depend on an individual's age (Kucharski et al., 2020, Verity et al., 2020), it may seem surprising to see both % over 65 years old and mean adult age not among the most important factors emerging from the LASSO results (Table 2). Connections with Italy, China or Iran were also not among the five most predictive factors, in April or in July. This despite evidence that the U.S. received confirmed “index” cases from Europe and from China (Spiteri et al., 2020). Voteshare in the last election (% Democratic % Republican) was an important variable in predicting deaths, both in April 17 and on July 1 (Table 2). The use of two dates helps us rule out the effect of counties in New York state acting as outliers. In April 2020, New York state had the highest number of reported COVID-19 cases, and was also among the most Democratic-leaning. The voteshare effect remains, however, by July 1, when both cases and deaths were higher in many other parts of the country, including counties in the U.S. South, where voteshare is much different. It could be that democratic voteshare captures aspects left unrepresented by the other crowding variables. Various surveys in the U.S. in 2020 (FiveThirtyEight, 2020) have shown Democratic voters to have higher levels of concern about COVID-19 than Republican voters. Given that Democratic voteshare predicts more deaths, this might be due to higher levels of liberal behavior among states from the West coast and North-East driven by differences in culture (Harrington and Gelfand, 2014). Democratic voteshare may act as a proxy for more openness, tolerance (Ruck et al., 2020) and looser norms (Harrington and Gelfand, 2014). By contrast, collectivist cultures may be better equipped to mitigate a pandemic through a tendency to obey authority (Kemmelmeier et al., 2003), conform with peers (Murray et al., 2011) and engage in less physical contact (Wu et al., 2019). Further consideration of this hypothesis would need to account for the demographics of strongly Democratic counties. Overall, we find that the use of socioeconomic determinants allows—as a supplement to existing SIR models—for rapid, early warning estimations for vulnerable counties at times when county-scale reporting data is heterogeneous and incomplete. In using death counts with a ten-day time lag, we were able to predict which counties were under-reporting COVID-19 cases in April and validate these predictions against fatality rates in July. In future pandemics this ‘early warning system’ could be used to identify vulnerable counties where disease outbreaks have not yet occurred. False positives will be produced but this may be improved by expanding our set of 31 covariates. While the results present a parsimonious set of socioeconomic risk factors for COVID-19 prevalence, additional covariate data will inevitably become available for early warning tools in future events. With further research, the methodology we have laid out here can be adapted to incorporate dynamic and/or network data, such as seasonality and ultraviolet light exposure (Carleton et al., 2020, Merow and Urban, 2020), inter-county migration or the Facebook connectedness Index. The incorporation of the more complex data into these early warning tools is a goal for future work.

Author contributions

Conception and design of study: R.A. Bentley, D.J. Ruck, J. Borycz Acquisition of data: D.J. Ruck, J. Borycz, R.A. Bentley Analysis and/or interpretation of data: D.J. Ruck, R.A. Bentley, J. Borycz Drafting the manuscript: R.A. Bentley Revising the manuscript: R.A. Bentley, D.J. Ruck Approval of the submitted manuscript: R.A. Bentley, D.J. Ruck, J. Borycz

Data accessibility

Data and relevant code for this research work are stored in GitHub repository: https://github.com/damianruck/EWS-pandemic-socioeconomic-variables.

Conflict of interest

The authors declare that there is no conflict of interest.

47 in total

1. A rapid method for assessing social versus independent interest in health issues: a case study of 'bird flu' and 'swine flu'.

Authors: R Alexander Bentley; Paul Ormerod
Journal: Soc Sci Med Date: 2010-04-24 Impact factor: 4.634

2. The coronavirus pandemic in five powerful charts.

Authors: Ewen Callaway; David Cyranoski; Smriti Mallapaty; Emma Stoye; Jeff Tollefson
Journal: Nature Date: 2020-03 Impact factor: 49.962

3. Disparities in COVID-19 Testing and Positivity in New York City.

Authors: Wil Lieberman-Cribbin; Stephanie Tuminello; Raja M Flores; Emanuela Taioli
Journal: Am J Prev Med Date: 2020-06-25 Impact factor: 5.043

4. Culture Change and Affectionate Communication in China and the United States: Evidence From Google Digitized Books 1960-2008.

Authors: Michael Shengtao Wu; Boyuan Li; Liangliang Zhu; Chan Zhou
Journal: Front Psychol Date: 2019-05-22

5. Prevalence, awareness, treatment, and control of hypertension in United States counties, 2001-2009.

Authors: Casey Olives; Rebecca Myerson; Ali H Mokdad; Christopher J L Murray; Stephen S Lim
Journal: PLoS One Date: 2013-04-05 Impact factor: 3.240

6. COVID-19 exacerbating inequalities in the US.

Authors: Aaron van Dorn; Rebecca E Cooney; Miriam L Sabin
Journal: Lancet Date: 2020-04-18 Impact factor: 79.321

7. The COVID-19 pandemic in the USA: what might we expect?

Authors: Gerardo Chowell; Kenji Mizumoto
Journal: Lancet Date: 2020-04-04 Impact factor: 79.321

8. Real estimates of mortality following COVID-19 infection.

Authors: David Baud; Xiaolong Qi; Karin Nielsen-Saines; Didier Musso; Léo Pomar; Guillaume Favre
Journal: Lancet Infect Dis Date: 2020-03-12 Impact factor: 25.071

9. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2).

Authors: Ruiyun Li; Sen Pei; Bin Chen; Yimeng Song; Tao Zhang; Wan Yang; Jeffrey Shaman
Journal: Science Date: 2020-03-16 Impact factor: 47.728

10. Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China.

Authors: Benjamin F Maier; Dirk Brockmann
Journal: Science Date: 2020-04-08 Impact factor: 47.728

6 in total

1. Inequality between identity groups and social unrest.

Authors: Christian Houle; Damian J Ruck; R Alexander Bentley; Sergey Gavrilets
Journal: J R Soc Interface Date: 2022-03-23 Impact factor: 4.118

2. Exploring COVID-19 Daily Records of Diagnosed Cases and Fatalities Based on Simple Nonparametric Methods.

Authors: Hans H Diebner; Nina Timmesfeld
Journal: Infect Dis Rep Date: 2021-04-01

3. Asian-Origin Approved COVID-19 Vaccines and Current Status of COVID-19 Vaccination Program in Asia: A Critical Analysis.

Authors: Chiranjib Chakraborty; Ashish Ranjan Sharma; Manojit Bhattacharya; Govindasamy Agoramoorthy; Sang-Soo Lee
Journal: Vaccines (Basel) Date: 2021-06-04