| Literature DB >> 35250112 |
Theresa Kuchler1, Dominic Russel1, Johannes Stroebel1.
Abstract
We use aggregated data from Facebook to show that COVID-19 is more likely to spread between regions with stronger social network connections. Areas with more social ties to two early COVID-19 "hotspots" (Westchester County, NY, in the U.S. and Lodi province in Italy) generally had more confirmed COVID-19 cases by the end of March. These relationships hold after controlling for geographic distance to the hotspots as well as the population density and demographics of the regions. As the pandemic progressed in the U.S., a county's social proximity to recent COVID-19 cases and deaths predicts future outbreaks over and above physical proximity and demographics. In part due to its broad coverage, social connectedness data provides additional predictive power to measures based on smartphone location or online search data. These results suggest that data from online social networks can be useful to epidemiologists and others hoping to forecast the spread of communicable diseases such as COVID-19.Entities:
Keywords: COVID-19; Communicable disease; Coronavirus; Social connectedness
Year: 2021 PMID: 35250112 PMCID: PMC8886493 DOI: 10.1016/j.jue.2020.103314
Source DB: PubMed Journal: J Urban Econ ISSN: 0094-1190
Fig. 1Social Network Distributions from Westchester and COVID-19 Cases in the U.S.
Note: Panel (a) shows the social connectedness to Westchester for U.S. counties. Panel (b) shows the number of confirmed COVID-19 cases per 10,000 residents by U.S. county on March 30, 2020. Panels (c) and (d) show binscatter plots with counties more than 50 miles from Westchester as the unit of observation. To generate the plot in Panel (c), we group into 100 equal-sized bins and plot the average against the corresponding average case density. Panel (d) is constructed in a similar manner. However, we first regress and cases per 10,000 residents on a set of control variables and plot the residualized values on each axis. Red lines show quadratic fit regressions. The controls for Panel (d) are 100 dummies for the percentile of the county’s geographic distance to Westchester; population density; median household income; and dummies for the six National Center for Health Statistics Urban-Rural county classifications.
Fig. 2Incremental from Adding Connections to Individual U.S. Counties.
Note: Panels show results from regressions to predict COVID-19 cases per 10k people by county on March 30, 2020. The incremental is the increase in from adding and to a particular U.S. county, over and above a set of baseline control variables: 100 dummies for percentiles of distance to the county under investigation; population density; median household income; and dummies for the six National Center of Health Statistics Urban-Rural county classifications. The graphs show the distributions over the incremental s for adding social connectedness to each county with a population over 50,000 in turn. Each regression in panels (a) and (b) excludes counties within 50 and 150 miles of the county of interest, respectively. In each panel the 10 largest incremental are labeled.
Fig. 3Social Network Distributions of Lodi and COVID-19 Cases in Italy.
Note: Panel (a) shows the social connectedness to Lodi for Italian provinces. Panel (b) shows the number of confirmed COVID-19 cases by Italian province on March 30, 2020. Panels (c) and (d) show binscatter plots with provinces more than 50 km from Lodi as the unit of observation. To generate the plot in Panel (c) we group into 30 equal-sized bins and plot the average against the corresponding average case density. Panel (d) is constructed in a similar manner. However, we first regress and cases per 10,000 residents on a set of control variables and plot the residualized values on each axis. Red lines show quadratic fit regressions. The controls for Panel (d) are 20 dummies for quantiles of the province’s geographic distance to Lodi; GDP per inhabitant; and population density.
COVID-19 Case Growth and Prior Proximity to Cases.
| Panel A | log(Change in Cases per 10k Residents + 1) | |||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
| 2 Week Lag: | 0.589*** | 0.415*** | 0.414*** | 0.321*** | ||||
| log(Change in Social Proximity to Cases + 1) | (0.041) | (0.036) | (0.041) | (0.037) | ||||
| 4 Week Lag: | -0.124*** | -0.080** | -0.002 | 0.010 | ||||
| log(Change in Social Proximity to Cases + 1) | (0.037) | (0.032) | (0.036) | (0.032) | ||||
| Share of Friends within 50 Miles | 0.096 | 0.031 | 0.050 | 0.076 | ||||
| (0.106) | (0.086) | (0.100) | (0.082) | |||||
| Share of Friends within 150 Miles | 0.018 | 0.214* | -0.256** | 0.143 | ||||
| (0.123) | (0.113) | (0.124) | (0.109) | |||||
| 2 Week Lag: | 1.432*** | 1.754*** | 1.244*** | 1.388*** | ||||
| log(Change in Physical Proximity to Cases + 1) | (0.129) | (0.184) | (0.118) | (0.176) | ||||
| 4 Week Lag: | -1.208*** | -1.433*** | -1.037*** | -1.225*** | ||||
| log(Change in Physical Proximity to Cases + 1) | (0.131) | (0.196) | (0.121) | (0.187) | ||||
| 2 Week Lag: | 0.317*** | 0.316*** | 0.646*** | 0.526*** | 0.604*** | 0.514*** | 0.372*** | 0.351*** |
| log(Change in Cases per 10k Residents + 1) | (0.022) | (0.018) | (0.012) | (0.011) | (0.011) | (0.010) | (0.022) | (0.019) |
| 4 Week Lag: | 0.113*** | 0.092*** | 0.077*** | 0.063*** | 0.097*** | 0.072*** | 0.071*** | 0.056*** |
| log(Change in Cases per 10k Residents + 1) | (0.019) | (0.016) | (0.009) | (0.008) | (0.009) | (0.008) | (0.019) | (0.017) |
| Time x Pop. Density FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x Median Household Income FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x State FEs | Y | Y | Y | Y | ||||
| Sample Mean | 2.177 | 2.177 | 2.177 | 2.177 | 2.177 | 2.177 | 2.177 | 2.177 |
| R-Squared | 0.717 | 0.755 | 0.706 | 0.752 | 0.718 | 0.754 | 0.725 | 0.757 |
| N | 47,040 | 47,025 | 47,040 | 47,025 | 47,040 | 47,025 | 47,040 | 47,025 |
| Panel B | log(Change in Deaths per 10k Residents + 1) | |||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
| 4 Week Lag: | 0.471*** | 0.240*** | 0.273*** | 0.141*** | ||||
| log(Change in Social Proximity to Deaths + 1) | (0.058) | (0.049) | (0.049) | (0.046) | ||||
| 8 Week Lag: | -0.018 | -0.057 | 0.187*** | 0.084* | ||||
| log(Change in Social Proximity to Deaths + 1) | (0.054) | (0.041) | (0.052) | (0.043) | ||||
| Share of Friends within 50 Miles | 0.109 | 0.149** | 0.060 | 0.156** | ||||
| (0.076) | (0.070) | (0.066) | (0.066) | |||||
| Share of Friends within 150 Miles | 0.040 | 0.129 | -0.014 | 0.116 | ||||
| (0.083) | (0.078) | (0.081) | (0.074) | |||||
| 4 Week Lag: | 0.738*** | 0.899*** | 0.691*** | 0.802*** | ||||
| log(Change in Physical Proximity to Deaths + 1) | (0.069) | (0.125) | (0.067) | (0.124) | ||||
| 8 Week Lag: | -0.657*** | -0.828*** | -0.699*** | -0.865*** | ||||
| log(Change in Physical Proximity to Deaths + 1) | (0.077) | (0.136) | (0.078) | (0.142) | ||||
| 4 Week Lag: | 0.163*** | 0.230*** | 0.467*** | 0.366*** | 0.425*** | 0.361*** | 0.247*** | 0.276*** |
| log(Change in Deaths per 10k Residents + 1) | (0.032) | (0.027) | (0.021) | (0.018) | (0.018) | (0.016) | (0.027) | (0.026) |
| 8 Week Lag: | 0.016 | 0.052** | 0.025 | 0.019 | 0.063*** | 0.032* | -0.064** | -0.019 |
| log(Change in Deaths per 10k Residents + 1) | (0.033) | (0.022) | (0.019) | (0.018) | (0.018) | (0.018) | (0.032) | (0.023) |
| Time x Pop. Density FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x Median Household Income FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x State FEs | Y | Y | Y | Y | ||||
| Sample Mean | 0.375 | 0.375 | 0.375 | 0.375 | 0.375 | 0.375 | 0.375 | 0.375 |
| R-Squared | 0.374 | 0.455 | 0.360 | 0.454 | 0.384 | 0.459 | 0.392 | 0.461 |
| N | 21,952 | 21,945 | 21,952 | 21,945 | 21,952 | 21,945 | 21,952 | 21,945 |
Note: Table shows results from regression 5. In Panel A, each observation is a county two-week period (between March 30, and November 2, 2020). The dependent variable is log of one plus the number of new COVID-19 cases per 10,000 residents. In Panel B, each observation is a county four-week period (between April 28, and November 2, 2020). The dependent variable is log of one plus the number of new COVID-19 deaths per 10,000 residents. Columns 1 and 2 include log of growth in social proximity to cases (deaths) lagged by one and two periods (two and four weeks in Panel A, four and eight weeks in Panel B). Columns 5 and 6 include analogous measures of physical proximity to cases (deaths). Columns 3 and 4 also control for the share of a county’s Facebook connections that are within 50 and 150 miles. Columns 7 and 8 include all measures. All columns include controls for one and two period lagged changes in cases (deaths), as well as time-specific fixed effects for percentiles of county population density and median household income. Columns 2, 4, 6, and 8 include additional time state fixed effects. Standard errors are clustered at the time state level. Significance levels: *(p0.10), **(p0.05), ***(p0.01).
COVID-19 Case Growth and Prior Proximity to Cases, by Two-Week Period.
| log(Change in Cases per 10k Residents + 1) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| March 31, - April 13 | April 14, - April 27 | April 28, - May 11 | May 12, - May 25 | May 26, - June 8 | June 9, - June 22 | June 23, - July 6 | July 7, - July 20 | July 21, - Aug. 10 | Aug. 11 - Aug. 24 | Aug. 25 - Sep. 7 | Sep. 8 - Sep. 21 | Sep. 22 - Oct. 5 | Oct. 6 - Oct. 19 | Oct. 20 - Nov. 2 | |
| 2 Week Lag: | 0.735*** | 0.411*** | 0.150** | 0.204*** | 0.580*** | 0.178** | 0.287*** | 0.225*** | 0.243*** | 0.305*** | 0.314*** | 0.152** | 0.371*** | 0.149** | 0.313*** |
| log(Change in Social Proximity to Cases + 1) | (0.093) | (0.088) | (0.060) | (0.061) | (0.062) | (0.074) | (0.057) | (0.067) | (0.072) | (0.077) | (0.080) | (0.068) | (0.073) | (0.069) | (0.069) |
| 4 Week Lag: | 0.339 | -0.190 | 0.157* | 0.053 | -0.116* | 0.196*** | 0.057 | 0.097 | 0.084 | 0.024 | -0.046 | 0.041 | 0.128* | 0.049 | -0.141** |
| log(Change in Social Proximity to Cases + 1) | (0.434) | (0.127) | (0.082) | (0.060) | (0.062) | (0.074) | (0.058) | (0.062) | (0.069) | (0.075) | (0.083) | (0.072) | (0.070) | (0.068) | (0.065) |
| Share of Friends within 50 Miles | 0.250 | -0.167 | 0.069 | -0.180 | -0.218 | 0.039 | 0.484* | -0.424* | 0.774*** | 0.203 | 0.060 | -0.547** | -0.455* | 0.398* | 0.776*** |
| (0.247) | (0.292) | (0.278) | (0.272) | (0.261) | (0.281) | (0.262) | (0.251) | (0.232) | (0.253) | (0.267) | (0.259) | (0.250) | (0.229) | (0.214) | |
| Share of Friends within 100 Miles | 0.066 | 0.657* | 0.259 | 0.913*** | 0.191 | -0.043 | -0.514* | 0.824*** | -0.213 | 0.300 | -0.018 | 0.629** | 0.845*** | -0.475* | -0.655*** |
| (0.284) | (0.336) | (0.320) | (0.311) | (0.298) | (0.322) | (0.300) | (0.286) | (0.264) | (0.285) | (0.301) | (0.292) | (0.282) | (0.258) | (0.243) | |
| 2 Week Lag: | 1.125*** | 0.486 | 2.089*** | 1.207*** | -0.112 | 2.281*** | 1.401*** | 1.821*** | 2.129*** | 2.098*** | 1.977*** | 0.985* | 2.723*** | 4.118*** | 3.876*** |
| log(Change in Physical Proximity to Cases + 1) | (0.189) | (0.388) | (0.284) | (0.256) | (0.316) | (0.436) | (0.355) | (0.428) | (0.607) | (0.761) | (0.752) | (0.521) | (0.637) | (0.456) | (0.480) |
| 4 Week Lag: | -2.193*** | -0.156 | -1.686*** | -1.072*** | 0.429 | -2.705*** | -1.551*** | -1.802*** | -2.127*** | -1.929** | -1.824** | -0.949* | -2.452*** | -3.748*** | -3.664*** |
| log(Change in Physical Proximity to Cases + 1) | (0.724) | (0.430) | (0.289) | (0.274) | (0.287) | (0.443) | (0.332) | (0.405) | (0.618) | (0.806) | (0.768) | (0.544) | (0.633) | (0.448) | (0.496) |
| 2 Week Lag: | 0.172*** | 0.381*** | 0.554*** | 0.463*** | 0.276*** | 0.361*** | 0.328*** | 0.346*** | 0.371*** | 0.315*** | 0.283*** | 0.423*** | 0.255*** | 0.367*** | 0.327*** |
| log(Change in Cases per 10k Residents + 1) | (0.059) | (0.050) | (0.037) | (0.036) | (0.036) | (0.041) | (0.033) | (0.036) | (0.037) | (0.041) | (0.042) | (0.035) | (0.037) | (0.035) | (0.035) |
| 4 Week Lag: | -0.083 | 0.124* | -0.026 | 0.046 | 0.128*** | -0.005 | 0.003 | 0.016 | 0.004 | 0.033 | 0.131*** | 0.056 | 0.014 | 0.076** | 0.155*** |
| log(Change in Cases per 10k Residents + 1) | (0.247) | (0.075) | (0.047) | (0.037) | (0.035) | (0.040) | (0.033) | (0.034) | (0.036) | (0.040) | (0.043) | (0.039) | (0.036) | (0.033) | (0.033) |
| Pop. Density FEs | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Median Household Income FEs | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| State FEs | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| Sample Mean | 1.239 | 1.257 | 1.334 | 1.372 | 1.429 | 1.586 | 2.038 | 2.530 | 2.707 | 2.675 | 2.627 | 2.629 | 2.840 | 3.040 | 3.356 |
| R-Squared | 0.608 | 0.574 | 0.644 | 0.648 | 0.666 | 0.615 | 0.673 | 0.705 | 0.732 | 0.657 | 0.606 | 0.617 | 0.637 | 0.680 | 0.713 |
| N | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 | 3,135 |
Note: Table shows time-specific results from regression 5. Each observation is a county. The dependent variable is log of one plus the number of new COVID-19 cases per 10,000 residents in one two-week period between March 30, and November 2, 2020. All columns include log of growth in social and physical proximity to cases, as well as log of growth in actual cases, lagged by two and four weeks (one and two time periods). All columns include time-specific fixed effects for percentiles of population density and median household income, time-specific fixed effects for state, and estimations of the share of a county’s Facebook connections that are within 50 and 150 miles Significance levels: *(p0.10), **(p0.05), ***(p0.01).
COVID-19 Case Growth, Prior Proximity to Cases, and Other Predictive Measures.
| log(Change in Cases per 10k Residents + 1) | ||||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
| 2 Week Lag: | 0.414*** | 0.321*** | 0.362*** | 0.277*** | 0.351*** | 0.270*** | 0.141*** | 0.141*** |
| log(Change in Social Proximity to Cases + 1) | (0.041) | (0.037) | (0.047) | (0.039) | (0.047) | (0.039) | (0.050) | (0.045) |
| 4 Week Lag: | -0.002 | 0.010 | -0.008 | 0.022 | 0.001 | 0.027 | -0.039 | -0.014 |
| log(Change in Social Proximity to Cases + 1) | (0.036) | (0.032) | (0.042) | (0.035) | (0.042) | (0.035) | (0.051) | (0.050) |
| Google searches related to Fever (% Change) | 0.286*** | 0.231*** | ||||||
| Google searches related to Cough (% Change) | 0.158*** | 0.117*** | ||||||
| Google searches related to Fatigue (% Change) | 0.022 | 0.021 | ||||||
| 2 Week Lag: | 0.165*** | 0.123*** | ||||||
| Google searches related to Fever (% Change) | (0.021) | (0.018) | ||||||
| 2 Week Lag: | 0.189*** | 0.139*** | ||||||
| Google searches related to Cough (% Change) | (0.021) | (0.017) | ||||||
| 2 Week Lag: | 0.013 | 0.019 | ||||||
| Google searches related to Fatigue (% Change) | (0.019) | (0.018) | ||||||
| 2 Week Lag: | 0.269*** | 0.179*** | ||||||
| log(Change in LEX Proximity to Cases + 1) | (0.025) | (0.022) | ||||||
| 4 Week Lag: | 0.006 | 0.006 | ||||||
| log(Change in LEX Proximity to Cases + 1) | (0.023) | (0.022) | ||||||
| Share of Friends within 50 Miles | 0.050 | 0.076 | 0.006 | 0.084 | 0.010 | 0.086 | 0.036 | 0.022 |
| (0.100) | (0.082) | (0.091) | (0.075) | (0.092) | (0.076) | (0.093) | (0.081) | |
| Share of Friends within 150 Miles | -0.256** | 0.143 | -0.213* | 0.168* | -0.220* | 0.171* | -0.126 | 0.287*** |
| (0.124) | (0.109) | (0.110) | (0.097) | (0.112) | (0.098) | (0.115) | (0.101) | |
| 2 Week Lag: | 1.244*** | 1.388*** | 1.116*** | 1.272*** | 1.117*** | 1.261*** | 0.971*** | 1.104*** |
| log(Change in Physical Proximity to Cases + 1) | (0.118) | (0.176) | (0.105) | (0.167) | (0.105) | (0.168) | (0.104) | (0.170) |
| 4 Week Lag: | -1.037*** | -1.225*** | -0.915*** | -1.077*** | -0.912*** | -1.066*** | -0.852*** | -0.996*** |
| log(Change in Physical Proximity to Cases + 1) | (0.121) | (0.187) | (0.108) | (0.178) | (0.108) | (0.179) | (0.107) | (0.180) |
| 2 Week Lag: | 0.372*** | 0.351*** | 0.498*** | 0.467*** | 0.484*** | 0.456*** | 0.536*** | 0.510*** |
| log(Change in Cases per 10k Residents + 1) | (0.022) | (0.019) | (0.024) | (0.019) | (0.024) | (0.019) | (0.023) | (0.020) |
| 4 Week Lag: | 0.071*** | 0.056*** | 0.027 | 0.013 | 0.034* | 0.019 | -0.005 | 0.001 |
| log(Change in Cases per 10k Residents + 1) | (0.019) | (0.017) | (0.021) | (0.017) | (0.021) | (0.017) | (0.022) | (0.019) |
| Time x Pop. Density FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x Median Household Income FEs | Y | Y | Y | Y | Y | Y | Y | Y |
| Time x State FEs | Y | Y | Y | Y | ||||
| Sample Mean | 2.177 | 2.177 | 2.279 | 2.279 | 2.279 | 2.279 | 2.333 | 2.333 |
| R-Squared | 0.725 | 0.757 | 0.768 | 0.800 | 0.767 | 0.799 | 0.795 | 0.827 |
| N | 47,040 | 47,025 | 38,520 | 38,520 | 38,520 | 38,520 | 30,210 | 30,195 |
Note: Table shows results from regression 5. Each observation is a county two-week period (between March 30, and November 2, 2020). The dependent variable is log of one plus the number of new COVID-19 cases per 10,000 residents. Columns 1 and 2 are the same as columns 7 and 8 in Table 1. Columns 3 and 4 add the percent growth in Google searches related to fever, cough, and fatigue from the week prior to the period to the second week of the period. Columns 5 and 6 includes analogous measures lagged by one period. Columns 7 and 8 add LEX-based proximity to cases. Standard errors are clustered at the time state level. Significance levels: *(p0.10), **(p0.05), ***(p0.01).
Predicting COVID-19 cases in U.S., with and without Social Proximity to Cases.
| RMSE: Baseline Model | RMSE: Best Available Model | RMSE: Counties w/ Google + LEX Only | |||||||
|---|---|---|---|---|---|---|---|---|---|
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | |
| Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | |
| (1) April 14 - April 27 | 1.636 | 1.534 | -0.102 | 1.488 | 1.387 | -0.102 | 1.399 | 1.299 | -0.100 |
| (2) April 28 - May 11 | 0.900 | 0.838 | -0.062 | 0.954 | 0.889 | -0.066 | 0.887 | 0.835 | -0.053 |
| (3) May 12 - May 25 | 0.746 | 0.722 | -0.024 | 0.771 | 0.746 | -0.025 | 0.671 | 0.646 | -0.025 |
| (4) May 26 - June 8 | 0.704 | 0.680 | -0.024 | 0.687 | 0.675 | -0.012 | 0.584 | 0.581 | -0.003 |
| (5) June 9 - June 22 | 0.800 | 0.776 | -0.024 | 0.779 | 0.766 | -0.013 | 0.669 | 0.660 | -0.010 |
| (6) June 23 - July 6 | 0.859 | 0.838 | -0.021 | 0.809 | 0.798 | -0.011 | 0.665 | 0.667 | 0.002 |
| (7) July 7 - July 20 | 0.793 | 0.780 | -0.013 | 0.733 | 0.730 | -0.003 | 0.530 | 0.526 | -0.004 |
| (8) July 21 - Aug. 10 | 0.755 | 0.719 | -0.036 | 0.725 | 0.701 | -0.024 | 0.508 | 0.509 | 0.002 |
| (9) Aug. 11 - Aug. 24 | 0.770 | 0.740 | -0.030 | 0.741 | 0.720 | -0.022 | 0.530 | 0.517 | -0.014 |
| (10) Aug. 25 - Sep. 7 | 0.725 | 0.719 | -0.005 | 0.728 | 0.722 | -0.006 | 0.503 | 0.503 | 0.000 |
| (11) Sep. 8 - Sep. 21 | 0.699 | 0.691 | -0.008 | 0.694 | 0.686 | -0.009 | 0.495 | 0.494 | -0.001 |
| (12) Sep. 22 - Oct. 5 | 0.748 | 0.719 | -0.029 | 0.726 | 0.705 | -0.021 | 0.513 | 0.511 | -0.002 |
| (13) Oct. 6 - Oct. 19 | 0.688 | 0.662 | -0.026 | 0.684 | 0.658 | -0.025 | 0.475 | 0.479 | 0.004 |
| (14) Oct. 20 - Nov. 2 | 0.667 | 0.652 | -0.015 | 0.647 | 0.628 | -0.018 | 0.462 | 0.455 | -0.007 |
Note: Table shows results from county-level predictions of COVID-19 case growth. The predicted outcome is log of one plus the number of new COVID-19 cases per 10,000 residents. All columns show root mean squared errors (RMSEs) from a random forest model trained on data from all periods prior to the period of interest. The model inputs in column 1 are population density; median household income; and log of growth in physical proximity to cases and actual cases, lagged by two and four weeks (one and two time periods). Columns 4 and 7 include information on one and two period lagged measures of LEX proximity to cases, and one period lagged percent changes in Google searches related to fever, cough, and fatigue. Column 4 includes predictions for 3136 counties using, for each county, a model that utilizes the most available information possible. Column 7 limits to the 1976 counties for which we have both Google symptom search and LEX data. Columns 2, 5, and 8 add lagged measures of social proximity to cases to columns 1, 4, and 7. Columns 3, 6, and 9 show the change in RMSE from adding social proximity to cases.
Predicting U.S. Hotspot COVID-19 spread, trained on Italian Hotspot spread.
| Linear Regression | Random Forest | |||||
|---|---|---|---|---|---|---|
| (1) | (2) | (3) | (4) | (5) | (6) | |
| Without SCI to Hotspot | With SCI to Hotspot | Diff. from SCI to Hotspot | Without SCI to Hotspot | With SCI to Hotspot | Diff. from SCI to Hotspot | |
| (1) RMSE | 0.990 | 0.972 | -0.018 | 1.041 | 1.010 | -0.031 |
| (2) Rank-Rank Corr. w/ Truth | 0.238 | 0.350 | 0.112 | 0.254 | 0.315 | 0.061 |
Note: Table shows results from county-level predictions of COVID-19 cases per 10,000 residents. Columns 1–3 and 4–6 show results from linear regression and random forest models, respectively. The models are trained using information from Italy on March 10, and tested using information from the U.S. on March 30. All measures are normalized by subtracting the mean then dividing by the standard deviation. Row (1) shows the prediction root mean squared errors (RMSEs) and row (2) shows prediction rank-rank correlation with the truth. The model inputs in columns 1 and 4 are to the hotspot (Lodi in Italy, Westchester in the U.S.), population density, and household income / (GDP per inhabitant). Columns 2 and 5 add to the hotspots. Columns 3 and 6 show the change in each measure from adding .
Predicting COVID-19 deaths in U.S., with and without Social Proximity to Deaths.
| RMSE: Baseline Model | RMSE: Best Available Model | RMSE: Counties w/ Google + LEX Only | |||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | |
| Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | Without Social Proximity to Cases | With Social Proximity to Cases | Diff. from Social Proximity to Cases | |
| (1) May 12 - June 8 | 0.765 | 0.731 | -0.034 | 0.709 | 0.690 | -0.019 | 0.709 | 0.711 | 0.002 |
| (2) June 9 - July 6 | 0.480 | 0.435 | -0.045 | 0.438 | 0.402 | -0.036 | 0.356 | 0.339 | -0.017 |
| (3) July 7 - Aug. 10 | 0.457 | 0.453 | -0.004 | 0.455 | 0.451 | -0.003 | 0.431 | 0.428 | -0.002 |
| (4) Aug. 11 - Sep. 7 | 0.471 | 0.462 | -0.008 | 0.458 | 0.454 | -0.003 | 0.400 | 0.401 | 0.001 |
| (5) Sep. 8 - Oct. 5 | 0.488 | 0.469 | -0.019 | 0.475 | 0.460 | -0.015 | 0.371 | 0.365 | -0.006 |
| (6) Oct. 6 - Nov. 2 | 0.549 | 0.543 | -0.006 | 0.544 | 0.539 | -0.005 | 0.405 | 0.406 | 0.001 |
Note: Table shows results from county-level predictions of COVID-19 deaths. The predicted outcome is log of one plus the number of new COVID-19 deaths per 10,000 residents. All columns show root mean squared errors (RMSEs) from a random forest model trained on data from all periods prior to the period of interest. The model inputs in column 1 are population density; median household income; and log of growth in physical proximity to deaths and actual deaths, lagged by four and eight weeks (one and two time periods). Columns 4 and 7 include information on one and two period lagged measures of LEX proximity to deaths, and one period lagged percent changes in Google searches related to fever, cough, and fatigue. Column 4 includes predictions for 3156 counties using, for each county, a model that utilizes the most available information possible. Column 7 limits to 1976 counties for which we have both Google symptom search and LEX data. Columns 2, 5, and 8 add lagged measures of social proximity to deaths to columns 1, 4, and 7. Columns 3, 6, and 9 show the change in RMSE from adding social proximity to deaths.