Antonio Paez1. 1. School of Earth Environment and Society McMaster University Hamilton Canada.
Abstract
The emergence of the novel SARS-CoV-2 coronavirus and the global COVID-19 pandemic in 2019 led to explosive growth in scientific research. Alas, much of the research in the literature lacks conditions to be reproducible, and recent publications on the association between population density and the basic reproductive number of SARS-CoV-2 are no exception. Relatively few papers share code and data sufficiently, which hinders not only verification but additional experimentation. In this article, an example of reproducible research shows the potential of spatial analysis for epidemiology research during COVID-19. Transparency and openness means that independent researchers can, with only modest efforts, verify findings and use different approaches as appropriate. Given the high stakes of the situation, it is essential that scientific findings, on which good policy depends, are as robust as possible; as the empirical example shows, reproducibility is one of the keys to ensure this.
The emergence of the novel SARS-CoV-2 coronavirus and the global COVID-19 pandemic in 2019 led to explosive growth in scientific research. Alas, much of the research in the literature lacks conditions to be reproducible, and recent publications on the association between population density and the basic reproductive number of SARS-CoV-2 are no exception. Relatively few papers share code and data sufficiently, which hinders not only verification but additional experimentation. In this article, an example of reproducible research shows the potential of spatial analysis for epidemiology research during COVID-19. Transparency and openness means that independent researchers can, with only modest efforts, verify findings and use different approaches as appropriate. Given the high stakes of the situation, it is essential that scientific findings, on which good policy depends, are as robust as possible; as the empirical example shows, reproducibility is one of the keys to ensure this.
The emergence of the novel SARS‐CoV‐2 coronavirus in 2019, and the global pandemic that followed in its wake, led to an explosive growth of research around the globe. According to Fraser et al. (2021), over 125,000 COVID‐19‐related papers were released in the first 10 months from the first confirmed case of the disease. Of these, more than 30,000 were shared in pre‐print servers, the use of which also exploded in the past year (Kwon 2020; Vlasschaert, Topf, and Hiremath 2020; Añazco et al. 2021).Given the ruinous human and economic cost of the pandemic, there has been a natural tension in the scientific community between the need to publish research results quickly and the imperative to maintain consistently high‐quality standards in scientific reporting; indeed, a call for maintaining the standards in published research termed the deluge of COVID‐19 publications a “carnage of substandard research” (Bramstedt 2020). Part of the challenge of maintaining quality standards in published research is that, despite an abundance of recommendations and guidelines (e.g., Ince et al. 2012; Ioannidis et al. 2014; Broggini et al. 2017; Brunsdon and Comber 2020), in practice reproducibility has remained a lofty and somewhat aspirational goal (Konkol et al. 2019; Konkol and Kray 2019). As reported in the literature, only a woefully small proportion of published research was actually reproducible before the pandemic (Iqbal et al. 2016; Stodden, Seiler, and Ma 2018), and the situation does not appear to have changed substantially since (Gustot 2020; Sumner et al. 2020).The push for open software and data (Páez 2021; Arribas‐Bel et al. 2021; Bivand 2020), along with more strenuous efforts toward open, reproducible research, is simply a continuation of long‐standing scientific practices of independent verification. Despite the (at times disproportionate) attention that high profile scandals in science tend to elicit in the media, science as a collective endeavor is remarkable for being a self‐correcting enterprise, one with built‐in mechanisms and incentives to weed out erroneous ideas. Over the long term, facts tend to prevail in science. At stake is the shorter‐term impacts that research may have in other spheres of economic and social life. The case of economists Reinhart and Rogoff comes to mind: by the time the inaccuracies and errors in their research were uncovered (see Herndon et al. 2014), their claims about debt and economic growth had already been seized by policy‐makers on both sides of the Atlantic to justify austerity policies in the aftermath of the Great Recession of 2007–2009.
As later research has demonstrated, those policies cast a long shadow, and their sequels continued to be felt for years (Basu, Carney, and Kenworthy 2017).In the context of COVID‐19, a topic that has grabbed the imagination of numerous thinkers has been the prospect of life in cities after the pandemic (e.g., Florida et al. 2020); as a result, the implications of the pandemic for urban planning, design, and management are the topic of ongoing research (e.g., Sharifi and Khavarian‐Garmsir 2020). The fact that the worst of the pandemic was initially felt in dense population centers such as Wuhan, Milan, Madrid, and New York, unleashed a torrent of research into the associations between density and the spread of the pandemic. The answers to some important questions hang on the results of these research efforts. For example, are lower density regions safer from the pandemic? Are de‐densification policies warranted, even if just in the short term? In the longer term, will the risks of life in high density regions presage a flight from cities? And, what are the implications of the pandemic for future urban planning and practice? Over the past year, numerous papers have sought to throw light on the underlying issue of density and the pandemic; nonetheless the results, as will be detailed next, remain mixed. Furthermore, to complicate matters, precious few of these studies appear to be sufficiently open to support independent verification.The objective of this article is to illustrate the importance of reproducibility in research in the context of the flood of COVID‐19 papers. For this, I focus on a recent study by Sy, White, and Nichols (2021) that examined the correlation between the basic reproductive number of COVID‐19, , and population density. The basic reproductive number is a summary measure of contact rates, probability of transmission of a pathogen, and duration of infectiousness. In rough terms, it measures how many new infections each infections begets. The paper of Sy, White, and Nichols (2021) was selected for being, in the literature examined, almost alone in supporting reproducible research. Accordingly, I wish to be clear that my objective in singling their work for discussion is not to malign their efforts, but rather to demonstrate how open and reproducible research efforts can greatly help to accelerate discovery. More concretely, open data and open code mean that an independent researcher can, with only modest efforts, not only verify the findings reported, but also examine the same data from a perspective which may not have been available to the original researchers due to differences in disciplinary perspectives, methodological traditions, and/or training, among other possible factors. The example, which shows consequential changes in the conclusions reached by different analyses, should serve as a call to researchers to redouble their efforts to increase transparency and reproducibility in their research. In this spirit, the present article also aims to show how data can be packaged in well‐documented, shareable units, and code can be embedded into self‐contained documents suitable for review and independent verification. The source for this article is an R Markdown document which, along with the data package, are available in a public repository.
Background: The intuitive relationship between density and spread of contagious diseases
The concern with population density and the spread of the virus during the COVID‐19 pandemic was fueled, at least in part, by dramatic scenes seen in real‐time around the world from large urban centers such as Wuhan, Milan, Madrid, and New York. In theory, there are good reasons to believe that higher density could have a positive association with the transmission of a contagious virus. It has long been known that the potential for interpersonal contact is greater in regions with higher density (see for example the research on urban fields and time‐geography, including Moore 1970; Moore and Brown 1970; Farber and Páez 2011). Mathematically, models of exposure and contagion indicate that higher densities can catalyze the transmission of contagious diseases (Li, Richmond, and Roehner 2018; Rocklöv and Sjödin 2020). The idea is intuitive and likely at the root of messages, by some figures in positions of authority, that regions with sparse population densities faced lower risks from the pandemic.As Rocklöv and Sjödin (2020) note, however, mathematical models of contagion are valid at small‐to‐medium spaces (and presumably, smaller time intervals too, such as time spent in restaurants, concert halls, cruises), and the results do not necessarily transfer to larger spatial units and longer time periods. There are solid reasons for this: while in a restaurant, one can hardly avoid being in proximity to other customers. On the other hand, a person can choose to (or be forced to as a matter of policy) not go to a restaurant in the first place. Nonetheless, the idea that high density correlates with high transmission is so seemingly sensible that it is often taken for granted even at the scale of large spaces (e.g., Cruz et al. 2020; Micallef et al. 2020). In such conditions, however, there exists the possibility of behavioral adaptations, which are difficult to capture in the mechanistic framework of differential equations (or can be missing in agent‐based models, e.g., Gomez et al. 2021); these adaptations, in fact, can be a key aspect of disease transmission.A plausible behavioral adaptation during a pandemic, especially one broadcast as widely and intensely as COVID‐19, is risk compensation. Risk compensation is a process whereby people adjust their behavior in response to their perception of risk (Noland 1995; Richens, Imrie, and Copas 2000; Phillips et al. 2011). In the case of COVID‐19, Chauhan et al. (2021) have found that perception of risks in the United States varies between rural, suburban, and urban residents, with rural residents in general expressing less concern about the virus. It is possible that people who listened to the message of leaders saying that they were safe from the virus because of low density may not have taken adequate precautions. Conversely, people in dense places who could more directly observe the impact of the pandemic may have become overly cautious. Both Paez et al. (2020) and Hamidi, Ewing, and Sabouri (2020b) posit this mechanism (i.e., greater compliance with social distancing in denser regions) to explain the results of their analyses. The evidence available does indeed show that there were important changes in behavior with respect to mobility during the pandemic (Jamal and Paez 2020; Molloy et al. 2020; Harris and Branion‐Calles 2021); furthermore, shelter in place orders may have had greater buy‐in from the public in higher density regions (Feyman et al. 2020; Hamidi and Zandiatashbar 2021), and the associated behavior may have persisted beyond the duration of official social‐distancing policies (Praharaj et al. 2020). In addition, there is evidence that changes in mobility correlated with the trajectory of the pandemic (Paez 2020; Noland 2021). Given the potential for behavioral adaptation, the question of density becomes more nuanced: it is not just a matter of proximity, but also of human behavior, which is better studied using population‐level data and models.
Background: But what does the literature say?
When it comes to population density and the spread of COVID‐19, the international literature to date remains inconclusive.On the one hand, there are studies that report positive associations between population density and various COVID‐19‐related outcomes. Bhadra, Mukherjee, and Sarkar (2021), for example, reported a moderate positive correlation between the spread of COVID‐19 and population density at the district level in India, however their analysis was bivariate and did not control for other variables, such as income. Similarly, Kadi and Khelfaoui (2020) found a positive and significant correlation between number of cases and population density in cities in Algeria in a series of simple regression models (i.e., without other controls). A question in these relatively simple analyses is whether density is not a proxy for other factors. Other studies have included controls, such as Pequeno et al. (2020), a team that reported a positive association between density and cumulative counts of confirmed COVID‐19 cases in state capitals in Brazil after controlling for covariates, including income, transport connectivity, and economic status. In a similar vein, Fielding‐Miller, Sundaram, and Brouwer (2020) reported a positive relationship between the absolute number of COVID‐19 deaths and population density (rate) in rural counties in the United States. Roy and Ghosh (2020) used a battery of machine learning techniques to find discriminatory factors, and a positive and significant association between COVID‐19 infection and death rates in U.S. states. Wong and Li (2020) also found a positive and significant association between population density and number of confirmed COVID‐19 cases in U.S. counties, using both univariate and multivariate regressions with spatial effects. More recently, Sy, White, and Nichols (2021) reported that the basic reproductive number of COVID‐19 in U.S. counties tended to increase with population density, but at a decreasing rate at higher densities.On the other hand, a number of studies report non‐significant or negative associations between population density and COVID‐19 outcomes. This includes the research of Sun et al. (2020) who did not find evidence of significant correlation between population density and confirmed number of cases per day in conditions of lockdown in China. This finding echoes the results of Paez et al. (2020), who in their study of provinces in Spain reported nonsignificant associations between population density and infection rates in the early days of the first wave of COVID‐19, and negative significant associations in the later part of the first lockdown. Similarly, Skórka et al. (2020) found zero or negative associations between population density and infection numbers/deaths by country. Fielding‐Miller, Sundaram, and Brouwer (2020) contrast their finding about rural counties with a negative relationship between COVID‐19 deaths and population density in urban counties in the United States. For their part, in their investigation of doubling time, White and Hébert‐Dufresne (2020) identified a negative and significant correlation between population density and doubling time in U.S. states. Likewise, Khavarian‐Garmsir, Sharifi, and Moradpour (2021) found a small negative (and significant) association between population density and COVID‐19 morbidity in districts in Tehran. Finally, two of the most complete studies in the United States, by Hamidi, Ewing, and Sabouri (2020a, b), used an extensive set of controls to find negative and significant correlations between density and COVID‐19 cases and fatalities at the level of counties in the United States.As can be seen, these studies are implemented at different scales in different regions of the world. They also use a range of techniques, from correlation analysis, to multivariate regression, spatial regressions, and machine learning techniques. This is natural and to be expected: individual researchers have only limited time and expertise. This is why reproducibility is important. To pick an example (which will be further elaborated in later sections of this article), the study of Sy, White, and Nichols (2021), hereafter referred to as SWN, would immediately grab the attention of a researcher with expertise in spatial analysis.
Reproducibility of research
SWN investigated the basic reproductive number of COVID‐19 in U.S. counties, and its association with population density, median household income, and prevalence of private mobility. For their multivariate analysis, SWN used mixed linear models. This is an appropriate modeling choice: is an interval‐ratio variable that is suitably modeled using linear regression; further, as SWN note there is a likelihood that the process in not independent “among counties within each state, potentially due to variable resource allocation and differing health systems across states” (p. 3). A mixed linear model accounts for this by introducing random components; in the case of SWN, these are random intercepts at the state level. SWN estimated various models with different combinations of variables, including median household income and prevalence of travel by private transportation. These controls help to account for potential variations in behavior: people in more affluent counties may have greater opportunities to work from home, and use of private transportation reduces contact with strangers. Moreover, they also conducted various sensitivity analyses. After these efforts, SWN concluded that there is a positive association between the basic reproductive number and population density at the level of counties in the United States.One salient aspect of the analysis in SWN is that the basic reproductive number can only be calculated reliably with a minimum number of cases, and a large number of counties did not meet such threshold. As researchers do, SWN made modeling decisions, in this case basing their analysis only on counties with valid observations. A modeler with expertise in spatial analysis would likely ask some of the following questions on reading SWN’s paper: how were missing counties treated? What are the implications of the spatial sampling framework used in the analysis? Is it possible to spatially interpolate the missing observations? Was there spatial residual autocorrelation in the models, or was the use of mixed models sufficient to capture spatial dependencies? These questions are relevant and their implications important. Fortunately, SWN are an example of a reasonably open, reproducible research product: their paper is accompanied by (most of) the data and (most of) the code used in the analysis. This means that an independent researcher can, with only a moderate investment of time and effort, reproduce the results in the paper, as well as ask additional questions.Alas, reproducibility is not necessarily the norm in the relevant literature.There are various reasons why a project can fail to be reproducible. In some cases, there might be legitimate reasons to withhold the data, perhaps due to confidentiality and privacy reasons (e.g., Lee et al. 2020). But in many other cases the data are publicly available, which in fact has commonly been the case with population‐level COVID‐19 information. Typically the provenance of the data is documented, but in numerous studies the data themselves are not shared (Cruz et al. 2020; Feng et al. 2020; Fielding‐Miller, Sundaram, and Brouwer 2020; Hamidi, Ewing, and Sabouri 2020a, b; Souris and Gonzalez 2020; Amadu et al. 2021; Bhadra, Mukherjee, and Sarkar 2021; Inbaraj, George, and Chandrasingh 2021). As any researcher can attest, collecting, organizing, and preparing data for a project can take a substantial amount of time. Pointing to the sources of data, even when these sources are public, is a small step toward reproducibility‐but only a very small one. Faced with the prospect of having to recreate a data set from raw sources is probably sufficient to dissuade all but the most dedicated (or stubborn) researcher from independent verification. This is true even if part of the data are shared (e.g., Wong and Li 2020). In other cases, data are shared, but the processes followed in the preparation of the data are not fully documented (Ahmad et al. 2020; Skórka et al. 2020). These processes matter, as shown by the errors in the spreadsheets of Reinhart and Rogoff (see Herndon et al. 2014 for the discovery of these errors), as well as by the data of biologist Jonathan Pruitt that led to an “avalanche” of paper retractions (see Viglione 2020). Another situation is when papers share well‐documented data, but fail to provide the code used in the analysis (Pequeno et al. 2020; Noury et al. 2021; Wang et al. 2021). Making code available only “on demand” (e.g., Brandtner et al. 2021) is an unnecessary barrier when most journals offer the facility to share supplemental materials online. Then there are those papers that more closely comply with reproducibility standards, and share well‐documented processes and data, as well as the code used in any analyses reported (Feyman et al. 2020; Paez et al. 2020; White and Hëbert‐Dufresne 2020; Stephens, Chernyavskiy, and Bruns 2021; Sy, White, and Nichols 2021). Even in this case, the pressure to publish “new findings” instead of replication studies can act as a deterrent
This may be particularly true for younger researchers.In the following sections, the analysis of SWN is reproduced, some relevant questions from the perspective of an independent researcher with expertise in spatial analysis are asked, and the data are reanalyzed.
Reproducing SWN
SWN examined the association between the basic reproductive number of COVID‐19 and population density. The basic reproductive number is a summary measure of contact rates, probability of transmission of a pathogen, and duration of infectiousness. In rough terms, measures how many new infections each infections begets. Infectious disease outbreaks generally tend to die out when , and to grow when . Reliable calculation of requires a minimum number of cases to be able to assume that there is community transmission of the pathogen. Accordingly, SWN based their analysis only on counties that had at least 25 cases or more at the end of the exponential growth phase (see Fig. 1). Their final sample included 1,151 counties in the United States, including in Alaska, Hawaii, Puerto Rico, and island territories. SWN used COVID‐19 data collected by the New York Times and made available (with versioning) in a GitHub repository.
For each county, SWN assumed that the exponential growth period began one week prior to the second daily increase in cases, and assumed that the period of exponential growth lasted approximately 18 days.
Figure 1
Basic reproductive rate in U.S. counties (Alaska, Hawaii, Puerto Rico, and territories not shown).
Basic reproductive rate in U.S. counties (Alaska, Hawaii, Puerto Rico, and territories not shown).Table 1 reproduces the first three models of SWN (the fourth model did not have any significant variables; see Table 1 in SWN). It is possible to verify that the results match, with only the minor (and irrelevant) exception of the magnitude of the coefficient for travel by private transportation, which is due to a difference in the input (here the variable is changed to 1% units, instead of the 10% units used by SWN). The mixed linear model gives random intercepts (i.e., the intercept is a random variable), and the standard deviation is reported in the fifth row of Table 1. It is useful to map the random intercepts: as seen in Figure 2, other things being equal, counties in Texas tend to have somewhat lower values of (i.e., a negative random intercept), whereas counties in South Dakota tend to have higher values of . The key of the analysis, after extensive sensitivity analysis, is a robust finding that population density has a positive association with the basic reproductive number. But does it?
Table 1
Reproducing SWN: Models 1–3
Model 1
Model 2
Model 3
Variable
β
95% CI
β
95% CI
β
95% CI
Intercept
2.274
[2.167, 2.381]
3.347
[2.676, 4.018]
3.386
[2.614, 4.157]
Log of population density
0.162
[0.133, 0.191]
0.145
[0.115, 0.176]
0.147
[0.113, 0.18]
Percent of private transportation
−0.013
[−0.02, −0.005]
−0.013
[−0.021, −0.005]
Median household income ($10,000)
−0.003
[−0.033, 0.026]
Standard deviation (Intercept)
0.166
[0.108, 0.254]
0.136
[0.081, 0.229]
0.137
[0.081, 0.232]
Within‐group standard error
0.665
[0.638, 0.693]
0.665
[0.638, 0.693]
0.665
[0.638, 0.694]
Figure 2
Random intercepts of Model 3 (Alaska, Hawaii, Puerto Rico, and territories not shown).
Reproducing SWN: Models 1–3Random intercepts of Model 3 (Alaska, Hawaii, Puerto Rico, and territories not shown).
Expanding on SWN
The preceding section shows that thanks to the availability of code and data, it is possible to verify the results reported by SWN. As noted earlier, though, an independent researcher might have wondered about the implications of the spatial sampling procedure used by SWN. The decision to use a sample of counties with reliable basic reproductive numbers, although apparently sensible, results in a non‐random spatial sampling scheme. Turning our attention back to Fig. 1, there is a distinct impression that many counties without reliable values of are in more rural, less dense parts of the United States. This impression is reinforced when the boundaries of urban areas are overlaid with population greater than 50,000 on the counties with valid values of (see Fig. 3). The fact that could not be accurately computed in many counties without large urban areas does not mean that there was no transmission of the virus: it simply means that we do not know with sufficient precision to what extent that was the case. The low number of cases may be related to low population and/or low population density. This is intriguing, to say the least: by excluding cases based on the ability to calculate we are potentially selecting the sample in a non‐random way.
Figure 3
Urban areas with population >50,000 (Alaska, Hawaii, Puerto Rico, and territories not shown).
Urban areas with population >50,000 (Alaska, Hawaii, Puerto Rico, and territories not shown).A problematic issue with non‐random sample selection is that parameter estimates can become unreliable, and numerous techniques have been developed to address this. A model useful for sample selection problems is Heckman’s selection model (see Maddala 1983). The selection model is in fact a system of two equations, as follows:
where is a latent variable for the sample selection process and is the latent outcome. Vectors and are explanatory variables (with the possibility that ). Both equations include random terms (i.e., and ). The first equation is designed to model the probability of sampling, and the second equation the outcome of interest (say ). The random terms are jointly distributed and correlated with parameter .What the analyst observes is the following:
and:
In other words, the outcome of interest is observed only for certain cases (, i.e., for sampled observations). The probability of sampling depends on . For the cases observed, the outcome depends on .A sample selection model is estimated using the same selection of variables as SWN Model 3. This is Sample Selection Model 1 in Table 2. The first thing to notice about this model is that the sample selection process and the outcome are correlated ( with 5% of confidence). The selection equation indicates that the probability of a county to be in the sample increases with population density (but at a decreasing rate due to the log‐transformation), when travel by private modes is more prevalent, and as median household income in the county is higher. This is in line with the impression made by Fig. 3 that counties with reliable values of tended to be those with larger urban centers. Once that the selection probabilities are accounted for in the model, several things happen with the outcomes model. First, the coefficient for population density is still positive, but the magnitude changes: in effect, it appears that the effect of density is more pronounced than what SWN Model 3 indicated. The coefficient for percent of private transportation changes signs. And the coefficient for median household income is now significant.
Table 2
Estimation results of sample selection models
Selection Model 1
Selection Model 2
Variable
β
95% CI
β
95% CI
Sample Selection Model
Intercept
−2.237
[−3.109, −1.365]
−7.339
[−8.381, −6.297]
Log of population density
0.385
[0.352, 0.418]
Density (1,000 per sq.km)
2.484
[2.13, 2.838]
Density squared
−0.387
[−0.473, −0.3]
Percent of private transportation
0.025
[0.016, 0.034]
0.057
[0.046, 0.067]
Median household income (10,000)
0.202
[0.168, 0.235]
0.32
[0.283, 0.357]
Outcome Model
Intercept
0.605
[−0.257, 1.466]
2.784
[1.652, 3.915]
Log of population density
0.39
[0.354, 0.426]
Density (1,000 per sq.km)
0.758
[0.509, 1.008]
Density squared
−0.132
[−0.187, −0.077]
Percent of private transportation
0.01
[0.001, 0.018]
−0.011
[−0.021, −0.001]
Median household income ($10,000)
0.126
[0.094, 0.159]
0.002
[−0.033, 0.037]
σ
0.954
[0.904, 1.003]
0.684
[0.652, 0.716]
ρ
0.971
[0.961, 0.98]
−0.199
[−0.377, −0.022]
The second model in Table 2 (Selection Model 2) changes the way the variables are entered into the model. The log‐transformation of density in SWN and Selection Model 1 assumes that the association between density and is monotonically increasing (if the sign of the coefficient is positive) or decreasing (if the sign of the coefficient is negative). There are some indications that the relationship may actually not be monotonical. For example, Paez et al. (2020) found a positive (if non‐significant) relationship between density and incidence of COVID‐19 in the provinces of Spain at the beginning of the pandemic. This changed to a negative (and significant) relationship during the lockdown. In the case of the United States, Fielding‐Miller, Sundaram, and Brouwer (2020) found that the association between COVID‐19 deaths and population density was positive in rural counties, but negative in urban counties. A variable transformation that allows for non‐monotonic changes in the relationship is the square of the density.As seen in the table, Selection Model 2 replaces the log‐transformation of population density with a quadratic expansion. The results of this analysis indicate that with this variable transformation, the selection and outcome processes are still correlated ( with 5% of confidence). But a few other interesting things emerge. On examination of the outcomes model, the quadratic expansion has a positive coefficient for the first order term, but a negative coefficient for the second order term. This indicates that initially tends to increase as density grows, but only up to a point, after which the negative second term (which grows more rapidly due to the square), becomes increasingly dominant. Secondly, the sign of the coefficient for travel by private transportation becomes negative again. This, of course, makes more sense than the positive sign of Selection Model 1: if people tend to travel in private transportation, the potential for contact should be lower instead of higher. And finally median household income is no longer significant, similar to SWN Model 3.Estimation results of sample selection models
Proceed with caution: Spatial effects ahead
The results of the selection models, in particular Selection Model 2, make us reassess the original conclusion that density has a positive association with the basic reproductive number of COVID‐19. A spatial analyst might still wonder about spatial residual autocorrelation. A challenge here is that spatial models tend to be technically more demanding, and although spatial models for qualitative variables exist, a spatial implementation of the sample selection model does not appear to exist. It might be argued that a reproducible research project can also allow a researcher to be more adventurous with their modeling decisions: since data and code are shared, other researchers can promptly and with relative ease poke the methods and see if they appear to be sound.In the present case, it appears that an application of spatial filtering (see Getis and Griffith 2002; Griffith 2004; Paez 2019) can help. Spatial filtering provides an elegant solution to regression problems that may have difficulties handling the spatial structures of spatial statistical and econometric models (Griffith 2000). A key issue in the present example is the fact that there are numerous missing observations, which prevents the calculation of autocorrelation statistics, let alone the estimation of models with spatial components.The following is an unorthodox, but potentially effective use of filters in a sample selection model:Estimate a sample selection model and retrieve the residuals of the outcome. This will be a vector with missing values for locations that were not sampled.Fit a spatial filter to the residuals. This is done by regressing the estimated residuals of the observed data on the corresponding values of the Moran eigenvectors.The resulting filter will correlate highly with the known residuals, and will provide information in non‐sampled locations that is consistent with the spatial pattern of the known residuals.Test the filter for spatial autocorrelation:4.1 If significant spatial autocorrelation is detected, this would be indicative of residual spatial pattern. Introduce the filter as a covariate in the outcome model of the sample selection model and return to step 1.4.2 If no significant spatial autocorrelation is detected, this would be indicative of random residual pattern. Stop.This procedure is implemented using a stopping criterion whereby the search for the filter only stops when the P‐value of Moran’s Coefficient of the filter fitted to the residuals is greater than 0.25, which was chosen as a sufficiently conservative value for testing for autocorrelation. The correlation of the known residuals with the corresponding elements of the filter is consistently high (the correlation coefficient typically is greater than 0.9). The results of implementing this procedure appear in Table 3 as Selection Model 3. The results are consistent with Selection Model 2, with two intriguing differences: 1) the variance of Sample Model 3 is smaller; and 2) the sample and outcome processes are no longer correlated (the confidence interval of includes zero). It appears that by capturing the spatial pattern of the residuals, which is likely strongly determined by the non‐random sampling framework, the outcome model is not only substantially more precise, but also appears to be independent from the selection process.
Table 3
Estimation results of sample selection model with spatial filter
Selection Model 3
Variable
β
95% CI
Sample Selection Model
Intercept
−7.304
[−8.346, −6.262]
Density (1,000 per sq.km)
2.445
[2.089, 2.802]
Density squared
−0.380
[−0.468, −0.292]
Percent of private transportation
0.056
[0.046, 0.067]
Median household income (10,000)
0.318
[0.28, 0.356]
Outcome Model
Intercept
2.563
[2.497, 2.629]
Density (1,000 per sq.km)
0.760
[0.746, 0.774]
Density squared
−0.133
[−0.135, −0.13]
Percent of private transportation
−0.011
[−0.012, −0.011]
Median household income ($10,000)
0.002
[−0.001, 0.004]
Spatial filter
1.000
[0.998, 1.001]
σ
0.017
[0.015, 0.019]
ρ
−0.304
[−0.957, 0.349]
Estimation results of sample selection model with spatial filterClearly, the various models display some intriguing differences; but how relevant are said differences from a more substantive standpoint? Fig. 4 shows the relationship between density and implied by SWN Model 3, Selection Model 2, and Selection Model 3. The left panel of the figure shows the non‐linear but monotonic relationship implied by SWN Model 1. The conclusion is that at higher densities, is always higher. The two panels on the right, in contrast, shows that Selection Model 2 and Selection Model 3 coincide that tends to increase as density grows. This continues until a density of approximately 2.9 (1,000 people per sq.km). At higher densities than that the relationship between density and begins to weaken, and the relationship becomes negative at densities higher than approximately 5.7 (1,000 people per sq.km).
Figure 4
Effect of density according to SWN Model 3 and Sample Selection Model 2.
To put this into context, other things being equal, the effect of density in a county like Charlottesville in Virginia (density ~1,639 people per sq.km) is roughly the same as that in a county like Philadelphia (density ~4,127 people per sq.km). In contrast, the effect of density on in a county like Arlington in Virginia (density ~3,093 people per sq.km) is stronger than either of the previous two examples. Lastly, the density of counties like San Francisco in California, or Queens and Bronx in NY, which are among the densest in the United States, contributes even less to than even the most rural counties in the country.Effect of density according to SWN Model 3 and Sample Selection Model 2.
Discussion
It is worth at this point to recall Cressie’s dictum about modeling: “[w]hat is one person’s mean structure could be another person’s correlation structure” (Cressie 1989, p. 201). There are almost always multiple ways to approach a modeling situation, as lively illustrated by a recent paper that reports the results of a crowdsourced modeling experiment (Schweinsberg et al. 2021). In the present case, I would argue that spatial sampling is an important aspect of the modeling process. Importantly, by adopting high reproducibility standards, SWN made a valuable contribution to the collective enterprise of seeking knowledge. Their effort, and subsequent efforts to validate and expand on their work, can potentially contribute to provide clarity to ongoing conversations about the relevance of density and the spread of COVID‐19.In particular, it is noteworthy that a sample selection model with a different variable transformation does not lend support to the thesis that higher density is always associated with a greater risk of spread of the virus [in Wong and Li’s words, “‘Density is destiny’ is probably an overstatement”; (2020)]. At the same time, the results presented here also stand in contrast to the findings of Hamidi et al., who found that higher density was either not significantly associated with the rate of the virus in a cross‐sectional study (Hamidi, Ewing, and Sabouri 2020b), or was negatively associated with it in a longitudinal setting [Hamidi, Ewing, and Sabouri (2020a). In this sense, the conclusion that density does not aggravate the pandemic may have been somewhat premature; instead, reanalysis of the data of SWN suggests that Fielding‐Miller, Sundaram, and Brouwer (2020) might be onto something with respect to the difference between rural and urban counties. More generally, there is no doubt that in population‐level studies density is indicative of proximity, but it also potentially is a proxy for adaptive behavior. And it is possible that the determining factor during COVID‐19, at least in the United States, has been variations in perceptions of the risks associated with contagion (Chauhan et al. 2021), and subsequent compensations in behavior in more and less dense regions.
Conclusion
The tension between the need to publish research potentially useful in dealing with a global pandemic, and a potential “carnage of substandard research” (Bramstedt 2020), highlights the importance of efforts to maintain the quality of scientific outputs during COVID‐19. An important part of quality control is the ability of independent researchers to verify and examine the results of materials published in the literature. As previous research illustrates, reproducibility in scientific research remains an important but elusive goal (Gustot 2020; e.g., Iqbal et al. 2016; Stodden, Seiler, and Ma 2018; Sumner et al. 2020). This idea is reinforced by the review conducted for this paper in the context of research about population density and the spread of COVID‐19.Taking one recent example from the literature (Sy, White, and Nichols, 2021], the present article illustrates the importance of good reproducibility practices. Sharing data and code can catalyze research, by allowing independent verification of findings, as well as additional research. After verifying the results of SWN, experiments with sample selection models and variations in the definition of model inputs, lead to an important reappraisal of the conclusion that high density is associated with greater spread of the virus. Instead, the possibility of a non‐monotonical relationship between population density and contagion is raised. I do not claim that the analysis presented here is the last word on the topic of density and the spread of COVID‐19, and there is always the possibility that someone else will be better equipped to analyze these data with greater competence. By opening up the analysis, documenting the way data were pre‐processed, and by sharing analysis ready data, my hope would be that others will be able to discover the limitations of my own analysis and improve on it, as appropriate.More generally, my hope is that the research of Sy, White, and Nichols (2021), the present article, and similar reproducible publications, will continue to encourage others to adopt higher reproducibility standards in their research.