Literature DB >> 36168515

Bayesian negative binomial regression with spatially varying dispersion: Modeling COVID-19 incidence in Georgia.

Fedelis Mutiso¹, John L Pearce², Sara E Benjamin-Neelon^3,4, Noel T Mueller^5,6, Hong Li¹, Brian Neelon^1,7.

Abstract

Overdispersed count data arise commonly in disease mapping and infectious disease studies. Typically, the level of overdispersion is assumed to be constant over time and space. In some applications, however, this assumption is violated, and in such cases, it is necessary to model the dispersion as a function of time and space in order to obtain valid inferences. Motivated by a study examining spatiotemporal patterns in COVID-19 incidence, we develop a Bayesian negative binomial model that accounts for heterogeneity in both the incidence rate and degree of overdispersion. To fully capture the heterogeneity in the data, we introduce region-level covariates, smooth temporal effects, and spatially correlated random effects in both the mean and dispersion components of the model. The random effects are assigned bivariate intrinsic conditionally autoregressive priors that promote spatial smoothing and permit the model components to borrow information, which is appealing when the mean and dispersion are spatially correlated. Through simulation studies, we show that ignoring heterogeneity in the dispersion can lead to biased and imprecise estimates. For estimation, we adopt a Bayesian approach that combines full-conditional Gibbs sampling and Metropolis-Hastings steps. We apply the model to a study of COVID-19 incidence in the state of Georgia, USA from March 15 to December 31, 2020.

Entities: Chemical

Keywords: zzm321990zzm321990Bzzm321990-splines; Conditionally autoregressive prior; Overdispersion; Pólya-gamma distribution; Spatial data analysis

Year: 2022 PMID： 36168515 PMCID： PMC9500097 DOI： 10.1016/j.spasta.2022.100703

Source DB: PubMed Journal: Spat Stat

Introduction

In disease mapping studies, the Poisson distribution is commonly used to examine geographic variation in disease counts (Best et al., 2000, Wakefield, 2006). A well-known limitation of the Poisson distribution, however, is equidispersion, or equivalence of the mean and variance. In many applications, this assumption is violated and the data exhibit overdispersion in which the variance exceeds the mean. Less commonly, the data may be underdispersed whereby the variance is less than the mean. To address overdispersion, the Poisson mean is often allowed to vary across covariate groups or geographic units using fixed and random effects. For example, many disease mapping studies will allow the mean to vary according to a prior distribution that induces marginal overdispersion in the response. In the simplest case, the Poisson mean is assumed to follow a Gamma prior distribution, leading to a marginally overdispersed negative binomial (NB) distribution. More generally, it is possible to introduce spatial random effects into the Poisson mean to capture unobserved geographical variation in the mean and induce marginal overdispersion. Conditional on the random effects, however, the data are once again assumed to follow an equidispersed Poisson distribution. In some cases, this conditional equidispersion assumption is itself limited, and the data are not only marginally overdispersed, but also conditionally overdispersed. Failing to account for this conditional overdispersion can lead to downwardly biased standard errors (Hilbe, 2011). One remedy is to assume a conditional NB (or related) distribution to accommodate residual overdispersion that may be present after conditioning on random effects. In particular, the NB model introduces a separate dispersion parameter that accounts for overdispersion in the data not explained by latent heterogeneity in the mean. Although this approach is sensible in many applications, most overdispersed count distributions, including NB, Conway-Maxwell-Poisson (Shmueli et al., 2005) or Poisson-lognormal (Stewart, 1994), constrain the dispersion parameter to be constant across covariate groups or, in the case of spatial data, across geographic areas. This assumption may be untenable in some applications. For example, in studies of COVID-19 incidence rates, we might expect the degree of conditional overdispersion to vary across covariate groups and across geographic regions, such as counties or census tracts. In such cases, it is necessary to properly model not only the mean, but also the dispersion term itself as a function of covariates and potentially spatial random effects to appropriately capture the observed and unobserved heterogeneity in the data. Several authors have explored the utility of varying dispersion parameters in Poisson-gamma models. Heydecker and Wu (2001) noted that the dispersion parameter could be modeled as a function of covariates in a study to identify sites for road accident remedial work. They postulated that the Poisson-gamma model with non-constant dispersion can better represent the nature of crash data than traditional Poisson-gamma model with constant dispersion parameter. Miaou and Lord (2003) proposed a Poisson-gamma model with a dispersion parameter dependent on traffic flows and flow ratios to study traffic crash-flow relationships for intersections. They found that assuming a fixed dispersion parameter could seriously bias both the empirical and full Bayes estimates for individual sites. Miranda-Moreno et al. (2005) compared the performance of alternative risk models for identifying hazardous sites and found that Poisson-gamma models with covariate-dependent dispersion parameters were superior to traditional models. More recently, Geedipally and Lord (2008) studied the effects of covariate-dependent dispersion parameters on estimation of confidence intervals in Poisson-gamma models. They noted that models with a non-constant dispersion parameter produced smaller confidence intervals and thus more precise estimates for both the gamma mean and the predicted response compared to models with a fixed dispersion parameter. Collectively, this work suggests that incorporating heterogeneous dispersion into count models can improve statistical inferences, particularly with respect to precision of parameter estimates. However, these studies did not incorporate spatial information, which is our focus here. Our goal, therefore, is to extend this work to the spatial setting by modeling spatial heterogeneity not only in the mean of the NB distribution, but also the dispersion parameter. Again, this is relevant in cases where the degree of overdispersion varies across regions. To account for this heterogeneity, we introduce covariates and random effects for both the mean and dispersion components of the NB model. The random effects are chosen to have a spatial structure to induce smoothing and account for spatial dependence among neighboring geographic areas. We allow for correlation between the random effects in the mean and dispersion components, which is appealing in settings where the mean response is associated with the degree of overdispersion. For example, in our application examining daily COVID-19 incidence rates in state of Georgia in the United States (US), it is likely that counties with higher average COVID-19 case counts also have more heterogeneous count distributions due to the presence of extreme counts or outliers. It is therefore advantageous to build this dependence into the model, while reserving uncorrelated components as a special case. For inference, we adopt a Bayesian approach and develop an efficient Markov chain Monte Carlo (MCMC) algorithm that combines full-conditional Gibbs sampling and Metropolis–Hastings (MH) steps.

Case study

As a motivating application, we examined associations between county-level social vulnerability and daily COVID-19 incident case data for the state of Georgia from March 15 to December 31, 2020, for a total of 292 study days. The COVID-19 data were downloaded from the Johns Hopkins University (JHU) Center for Systems Science and Engineering (Johns Hopkins University, 2020). In Georgia, the first COVID-19 case was reported on March 2, 2020 (Georgia Department of Public Health, 2020) and there were relatively few cases in early March. As a result, we initiated our analysis on March 15. To examine associations between COVID-19 incidence and social vulnerability, we downloaded the Social Vulnerability Index (SVI) data from the CDC’s Agency for Toxic Substances and Disease Registry (Centers for Disease Control and Prevention, 2018). The SVI is a percentile-based measure of social vulnerability derived from four themes that explain different aspects of vulnerability: socioeconomic status; household composition; race, ethnicity and language; and housing and transportation availability. The four themes comprise a total of 15 variables obtained from the US Census Bureau’s American Community Survey (Centers for Disease Control and Prevention, 2018). These variables include the percent of county residents below poverty, percent unemployed, percent of residents aged 65 and older, and crowding among others. Table S1 in Appendix A of the Supplementary material provides detailed information of the census variables used to construct each theme. Counties are ranked on each of the 15 items and the overall SVI rank for each county is formed by summing the individual rankings. To address potential confounding, we obtained several additional variables unrelated to the SVI from the Robert Wood Johnson Foundation’s 2019 County Health Rankings & Roadmaps: Rankings Data & Documentation (RWJ Foundation, 2020). These variables included the percentage of county residents in fair or poor health, the percentage of female residents in the county, the percentage of adult smokers in the county, and population density defined as the number of residents per square mile in the county. Data from various sources were at the county level, which allowed seamless merging using each county’s unique Federal Information Processing Standards (FIPS) code to create the final analysis dataset. Fig. 1 presents a map of the incidence rate per 100 000 individuals (panel a) and overdispersion estimates (panel b) from separate intercept-only NB models for all 159 counties in Georgia over the course of the study period, from March 15 to December 31, 2020. The color shades correspond to the quintile distribution of the predicted incidence and estimated overdispersion. The white shade represents the lowest quintile, whereas the darker shade denotes the uppermost quintile. The predicted incidence rate ranged from 8.42 to 68.65 cases per 100 000 individuals while overdispersion ranged from 0.712 to 5.443. Fig. 1(a) suggests high incidence case rates for select counties in the southwest, southeast, and northern corners of Georgia. The majority of these counties are small and rural with high social vulnerability indices. In contrast, the unadjusted predicted incidence is lower in larger suburban counties such as those within the Atlanta area in the north central portion of the state. Fig. 1(b) suggests high overdispersion in select counties in the southwest, central and northern corners of the state. As with the incidence rates, most of the overdispersed counties are rural and highly vulnerable. The elevated overdispersion in the southwest counties may be due to presence of extreme counts emanating from COVID-19 outbreaks early in the pandemic (Galofaro, 2020). Overall, Fig. 1(b) indicates that there is substantial spatial variation in overdispersion across counties. This highlights the importance of properly modeling dispersion in infectious disease settings, where localized outbreaks can lead to spikes in overdispersion that vary over time and across geographic regions.

Fig. 1

Descriptive maps for incidence rate and overdispersion from intercept only models for data from March 15, 2020 to December 31, 2020 for the 159 counties in Georgia. Panel (a): Predicted incidence. Panel (b): Overdispersion. Legends correspond to sample quintiles.

Spatial negative binomial model with heterogeneous dispersion

Model specification and Bayesian inference

The spatial NB model with heterogeneous dispersion extends the conventional NB model to accommodate covariates in the dispersion parameter, as well as spatial effects in both the mean and dispersion. Let denote the number of COVID-19 cases in county () on day (). We assume that is distributed as NB() with the following probability mass function The conditional mean and variance of are given by where captures the overdispersion in the data. As , the mean and variance of the counts are approximately the same and a Poisson model could be assumed. As , the counts of county on day become increasingly dispersed relative to the Poisson distribution. To facilitate Bayesian computation, we follow Pillow and Scott (2012) and model using an inverse-logit function: where is a vector of county-level covariates, is a vector of fixed effect coefficients, is a smooth function of time evaluated at day , and is a county-level spatial random effect. The linear predictor, , also includes two offset terms: , the population size for county , and , an additional rate constant that allows us to model the COVID-19 incident rate per individuals (e.g. =100 000) in a county of size . The above representation of NB allows us to exploit a key connection between the inverse-logit function for and the Pólya-Gamma () density for efficient Bayesian inference. In particular, the parameterization in Eq. (3) enables us to introduce latent weight terms, , that follow independent prior distributions. By conditioning on these weights, the NB GLM can be represented as a scale mixture of normal distributions with precision terms, which allows a straightforward and efficient Gibbs sampling update for NB regression parameters. Details on key properties of are provided in Appendix B of the Supplementary material. For additional details on data augmentation and the above parameterization of the NB model, see Neelon, 2019, Neelon et al., 2022. To accommodate spatial and temporal variation in dispersion, we assume the following functional form for : where is a vector of county-level covariates, is a vector of fixed effect coefficients, is a smooth function of time evaluated at day , and is a county-level spatial random effect for overdispersion. In subsequent discussion, we assume that the fixed effect covariates for the mean and dispersion components are identical (i.e., and thus ) although in general this is not necessary. To allow for sufficient smoothness over time, we model and using cubic -splines (Eilers and Marx, 1996) of the form where is the model component (1 for the mean and 2 for the dispersion component); is a vector of -splines basis functions for time , where and is the number of interior knots; is a vector of fixed effect basis coefficients for component that capture the overall temporal trend. The non-intercept terms of in Eq. (3), that is, , represent log rate ratios (RRs). Thus, for , denotes a rate ratio. The parameter vector in Eq. (4) denotes a vector of log overdispersion terms. For example, represents the change in for a one-unit change in , holding other variables fixed. The random effects and capture, respectively, the variability in rate ratio and overdispersion across regions. Adjusted for observed covariates, large values of suggest a higher incidence of the outcome in region compared to other regions while larger values of imply more overdispersion in counts for region relative to other regions. The spatial random effects account for unmeasured regional level factors that may influence the incidence and dispersion in related ways. For instance, regions with a high incidence may exhibit more overdispersion, possibly due to presence of extreme counts. To allow this association and to encourage spatial smoothing and borrowing of information across neighboring areas, we assume a bivariate intrinsic CAR (BICAR) prior distribution for (Mardia, 1988): where represents the number of neighbors of region , is the set of neighbors for region and is a 2 × 2 variance–covariance matrix of conditional on the remaining spatial random effects, . In presence of a fixed effect intercept, a sum-to-zero constraint must be applied to to ensure the model is identifiable. The BICAR prior implies that the conditional mean of is the average of neighboring regions’ spatial effects, and thus incorporates information from the neighbors by allowing adjacent regions to borrow information. The sharing of information can lead to more reliable random-effect estimates for regions with small sample sizes. The covariance matrix scaled by the number of neighbors for region allows for spatial smoothing since, as the number of neighbors increases, more information is borrowed in predicting , and hence more prior confidence that is conditionally similar to the average of adjacent regions. The off-diagonal element of , denoted , is the covariance between and and controls the degree of association between the two model components. When , regions with higher incidence may tend to have more overdispersed counts. When , the two models are uncorrelated; thus, we retain separate components as a special case, and this assumption can be evaluated by inspecting whether the credible interval for includes . The joint prior distribution for the spatial random effects matrix follows from Brook’s lemma (Banerjee et al., 2014) and is proportional to the multivariate normal distribution given by where and is adjacency matrix with , if regions and are neighbors and otherwise. Since is singular, the joint prior distribution in (7) is improper, but the posterior of is itself proper. To complete Bayesian model specification, we assign proper but diffuse priors to the respective model parameters. For the mean component, we conveniently combine the fixed effect and spline parameters, and , into a single parameter vector, say of dimension , for a single Gibbs update from the full conditional distribution of . We assume an exchangeable normal prior for . That is, , where is the default choice leading to independent priors for the fixed effects and spline coefficients. To update the parameters for dispersion component, and , we take advantage of the correlation between these two parameter vectors to efficiently update them as a block, say of dimension , using a Metropolis–Hastings step. Similar to , we assume an exchangeable normal prior for . That is, . For the spatial covariance matrix, , we assume an inverse-Wishart () prior with default degrees of freedom and as the default choice for . For posterior computation, we implement an MCMC algorithm that combines full-conditional Gibbs steps with Metropolis–Hastings updates. For , we adopt the computationally efficient data-augmented Gibbs sampler proposed by Pillow and Scott (2012) for posterior computation of . The sampler introduces latent weights, for all and , that are scale mixtures of normal with independent precision terms (Polson et al., 2013a). For and , is said to have a PG distribution if where the ’s are independently distributed according to a Gamma distribution. If we draw independently from PG, where is defined in (3), then conditional on , the latent response variable is distributed according to where is the combined vector of fixed-effect covariates and the -splines basis functions. This leads to a convenient data-augmentation scheme for updating : (1) for all , draw from PG(); (2) form ; (3) assuming a prior, update from its closed-form full conditional, where Here, is an design matrix of fixed-effect covariates and the -splines basis functions, is an diagonal matrix with diagonal entries , is an vector of latent responses, is an vector of spatial random effects, and is an vector of population offsets. The proof of (8) is provided in Appendix B of the Supplementary material. To update the fixed and spline effects for the dispersion, , we adopt a random walk MH steps with symmetric multivariate proposal density centered at the previous value of and acceptance ratio where and are the proposed and current values of at iteration , respectively, and and are the probability distribution functions for NB and the variate normal prior distribution with mean and covariance evaluated at . Similarly, to update the 2 × 1 random effect vector , we use a random walk MH step with a symmetric bivariate proposal density centered at the previous and acceptance ratio where and are the proposed and current values of , respectively and is distributed according to the conditional bivariate CAR prior distribution of given in (6). Lastly, assuming an IW prior, we update the random effects covariance from a conjugate inverse-Wishart (IW) distribution where and is the random effects matrix centered at its mean. The first column of is equal to and the second column equal to . The Gibbs sampler for posterior inference can be summarized as follows: For county on day , draw from PG() using the accept-reject algorithm implemented using the R package BayesLogit (Polson et al., 2013b) For all and , create the latent response variable Update from its conjugate full conditional, where and are provided in (8) Update as a block using a random walk MH step with acceptance ratio given in (9) For , update the vector using a MH step with acceptance ratio given in (10) Finally, update the random effects covariance matrix, , from a conjugate IW distribution given in (11). As a sensitivity analysis, we consider an alternative first-order random walk (RW1) prior for the spline coefficients and of the form is a prior precision term and is a penalty matrix of rank with entries This prior introduces autoregressive dependence and creates additional smoothing which may be desirable in some cases particularly for the mean portion of the model. The Gibbs and MH update steps under this alternative RW1 prior for are outlined in Appendix C of the Supplementary material. As described below in Section 5, an additional sensitivity analysis was performed using this prior with COVID-19 case data from USAFacts (USAFacts, 2022). MCMC convergence is monitored using standard Bayesian diagnostic approaches such as trace plots and Geweke (1992) statistics, which we implement using package (Plummer et al., 2006). To compare different models, we use the “widely applicable information criterion” or WAIC (Watanabe, 2010) Bayesian measure, which combines a measure of predictive accuracy of the fitted model with a penalty for model complexity.

Parameter interpretation for a simple model

In this section, we illustrate parameter interpretation for a simple model. Consider now the following fixed effects version of the model in Eq. (1), where Under this specification, the model parameters have the following interpretations: represents the incidence rate (i.e., cases per individuals) when and . denotes the incidence rate ratio corresponding to a one-unit change in , holding the dispersion parameter fixed. If both and , then and hence . In this case, we have essentially no overdispersion and the model reduces to a Poisson-like equidispersed model in which the variance in Eq. (2) reduces to the mean . Thus, the Poisson model arises as a special case of the proposed model. If but , then we have a conventional NB model with constant . Hence, the NB model arises as a special case of the proposed model. Finally, if and , then the degree of overdispersion varies as a function of . For example, if is a binary indicator for the presence of a hospital facility in the county, this would imply different levels of overdispersion across counties based on presence or absence of a hospital facility. More generally, the model proposed in Eqs. (1), (3) permits the dispersion to vary flexibly over space and time through the inclusion of spatiotemporal fixed and random effects. At the same time, it includes the Poisson and NB as special cases, and this assumption can be assessed using model comparison metrics such as WAIC (Watanabe, 2010). Thus, the model accommodates flexible dispersion structures but can reduce to more basic models if the data support a lower degree of overdispersion.

Simulation study

We conducted a simulation study to examine the performance of the model and to explore the effect of ignoring spatial heterogeneity during model fitting. Specifically, we compared estimates under the proposed model to estimates from three misspecified models: a model that included fixed and uncorrelated effects rather than BICAR spatial effects for the two components; a model that included fixed and spatial effects for the mean component but only fixed effects for overdispersion parameter; and a model with fixed and spatial effects for the mean and a constant overdispersion parameter. To mimic the spatial structure of our application, we used the US Census county-level adjacency matrix for Georgia. This matrix has counties and unique adjacencies. We generated the outcome data, , under the following model comparable to one given in (1): where for and , denotes incident COVID-19 case count for county on day ; is a county level covariate such as the percent of county designated as rural; is a county level binary covariate; and are smooth functions of time, , for the mean and dispersion components; and is a vector of spatial intercepts that is assumed to follow a BICAR prior distribution. For component , we approximate using cubic -splines as described in Eq. (5) with interior knots placed days apart for a total spline coefficients. For the county sample sizes, , we used the US Census county population estimates for the state of Georgia (US Census Bureau, 2021), resulting in a range of 1 537 to 1 063 937. We assumed a rate constant in order to model incidence rates per 100 000 individuals. The true values for the model parameters (excluding splines) are given in Table 1. Figure S1 in Appendix D of the Supplementary material presents a partial histogram of the counts. We assumed independent N priors for the fixed effects for both the count and dispersion components of the model. To complete prior specification, we assigned an IW prior to . We set the initial values for the fixed effects of the mean, , at the maximum likelihood estimates from a fixed effect NB model while the fixed effects for dispersion component, , were initialized to . We sampled the initial values for and from independent standard normal distributions, and was initialized to .

Table 1

Model	Component	Parameter	Truth	Post. Mean (95% CrI)	Bias
Model 1 (Simulated Model)	Count	β10	0.50	0.42(0.28,0.57)	−0.08
		β11	−0.25	−0.25(−0.32,−0.18)	0.003
		β12	0.50	0.41(0.28,0.53)	−0.09
	Dispersion	β20	−0.50	−0.54(−0.59,−0.47)	−0.04
		β21	0.50	0.50(0.46,0.57)	0.001
		β22	0.25	0.18(0.08,0.28)	−0.07
	Random effects	Σ11	0.70	0.61(0.44,0.85)	−0.09
		Σ12	0.20	0.16(0.05,0.30)	−0.04
		Σ22	0.50	0.45(0.31,0.63)	−0.05

Model 2	Count	β10	0.50	0.42(0.25,0.59)	−0.08
		β11	−0.25	−0.22(−0.29,−0.15)	0.03
		β12	0.50	0.46(0.35,0.57)	−0.04
	Dispersion	β20	−0.50	−0.52(−0.61,−0.43)	−0.02
		β21	0.50	0.53(0.48,0.58)	0.03
		β22	0.25	0.19(0.10,0.28)	−0.06
	Random effects	Σ11	0.70	0.48(0.35,0.64)	−0.22
		Σ22	0.50	0.38(0.28,0.51)	−0.12

Model 3	Count	β10	0.50	0.16(0.00,0.33)	−0.34
		β11	−0.25	−0.36(−0.43,−0.29)	−0.11
		β12	0.50	0.36(0.21,0.50)	−0.14
	Dispersion	β20	−0.50	−0.77(−0.83,−0.71)	0.23
		β21	0.50	0.39(0.36,0.42)	−0.11
		β22	0.25	0.08(0.03,0.13)	−0.17
	Random effects	Σ11	0.70	0.74(0.59,0.93)	0.04

Model 4	Count	β10	0.50	0.11(−0.04,0.25)	−0.39
		β11	−0.25	−0.75(−0.80,−0.69)	−0.50
		β12	0.50	0.26(0.14,0.38)	−0.24
	Dispersion	α	–	2.40(2.33,2.48)	–
	Random effects	Σ11	0.70	0.74(0.59,0.94)	0.04

For comparison, we fitted three misspecified models (labeled Models ). Model assumed uncorrelated random intercepts, and , in Eq. (14). Hence, this model assumes that the process governing random variation in incidence is unrelated to the process governing random variation in the dispersion. Consequently, we assumed univariate ICAR distribution for the spatial random effects for each component with conditional precision (). We assigned independent conjugate Gamma priors to the conditional precision parameters for both components. We set the initial values for the fixed effects of the mean, , at the maximum likelihood estimates from a fixed effect NB model while the fixed effects for dispersion component, , were initialized to . We sampled the initial values for the random effects from independent standard normal distributions, and the precision terms were initialized to one. To update and , we used separate MH steps with a univariate proposal centered at the previous . Model assumed fixed and spatial random effects for the mean but only fixed effects in the dispersion, thus ignoring random spatial variation in this component. We assumed a univariate ICAR distribution for the spatial random effect, , for the mean with conditional precision . We assigned an independent conjugate Gamma prior to . We set the initial values for the fixed effects of the mean, , at the maximum likelihood estimates from a fixed effect NB model while the fixed effects for dispersion component, , were initialized to . We sampled the initial values for for the mean from independent standard normal distribution, and the precision term was initialized to one. As with Model , a MH step with a univariate proposal centered at the previous was used to update . Model assumed fixed and spatial random effects for the mean but assumed constant dispersion ( for all and ). Similar to Models and , we assumed a univariate ICAR distribution for the spatial random effects for the mean with conditional precision . We assigned independent conjugate Gamma prior to . We set the initial values for the fixed effects of the mean, , at the maximum likelihood estimates from a fixed effect NB model. We sampled the initial values for the random effects for the mean from independent standard normal distribution, and the precision term was initialized to one. As with Models and , a MH step with a univariate proposal was used to update . We initialized the overdispersion parameter, , to 1.2. To update , we used a MH step with a zero-truncated normal proposal with variance 0.0018. We conducted the simulations in R version 4.0.3 (R Core Team, 2021) and ran all the models for 30 000 iterations each with a burn-in of 15 000, which was sufficient in all cases to ensure convergence based on trace plots and Geweke diagnostics p-values. To reduce auto-correlation, we retained every 10th observation. Across all models, we observed adequate MH acceptance ratios ranging from 35% to 43%. The simulation results are detailed in Table 1. The columns present the true parameter values, the posterior means, 95% credible intervals (CrIs), and the estimated bias for the proposed (Model ) and misspecified models (Models ). The results for the fixed effect parameters for Models and were similar and well estimated. The random effect variance components of Model were all close to their true values. In Model , the bias for , , was , which is substantially smaller than the bias for from Model (). In addition, the 95% CrI for from Model did not include the true value. Most parameters for Model were poorly estimated. Likewise, all the parameters for Model exhibited large bias and poor CrI coverage. As expected, the proposed model performed best, followed by Models and . These findings suggest that ignoring or oversimplifying spatial heterogeneity in the dispersion component has an impact on the mean. This highlights a critical point that proper specification for dispersion is needed for correct model inferences, even when interest is in the mean. Table 2 summarizes the goodness of fit statistics for different models using WAIC. Rows indicate the model while the first three columns represent various components of the WAIC statistic (likelihood, penalty, and Total WAIC). The last column represents the successive differences in WAIC values from the previous model. Model slightly outperformed Model ; Models and showed poor fit. These results again suggest that ignoring the spatial effects in dispersion (Model ) and assuming constant dispersion in addition to ignoring spatial effects in dispersion (Model ) can lead to poor model fit when the mean and dispersion processes do, in fact, vary spatially in a correlated manner.

Table 2

Model comparison results for Simulation Study .

	WAIC estimate
	Likelihood	Penalty	Total	Successive difference
Model 1	−75701.53	261.44	151925.90	–
Model 2	−75714.45	259.09	151947.10	21.20
Model 3	−76294.23	202.50	152993.40	1046.30
Model 4	−77208.52	187.39	154791.80	1798.42

Results from simulation study comparing a spatially correlated NB model with three misspecified models: Model 1: Spatially correlated random intercepts in the mean and dispersion components. Model 2: Separate spatial random intercepts in the mean and dispersion components. Model 3: Spatial random intercept in mean component and only covariates in the dispersion component. Model 4: Spatial random intercept in mean component and a constant dispersion parameter. Note that in all models and in models 1 and 2. Figs. 2(a) and 2(b) present, respectively, the true and estimated incidence rate and overdispersion trends over time. In particular, the mean trend followed a sinusoidal pattern with higher counts around days , and . A similar sinusoidal pattern for overdispersion was observed with higher overdispersion around days , and . Overall, the estimated time trends closely mimic the simulated trends suggesting the proposed model properly fits the data. If the effect of time on incidence case rates or overdispersion was negligible, the line trend for or would be horizontal through zero; in that case, the spline function for that component would be unnecessary.

Fig. 2

(a): Average time trend for the mean in simulation study. (b): Average time trend for dispersion in simulation study.

Model comparison results for Simulation Study . (a): Average time trend for the mean in simulation study. (b): Average time trend for dispersion in simulation study. Figs. 3(a)–3(d) present a map of the true values and posterior mean predictions for and under Model . In both cases, the spatial pattern for the estimated effects closely resembles the true spatial distribution, suggesting that the proposed model precisely recovered the underlying spatial pattern in the data. Of note, the model was able to accurately capture the spatial pattern for the dispersion, which can in turn lead to improved model fit.

Fig. 3

Simulated and predicted random effects for the mean and dispersion components for Model 1: spatial NB model with correlated random intercepts. Panel (a): Simulated random intercept for the mean component. Panel (b): Predicted random intercept for the mean component. Panel (c): Simulated random intercept for dispersion component. Panel (d): Predicted random intercept for dispersion component. Legends correspond to sample quintiles.

Analysis of Georgia COVID-19 data

To analyze the Georgia COVID-19 data, we follow the same setup described in Section 4 and fit the following spatial NB model with correlated random intercepts for the mean and dispersion: where SVI is the social vulnerability index for county , PctPoorHealth is the percentage in fair or poor health in county , PctFemale is the percentage of female residents in county , PctSmoke is the percentage of adult smokers in county , and PopDensity is the population density of county . As in the simulation, we model time as a smooth function, (), which we approximate by cubic -splines with interior knots placed days apart. For SVI, we considered three functional forms: linear, quadratic, and cubic -splines. However, because the WAIC values for all the three models were nearly identical, we opted for the more parsimonious model with linear SVI trend. Since population density provides a measure of rurality, we did not include rurality as an independent predictor in the model. Moreover, because the final SVI measure is composed of variables such as percent poverty, percent minority, and percent age 65 and older, we did not include these variables as separate independent variables in the models. Similar to Section 4, we assigned weakly informative normal priors to both the fixed effect and spline regression coefficients. We assigned a BICAR prior for with an IW() prior for the spatial covariance . We ran the MCMC algorithm for 20 000 iterations with a burn-in of 10 000 and a thin of . Model diagnostics such as the trace plots and Geweke diagnostics p-values greater than 0.05 suggest the MCMC chains achieved convergence. Posterior mean RRs (exponentiated coefficients) for the mean and dispersion components are presented in Table 3. The SVI was positively associated with COVID-19 incident cases in the adjusted model, while the percentage of female residents and population density exhibited a negative association with COVID-19 incident cases. The SVI estimate suggests that each unit increase in SVI was on the average associated with a 20% increase in COVID-19 incidence (RR ). Additionally, these results confirm the somewhat surprising descriptive finding noted earlier in Section 2 about a possible inverse relationship between COVID-19 incidence and population density in the state of Georgia. Specifically, results from the adjusted model suggest that each unit increase in population density was on the average associated with a 20% decrease in COVID-19 incidence (RR ). When compared to the mean component, most estimates for the dispersion component preserve the same direction. The SVI was significant and positively associated with overdispersion while the percentage of female residents was significantly negatively associated with decreased overdispersion. In particular, each unit increase in SVI was associated with a 12% increase in overdispersion of COVID-19 incident case counts per 100 000 individuals (RR ).

Table 3

Posterior mean rate ratios and 95% posterior intervals for the COVID-19 study from the spatial NB model with correlated intercepts (excluding spline parameters for time) .

Model component	Variable	Rate ratio (95% CrI)
Mean	SVI	1.20(1.07,1.32)
	% fair or poor health	1.06(0.94,1.22)
	% female	0.82(0.78,0.86)
	% of adult smokers	0.98(0.84,1.13)
	Population density	0.80(0.75,0.85)
Dispersion	SVI	1.12(1.05,1.15)
	% fair or poor health	0.94(0.89,1.00)
	% female	0.85(0.83,0.89)
	% of adult smokers	1.07(0.99,1.13)
	Population density	0.83(0.80,0.86)
Random effects	var(ϕ1i)	0.82(0.63,1.06)
	cov(ϕ1i,ϕ2i)	0.54(0.40,0.72)
	var(ϕ2i)	0.52(0.41,0.67)
	corr(ϕ1i,ϕ2i)	0.82(0.75,0.87)

The variance component estimates suggest a moderate degree of variability in the random effects across counties, with greater heterogeneity in the mean than in the dispersion component (). This is not surprising since we expect counties to differ more in terms of underlying factors that affect the incidence rates than in terms of factors that affect overdispersion. The estimate for the random effect correlation was 0.82 (, suggesting a strong positive correlation between the mean and dispersion components, and further supporting the appropriateness of a model with spatially correlated random intercepts. Posterior mean rate ratios and 95% posterior intervals for the COVID-19 study from the spatial NB model with correlated intercepts (excluding spline parameters for time) . Figs. 4(a) and 4(b) display the estimated average trends across counties for incident cases and overdispersion, respectively. The observed daily mean incidence rates across counties are overlayed in Fig. 4(a). The average daily incidence increased steadily from March 15 to mid-April before leveling off in May to early June. This plateau may be the result of social distancing mandates put into effect in April, 2020. From mid June to July, the incidence rose steadily and remained elevated through August. The incidence tapered off in September and early October, before a precipitous increase through the end of year as a “third wave” of infections set in. Fig. 4(b) suggests an increasing rate of overdispersion from March 15 through the first week of April. This increase in overdispersion could be due to lack of incident case reports across many counties early on in the pandemic, and hence the data were possibly zero-inflated. The overdispersion trend leveled-off between mid-April to late May before steadily decreasing until early August. From September through the end of study period, the overdispersion trend appeared to flatten, possibly suggesting that a model with a constant time trend would provide adequate fit to the data during this time interval. Overall, however, the non-constant overdispersion trend further highlights the model fitting flexibility afforded by the proposed model.

Fig. 4

Daily incidence rate and overdispersion trends for the COVID-19 study. Panel (a): Mean incidence rate trend across counties. Panel (b): Mean overdispersion () across counties.

Daily incidence rate and overdispersion trends for the COVID-19 study. Panel (a): Mean incidence rate trend across counties. Panel (b): Mean overdispersion () across counties. Fig. 5 presents the model-based spatial random effect predictions for the mean and dispersion components. Both figures suggest a substantial variation in the spatial distribution of random effects within each component but similar spatial patterns across components. The between-component similarities in spatial patterns support the hypothesis of increased overdispersion in counties with higher incident rates.

Fig. 5

Model-based spatial random effect estimates for the COVID-19 study. Panel (a): Posterior mean predicted random effects for the mean component . Panel (b): Posterior mean predicted random effects for the dispersion component . Legends correspond to sample quintiles.

Counties with darker shades in panel (a) exhibit higher expected incidence compared to typical (i.e., ) counties while those in white shade have lower expected incident rates. Similarly, the dark shaded counties in panel (b) suggest higher overdispersion compared to typical (i.e., ) counties while those in white shade have lower overdispersion. Specifically, the maps indicate high incidence rates and levels of overdispersion among rural counties, particularly those in the southwest that were home to severe outbreaks early in the pandemic (Galofaro, 2020). This result corroborates the negative association for population density that we observed in Table 3. In effect, rural areas were prone to occasional outbreaks that caused extreme spikes in incidence given the small population sizes. This in turn led to increased overdispersion in these rural counties. The random effects for counties within large suburban areas such as Atlanta (north central portion) and Augusta (central eastern border) are lightly or moderately shaded suggesting lower incidence and overdispersion relative to more rural and less populated counties such as in southwest Georgia. Model-based spatial random effect estimates for the COVID-19 study. Panel (a): Posterior mean predicted random effects for the mean component . Panel (b): Posterior mean predicted random effects for the dispersion component . Legends correspond to sample quintiles. As a sensitivity analysis, we refit the model using the RW1 prior for the spline coefficients, as descried in Section 3. The results are presented in Supplementary Figures S2 and S3 and closely mirror those from our primary analysis assuming independent normal priors for the spline coefficients. Finally, because COVID-19 case data can be susceptible to various types of measurement error, we conducted a second sensitivity analysis using data from USAFacts (USAFacts, 2022) to determine if our primary results based on JHU data were consistent across different publicly available databases. USAFacts has been used widely in COVID-19 studies (Millett et al., 2020, Whaley et al., 2021, Madlock-Brown et al., 2022). Results using the USAFacts data with RW1 prior are presented in Supplementary Figures S2 and S3 and are nearly identical to those based on the JHU data, suggesting that the findings are robust to different sources of publicly available data.

Discussion

This paper proposed a NB model with spatially correlated random effects for the mean and dispersion components. Both components consist of region level predictors, temporal trends, and random effects that are modeled via a bivariate CAR prior. The choice of this prior is appealing since it induces correlation between the mean and dispersion components of the model, an attractive feature if regions with high incidence rates also exhibit more overdispersion. Results from our simulation study suggest that ignoring the spatial correlation when the two components are truly correlated leads to biased fixed effect and covariance estimates. Moreover, modeling the overdispersion parameter as a function of region level covariates (e.g. SVI) and temporal -splines provides additional model flexibility, as it is uncommon for overdispersion to remain constant over space and time in disease-mapping applications such as the COVID-19 study presented here. Our COVID-19 study for incident cases in Georgia revealed several important findings. First, counties with high SVI had significantly higher incident case rates and more overdispersion. Second, results suggest that incident case rates and overdispersion were generally higher in counties with low population density. This finding suggests that the pandemic has had a disproportionate effect on the incident rate in more vulnerable, rural communities. This may point to inadequate access to health care or lack of appropriate health resources in rural areas. The maps of model-based spatial random effect estimates in Fig. 5 accurately highlight several rural counties in southwest Georgia that reported COVID-19 outbreaks arising from superspreader events such as funerals (Galofaro, 2020). Thus, the proposed model could be a useful tool to inform policy decisions related to COVID-19 or future pandemics. In particular, the model could be used to ensure effective distribution of health resources and personnel in more vulnerable communities in Georgia, throughout the US, and in other parts of the world faced with emerging infectious diseases that may exhibit spatially and temporarily varying overdispersion. There are some limitations to our analysis. In particular, measurement error (due, for example, to underreporting) poses a challenge in analyzing COVID-19 case data. While our sensitivity analysis using the USAFacts data does not address underreporting per se, it lends some confidence that our analyses yield consistent results across different publicly available and widely used data sources. Future work might tackle the issue of measurement error more directly using recent approaches proposed by Zhang and Yi (2022) and de Oliveira et al. (2022). Our model could also be applied to COVID-19 deaths, which are less prone to underreporting; however, zero-inflated models may be warranted in this case due to the high proportion of zeros present in COVID-19 fatality data. Future work could also permit county-specific time trends by incorporating space–time interactions into the model, as discussed in [20], Rushworth et al. (2014), and Lee et al. (2018). Finally, following work of Sahu and Böhning (2021) and Neelon et al. (2022), the proposed model could also be extended to jointly model cases and deaths as part of future work. The methods discussed in this paper can be applied to other research areas, such as the analysis of the human microbiome. This work could be also extended to address underdispersed data through the development of generalized Poisson (Consul, 1989) models with spatially varying dispersion parameters. Underdisepresion has increasingly been recognized as an important feature of COVID-19 data (Kobak, 2022). More generally, our approach should prove useful when interest lies in modeling spatial count data with non-constant dispersion.

Funding and disclaimer

This work was supported by grant HX002299-01A2 from the Department of Veteran Affairs Health Services Research and Development (HSR&D) program and in part by the Biostatistics Shared Resource, Hollings Cancer Center, Medical University of South Carolina (P30 CA138313). This article represents the views of the authors and not those of the Department of Veterans Affairs, HSR&D, the United States government, or the Biostatistics Shared Resource, Hollings Cancer Center. The funders had no influence on the study design, implementation, or findings.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

6 in total

1. Disease mapping and spatial regression with count data.

Authors: Jon Wakefield
Journal: Biostatistics Date: 2006-06-29 Impact factor: 5.899

2. Zero-inflated models for adjusting varying exposures: a cautionary note on the pitfalls of using offset.

Authors: Cindy Feng
Journal: J Appl Stat Date: 2020-07-25 Impact factor: 1.416

3. Underdispersion: A statistical anomaly in reported Covid data.

Authors: Dmitry Kobak
Journal: Signif (Oxf) Date: 2022-03-29

4. A model to estimate the impact of changes in MMR vaccine uptake on inequalities in measles susceptibility in Scotland.

Authors: Gary Napier; Duncan Lee; Chris Robertson; Andrew Lawson; Kevin G Pollock
Journal: Stat Methods Med Res Date: 2016-08 Impact factor: 3.021

5. Assessing the Association Between Social Gatherings and COVID-19 Risk Using Birthdays.

Authors: Christopher M Whaley; Jonathan Cantor; Megan Pera; Anupam B Jena
Journal: JAMA Intern Med Date: 2021-08-01 Impact factor: 44.409

6. Assessing differential impacts of COVID-19 on black communities.

Authors: Gregorio A Millett; Austin T Jones; David Benkeser; Stefan Baral; Laina Mercer; Chris Beyrer; Brian Honermann; Elise Lankiewicz; Leandro Mena; Jeffrey S Crowley; Jennifer Sherwood; Patrick S Sullivan
Journal: Ann Epidemiol Date: 2020-05-14 Impact factor: 3.797

6 in total