| Literature DB >> 29276757 |
Gina S Lovasi1, David S Fink2, Stephen J Mooney3, Bruce G Link4.
Abstract
Accounting for non-independence in health research often warrants attention. Particularly, the availability of geographic information systems data has increased the ease with which studies can add measures of the local "neighborhood" even if participant recruitment was through other contexts, such as schools or clinics. We highlight a tension between two perspectives that is often present, but particularly salient when more than one type of potentially health-relevant context is indexed (e.g., both neighborhood and school). On the one hand, a model-based perspective emphasizes the processes producing outcome variation, and observed data are used to make inference about that process. On the other hand, a design-based perspective emphasizes inference to a well-defined finite population, and is commonly invoked by those using complex survey samples or those with responsibility for the health of local residents. These two perspectives have divergent implications when deciding whether clustering must be accounted for analytically and how to select among candidate cluster definitions, though the perspectives are by no means monolithic. There are tensions within each perspective as well as between perspectives. We aim to provide insight into these perspectives and their implications for population health researchers. We focus on the crucial step of deciding which cluster definition or definitions to use at the analysis stage, as this has consequences for all subsequent analytic and interpretational challenges with potentially clustered data.Entities:
Keywords: epidemiologic measurement; health surveys; multilevel analysis
Year: 2017 PMID: 29276757 PMCID: PMC5737714 DOI: 10.1016/j.ssmph.2017.07.005
Source DB: PubMed Journal: SSM Popul Health ISSN: 2352-8273
Fig. 2Schematic representation of observations within clusters differing in density and sampling strategy. Notes: Panel (a) shows a balanced sampling pattern with 10 dots sampled by design within each of 10 randomly selected clusters, a situation often handled through the use of weights or so-called “fixed effects” (dummy indicator variables for all clusters except for an omitted reference cluster). Panel (b) shows a sparse, unbalanced pattern of 100 dots arranged randomly across 100 clusters, resulting in few observations per cluster (including some clusters with zero observations). Panel (c) shows an unbalanced pattern 1000 dots within 100 clusters.
Fig. 1A schematic diagram of overlapping sources of clustering. Subjects recruited from schools A and D are both clustered in schools and in an overlapping subset of census tracts. Which, if any, of these clustering sources does an analyst need to account for? Notes: This study recruited students from schools A and D, then measured neighborhood conditions in census tracts referring to students in those tracts (1, 2, 4, 5, 6, 8, and 9). Does the analyst need to account for clustering on tracts, on schools, or both? How should we decide, noting that the clusters are overlapping and not hierarchical? A design-based perspective would emphasize the recruitment setting, indicating that inference about students in general must account for clustering of students within schools. A model-based perspective would emphasize whether clustering is important to approximating the probability model generating the observed data.
Summary of the contrasting perspectives on accounting for non-independence described here.
| Model-based | Design-based | |
|---|---|---|
| Goal of accounting for clustering | Better approximating the probability model for the data generating process | Accounting for sampling strategy to allow inference to a finite population of interest |
| Implications for analysis | A tendency toward more complexity, potentially including cross-classified models to avoid misspecification | A tendency toward less complexity, focused attention on accounting for sampling may be seen as sufficient |
| Cluster definition source | Relatively more emphasis on the a priori structure of the data generating process, or empirical analysis suggestive of residual clustering | Relatively more emphasis on the investigator-controlled and empirically-informed model relating cluster membership to sampling probabilities |
| Key analytic technique(s) | Multi-level models, generalized estimating equations or cluster robust standard errors | Models incorporating complex sampling weights, which may include multi-level models or generalized estimating equations |
Note: While we emphasize for clarity the divergent implications of the model-based and design-based perspectives, both perspectives are flexible and there is much potential for overlap and integration