| Literature DB >> 30575062 |
Francesco Innocenti1, Math J J M Candel1, Frans E S Tan1, Gerard J P van Breukelen1,2.
Abstract
In multilevel populations, there are two types of population means of an outcome variable ie, the average of all individual outcomes ignoring cluster membership and the average of cluster-specific means. To estimate the first mean, individuals can be sampled directly with simple random sampling or with two-stage sampling (TSS), that is, sampling clusters first, and then individuals within the sampled clusters. When cluster size varies in the population, three TSS schemes can be considered, ie, sampling clusters with probability proportional to cluster size and then sampling the same number of individuals per cluster; sampling clusters with equal probability and then sampling the same percentage of individuals per cluster; and sampling clusters with equal probability and then sampling the same number of individuals per cluster. Unbiased estimation of the average of all individual outcomes is discussed under each sampling scheme assuming cluster size to be informative. Furthermore, the three TSS schemes are compared in terms of efficiency with each other and with simple random sampling under the constraint of a fixed total sample size. The relative efficiency of the sampling schemes is shown to vary across different cluster size distributions. However, sampling clusters with probability proportional to size is the most efficient TSS scheme for many cluster size distributions. Model-based and design-based inference are compared and are shown to give similar results. The results are applied to the distribution of high school size in Italy and the distribution of patient list size for general practices in England.Entities:
Keywords: design-based inference; hierarchical population; informative cluster size; model-based inference; two-stage sampling
Mesh:
Year: 2018 PMID: 30575062 PMCID: PMC6590157 DOI: 10.1002/sim.8070
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Notation
| Population | Sample | |
|---|---|---|
| Number of clusters |
|
|
| Number of individuals within cluster |
|
|
| Number of individuals |
|
|
| Average cluster size |
|
|
| Cluster size variance |
|
|
| Coefficient of variation of cluster size |
|
|
| Skewness of cluster size distribution |
| ‐ |
| Kurtosis of cluster size distribution |
| ‐ |
| Correlation between cluster effect and cluster size |
| ‐ |
| Unexplained between‐cluster variance |
| ‐ |
| Within‐cluster variance |
| ‐ |
| Total unexplained outcome variance |
| ‐ |
| Intraclass correlation coefficient |
| ‐ |
Estimators of the population mean : conditional and marginal expectations and variancesa
| SRS | TSS1 | TSS2 | TSS3 | |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
aDerivations are given in Appendix B. Note that where k is the number of clusters sampled with any TSS scheme; under SRS and , where k SRS is the number of clusters indirectly sampled with SRS; under any TSS scheme ; is the sample coefficient of variation of cluster size, where and ; is the population coefficient of variation of cluster size; is the skewness and is the kurtosis of cluster size distribution. The fourth row shows whether is unbiased or approximately unbiased (ie, for k sufficiently large). SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3.
Relative efficiencies of two‐stage sampling (TSS) schemes versus simple random sampling (SRS) and each othera
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
aDerivations are given in section 2 of the Supplementary Material. Recall that ρ is the intraclass correlation, defined as , where is the total unexplained outcome variance.
Figure 1Model‐based Relative Efficiency of TSS2 versus TSS1 for a given total sample size , as a function of the (absolute value of the) correlation between cluster effect and cluster size (ie, corr(u ,N )), for different values of the average number of individuals sampled per cluster (ie, ) and of the coefficient of variation of cluster size (ie, τ ) (curves), and different cluster size distributions (panels). The values of the relative efficiency at corr(u ,N ) = 0 and corr(u ,N ) = 1 refer to ω and λ, respectively
Figure 2Model‐based Relative Efficiencies of TSS3 versus TSS2, for a given total sample size and noninformative cluster size (ie, γ = 0), as a function of the coefficient of variation of cluster size (ie, τ ), for different values of the intraclass correlation (ie, ρ) (curves) and for different average numbers of individuals sampled per cluster (ie, ) (panels)
Population mean μ estimator and sampling variance per sampling scheme under the design‐based approacha
| SRS | TSS1 | TSS2 | TSS3 | |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
| ||
|
|
|
|
|
|
| Under Equation |
|
|
aNote that is the number of individuals sampled with SRS, k is the number of clusters sampled with a TSS scheme, and . For any TSS scheme, we assume and or sampling with replacement at each stage, and for SRS or sampling with replacement. In the third row, the outcome variable is assumed to be described by Equation (7). For large enough k, the variances in the third row are equal to those in the last two rows of Table 1. Note that is the skewness, and is the kurtosis of cluster size distribution in the population. Derivations are given in section 3 of the Supplementary Material. SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3.
Figure 3First column: Distribution of public high school size in Italy (first row), distribution of patient list size for general practices in England (second row). Second column: Model‐based Relative Efficiency of TSS2 versus TSS1, as a function of the (absolute value of the) correlation between cluster effect and cluster size (ie, corr(u ,N )), for different values of the intraclass correlation coefficient ρ (curves). Third column: Design‐based Relative Efficiency of TSS2 versus TSS1, as a function of the (absolute value of the) correlation between cluster mean and cluster size (ie, ), for different values of the intraclass correlation coefficient ρ (curves). TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2