Literature DB >> 30575062

Relative efficiencies of two-stage sampling schemes for mean estimation in multilevel populations when cluster size is informative.

Francesco Innocenti¹, Math J J M Candel¹, Frans E S Tan¹, Gerard J P van Breukelen^1,2.

Abstract

In multilevel populations, there are two types of population means of an outcome variable ie, the average of all individual outcomes ignoring cluster membership and the average of cluster-specific means. To estimate the first mean, individuals can be sampled directly with simple random sampling or with two-stage sampling (TSS), that is, sampling clusters first, and then individuals within the sampled clusters. When cluster size varies in the population, three TSS schemes can be considered, ie, sampling clusters with probability proportional to cluster size and then sampling the same number of individuals per cluster; sampling clusters with equal probability and then sampling the same percentage of individuals per cluster; and sampling clusters with equal probability and then sampling the same number of individuals per cluster. Unbiased estimation of the average of all individual outcomes is discussed under each sampling scheme assuming cluster size to be informative. Furthermore, the three TSS schemes are compared in terms of efficiency with each other and with simple random sampling under the constraint of a fixed total sample size. The relative efficiency of the sampling schemes is shown to vary across different cluster size distributions. However, sampling clusters with probability proportional to size is the most efficient TSS scheme for many cluster size distributions. Model-based and design-based inference are compared and are shown to give similar results. The results are applied to the distribution of high school size in Italy and the distribution of patient list size for general practices in England.

Entities: Chemical Disease Gene Species

Keywords: design-based inference; hierarchical population; informative cluster size; model-based inference; two-stage sampling

Mesh：

Year: 2018 PMID： 30575062 PMCID： PMC6590157 DOI： 10.1002/sim.8070

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

INTRODUCTION

Hierarchical or multilevel populations arise when individuals or micro‐units are nested within clusters or macro‐units.1, 2 Considering, for the sake of simplicity, only populations with two levels of nesting, examples include patients clustered in general practices, elderly people nested in nursing homes, and students grouped in schools. In these populations, the overall mean of an outcome variable (eg, cholesterol level, blood pressure, body mass index) can be defined in two ways, ie, as the mean of all individuals in the population ignoring cluster membership (ie, first, pooling all patients from all clusters in the population, and then computing the average cholesterol level); or as the mean of all cluster‐specific means (ie, first, computing the mean cholesterol level within each cluster, and then averaging all the cluster‐specific means). These two definitions coincide only under special conditions, as will be seen later, but this paper focuses on the first definition only. Related to these two definitions is the concept of informative cluster size. When clusters vary in size in the population (eg, small versus large general practices), cluster sizes can be seen as realizations of a random variable,3 and the outcome variable of interest may be related to cluster size (eg, surgeons operating on many patients might have better performances than those operating on fewer patients4). If this is the case, then cluster size is said to be informative.5 Nevalainen et al6 describe and give practical examples of three data‐generating mechanisms that can lead to informative cluster size. Briefly, a latent variable (eg, the competence of the surgeon) influences cluster size (eg, the number of patients) and the outcome variable (eg, success of the operation) at the same time; or cluster size affects the outcome variable (eg, surgeons become better by practice); or vice versa, the outcome variable affects cluster size (eg, better surgeons get more referrals). In relation, Seaman et al5 point out that the standard methods to analyze clustered data, namely, generalized linear mixed models (GLMMs) and generalized estimating equations (GEEs), implicitly assume that cluster size is unrelated to the outcome variable, and discuss different methods to handle informative cluster size for cluster‐specific inference with GLMM and population‐average inference with GEE. The topic of this paper is the unbiased and efficient estimation of the population mean in the presence of informative cluster size. To estimate the population mean, individuals can be sampled either with simple random sampling (SRS), that is, directly from the population, or with two‐stage sampling (TSS), that is, sampling first clusters and then individuals within the sampled clusters.7, 8, 9 Given cluster size variation in the population, at least three alternative TSS schemes can be considered. Sampling clusters with probability proportional to cluster size and then sampling the same number of individuals from each sampled cluster. Sampling clusters with equal probability and then sampling per sampled cluster a number of individuals proportional to cluster size. Sampling clusters with equal probability and then sampling the same number of individuals per cluster. In order to evaluate each sampling scheme in terms of unbiasedness and efficiency of mean estimation, it is useful to distinguish two approaches to inference in survey sampling literature10: the design‐based paradigm7, 8, 9 and the model‐based approach.11, 12, 13 In the design‐based approach, the outcome value for each unit (eg, patient) in the population is assumed to be a fixed unknown quantity. The random variable is then the inclusion indicator, that is, the variable that states whether or not a unit is included into the sample. Thus, inference is based on the distribution of the inclusion indicator over repeated samples with a probability sampling design. In contrast, the model‐based approach assumes that the outcome value in the real finite population is a realization of a stochastic model, representing a hypothetical infinite population. Inference is then based on the probabilistic model. As long as the assumptions of the model are met, model‐based inference can then ignore the sampling scheme and condition on the observed sample.8, 10, 12, 13 However, if the model residuals (ie, the stochastic part) are correlated with the variables which determine the sampling probabilities (and then with the sampling probabilities themselves), the sampling design is said to be informative.2(p222),10, 13, 14, 15, 16 When this is the case, model‐based analysis is biased, unless the sampling design is taken into account.2(p237) In the multilevel modeling literature, many authors have investigated unbiased estimation when TSS with unequal sampling probabilities is informative, but they assumed noninformative cluster size.16, 17, 18, 19, 20 In this paper, this sampling scheme is informative due to the cluster size being informative. In this paper, cluster size is treated as a random variable and assumed to be informative, but the special case of noninformative cluster size will also be covered briefly. Furthermore, a simple hierarchical linear model,1, 2 for the outcome variable in the population, is assumed and used to define the parameter of interest (ie, the population mean). We thus adopt a model‐based approach but will also make a comparison with design‐based inference. It will be shown that the type of analysis (ie, unweighted versus weighted analysis) needed for unbiased estimation of the population mean depends on the chosen sampling scheme. Furthermore, the three aforementioned TSS schemes will be compared with each other and with SRS in terms of their efficiency under the constraint of a fixed total sample size. It will also be shown that their relative efficiencies depend on the cluster size distribution. The rest of this paper is organized as follows. In Section 2, the assumptions on which our findings are based and the considered sampling schemes are presented in more detail. In Section 3, the population mean is derived under a linear mixed model for a two‐level hierarchical population with varying and informative cluster size. Furthermore, Section 3 deals with the estimation of the population mean under different sampling schemes, presenting both the expectation and sampling variance of the estimator under each scheme. In Section 4, the three TSS schemes are compared with each other and with SRS in terms of efficiency for a given total sample size (number of individuals). In Section 5, the relative efficiencies of the three TSS schemes are derived under the design‐based approach and compared with those obtained under the model‐based framework. The results of this paper are applied in Section 6 to two real populations, ie, high schools in Italy and general practices in England. Some final remarks are offered in Section 7. The online Supplementary Material contains part of the derivations of the equations given in this paper as well as additional tables and figures.

ASSUMPTIONS AND SAMPLING SCHEMES

The structure of the data is hierarchical with two levels of nesting (eg, pupils are nested within schools, patients within general practitioners (GPs)). The results of this paper are based on the following assumptions (the notation is summarized in Table A1 in Appendix A).

Table A1

Notation

	Population	Sample
Number of clusters	K	k
Number of individuals within cluster j	N _j	n _j or n
Number of individuals	Npop=∑j=1KNj	m=n‾k=∑j=1knj
Average cluster size	θ _N	N‾=∑j=1kNjk
Cluster size variance	σN2	SN2=∑j=1kNj−N‾2k
Coefficient of variation of cluster size	τN=σNθN	CVN=SNN‾
Skewness of cluster size distribution	ζN=E[(Nj−θN)3]σN3	‐
Kurtosis of cluster size distribution	ηN=E[(Nj−θN)4]σN4	‐
Correlation between cluster effect and cluster size	corr(u _j,N _j)	‐
Unexplained between‐cluster variance	σν2	‐
Within‐cluster variance	σε2	‐
Total unexplained outcome variance	σy2=σν2+σε2	‐
Intraclass correlation coefficient	ρ=σν2σy2	‐

The population is composed of K clusters (eg, schools, GPs) and each cluster j contains N individuals (eg, students, patients), that is, clusters are allowed to have different sizes. The total number of individuals in the population (ie, the population size) is . Sampling is either SRS of individuals in one stage, or else TSS. In TSS, we first sample k clusters, and then sample n or n individuals from each sampled cluster j. In case of TSS, the population is very large relative to the sample size at each design level, that is, and , where is the average number of individuals sampled per sampled cluster and is the mean cluster size in the population. In case of SRS, N pop is very large relative to m, the number of individuals sampled (ie, ). The outcome variable Y is quantitative (eg, cholesterol level) and measured at the individual (eg, patient) level. Furthermore, Y shows variation at the cluster level as well as at the individual level. Therefore, sampling error occurs at each design level. This is taken into account by assuming the following two‐level random intercept model for the outcome of the ith individual from the jth cluster: where , , u ⊥ε , and will be defined in the next assumption. Note that multilevel models, such as Equation (1), are not only a standard procedure for modeling hierarchical populations1, 2 but also a natural way for taking into account the clustering induced by TSS in a model‐based approach. See other works.1(pp212,213),2(pp218,223),8(pp200,262‐264),10, 11(p256),12(p65),13, 21, 22 The cluster effect u is allowed to be linearly related to the size of the cluster in the population N , that is, , where α = −γ θ for model identifiability, , and ν ⊥N . In order to deal with cluster size variation and informative cluster size in estimating the population mean (ie, the average of all individual outcomes), three competing TSS schemes are considered, which will be compared with SRS of individuals and with each other, under the constraint that all sampling schemes have the same total sample size. Two‐Stage Sampling 1 (TSS1): Stage 1: Sample k clusters with probability proportional to cluster size N , that is, , is the probability of cluster j being sampled if one cluster is randomly sampled, and so the inclusion probability for the jth cluster, that is, the probability that cluster j is sampled given a total of k sampled clusters, is .9(p51) If , then ; this approximation will be used. Stage 2: Sample the same number of individuals n per cluster, so that , where π denotes the probability of including the ith individual from cluster j in the sample, given that, at the first stage, the jth cluster is sampled. Note that, under this sampling scheme, all individuals have the same unconditional probability of selection, that is, . A potential drawback of TSS1 is that we must know the sizes of all clusters in the population to draw the k clusters for the sample. Two‐Stage Sampling 2 (TSS2): Stage 1: Sample k clusters with SRS, that is, Stage 2: Sample the same percentage of individuals per cluster p, that is, the number of individuals sampled per cluster (ie, n ) is proportional to the cluster size in the population (ie, N ), and so and ∀j = 1,…,K. Under this sampling scheme, the unconditional probability of being included into the sample is the same for all individuals, that is, . In contrast to what was the case for TSS1, we now need to know only the cluster sizes for the sampled clusters before sampling individuals from those sampled clusters. Two‐Stage Sampling 3 (TSS3): Stage 1: Sample k clusters with SRS, that is, Stage 2: Sample the same number of individuals n per cluster, then . The unconditional sample inclusion probability of the ith individual in the jth cluster is . Thus, individuals from different clusters have a different probability to be drawn from their cluster (the larger N , the smaller this probability). This has consequences for the data analysis as will be seen in the next section. As a final remark on this section, note that the three TSS schemes considered here can be seen as three particular cases of a larger family of alternative TSS schemes. At the first stage, a more general expression for π is , where X is an arbitrary auxiliary variable available before sampling. At the second stage, a general form for π is , where Z is an auxiliary variable for individuals prior of sampling. Thus, TSS1 follows by imposing X = N , Z = 1, and n = n. Instead, TSS2 results from X = 1, Z = 1, and n = p N , whereas TSS3 is obtained with X = 1, Z = 1, and n = n.

DEFINITION AND ESTIMATION OF THE POPULATION MEAN μ

To find the population mean and variance , defined from model (1) as the marginal expectation and variance of Y over cluster effect u and individual effect ε , the marginal expectation and variance of cluster effect u (ie, and , respectively) are needed. If cluster size is noninformative (ie, γ = 0 in Assumption 4), then and leading to and . In contrast, if cluster size is informative (ie, γ ≠ 0 in Assumption 4), or depending on the sampling scheme. To prevent misunderstanding, note that the cluster effect u in the population does not depend on the sampling design, and its marginal distribution in the population is (where f(.) indicates a probability density function). Nevertheless, the sampling design determines the cluster effect sampling distribution, which is, for a sample of size one, equal to if clusters are sampled with equal probabilities, and equal to , if clusters are sampled with probabilities proportional to their size. Under TSS2 or TSS3, the k clusters are sampled with equal probabilities from the population of K clusters, and then (for proofs, see Appendix A) Note that is the component of explained by N , and is the unexplained variance of u . Hence, the following expression for comes from model (1) and Equation (2a): which can be interpreted as the expected outcome for an arbitrary individual (ie, ) from an arbitrary cluster (ie, ). To estimate β 0 unbiasedly, large and small clusters should be weighted equally, both in the sampling scheme and in the estimator (see Appendix B). However, β 0 is not the parameter of interest in this paper. Under SRS m individuals are sampled directly from the population of individuals and with equal probabilities (ie, , ∀i = 1,…,N pop). Now, the probability that a selected individual belongs to a cluster of size N is proportional to cluster size, meaning that large clusters have higher chance of being represented in the SRS sample. Hence, under SRS, k SRS clusters are indirectly sampled from the population with sampling probability proportional to size, and k SRS can run from 1 to m. Likewise, under TSS1 k clusters are sampled with probabilities proportional to their size, and so large clusters are more likely to be drawn. Therefore, under SRS and TSS1, the marginal expectation and variance of cluster effect u are (for proofs, see Appendix A) where and are the coefficient of variation and the skewness of cluster size distribution in the population, respectively. Note that if one of the following conditions holds: (i) τ = 0 (ie, no cluster size variation), (ii) γ = 0 (ie, cluster size is noninformative), or (iii) ζ = τ (eg, N is Poisson distributed, see Table S.M.1 in the Supplementary Material). Likewise, if either condition (i) or (ii) holds. Thus, from model (1) and Equation (4a), the population mean that we here want to estimate as follows: This mean can be interpreted as the expected outcome for an individual randomly sampled from the population ignoring cluster membership by SRS. Note that the two definitions of in Equations (3) and (5) coincide if either clusters have the same size in the population (ie, τ = 0) or cluster size is not related to the outcome (ie, γ = 0). Given the focus of this paper on μ, model (1) can be rewritten from Equation (5) as follows: where (see Equation (4a)) with and (see Equation (4b)). To estimate μ unbiasedly, the weight of a cluster should be proportional to its size, either in the sampling scheme or in the estimator (for details, see Appendices A and B). For each sampling scheme, the first row of Table 1 presents the unbiased or approximately unbiased (ie, for k sufficiently large) estimator of μ under model (6), the second and third row present the conditional expectation and variance of , the fourth row gives the marginal expectation of , and the last two rows show the two components of the marginal variance of (ie, , where under TSS and under SRS) (for proofs, see Appendix B). As the first row of Table 1 shows, the estimator of μ is a weighted sum of cluster means in each sampling scheme, but the weights differ between schemes. Under SRS k SRS clusters are indirectly sampled from the population and large clusters have higher chance of being sampled, thus the unweighted estimator is unbiased for μ (recall that from Assumption 2, , which implies that k SRS→m). Under TSS1 clusters are sampled with probabilities proportional to their size, and so μ is estimated unbiasedly by the unweighted average of cluster means. Under TSS3 and TSS2 cluster means must be weighted by cluster size (ie, N in TSS3, and also in TSS2 because n = p N ) in the analysis, because clusters are weighted equally by these sampling designs, that is, all clusters have equal sampling probability (for details, see Appendix B). An exception to this is the special case of noninformative cluster size (ie, γ = 0), in which the two definitions of population means coincide (ie, μ = β 0). It then follows that for any sampling scheme (see Appendix A), and from model (1), then results that . Thus, any estimator of μ = β 0 of the form is unbiased then, although some weights w are more efficient than others.23, 24

Table 1

Estimators of the population mean : conditional and marginal expectations and variancesa

	SRS	TSS1	TSS2	TSS3
μ^	∑i=1myim	∑j=1ky‾jk	∑j=1kpNjy‾j∑j=1kpNj	∑j=1kNjy‾j∑j=1kNj
E(μ^\|N∗)	β0+γN‾SRS−θN	β0+γN‾−θN	β0+γN‾CVN2+1−θN	β0+γN‾CVN2+1−θN
V(μ^\|N∗)	σν2+σε2m	nσν2+σε2nk	n‾CVN2+1σν2+σε2n‾k	nσν2+σε2nk×CVN2+1
E(μ^)	β0+γθNτN2	β0+γθNτN2	β0+γθNk−1kτN2	β0+γθNk−1kτN2
E(V(μ^\|N∗))	σν2+σε2m	nσν2+σε2nk	pθNkτN2+1τN2+kσν2+σε2pθNk	nσν2+σε2kτN2+1τN2+knk
V(E(μ^\|N∗))	γ2σN2τNζN−τN+1m	γ2σN2τNζN−τN+1k	γ2σN2kk−1k2τN2ηN−k−3k−1+τNτN−2ζN+	γ2σN2kk−1k2τN2ηN−k−3k−1+τNτN−2ζN+
			+2k−1kτNζN−τN+1	+2k−1kτNζN−τN+1

aDerivations are given in Appendix B. Note that where k is the number of clusters sampled with any TSS scheme; under SRS and , where k SRS is the number of clusters indirectly sampled with SRS; under any TSS scheme ; is the sample coefficient of variation of cluster size, where and ; is the population coefficient of variation of cluster size; is the skewness and is the kurtosis of cluster size distribution. The fourth row shows whether is unbiased or approximately unbiased (ie, for k sufficiently large). SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3.

Estimators of the population mean : conditional and marginal expectations and variancesa aDerivations are given in Appendix B. Note that where k is the number of clusters sampled with any TSS scheme; under SRS and , where k SRS is the number of clusters indirectly sampled with SRS; under any TSS scheme ; is the sample coefficient of variation of cluster size, where and ; is the population coefficient of variation of cluster size; is the skewness and is the kurtosis of cluster size distribution. The fourth row shows whether is unbiased or approximately unbiased (ie, for k sufficiently large). SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3.

RELATIVE EFFICIENCIES OF TSS SCHEMES VERSUS SRS AND EACH OTHER

Under the constraint of a fixed total sample size (ie, ), the efficiency of the three TSS schemes can be investigated by computing their relative efficiencies, defined as the ratio of the sampling variances of under two competing sampling schemes (ie, the variances obtained as the sum of the last two rows of Table 1). For instance, the relative efficiency of TSS1 versus SRS is defined as the ratio of for SRS to for TSS1 (ie, ). The relative efficiencies are given in Table 2 (for proof, see section 2 of the Supplementary Material), whereas the relative efficiency of TSS2 versus TSS1 is plotted in Figure 1. As shown by Table 2, the numerator and denominator of the relative efficiency are both a weighted sum of two components, respectively and from last two rows of Table 1, with weights determined by the correlation between cluster effect and cluster size . The component with weight depends on the intraclass correlation , the coefficient of variation of cluster size τ , and the average number of individuals sampled per cluster . The other component, ie, , weighted by , is a function of the coefficient of variation τ , the skewness ζ , and (for TSS2 and TSS3 only) the kurtosis η of cluster size distribution. Denote by ω the relative efficiency under noninformative cluster size (ie, R E = ω if ), and by λ the relative efficiency under a perfect linear relation between u and N (ie, R E = λ if ). These two extremes can be derived directly from Table 2 and Figure 1, which plots the R E against . Therefore, the R E moves from ω to λ as corr(u ,N ) moves from zero to one. For small to moderate correlations (say, |corr(u ,N )| < 0.7), ω receives more weight in the relative efficiency. If ω and λ are both smaller than or equal to one, the relative efficiency is also smaller than or equal to one. Now, the ω's shown in Table 2 are all smaller than one, which entails the following ordering of the sampling schemes in terms of efficiency based on ω (from most to least efficient): SRS, TSS1, TSS2, and TSS3. Under a perfect linear relation between cluster effect and cluster size (ie, corr(u ,N )2 = 1), R E = λ, and SRS is more efficient than TSS1, whereas TSS2 and TSS3 are equally efficient. Furthermore, TSS1 is more efficient than TSS2 and TSS3 (ie, λ ≤ 1) if one of the following conditions is met (for proofs, see section 2 of the Supplementary Material): the cluster size distribution is positively skewed (ie, ζ > 0) with τ ∈ [0,ζ ], or is symmetric (ie, ζ = 0) with τ ∈ [0,1] and , or is Normal. Thus, for any value of corr(u ,N ), the ordering of the sampling schemes in terms of efficiency based on is (from most to least efficient) as follows: SRS, TSS1, TSS2, and TSS3. However, if none of the aforementioned conditions is met, λ might be bigger than one and then, to see whether TSS1 is more efficient than TSS2 and TSS3, the relative efficiency must be computed for the specific cluster size distribution.

Table 2

Relative efficiencies of two‐stage sampling (TSS) schemes versus simple random sampling (SRS) and each othera

RETSS1 vs SRS	1−corruj,Nj2+corruj,Nj2ρτNζN−τN+11−corruj,Nj21+n−1ρ+corruj,Nj2nρτNζN−τN+1
RETSS2 vs SRS	1−corruj,Nj2+corruj,Nj2ρτNζN−τN+11−corruj,Nj21+n‾kτN2+1τN2+k−1ρ+corruj,Nj2n‾ρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
RETSS3 vs SRS	1−corruj,Nj2+corruj,Nj2ρτNζN−τN+11−corruj,Nj2kτN2+1τN2+k1+n−1ρ+corruj,Nj2nρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
RETSS2 vs TSS1	1−corruj,Nj21+n−1ρ+corruj,Nj2nρτNζN−τN+11−corruj,Nj21+n‾kτN2+1τN2+k−1ρ+corruj,Nj2n‾ρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
RETSS3 vs TSS1	1−corruj,Nj21+n−1ρ+corruj,Nj2nρτNζN−τN+11−corruj,Nj2kτN2+1τN2+k1+n−1ρ+corruj,Nj2nρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
RETSS3 vs TSS2	1−corruj,Nj21+n‾kτN2+1τN2+k−1ρ+corruj,Nj2n‾ρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+11−corruj,Nj2kτN2+1τN2+k1+n−1ρ+corruj,Nj2nρk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1

aDerivations are given in section 2 of the Supplementary Material. Recall that ρ is the intraclass correlation, defined as , where is the total unexplained outcome variance.

Figure 1

Model‐based Relative Efficiency of TSS2 versus TSS1 for a given total sample size , as a function of the (absolute value of the) correlation between cluster effect and cluster size (ie, corr(u ,N )), for different values of the average number of individuals sampled per cluster (ie, ) and of the coefficient of variation of cluster size (ie, τ ) (curves), and different cluster size distributions (panels). The values of the relative efficiency at corr(u ,N ) = 0 and corr(u ,N ) = 1 refer to ω and λ, respectively

Relative efficiencies of two‐stage sampling (TSS) schemes versus simple random sampling (SRS) and each othera aDerivations are given in section 2 of the Supplementary Material. Recall that ρ is the intraclass correlation, defined as , where is the total unexplained outcome variance. Model‐based Relative Efficiency of TSS2 versus TSS1 for a given total sample size , as a function of the (absolute value of the) correlation between cluster effect and cluster size (ie, corr(u ,N )), for different values of the average number of individuals sampled per cluster (ie, ) and of the coefficient of variation of cluster size (ie, τ ) (curves), and different cluster size distributions (panels). The values of the relative efficiency at corr(u ,N ) = 0 and corr(u ,N ) = 1 refer to ω and λ, respectively Given that R E = ω if corr(u ,N ) = 0 and ω has more weight than λ in the R E for |corr(u ,N )| < 0.7, it is useful to have a closer look at the patterns of the ω's shown in Table 2. First, the ω of any TSS scheme versus SRS is a decreasing function of the intraclass correlation ρ, the average number of individuals sampled per cluster , and (only for TSS2 and TSS3) of the coefficient of variation of cluster size τ . Second, ω(TSS2 vs TSS1), ω(TSS3 vs TSS1), and ω(TSS3 vs TSS2) are decreasing functions of the coefficient of variation of cluster size τ . Third, as the intraclass correlation ρ and/or the average number of individuals sampled per cluster increase, TSS2 moves away from TSS1 and toward TSS3 in terms of efficiency as expressed by ω (see Figure 2).

Figure 2

Model‐based Relative Efficiencies of TSS3 versus TSS2, for a given total sample size and noninformative cluster size (ie, γ = 0), as a function of the coefficient of variation of cluster size (ie, τ ), for different values of the intraclass correlation (ie, ρ) (curves) and for different average numbers of individuals sampled per cluster (ie, ) (panels) When the outcome variable is unrelated to the cluster size (ie, γ = 0 and so also corr(u ,N ) = 0), the population mean μ is equal to β 0, as shown in Section 3. In this special case, any estimator of μ of the form is unbiased. However, some weights are more efficient than others. For TSS2, weighting cluster means by their inverse variance (ie, , where because γ = 0) is optimal, and unweighted analysis (ie, w = 1) is more or less efficient than cluster size weighting (ie, w = p N ), depending on the intraclass correlation ρ and the average cluster size in the sample.3, 23 The conditional variance of the optimal estimator is .3, (eq.(6)) Under TSS1 and TSS3, the same number of individuals is sampled per cluster (ie, n = n, ∀j = 1,…,k), so the estimator with reduces to . Thus, for TSS1 and TSS3, w = 1 is optimal and its sampling variance is given in the fifth row of the TSS1 column in Table 1 (for proof, see Appendix B or section 2.3 of the Supplementary Material), so TSS1 and TSS3 are equally efficient then, given equal weighting of cluster means, but TSS3 is more practical because, unlike TSS1, it does not require the knowledge of all cluster sizes in the population. The optimal estimator of TSS2 is less efficient than that of TSS3 and TSS1 (ie, , for proof see section 2.3 of the Supplementary Material). Therefore, TSS3 combined with is the best strategy to estimate μ if cluster size is not informative. To prevent misunderstanding, note that the ordering of sampling schemes in this last paragraph only holds if noninformative cluster size is combined with optimal weighting of cluster means. Those weights differ from the ones in Table 1 first row, on which Table 2 and Figures 1 and 2 are based, and which are needed for unbiased estimation of the population mean if cluster size is informative.

DESIGN‐BASED INFERENCE FOR TSS WHEN CLUSTER SIZE IS INFORMATIVE

The aim of this section is to study the relative efficiencies of the three TSS schemes compared with SRS and with each other under the design‐based approach. It is important to emphasize that the inferential framework of this section is different from the model‐based approach adopted in the rest of this paper. So far, the outcome variable Y and cluster size N were both seen as random variables, and inference was based on the probability distribution of Y given in model (1). In contrast, in the design‐based approach (ie, this section), the outcome variable Y and cluster size N are fixed quantities, the inclusion indicator is the only random variable (eg, for cluster j, it is defined as I = 1 if cluster j is included into the sample, which occurs with probability π , and I = 0 otherwise), and inference is based on the probability distribution induced by the sampling scheme. The notation of this section remains the same as before with the important distinction that all population quantities here must be interpreted as relating to the finite population. Thus, the two types of population means can be expressed as and , respectively, where is the mean of all N individuals within cluster j. Furthermore, in the population the outcome variable for the ith individual within the jth cluster can be decomposed (combining model (1) with Assumption 4) as follows: where ν is the cluster effect with and , and ν ⊥N , whereas ε is the individual effect with , , and ν ⊥ε , which entails that here represents β 0 + u in model (1). Note that, in this section, no distributional assumptions are made for Equation (7), all quantities (ie, Y , N , ν , and ε ) are just fixed constants, the only random variable is the inclusion indicator and its probability distribution is the foundation of inference. From Equation (7), it follows that , an expression that is similar to Equation (5) but refers to the finite population (for proof, see section 3 of the Supplementary Material). Hence, under both inferential paradigms, the two population means coincide (ie, μ = β 0) only if either there is no cluster size variation in the population (ie, τ = 0), or cluster size is noninformative (ie, γ = 0). For each sampling scheme, Table 3 shows in the first row the estimator of the population mean μ, in the second row the sampling variance of as available in the design‐based literature,7, 8, 9, 25 and in the third row again the sampling variance of but under the assumption that Equation (7) describes the outcome variable Y in the population (for proofs, see section 3 of the Supplementary Material). For large enough k (say, k ≥ 30), the model‐based variances given in Table 1 are equal to the design‐based variances given in the third row of Table 3. Furthermore, the estimators of Table 3 are the same as those of the model‐based approach (ie, Table 1, first row). The estimators under SRS and TSS1 are unbiased,7(p308),8(p236) whereas the estimator under TSS2 and TSS3, the so‐called ratio estimator, is only approximately unbiased,8(p186),25(pp323,324) and then the number of sampled clusters k is assumed to be large enough to neglect this bias. It is important to emphasize that, under the design‐based paradigm, the properties of an estimator (ie, approximate unbiasedness, variance as given in the second row of Table 3) are based only on the sampling scheme.8(p147),9(p239) The assumption that the outcome variable is described by Equation (7) (ie, Table 3, third row) is needed to allow a fair comparison with the results obtained under the model‐based approach. However, the assumption of a model, like Equation (7), to evaluate competing sampling schemes is appropriate under the design‐based framework, provided that inference is then based on the sampling scheme only.7(p256),8(p205),26, 27

Table 3

Population mean μ estimator and sampling variance per sampling scheme under the design‐based approacha

	SRS	TSS1	TSS2	TSS3
μ^	∑i=1myim	∑j=1ky‾jk	∑j=1kNjy‾j∑j=1kNj	∑j=1kNjy‾j∑j=1kNj
	7, (p22),8, (eq.(2.8),p35),	7, (eq.(11.39),p308),8, (p236),	7, (eq.(11.25),p303), 8, (eq.(5.26),p186),	7(eq.(11.25),p303),8(eq.(5.26),p186),
	25, (eq.(5),p21),	25, (eq.(2),p359),	25, (eq.(76),p317),	25, (eq.(76),p317),
V(μ^)	∑i=1Npop(Yi−μ)2m(Npop−1)	1k∑j=1KNjNpop(Y‾j−μ)2	1k∑j=1KNjθN2(Y‾j−μ)2(K−1)	1k∑j=1KNjθN2(Y‾j−μ)2(K−1)
	7, (eq.(2.8),p23), 8, (eq.(2.9),p36), 25, (eq.(39),p29),	+1k∑j=1KNjNpop∑i=1Nj(Yij−Y‾j)2n(Nj−1)	+1kK∑j=1KNjθN2∑i=1Nj(Yij−Y‾j)2nj(Nj−1)	+1kK∑j=1KNjθN2∑i=1Nj(Yij−Y‾j)2n(Nj−1)
		7, (eq.(11.33),p307), 25, (eq.(14),p362),	7, (eq.(11.27),p304), 25, (eq.(96),p325),	7, (eq.(11.27),p304), 25, (eq.(96),p325),
V(μ^)	σν2+σε2+γ2σN2τNζN−τN+1m	nσν2+σε2nk+γ2σN2τNζN−τN+1k	pθN(τN2+1)σν2+σε2pθNk	(τN2+1)nσν2+σε2nk
Under Equation (7)			+γ2σN2τN4+τN2(ηN−3)+2ζNτN(1−τN2)+1k	+γ2σN2τN4+τN2(ηN−3)+2ζNτN(1−τN2)+1k

aNote that is the number of individuals sampled with SRS, k is the number of clusters sampled with a TSS scheme, and . For any TSS scheme, we assume and or sampling with replacement at each stage, and for SRS or sampling with replacement. In the third row, the outcome variable is assumed to be described by Equation (7). For large enough k, the variances in the third row are equal to those in the last two rows of Table 1. Note that is the skewness, and is the kurtosis of cluster size distribution in the population. Derivations are given in section 3 of the Supplementary Material. SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3.

Population mean μ estimator and sampling variance per sampling scheme under the design‐based approacha aNote that is the number of individuals sampled with SRS, k is the number of clusters sampled with a TSS scheme, and . For any TSS scheme, we assume and or sampling with replacement at each stage, and for SRS or sampling with replacement. In the third row, the outcome variable is assumed to be described by Equation (7). For large enough k, the variances in the third row are equal to those in the last two rows of Table 1. Note that is the skewness, and is the kurtosis of cluster size distribution in the population. Derivations are given in section 3 of the Supplementary Material. SRS, simple random sampling; TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2; TSS3, two‐stage sampling 3. Similarly to Section 4, the relative efficiency of two competing sampling schemes is defined as the ratio of their variances (as given in the third row of Table 3). For large enough k (say, k ≥ 30), it turns out that these relative efficiencies (given in Table S.M.2 and shown in Figures S.M.1‐2 of the Supplementary Material) are approximately equal to those shown in Table 2 because the variances in Table 1 and those in the third row of Table 3 are approximately equal. The only distinction to be made is that corr(u ,N ) is replaced with the correlation between cluster mean and cluster size . Like in Section 4, numerator and denominator of the relative efficiency are both made up of two components, weighted by and , respectively, and only the component weighted by depends on the skewness and kurtosis of the cluster size distribution. The extreme cases of the relative efficiency, namely, under noninformative cluster size and a perfect relation between cluster mean and cluster size, are denoted by ω and λ, respectively. The patterns and the ordering of the relative efficiencies are then those of Section 4. Specifically, for any value of , SRS is the most efficient sampling scheme, followed by TSS1 (under the conditions given in Section 4), TSS2, and finally TSS3. To conclude, even though the mathematical foundations of the two inferential approaches are different, in the considered setting, they yield almost the same results, ie, the population mean estimators are the same, as well as the relative efficiencies, provided that k is large enough and Equation (7) holds in the population. An advantage of the design‐based approach is robustness because the unbiasedness and the variance of a design‐based estimator do not depend on the assumptions of a model. Nevertheless, the model‐based approach has a practical advantage when designing a survey, more specifically for choosing a sampling scheme and computing the sample size. The sampling variances in Table 1 (last two rows) and Table 3 (last row), and the relative efficiencies in Table 2, all based on Equation (7), can be obtained by specifying the intraclass correlation ρ, the correlation corr(u ,N ), and four parameters of cluster size distribution (ie, θ , τ , ζ , and η ). In contrast, the sampling variances in Table 3 (second row) from the design‐based approach require the knowledge of cluster size N and cluster mean for all the K clusters in the population. If that information were available, then the population mean μ would also be known, making the survey superfluous.

APPLICATION TO TWO REAL CLUSTER SIZE DISTRIBUTIONS

With the aim of planning a survey to estimate the population mean μ of a quantitative outcome variable Y in a two‐level population, we want to establish whether TSS1 is more efficient than TSS2 for the population under study and assess its efficiency gain relative to TSS2. The outcome variable Y is assumed to be decomposed, as shown in Equation (7), but the analysis is carried out for both the model‐based and the design‐based approach. Two real cluster size distributions are considered, ie, the distribution of public high school size in Italy and the distribution of patient list size for general practices in England. School size and alcohol consumption. In adolescent health literature, it has been shown that greater connection between students and school (eg, positive relations with teachers and peers, participation in school activities) is associated with less emotional distress, substance consumption (eg, alcohol, cigarettes, marijuana), violence, and suicidal intentions.28 Furthermore, it has been found that school connectedness and school size are inversely related,29, 30 which suggests that school size can be informative for health risk behaviors in adolescents. Suppose that we want to estimate the average weekly alcohol consumption (in liters) among high school students in Italy. According to the Italian Ministry of Education,31 in the school year 2016/2017 in Italy, there were 6,235 = K public high schools with a total of 2,515,060 = N pop students enrolled. The distribution of public high school size in Italy (with parameters θ = 403, τ = 0.912, ζ = 1.256, and η = 4.315) is plotted in Figure 3 (first column, first row). The first row of Figure 3 also shows the relative efficiency of TSS2 versus TSS1, for a sample of 50 = k schools and students per school, as a function of the (absolute value of the) correlation between school size and school specific‐mean, for different values of the intraclass correlation, under the model‐based (second column) and the design‐based approach (third column). As can be seen from Figure 3, under both inferential approaches TSS1 is more efficient than TSS2 and allows a sizeable efficiency gain (about 15%) even for noninformative school size and a small intraclass correlation (ρ = 0.01).

Figure 3

First column: Distribution of public high school size in Italy (first row), distribution of patient list size for general practices in England (second row). Second column: Model‐based Relative Efficiency of TSS2 versus TSS1, as a function of the (absolute value of the) correlation between cluster effect and cluster size (ie, corr(u ,N )), for different values of the intraclass correlation coefficient ρ (curves). Third column: Design‐based Relative Efficiency of TSS2 versus TSS1, as a function of the (absolute value of the) correlation between cluster mean and cluster size (ie, ), for different values of the intraclass correlation coefficient ρ (curves). TSS1, two‐stage sampling 1; TSS2, two‐stage sampling 2 Patient list size for general practices and government expenditure on health. According to Eurostat,32 in 2016, health was the second largest area of government expenditure in the United Kingdom with a share of 7.6% of the Gross Domestic Product (GDP). Spending for hospital services represented the largest component of the government expenditure on health, with a share of 5.7% of the GDP.32 In reducing such costs, general practices can play a role by effectively treating those conditions, which can lead to avoidable hospitalisations (eg, influenza, diabetic complications). Kelly and Stoye33 have found that small practices (defined as those with three or fewer full‐time equivalent (FTE) practitioners) had higher rates of hospitalizations for such preventable conditions in 2010/2011 in England. This suggests that patient list size can be informative for government expenditure on health, given that patients per general practice were proportional to the number of FTE practitioners (see figure 2.6 and table 2.3 in the work of Kelly and Stoye33). Suppose we want to estimate the average per capita government expenditure on health in England. According to the Health and Social Care Information Centre,34 in October 2017, 58,719,921 = N pop patients were registered at 7,353 = K general practices in England. The distribution of patient list size for general practices in England (with parameters θ = 7,986, τ = 0.633, ζ = 2.12, and η = 14.549) is plotted in Figure 3 (first column, second row). The second row of Figure 3 shows the relative efficiency of TSS2 versus TSS1, for a sample of 50 = k practices and patients per practice, as a function of the (absolute value of the) correlation between patient list size and general practice specific‐mean, for different values of the intraclass correlation, under the model‐based (second column) and the design‐based approach (third column). As shown in the second row of Figure 3, TSS1 is more efficient than TSS2 under both inferential paradigms and its efficiency gain increases as the intraclass correlation and/or the correlation between patient list size and the general practice specific‐mean increase. To conclude, the two examples show that TSS2 leads to important efficiency losses relative to TSS1, and that, in planning a survey, it is more practical to use variances based on a model, like those given in Table 1 or third row of Table 3, than the design‐based variances in the second row of Table 3, which require the prior knowledge of all cluster sizes N as well as all cluster means in the population.

DISCUSSION

In multilevel populations, two types of overall means can be defined, ie, the mean of all individual outcomes in the population ignoring cluster membership and the mean of all cluster‐specific means. For unbiased estimation of the first population mean, individuals can be sampled not only by SRS but also with three alternative TSS schemes, ie, sampling clusters with probability proportional to cluster size and then taking a SRS of the same number of individuals within sampled clusters (ie, TSS1); drawing a SRS of clusters and then sampling the same percentage of individuals per cluster (ie, TSS2); and taking a SRS of clusters and then of individuals within the sampled clusters (ie, TSS3). The results of this paper are the following. First, it was shown that the first population mean gives equal weight to all individuals and thus more weight to large clusters than to small clusters, the second mean gives equal weight to all clusters irrespective their size, and these two means coincide only if cluster size does not vary or is unrelated (ie, noninformative) to the outcome variable of interest. Second, for estimation of the first population mean (ie, the average of all individual outcomes), the unweighted average of cluster means is unbiased under TSS1, and weighting cluster means by cluster size is asymptotically unbiased under TSS2 or TSS3. Third, it was shown that the relative efficiency of any TSS scheme versus SRS is a decreasing function of the intraclass correlation, the average number of individuals sampled per cluster, and (only for TSS2 and TSS3) of the coefficient of variation of cluster size. Furthermore, the relative efficiencies of TSS2 and TSS3 versus TSS1 and of TSS3 versus TSS2 are decreasing functions of the coefficient of variation of cluster size, but the efficiency loss of TSS3 compared with TSS2 improves with an increase of the intraclass correlation and/or the average number of individuals sampled per cluster. All relative efficiencies also depend on other features of the cluster size distribution, in particular, on its skewness and (only for those involving TSS2 and TSS3) kurtosis. Nevertheless, SRS is always the most efficient sampling scheme, followed (for many cluster size distributions) by TSS1, and then by TSS2, which, in turn, is always more efficient than TSS3. With respect to choosing between the three TSS schemes, we do not expect TSS1 to be less efficient than TSS2 in practice, and thus we recommend TSS1 provided all cluster sizes are known before sampling. Fourth, it was shown that model‐based and design‐based inference in survey sampling yield almost the same results, at least if the model assumptions are met. Although design‐based inference has the advantage of being robust against violations of the model assumptions, comparing the four sampling schemes in terms of their relative efficiencies, as well as sample size planning, can only be done taking a model‐based approach. Sample size planning within the design‐based approach would require knowledge of the size and outcome mean of all clusters in the population (see Table 3, second row), which, in turn, would imply that the population mean is already known. Furthermore, models are also needed to deal with missing data and measurement error.9 The results of this paper could be extended by (i) deriving the optimal design of these three TSS schemes under a cost constraint and comparing their efficiencies under that constraint instead of the present constraint of a fixed total sample size, (ii) investigating different variance estimation methods, (iii) considering binary outcome variables, and (iv) deriving the optimal design for a scheme, which samples different numbers and percentages of individuals at the second stage, that is, a sampling scheme in‐between TSS2 and TSS3. SIM_8070‐Supp‐0001‐SIM8070_online_Supplementary_Material.pdf Click here for additional data file.

11 in total

1. Promoting school connectedness: evidence from the National Longitudinal Study of Adolescent Health.

Authors: Clea A McNeely; James M Nonnemaker; Robert W Blum
Journal: J Sch Health Date: 2002-04 Impact factor: 2.118

2. Comments on 'Efficiency loss because of varying cluster size in cluster randomized trials is smaller than literature suggests'.

Authors: Gerard J P van Breukelen; Math J J M Candel
Journal: Stat Med Date: 2012-02-20 Impact factor: 2.373

3. Properties of analysis methods that account for clustering in volume-outcome studies when the primary predictor is cluster size.

Authors: Katherine S Panageas; Deborah Schrag; A Russell Localio; E S Venkatraman; Colin B Begg
Journal: Stat Med Date: 2007-04-30 Impact factor: 2.373

4. A Comparison of Population-Averaged and Cluster-Specific Approaches in the Context of Unequal Probabilities of Selection.

Authors: Natalie A Koziol; James A Bovaird; Sonia Suarez
Journal: Multivariate Behav Res Date: 2017-03-10 Impact factor: 5.923

5. School connectedness in the health behavior in school-aged children study: the role of student, school, and school neighborhood characteristics.

Authors: Douglas R Thompson; Ronaldo Iachan; Mary Overpeck; James G Ross; Lori A Gross
Journal: J Sch Health Date: 2006-09 Impact factor: 2.118

6. Protecting adolescents from harm. Findings from the National Longitudinal Study on Adolescent Health.

Authors: M D Resnick; P S Bearman; R W Blum; K E Bauman; K M Harris; J Jones; J Tabor; T Beuhring; R E Sieving; M Shew; M Ireland; L H Bearinger; J R Udry
Journal: JAMA Date: 1997-09-10 Impact factor: 56.272

7. Bayesian inference under cluster sampling with probability proportional to size.

Authors: Susanna Makela; Yajuan Si; Andrew Gelman
Journal: Stat Med Date: 2018-07-04 Impact factor: 2.373

8. Relative efficiency of unequal versus equal cluster sizes in cluster randomized and multicentre trials.

Authors: Gerard J P van Breukelen; Math J J M Candel; Martijn P F Berger
Journal: Stat Med Date: 2007-06-15 Impact factor: 2.373

9. Relative efficiencies of two-stage sampling schemes for mean estimation in multilevel populations when cluster size is informative.

Authors: Francesco Innocenti; Math J J M Candel; Frans E S Tan; Gerard J P van Breukelen
Journal: Stat Med Date: 2018-12-21 Impact factor: 2.373

Review 10. Review of methods for handling confounding by cluster and informative cluster size in clustered data.

Authors: Shaun Seaman; Menelaos Pavlou; Andrew Copas
Journal: Stat Med Date: 2014-08-04 Impact factor: 2.373

2 in total

1. Optimal two-stage sampling for mean estimation in multilevel populations when cluster size is informative.

Authors: Francesco Innocenti; Math Jjm Candel; Frans Es Tan; Gerard Jp van Breukelen
Journal: Stat Methods Med Res Date: 2020-09-17 Impact factor: 3.021

2. Relative efficiencies of two-stage sampling schemes for mean estimation in multilevel populations when cluster size is informative.

Authors: Francesco Innocenti; Math J J M Candel; Frans E S Tan; Gerard J P van Breukelen
Journal: Stat Med Date: 2018-12-21 Impact factor: 2.373

2 in total