Literature DB >> 32940135

Optimal two-stage sampling for mean estimation in multilevel populations when cluster size is informative.

Francesco Innocenti¹, Math Jjm Candel¹, Frans Es Tan¹, Gerard Jp van Breukelen^1,2.

Abstract

To estimate the mean of a quantitative variable in a hierarchical population, it is logistically convenient to sample in two stages (two-stage sampling), i.e. selecting first clusters, and then individuals from the sampled clusters. Allowing cluster size to vary in the population and to be related to the mean of the outcome variable of interest (informative cluster size), the following competing sampling designs are considered: sampling clusters with probability proportional to cluster size, and then the same number of individuals per cluster; drawing clusters with equal probability, and then the same percentage of individuals per cluster; and selecting clusters with equal probability, and then the same number of individuals per cluster. For each design, optimal sample sizes are derived under a budget constraint. The three optimal two-stage sampling designs are compared, in terms of efficiency, with each other and with simple random sampling of individuals. Sampling clusters with probability proportional to size is recommended. To overcome the dependency of the optimal design on unknown nuisance parameters, maximin designs are derived. The results are illustrated, assuming probability proportional to size sampling of clusters, with the planning of a hypothetical survey to compare adolescent alcohol consumption between France and Italy.

Entities: Chemical Disease Gene Species

Keywords: Cross-national comparisons; informative cluster size; maximin design; optimal design; sample size calculation; two-stage sampling

Mesh：

Year: 2020 PMID： 32940135 PMCID： PMC8172256 DOI： 10.1177/0962280220952833

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 3.021

1 Introduction

For the purpose of estimating the mean or prevalence of an outcome variable (e.g. alcohol consumption or smoking) in a hierarchical population (e.g. students within schools, patients within general practices), or of comparing subpopulations with respect to such a mean or prevalence, it is often convenient, for economic or logistic reasons, to sample in two stages: first, clusters (e.g. schools, general practices) are sampled and then individuals (e.g. students, patients) are drawn from the sampled clusters.[1-3] Examples of these multi-stage sampling designs include school-based surveys for monitoring substance use among adolescents,[4-6] and national surveys for estimating the average length of stay for discharges from hospitals, or nursing homes. The topic of this paper is the efficient design of two-stage sampling (TSS) schemes for estimating the mean of a quantitative outcome variable in a two-level population. In practice, clusters usually vary in size (e.g. small versus large schools) and then, to estimate the population mean, a sample can be drawn with at least three alternative TSS schemes: sampling clusters with probability proportional to cluster size, and then sampling the same number of individuals from each selected cluster (TSS1); sampling clusters with equal probability, and then sampling the same percentage of individuals from each sampled cluster (TSS2); sampling clusters with equal probability, and then sampling the same number of individuals per cluster (TSS3). These three TSS schemes will be considered in this paper and compared with Simple Random Sampling (SRS) of individuals. Additionally to cluster size variation, further complications arise with informative cluster sizes, that is, when cluster size is related to the outcome of interest. , For instance, cluster size is informative when the amount of alcohol consumed by an adolescent is related to the number of students enrolled in the school, as small schools might provide a more supportive environment,[11-13] or when the number of patients registered to a general practice affects its efficacy in preventing expensive hospitalisations, thus impacting on public expenditure on health per patient. Informative cluster sizes not only can have direct policy implications, such as introducing a limit to school or general practice size, they also have consequences for statistical data analysis and sample size planning. In informative cluster size literature (see the review by Seaman et al., and references therein), the main focus has been on how to handle informative cluster size when the target of inference is the association between the outcome variable and some covariates (e.g. a risk factor). For instance, Seaman et al. have discussed several methods to make cluster-specific inferences with Generalized Linear Mixed Models and population-average inferences with Generalized Estimating Equations when cluster size is informative. Innocenti et al., instead, have investigated a different topic: the implications of informative cluster size for unbiased and efficient estimation of a population mean in surveys conducted with the three aforementioned TSS schemes. The present paper is also about mean estimation for these three TSS schemes when cluster size is informative, but focuses instead on sample size planning, and the consequences of informative cluster size for the required sample sizes and budget. Innocenti et al.’s results are the starting point of this paper and therefore summarized here. First, there are two definitions of overall mean in a two-level population, namely the average of all individual outcomes and the average of all cluster-specific means. These two definitions coincide only if cluster sizes are either equal or non-informative. Second, when cluster size is informative, estimation of the mean of all individual outcomes (i.e. the definition used in this paper) is unbiased under TSS1 with the unweighted average of cluster means, and asymptotically unbiased under TSS2 and TSS3 with the average of cluster means weighted by cluster size. In contrast, when cluster size is non-informative, the unweighted average of cluster means is unbiased for all sampling schemes, but optimally efficient for TSS1 and TSS3 only. Third, under the constraint of a fixed total sample size, SRS is more efficient than any TSS scheme, TSS3 is the least efficient TSS scheme, and TSS1 is the most efficient for many cluster size distributions. Indeed, when cluster size is informative, the relative efficiency of these sampling schemes depends on some features of the cluster size distribution in the population, such as the coefficient of variation, the skewness, and the kurtosis. However, when cluster size is non-informative, TSS1 and TSS3 are equally efficient and outperform TSS2. Fourth, the two inferential paradigms in survey sampling, namely the model-based and the design-based approach, , give similar results in terms of unbiased and efficient estimation of the average of all individual outcomes with the three aforementioned TSS schemes, at least if the model assumptions are met. Furthermore, sample size planning and sampling schemes comparisons, which are the topics of this paper, are much more feasible with the assumption of a model for the outcome variable of interest. For these two reasons, the model-based approach is adopted here. This work extends the results of Innocenti et al. in the following ways. First, for each of the three aforementioned TSS schemes, the optimal design is derived. Here, the optimal design is defined as that design (i.e. number of clusters and number of individuals per cluster) that minimizes the sampling variance of the population mean estimator subject to a cost constraint. Second, the three optimal TSS schemes are compared with SRS and with each other under the constraint of a fixed budget. Third, to take care of uncertainty with respect to model parameters and distributional features of cluster size, as a practical alternative, maximin designs are derived. Fourth, sample size calculations for making comparisons between populations are derived and illustrated. This paper is structured as follows. In section 2, the assumptions of this paper are presented, as well as the sampling schemes and the corresponding mean estimators. Furthermore, the findings of a simulation study to assess the accuracy of some results in Innocenti et al. that are relevant to the present paper are summarized. In section 3, the optimal design for each TSS scheme is derived, and these optimal TSS designs are compared with each other and with SRS for a fixed budget. Furthermore, the consequences of ignoring informative cluster size at the design phase of a study are investigated. Section 4 deals with the maximin approach, that is, a strategy to solve the dependency of the optimal design on unknown nuisance parameters. Section 5 provides a procedure for computing sample sizes for surveys aimed to make cross-population comparisons, and the procedure is illustrated in planning a survey for comparing the average alcohol consumption among adolescents in France and Italy. Section 6 offers some final remarks. The mathematical derivations of the results, the description of the simulation study discussed in section 2, and additional figures and tables can be found in the Supplementary Material 1 (S.M.1). The Supplementary Material 2 (S.M.2) provides the R code of the simulation study and other R codes to apply some of the mathematical results of this paper.

2 Assumptions, sampling schemes and mean estimators

The results of Innocenti et al. and this paper are based on the following assumptions (the notation used in the main text is summarized in the Appendix). Assumption 1: The population is composed of clusters and each cluster contains individuals, that is, in the population clusters vary in size ( ). The population size is . Assumption 2: Sampling is either SRS of individuals in one stage, or else TSS. In TSS, we first sample clusters, and then sample or individuals per selected cluster . In case of TSS, the population is very large relative to the sample size at each design level (i.e. and , where is the average sample size per sampled cluster, and is the population mean of cluster size). In case of SRS, is very large relative to , the number of individuals sampled (i.e. ). Assumption 3: The outcome variable is quantitative (e.g. alcohol consumption) and measured at the individual level. Further, shows variation at the cluster level as well as at the individual level. Therefore, sampling error occurs at each design level. This is taken into account by assuming the following two-level random intercept model for the outcome of the -th individual from the -th cluster , where , and cluster effect and individual effect are unrelated (i.e. ). The distribution of will be defined in the next assumption. Assumption 4: Cluster effect is linearly related to cluster size , that is, , where for model identifiability, , and is the component of cluster effect that does not depend on cluster size (i.e. ). Thus, the conditional distribution of given is . Innocenti et al. show that in model (1) is the average of all cluster-specific means in the population, and differs from the average of all individual outcomes in the population , unless cluster size is non-informative ( ) or constant across clusters, as can be seen from the following expression where , , and are, respectively, the population mean, the coefficient of variation, and the variance of cluster size. The distinction between and comes from considering the distribution of cluster effect over either the population of clusters (which yields ) or the population of individuals (which yields ). This paper focuses on . With the aim of estimating , the three aforementioned TSS schemes are studied in this paper. For each of these TSS schemes and SRS, Table 1 summarizes the sampling procedure (i.e. sample size and inclusion probability per design stage) and the required knowledge before sampling. Furthermore, Table 1 shows the population mean estimator and the sampling variance for each sampling scheme. Denote by the correlation between and , where and , and by the degree of informativeness of cluster size. From Table 1, note that for TSS1 and TSS3, while for TSS2, where is the average population size of the sampled clusters (not to be confused with , that is, the average sample size of the sampled clusters). Furthermore, for TSS2 . The sampling variances in Table 1 are functions of the total unexplained outcome variance , the intraclass correlation coefficient , the sample sizes ( , ), the parameter , and some features of the cluster size distribution in the population: the coefficient of variation , the skewness , and (for TSS2 and TSS3 only) the kurtosis . When cluster size is non-informative ( ), depends only on , , , , and (for TSS2 and TSS3 only) . The estimators associated with SRS and TSS1 are unbiased, and their sampling variances are exact expressions.

Table 1.

Sampling schemes, required prior knowledge, population mean estimators, and sampling variances.

TSS1	Stage 1	k clusters with probability πj≈kNj∑j=1KNj
	Stage 2	n individuals per sampled cluster with probability πi\|j=nNj
	Required prior knowledge	List of all K clusters in the population and their sizes Nj . List of all individuals within the k sampled clusters.
	μ^	∑j=1ky¯jk
	Vμ^	σy2nk1+ρn−1+nψτNζN−τN+1
TSS2	Stage 1	k clusters with probability πj=kK
	Stage 2	nj=pNj individuals per sampled cluster with probability πi\|j=njNj=p
	Required prior knowledge	List of all K clusters in the population. List of all individuals within the k sampled clusters.
	μ^	∑j=1knjy¯j∑j=1knj=∑j=1kpNjy¯j∑j=1kpNj=∑j=1kNjy¯j∑j=1kNj
	Vμ^	σy2nk1+ρτN2+1τN2k+1n−1+nψk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
TSS3	Stage 1	k clusters with probability πj=kK
	Stage 2	n individuals per sampled cluster with probability πi\|j=nNj
	Required prior knowledge	List of all K clusters in the population. List of all individuals within the k sampled clusters.
	μ^	∑j=1kNjy¯j∑j=1kNj
	Vμ^	σy2nkτN2+1τN2k+1+ρτN2+1τN2k+1n−1+nψk−1k2τN2ηN−k−3k−1+τNτN−2ζN+2k−1kτNζN−τN+1
SRS	Stage 1	m individuals with probability πi=mNpop
	Stage 2
	Required prior knowledge	List of all Npop individuals in the population.
	μ^	∑i=1myim
	Vμ^	σy2m1+ρψτNζN−τN+1

Note: For TSS1, follows from if . For TSS2, , where , and .

Sampling schemes, required prior knowledge, population mean estimators, and sampling variances. Note: For TSS1, follows from if . For TSS2, , where , and . The estimators associated with TSS2 and TSS3 are only asymptotically unbiased, and the corresponding sampling variances are based on first-order Taylor series approximations. The accuracy of these approximations was evaluated through a simulation study discussed in supplementary material S.M.1 (section 1), but the main findings are summarized here. Sampling clusters guarantees nearly unbiased estimates of under TSS2 and TSS3 independently of the cluster size distribution, and fair accuracy (i.e. bias ) of the variances in Table 1 (TSS2 and TSS3 row) when , , and and are relatively close (say, ) to those of the Normal distribution (i.e. and ). However, for cluster size distributions with extreme skewness and kurtosis (e.g. and ) at least clusters must be sampled to achieve a reasonable accuracy (i.e. bias 6%) of the sampling variances in Table 1, for and . Furthermore, the simulations showed that the two lower-bounds for (i.e. , and ) guarantee the corresponding accuracy level across different values for (at least for ). To contextualize these two lower-bounds for , in a school-based survey for studying substance use among adolescents in European countries, Shackleton et al. have reported that, across countries, (Median ) and (Median ).

3 Optimal design and relative efficiencies for a given budget

3.1 Optimal design

For any sampling scheme, the precision of the estimator , and thus also the width of a confidence interval for and the statistical power for testing a hypothesis on , depends on the number of clusters and on the sample size per cluster (Table 1). This raises the question of the best combination of sample sizes at each design stage (i.e. sampling many clusters versus sampling many individuals per cluster). Define the optimal design as that design (i.e. number of clusters and number of individuals per cluster), which minimizes subject to a cost constraint, given that time and budget are limited in practice. For TSS, the cost constraint is assumed to be , where is the budget for sampling and measuring (excluding costs for constructing the sampling frame and other costs not related to sample size). From now on is called the research budget. Furthermore, is the average cost for sampling a cluster, is the average cost for sampling an individual from a sampled cluster, and is the cost per cluster including the costs for sampling individuals from that cluster (recall that for TSS2 ). For SRS, the cost constraint is , where is the number of individuals to sample, is the average cost for sampling an individual directly from the population, and represents the extra-cost due to constructing the sampling frame for a SRS compared with the sampling frame for a TSS. For each TSS scheme, the optimal design (i.e. the optimal sample sizes and ) for estimating and the optimal variance (i.e. under the optimal design) are given in Table 2 (for proofs, see section 2.2 of S.M.1). For TSS2, one can obtain the optimal proportion of individuals to sample per cluster from the optimal , by dividing as given in Table 2 (TSS2 row) by . The optimal TSS2 and TSS3 designs depend on two approximations of : the first-order Taylor approximation mentioned in section 2 and evaluated in S.M.1 (section 1), which underlies the equations in Table 1, and an approximation based on large (i.e. such that , , and ) to simplify the expressions in Table 1. These two approximations give the following equations (for details, see section 2.1 of S.M.1) and where for TSS2, . Recall from section 2 that, for TSS2 and TSS3, must be large anyway, because the estimators and given in Table 1 are only asymptotically unbiased. As a special case, gives the optimal design and optimal variance for non-informative cluster size (for which case ), which under TSS1 coincide with the equations available for cluster randomized trials (for instance, see Moerbeek et al. ). There is no such equivalence under TSS2 due to sample size variation between clusters, and under TSS3 due to weighting cluster means by cluster size if informative cluster size is assumed in the design phase. Indeed, under non-informative cluster size, no weighting is needed under TSS3, and then the optimal design equations for TSS1 apply to TSS3 as well.

Table 2.

Optimal design and optimal variance for each sampling scheme.

SRS	Vμ^*	csrsσy21+ρψτNζN−τN+1C−c0
TSS1	Optimal design	n=cr1−ρρ11+ψτNζN−τN+1 , k=Cc1cr+n*
TSS1	Vμ^*	c1σy2crρ1+ψτNζN−τN+1+1−ρ2C
TSS2	Optimal design	n=cr1−ρρ1τN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1 , k=Cc1cr+n*
TSS2	Vμ^*	c1σy2crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρ2C
TSS3	Optimal design	n=cr1−ρρτN2+1τN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1 , k=Cc1cr+n*
TSS3	Vμ^*	c1σy2crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρτN2+12C

Note: Derivations are given in section 2.2 in supplementary material S.M.1. Note that , implies that that, in turn, entails that since . Note that for any distribution (for proof, see section 2.1, S.M.1). Recall that for TSS2 .

Optimal design and optimal variance for each sampling scheme. Note: Derivations are given in section 2.2 in supplementary material S.M.1. Note that , implies that that, in turn, entails that since . Note that for any distribution (for proof, see section 2.1, S.M.1). Recall that for TSS2 . Note from Table 2 that the optimal number of clusters and the optimal number of individuals per cluster are inversely related, and that is an increasing function of the cluster-to-individual cost ratio and a decreasing function of and . These relations between the optimal design and , , and hold, under TSS1, for , and always under TSS2 and TSS3 (for proof, see section 2.1 of S.M.1). The condition is met by all the distributions in Tables S.2 and S.7 (S.M.1). Hence, this condition is assumed to be satisfied when considering results for TSS1 in the sequel.

3.2 Effect of cluster size informativeness on the optimal design and study budget needed

The optimal number of individuals per cluster for TSS1 and TSS3 is plotted in Figure 1, for two real-life cluster size distributions: the general practice list size distribution in England, and the public high school size distribution in Italy (both distributions are shown in Figure S.1, S.M.1). The behaviour of for other cluster size distributions is shown in Figures S.2 and S.4 (S.M.1) for TSS1 and TSS3, respectively, and in Figure S.3 (S.M.1) for TSS2. In most scenarios in Figure 1 and Figures S.2–S.4 (S.M.1), the difference between for (i.e. ) and for (i.e. ) is small, which means that the ratio of under the design assuming to under the design assuming , when the true , is close to 1. So, the optimal designs in Table 2 are quite robust against misspecification of , in the sense of being efficient relative to the optimal design for the true and given a fixed research budget . However, ignoring informativeness can lead to serious underestimation of the sampling variance of the mean estimator, and thereby also of the budget needed, as will be seen below. Further, the optimal design depends not only on , but also on and the cluster size distribution ( , , ). That dependence will be addressed in section 4.

Figure 1.

Optimal number of individuals per cluster n* under TSS1 (left column) and TSS3 (right column), as a function of ρ, for different values of cr and ψ (curves), and different cluster size distributions (rows). The cluster size distributions are shown in Figure S.1 (S.M.1). Note that ψ = 0.35 corresponds to ρuN=±0.51. An example will now show that (a) given a study budget, the optimal design is robust against misspecification of cluster size informativeness, but (b) the budget needed is very sensitive to misspecification. Suppose we plan a survey to estimate in the population of all patients of all general practices in England. The parameters of the general practice patient list size distribution are , , and (Table S.2, S.M.1). Furthermore, suppose that , , and . The optimal TSS1 samples individuals and clusters assuming , and and assuming (see Table 2, TSS1 row). If the true , for the design correctly assuming , and for the design incorrectly assuming (see variance equation in Table 1, TSS1 row), giving a variance ratio . Additional results for TSS1, TSS2, and TSS3 are given in Table S.8 (S.M.1), which shows that even in some more extreme cases (e.g. , i.e. ) the variance ratio still exceeds 0.8. The example given here and those in Table S.8 (S.M.1) show that the optimal designs in Table 2 are quite robust against misspecification of , in the sense of being efficient relative to the optimal design for the true and given a fixed research budget . However, ignoring informativeness can lead to serious underestimation of the budget needed. Suppose one wants to test the null hypothesis that against the alternative hypothesis that . The budget that guarantees the desired power level for the chosen type I error rate , is then obtained by equating in Table 2 with , where is the th percentile of the standard normal distribution. This gives , where is the numerator of in Table 2 excluding , and is the standardized difference between true mean and mean according to . Since is an increasing function of , , and , the required budget for the desired power level also increases with , , and . Likewise, increases with , at least up to (for proofs, see section 2.2 in S.M.1). The required budget to detect a standardized difference of medium size ( ), with power and two-tailed , is plotted in Figure 2 for TSS1 and TSS3, as function of , for the general practice list size distribution in England and the public high school size distribution in Italy, and assuming . As can be seen in Figure 2, the research budget is not robust against misspecification of . For example, the required budget for the optimal TSS1, assuming the English general practice list size distribution, , , and (Figure 2, left column, first row), is underestimated by 29% if one incorrectly assumes when the true . The required budget is also shown, for other cluster size distributions, in Figures S.5 and S.7 (S.M.1) for TSS1 and TSS3, respectively, and in Figure S.6 (S.M.1) for TSS2. These figures show that increases with , , and , and that the impact of the cluster size distribution on becomes more relevant as increases. Hence, ignoring informative cluster size at the design phase of the survey can lead to underestimating the required budget for the chosen effect size and desired power level. Finally, for the desired power level, the required budget is smallest with the optimal TSS1, and largest with the optimal TSS3.

Figure 2.

Budget C needed for the optimal design to detect a standardized difference between hypothesized and true population mean of medium size (d0 = 0.5), with 90% power using a two-tailed test with α = 0.05, as a function of ψ, for different values of ρ and cr (curves) with c1 = 10, different sampling schemes (columns), and different cluster size distributions (rows). The cluster size distributions are shown in Figure S.1 (S.M.1). Note that ψ∈[0,1.3] corresponds to ρuN∈[−0.75,+0.75].

3.3 Relative efficiencies for a given budget

We now compare the efficiency of the optimal designs in Table 2 with each other and with SRS, under the constraint of a fixed research budget. The relative efficiency ( ) of the optimal designs for two sampling schemes is defined as the ratio of their optimal variances in Table 2, more specifically, RE . These s are shown in Table 3 (for proofs, see section 2.3, S.M.1), which also gives the sufficient (but not necessary) conditions under which each is smaller than one.

Table 3.

Relative efficiencies of TSS schemes versus SRS and each other for a given budget.

D1 vs D2	RED1 vs D2=VD2μ^VD1μ^	Sufficient (but not necessary) conditions such that RE≤1
TSS1 vs SRS	1+ρψτNζN−τN+1crρ1+ψτNζN−τN+1+1−ρ2×csrsc1×CC−c0	ζN≥τN−1τN−1τNψ and csrsc1=CC−c0=1
TSS2 vs SRS	1+ρψτNζN−τN+1crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρ2×csrsc1×CC−c0	ζN≤τN−1τN or ζN≥τN+1τNcr−1τN or Nj∼NθN,σN2 , and csrsc1=CC−c0=1
TSS3 vs SRS	1+ρψτNζN−τN+1crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρτN2+12×csrsc1×CC−c0	ζN≤τN−1τN or ζN≥τN+1τNcr−1τN or Nj∼NθN,σN2 , and csrsc1=CC−c0=1
TSS2 vs TSS1	crρ1+ψτNζN−τN+1+1−ρ2crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρ2	τN−1τN−1τNψ≤ζN≤τN−1τN or ζN≥τN or Nj∼NθN,σN2
TSS3 vs TSS1	crρ1+ψτNζN−τN+1+1−ρ2crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρτN2+12	τN−1τN−1τNψ≤ζN≤τN−1τN or ζN≥τN or Nj∼NθN,σN2
TSS3 vs TSS2	crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρ2crρτN2+1+ψτN4+τN2ηN−3+2ζNτN1−τN2+1+1−ρτN2+12

Note: Derivations are given in section 2.3 in supplementary material S.M.1. Recall that is the optimal variance in Table 2, , , and . The conditions for in the rightmost column are valid for and are satisfied by all distributions in Table S.7 (S.M.1).

Relative efficiencies of TSS schemes versus SRS and each other for a given budget. Note: Derivations are given in section 2.3 in supplementary material S.M.1. Recall that is the optimal variance in Table 2, , , and . The conditions for in the rightmost column are valid for and are satisfied by all distributions in Table S.7 (S.M.1). The of a TSS scheme compared with SRS (Table 3, first three rows) is composed of three ratios. The first ratio is a function of , , , , , and , and is always smaller than one for , and also for at least under the conditions for given in the rightmost column of Table 3 (for proofs, see section 2.3, S.M.1). The other two components of are the ratio , for the costs per individual in SRS relative to TSS, and the budget ratio . Since sampling an individual directly from the population will be more expensive than sampling an individual after having sampled the cluster to which he/she belongs (i.e. ), and constructing the sampling frame for a SRS has extra-costs compared with constructing the sampling frame for a TSS (i.e. ), the ratios and will always be at least one and often larger than one. As a result, the RE can become larger than one, implying that SRS can be less efficient than TSS under the constraint of a fixed budget. The s of the optimal TSS1 and TSS3 versus SRS are shown in Figure 3, for the general practice list size distribution in England and the public high school size distribution in Italy, and assuming (note that values greater than 1 give a higher of TSS versus SRS). Further, Figures S.8–S.10 (S.M.1) show the s of the optimal TSS1, TSS2, and TSS3 versus SRS for other cluster size distributions. For , the of any optimal TSS versus SRS is a decreasing function of (i) (Table 3), (ii) (at least for , see Figure 3, and Figures S.8–S.10 in S.M.1), and (iii), only for TSS2 and TSS3, (Table 3). For , the patterns remain almost the same as before and the s also do not seem to vary much across cluster size distributions (Figure 3, and Figures S.8–S.10 in S.M.1).

Figure 3.

Relative efficiency of the optimal TSS1 versus SRS (left column), and of the optimal TSS3 versus SRS (right column), for a given research budget C and assuming (csrs/c1) = (C/(C-c0)) = 1 (values greater than 1 give a higher RE of TSS versus SRS), as a function of ρ, for different values of cr and ψ (curves), and different cluster size distributions (rows). The cluster size distributions are shown in Figure S.1 (S.M.1). Note that ψ = 0.35 corresponds to ρuN = ±0.51. The s of the three TSS schemes compared with each other (Table 3, last three rows) are functions of , , , , , and . The optimal TSS2 is more efficient than the optimal TSS3 since (unless , Table 3, or , since in both cases ). The s of TSS2 and TSS3 versus TSS1 are smaller than one, and so the optimal TSS1 is the most efficient TSS scheme, at least for cluster size distributions satisfying the conditions in Table 3 (rightmost column), such as all distributions in Table S.7 (S.M.1). For other cluster size distributions, one must compute the for that particular distribution to see whether . However, for , the s in the last three rows of Table 3 are all smaller than one for any cluster size distribution, making TSS1 the most efficient TSS scheme, followed by TSS2. Note that this only holds if informative cluster size ( ) is assumed at the design stage, such that in TSS3 cluster means are weighted by cluster size to estimate (Table 1). If non-informative cluster size ( ) is assumed already in the design stage, then no weighting is needed for TSS3, and TSS3 then is as efficient as TSS1. The of the optimal TSS2 and TSS3 versus the optimal TSS1 are shown in Figure 4, for the general practice list size distribution in England and the public high school size distribution in Italy, and in Figures S.11–S.12 (S.M.1) for other four cluster size distributions. For , these reduce to and , which are both decreasing functions of , but also decreases as and/or increases. For , the patterns are the same as before with two major differences. First, both s decrease as increases (Table 3). Second, for , both s differ at most from their values at (Figure 4, and Figures S.11–S.12 in S.M.1), except for the English general practice (GP) list size distribution that, having an extreme kurtosis (i.e. ), shows a drop in (compared with the case ) larger than . Note that TSS1 is the most efficient design in Figure 4 and Figures S.11–S.12 (S.M.1).

Figure 4.

Relative efficiency of the optimal TSS2 versus the optimal TSS1 (left column), and of the optimal TSS3 versus the optimal TSS1 (right column), for a given research budget, as a function of ρ, for different values of cr and ψ (curves), and different cluster size distributions (rows). The cluster size distributions are shown in Figure S.1 (S.M.1). Note that ψ = 0.35 corresponds to ρuN = ±0.51.

4 Maximin design

In section 3.2 it has been noticed that the optimal designs in Table 2 require a priori knowledge of some nuisance parameters (i.e. , , , , and ). This is known as the local optimality problem in optimal design literature. , Basically, this means that the optimal design is optimal only for certain values of these nuisance parameters. In this paper, the local optimality problem is solved taking a maximin approach.[20-22] This approach has been applied in several contexts, such as longitudinal studies,[23-25] fMRI experiments, cluster randomized and multicentre trials,[27-29] cost-effectiveness studies, , life-event studies, test construction, and biological and pharmacological studies.[34-37] The maximin approach is composed of the following steps: Define the parameter space, that is, for each unknown parameter (i.e. , , , , and ) determine the range of plausible values (e.g. ). Define the design space, that is, the set of all candidate designs ( , ). In this step, one can rule out those designs that are unfeasible in practice (e.g. too many clusters to cover relative to the time available for data collection), thus preventing sample size adjustments afterwards. For each design ( , ) in the design space, find those values of the nuisance parameters which minimize the efficiency (and thus maximize ) within the range of their plausible values, as defined in step 1. Choose the design that maximizes the minimum efficiency obtained in step 3. In other words, choose those values of and that minimize given the worst-case values of the nuisance parameters chosen in step 3. The resulting design is called the maximin design, which is the optimal design for the worst-case scenario, as defined by that set of parameter values chosen in step 3. The advantage of the maximin design is that it not only maximizes the efficiency and the power in the worst-case scenario, but it also guarantees at least that same efficiency and power level for all the other parameter values within the parameter space. Indeed, is smaller and the power for hypothesis testing on is larger, for all other parameter values than for the worst-case values chosen in step 3, given any fixed sample size (i.e. and ). Following the four steps above, we now explain how to find the maximin design for each sampling scheme. The optimal design for TSS1 depends on , , , and However, to draw a TSS1 sample we need to know the cluster size distribution in the population anyway, which means that and are also known before sampling. Thus, for TSS1, only and are unknown. The maximin design for TSS1 is obtained by plugging into the optimal sample sizes equations (Table 2, TSS1 row) the largest realistic values of and (for proofs, see section 3.1 in S.M.1). Unlike for TSS1, when sampling with TSS2 or TSS3 the researcher needs no prior knowledge of the whole cluster size distribution. Indeed, if such information is available, sampling with TSS1 is a better choice (Table 3). The maximin design for TSS2 and TSS3 is obtained by plugging into the optimal design equations (Table 2) the upper-bounds of the ranges for , , and , and the worst-case value of (for proofs, see section 3.1 in S.M.1). The latter value can be obtained with an R function given in S.M.2 (section 2), which searches numerically for the value of that maximizes (i.e. equations (3) and (4)) within its range of plausible values, given the worst-case values for , , and . For several upper-bounds for , , , and , a numerical evaluation was performed and this always gave as worst-case value of within the range (for details, see section 3.2 in S.M.1). To be on the safe side in sample size planning, one can assume for the parameter range in health and medical research, , and in educational research. , Lacking empirical evidence for or , we propose , which corresponds to . The range can be justified by considering Table S.7 (S.M.1), and the extreme cases of an exponential cluster size distribution, for which , and of a binary distribution with half of all clusters having size and the other half having size , for which . Finally, for and the ranges and can be chosen based on Table S.7 (S.M.1). Since under TSS2 and TSS3 is an increasing function of and (at least if , which will usually hold), assuming positive skewness and positive excess kurtosis (i.e. ) is a safe choice. As mentioned in section 3.1, the optimal design for TSS2 and TSS3 depends on two approximations: the first-order Taylor series approximation used to derive for TSS2 and TSS3 in Table 1, and the large approximation to simplify the equations in Table 1 into equations (3) and (4). Since the maximin design is the optimal design for the worst-case scenario, the same approximations also underlie the maximin design. Based on the simulation study and the numerical evaluation discussed in S.M.1 (sections 1 and 3.3), it turned out that each approximation induces a bias of at most in the used to derive the optimal/maximin design if the optimal/maximin , or, for and , . Since , a simple solution is to increase the maximin with to ensure sufficient power at the expense of a higher budget . However, if the maximin or (for and ) both approximations are biased by more than . A solution is to first increase such that maximin or (for and ) , and then further increase by .

5 Sample size calculation for cross-population comparisons

The results of the previous sections allow to efficiently plan a survey not only for estimating a mean, but also for comparing different populations, if the samples are independent. An example of such a study is the ESPAD study, which compares substance use among 15–16-year-old students across 35 European countries. For a fixed separate budget per population, the optimal design per population is given in Table 2 and the maximin design in section 4. However, the design can be further optimized by constraining the total budget (i.e. the sum of the separate budgets) instead of each separate budget and finding the optimal (or maximin) budget split between populations (for details, see section 4 of S.M.1). For the case of comparing two populations, this optimization was formalized into a procedure to compute maximin sample sizes per population and the maximin budget split between populations, obtained by extending Van Breukelen and Candel to TSS1 with informative cluster sizes and different cluster size distributions per population. This procedure for comparing two populations is implemented in an R code given in section 4 in S.M.2. To use this program, the researcher needs to specify and per population, and of the cluster size distribution of each population, the largest plausible values for and , a range for the ratio of the outcome standard deviations ( ) between the two populations, the smallest difference that is worthwhile being detected, the maximum sum of outcome variances in both populations , the power level , and the type I error rate . The R code (S.M.2, section 4) returns the maximin sample sizes per population and the maximin budget split. The steps of this procedure are given in S.M.1 (section 4). This procedure is presented only for TSS1, because it is the most efficient sampling scheme for many cluster size distributions. Let us demonstrate the procedure with the following example. Suppose that we want to plan a survey to estimate and compare the average alcohol consumption among high school students between France and Italy. Similar to the ESPAD study, alcohol consumption is measured as the average volume of ethanol (in centilitres) consumed on the last drinking day. Based on adolescent health literature, at the design stage, school size (i.e. total number of students) can be assumed to be informative, that is, related to alcohol consumption. Indeed, it has been found that school size and school connectedness, broadly defined as the degree of belonging at school, are inversely related, , as well as school connectedness and alcohol use. TSS1 is the most efficient two-stage sampling scheme for both high school size distributions (this can be verified by checking the conditions in the rightmost column of Table 3, with the numbers given in the second and third row of Table S.7 of S.M.1), and so it is chosen for both populations. Suppose that we want to test the null hypothesis that against the alternative hypothesis that , where and are the population means of alcohol consumption in France and Italy, respectively. Since the French and the Italian samples are independent, we can apply the procedure above to determine how many schools and how many students per school one has to sample per country, and how to split the total budget between countries. The results are shown in Table 4 for four different cost scenarios. Two largest plausible values are assumed for and , respectively, and . This combination of costs and model parameters (Table 4, first six columns) gives a total of scenarios, each corresponding to a row in Table 4. The seventh column in Table 4 gives the maximin budget split (i.e. the ratio of the budget for France, , to that for Italy, ), and from the eighth to the eleventh column the maximin sample sizes per country are shown. Finally, the rightmost column of Table 4 shows the total budget required to detect a standardized difference of medium size ( ), with power using a two-tailed test with . From Table 4, it can be seen that the maximin per country is an increasing function of , a decreasing function of and , and is inversely related to the maximin . Furthermore, the maximin budget split only for and homogeneous costs ( and ). In all other scenarios , meaning that more budget is allocated to the Italian sample than to the French sample. Given that and are the same for both countries, because (i) sampling a student is more expensive in Italy than in France ( ), or (ii) sampling a school is more expensive in Italy than in France ( ), or (iii) only for , the school size distribution in Italy is such that is larger than in France (see Tables S.7 and S.9 of S.M.1). Finally, the total budget required for the desired power is larger for than for (Table 4, rightmost column), suggesting that ignoring informative cluster size at the design stage has the consequence of determining a research budget which is too low for the desired power level. Specifically, informative cluster size requires to increase with depending on the scenario (the larger and/or , the larger this relative increase, see Table 4, rightmost column).

Table 4.

Maximin design ( , , , ) and budget needed to detect a standardized difference of medium size ( ) with a power of using a two-tailed test with and assuming , as a function of the maximum , the maximum , the cost per individual in France and in Italy , and the cost for sampling a cluster in France and in Italy .

ψmax	ρmax	c1,F	c2,F	c1,I	c2,I	Maximin budget split CFCI	nFMD	nIMD	kFMD	kIMD	C
0	0.1	10	200	10	200	1	13.42	13.42	14.04	14.04	9386.54
		10	200	20	200	0.74	13.42	9.49	14.04	16.38	11077.36
		10	200	10	400	0.64	13.42	18.97	14.04	12.39	12002.01
		10	200	20	400	0.50	13.42	13.42	14.04	14.04	14079.82
	0.2	10	200	10	200	1	8.94	8.94	24.33	24.33	14084.50
		10	200	20	200	0.79	8.94	6.32	24.33	27.44	16002.68
		10	200	10	400	0.60	8.94	12.65	24.33	22.13	18692.58
		10	200	20	400	0.50	8.94	8.94	24.33	24.33	21126.75
0.35	0.1	10	200	10	200	0.98	11.31	11.10	18.52	19.08	11734.95
		10	200	20	200	0.74	11.31	7.85	18.52	21.91	13620.31
		10	200	10	400	0.61	11.31	15.70	18.52	17.09	15317.98
		10	200	20	400	0.49	11.31	11.10	18.52	19.08	17670.90
	0.2	10	200	10	200	0.97	7.54	7.40	32.58	33.63	18187.89
		10	200	20	200	0.79	7.54	5.23	32.58	37.39	20365.46
		10	200	10	400	0.57	7.54	10.47	32.58	30.97	24601.40
		10	200	20	400	0.49	7.54	7.40	32.58	33.63	27402.39

6 Discussion

To estimate an overall mean, two-stage sampling is a logistically convenient way to collect data from a multilevel population. In practice, resources (time and money) for sampling are limited. Thus, this paper presents optimal sample sizes per design stage that either maximize the precision of the population mean estimate for the available research budget, or minimize the research budget for the required precision for estimation. Such optimal designs were derived for three TSS schemes: sampling clusters with probability proportional to cluster size, and then the same number of individuals per cluster (TSS1); sampling clusters with equal probability, and then the same percentage of individuals per cluster (TSS2); and sampling clusters with equal probability, and then the same number of individuals per cluster (TSS3). The optimal sample size equations were derived allowing cluster size to be informative, that is, to be related to the outcome variable of interest. It turned out that the optimal designs given in Table 2 are quite robust against misspecification of the degree of informativeness of cluster size . As shown in section 3.2 and in Table S.8 (S.M.1), the relative efficiency of the optimal TSS1 assuming (i.e. non-informative cluster size) versus the optimal TSS1 assuming (i.e. informative cluster size), when the true was close to one. Nevertheless, ignoring informative cluster size is risky for two reasons. First, assuming one would be tempted to combine the unweighted average of cluster means with TSS3, because this strategy (i.e. combination of sampling scheme and estimator) is unbiased and efficient for . However, this strategy is biased and inefficient if the true . Thus, assuming is always prudent because it leads to combining the unweighted average of cluster means with TSS1, that is, choosing a strategy which is unbiased and highly efficient both for informative and non-informative cluster size. Second, assuming can lead to underestimating the research budget for the desired power level, because the research budget is an increasing function of (see Figure 2, and Table 4, rightmost column). This applies not only to TSS1, but also if, because of practical constraints, one has to choose TSS2 or TSS3 as a sampling scheme. For these two reasons, we recommend assuming at the design stage of the survey. The optimal designs of the three TSS schemes were compared with each other and with SRS under the constraint of a fixed budget. In contrast to what was the case under the constraint of a fixed total sample size, SRS can be less efficient than TSS, because it is more expensive to construct a sampling frame of all individuals in the population than of those from the selected clusters only ( ), and because it is more costly to sample and measure geographically dispersed individuals than those that are grouped in a natural cluster (e.g. school, general practice) ( ). Under informative cluster size, the optimal TSS1 was shown to be the most efficient sampling scheme for many cluster size distributions, followed by TSS2, and then TSS3. We thus recommend TSS1, provided all cluster sizes are known before sampling. The optimal design depends on several unknown parameters (i.e. the intraclass correlation , the informativeness parameter , and the cluster size distribution’s coefficient of variation , skewness , and kurtosis ). To address this issue the maximin approach was proposed. For the considered TSS schemes, this strategy consists of plugging the worst-case value for each unknown parameter into the optimal design equations in Table 2. For , , and , the largest plausible value is the worst-case value. If all plausible values for , then the largest plausible value for is also the worst-case value. The worst-case value for can be obtained with an R code, given in S.M.2 (section 2). However, a numerical evaluation showed that if the largest plausible value for is 1, this is the worst-case value for . The R code also returns the worst-case value for in the rather unrealistic case that some plausible values for . The maximin approach has the advantages of being relatively simple to implement, and being robust against misspecification of the unknown parameters by maximizing the minimum efficiency over the ranges of their plausible values. An alternative approach is to obtain estimates of the nuisance parameters from a pilot study and use these in the sample size calculation. However, risks to be underestimated (and thus the main survey to be under-powered), unless the pilot study samples a large number of clusters and of individuals per cluster, which means a sizeable portion of the limited resources for the main survey has to be devoted to the pilot study. The underestimation is likely to be even more severe for skewness and kurtosis, given that their traditional estimators are biased downwards unless the sample size is large or (only for the skewness) cluster size is normally distributed. For all these reasons, we recommend the maximin approach. Relatedly, to improve the planning of future surveys, empirical studies should report values of these nuisance parameters like in Table S.7 (S.M.1). The results of this paper also allow to efficiently plan surveys for comparing different populations, provided the samples are independent. For TSS1, a procedure to derive maximin sample sizes and maximin budget split between populations was obtained by extending Van Breukelen and Candel’s findings to informative cluster size. Analogous extensions for TSS2 and TSS3 could be explored. However, when either cluster size is non-informative ( ), or the cluster size distribution as well as the informativeness parameter is the same in both populations (e.g. treated and control groups in a cluster randomized trial), we have that (see equation (2)) and then the equations given in this paper reduce to simpler expressions as also derived by Van Breukelen and Candel (i.e. those for TSS1 with ). Finally, in this paper the model-based approach to survey sampling was adopted. However, the results of this paper are valid also under the design-based approach, provided model (1) and assumption 4 hold and inference is then based on the sampling scheme. Future research could extend the results of this paper by considering dichotomous outcomes, three-level populations, and by deriving the optimal design for longitudinal studies to monitor trends. Click here for additional data file. Supplemental material, sj-zip-1-smm-10.1177_0962280220952833 for Optimal two-stage sampling for mean estimation in multilevel populations when cluster size is informative by Francesco Innocenti, Math JJM Candel, Frans ES Tan and Gerard JP van Breukelen in Statistical Methods in Medical Research Click here for additional data file. Supplemental material, sj-pdf-2-smm-10.1177_0962280220952833 for Optimal two-stage sampling for mean estimation in multilevel populations when cluster size is informative by Francesco Innocenti, Math JJM Candel, Frans ES Tan and Gerard JP van Breukelen in Statistical Methods in Medical Research

23 in total

1. The National Nursing Home Survey: 1999 summary.

Authors: Adrienne Jones
Journal: Vital Health Stat 13 Date: 2002-06

2. Maximin D-optimal designs for longitudinal mixed effects models.

Authors: Mario J N M Ouwens; Frans E S Tan; Martijn P F Berger
Journal: Biometrics Date: 2002-12 Impact factor: 2.571

3. Sample size calculation for treatment effects in randomized trials with fixed cluster sizes and heterogeneous intraclass correlations and variances.

Authors: Math J J M Candel; Gerard J P van Breukelen
Journal: Stat Methods Med Res Date: 2014-12-17 Impact factor: 3.021

4. How big should the pilot study for my cluster randomised trial be?

Authors: Sandra M Eldridge; Ceire E Costelloe; Brennan C Kahan; Gillian A Lancaster; Sally M Kerry
Journal: Stat Methods Med Res Date: 2015-06-12 Impact factor: 3.021

5. Sample size calculation in cost-effectiveness cluster randomized trials: optimal and maximin approaches.

Authors: Md Abu Manju; Math J J M Candel; Martijn P F Berger
Journal: Stat Med Date: 2014-07-10 Impact factor: 2.373

6. School connectedness in the health behavior in school-aged children study: the role of student, school, and school neighborhood characteristics.

Authors: Douglas R Thompson; Ronaldo Iachan; Mary Overpeck; James G Ross; Lori A Gross
Journal: J Sch Health Date: 2006-09 Impact factor: 2.118

7. Protecting adolescents from harm. Findings from the National Longitudinal Study on Adolescent Health.

Authors: M D Resnick; P S Bearman; R W Blum; K E Bauman; K M Harris; J Jones; J Tabor; T Beuhring; R E Sieving; M Shew; M Ireland; L H Bearinger; J R Udry
Journal: JAMA Date: 1997-09-10 Impact factor: 56.272

8. Patterns of common drug use in teenagers.

Authors: G C Patton; M Hibbert; M J Rosier; J B Carlin; J Caust; G Bowes
Journal: Aust J Public Health Date: 1995-08

9. Maximin optimal designs for cluster randomized trials.

Authors: Sheng Wu; Weng Kee Wong; Catherine M Crespi
Journal: Biometrics Date: 2017-02-09 Impact factor: 1.701

10. Efficient design of cluster randomized trials with treatment-dependent costs and treatment-dependent unknown variances.

Authors: Gerard J P van Breukelen; Math J J M Candel
Journal: Stat Med Date: 2018-06-10 Impact factor: 2.373

2 in total

1. Impact of complex, partially nested clustering in a three-arm individually randomized group treatment trial: A case study with the wHOPE trial.

Authors: Guangyu Tong; Karen H Seal; William C Becker; Fan Li; James D Dziura; Peter N Peduzzi; Denise A Esserman
Journal: Clin Trials Date: 2021-10-24 Impact factor: 2.486

2. Assessing Determinants of Programmatic Performance of Community Management of Malaria, Pneumonia, and Diarrhea in Children in Africa: Protocol and Data Collection for a Mixed Methods Evaluation of Integrated Community Case Management.

Authors: Aliya Karim; Don de Savigny; Jean Serge Ngaima; Daniel Mäusezahl; Daniel Cobos Muñoz; Antoinette Tshefu
Journal: JMIR Res Protoc Date: 2022-03-14

2 in total