Literature DB >> 30582068

Comparing different ways of calculating sample size for two independent means: A worked example.

Lei Clifton¹, Jacqueline Birks¹, David A Clifton².

Abstract

We discuss different methods of sample size calculation for two independent means, aiming to provide insight into the calculation of sample size at the design stage of a parallel two-arm randomised controlled trial (RCT). We compare different methods for sample size calculation, using published results from a previous RCT. We use variances and correlation coefficients to compare sample sizes using different methods, including 1. The choice of the primary outcome measure: post-intervention score vs. change from baseline score. 2. The choice of statistical methods: t-test without using correlation coefficients vs. analysis of covariance (ANCOVA). We show that the required sample size will depend on whether the outcome measure is the post-intervention score, or the change from baseline score, with or without baseline score included as a covariate. We show that certain assumptions have to be met when using simplified sample size equations, and discuss their implications in sample size calculation when planning an RCT. We strongly recommend publishing the crucial result "mean change (SE, standard error)" in a study paper, because it allows (i) the calculation of the variance of the change score in each arm, and (ii) to pool the variances from both arms. It also enables us to calculate the correlation coefficient in each arm. This subsequently allows us to calculate sample size using change score as the outcome measure. We use simulation to demonstrate how sample sizes by different methods are influenced by the strength of the correlation.

Entities: Disease Species

Keywords: Arm; Baseline; Change score; Correlation; Covariate; Independent; Means; Outcome measure; Post-intervention; RCT; Sample size; Standard deviation; Standard error; Variance

Year: 2018 PMID： 30582068 PMCID： PMC6297128 DOI： 10.1016/j.conctc.2018.100309

Source DB: PubMed Journal: Contemp Clin Trials Commun ISSN： 2451-8654

Background

Sample size calculations for a parallel two-arm trial with a continuous outcome measure can be undertaken based on (i) a pre-specified difference between arms at the post-intervention endpoint and (ii) an estimate of the standard deviation (SD) of the outcome measure. If the outcome variable is also measured at baseline, an alternative outcome measure is change from baseline instead of the post-intervention measure. Use of this alternative outcome measure would result in a different power calculation from that obtained using the post-intervention as the outcome measure. It is possible to carry out a power calculation based on analysis of covariance (ANCOVA) where the baseline measure is included as a covariate in the analysis. Sample size calculations typically use published results from trials similar to those under consideration. We use results from a published paper for the MOSAIC trial [1] to compare different methods for sample size calculation. We examine the assumptions made by each method for calculating sample size, and discuss the implications of these assumptions when calculating the required sample size for a new RCT. We aim to provide insight into sample size calculations at the design stage of an RCT. We introduce the notion of change scores, and show how to derive variances of these change scores along with related correlation coefficients in Section 3, using published results. We then calculate and compare sample sizes using different methods in Section 4. A description of the simulation of different strengths of the correlation is presented in Section 5, with the aim of investigating its influence on the calculation of sample sizes using different methods. Section 6 discusses simplified sample size equations when certain assumptions are met. Finally, we consider implications in sample size calculation when planning an RCT in Section 7.

Method

Published results of the MOSAIC trial

The MOSAIC trial is an RCT using continuous positive airway pressure (CPAP) for symptomatic obstructive sleep apnoea. The trial randomised 391 patients between two treatment arms (CPAP vs. standard care). It has two primary outcomes at 6 months: change in Epworth Sleepiness Score (ESS), and change in predicted 5-year mortality using a cardiovascular risk score. The authors also reported the energy/vitality score (referred to as the “energy score” hereafter) of the 36-item short-form questionnaire (SF-36). The change in SF-36 energy score at 6 months is a secondary outcome of the MOSAIC trial, and an investigator might conduct another RCT using it as the primary outcome. The online supplement of the MOSAIC paper [1] states that all data were analysed using multiple variable regression models adjusting for the minimisation variables and baseline value of the variable being analysed. Table 1 shows data concerning the SF-36 energy score, taken from Table 4 in the MOSAIC paper [1]. The outcome measure is energy score in the SF-36 questionnaire, measured at baseline and at 6 months post-intervention. An increase in the energy score indicates an improvement in health status. The table shows that the adjusted treatment effect (6.6) is the same as the unadjusted treatment effect (10.8–4.2 = 6.6). The baseline mean scores are similar in both arms, being 49.7 and 49.8, respectively.

Table 1

SF-36 energy score at baseline and 6-month post-intervention, reproduced using results from the MOSAIC trial.

Energy	Control arm (N = 168)	CPAP arm (N = 171)
Baseline mean score (SD)	49.7 (23.7)	49.8 (22.4)
6-month mean score (SD)	53.9 (22.5)	60.6 (20.9)
Mean change (SE)	+4.2 (1.4)	+10.8 (1.3)
Adjusted treatment effect (95% CI)	+6.6 (+3.1 to +10.1)
p value	p < 0.0001

CPAP, continuous positive airway pressure; SF-36, 36-item Short-Form health survey; SD, standard deviation; SE, standard error; CI, confidence interval; N, number of participants.

Table 4

Correlation r	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
N by ANCOVA	170	169	164	155	143	128	109	87	62	33	0
N by t-test on post score	170	170	170	170	170	170	170	170	170	170	170
N by t-test on change score (Fig. 1)	112	112	112	112	112	112	112	112	112	112	112
N by t-test on change score (Fig. 2)	363	326	290	254	218	182	146	110	73	37	0

SF-36 energy score at baseline and 6-month post-intervention, reproduced using results from the MOSAIC trial. CPAP, continuous positive airway pressure; SF-36, 36-item Short-Form health survey; SD, standard deviation; SE, standard error; CI, confidence interval; N, number of participants. In the following sections, we show how to derive the variances of the change scores and correlation coefficients between baseline and 6 month measurements for both arms, using the results reported in Table 1 including “Mean change (SE)“.

Deriving the sample variance of the change score

We use generic notation in this paper, noting that the proposed method is applicable to arbitrary continuous outcome measures. Suppose the primary continuous outcome measure is , with and denoting at baseline and post-intervention, respectively. For simplicity, we will call the “baseline score”, the “post score”, and the “change score”. Let denote the sample variance of baseline score , denote the sample variance of post score , denote the sample variance of the change score . Let , , and denote their corresponding standard deviations (SD). We show how to derive in each arm, for the purpose of calculating sample size. Let denote the standard error (SE) of , and denote the number of participants; can then be expressed The SEs reported in Table 1 (which are 1.4 and 1.3 in the control and intervention arms, respectively) are those results that allow us to derive using the relationship above. For the control arm, using the formulation above, we have . For the intervention arm, we have . These derived values of and are different in the two treatment arms; therefore, we will need to use their pooled variance for the calculation of sample size. Using the equation shown in the Appendix, the pooled sample variance of is The calculation of above requires the knowledge of “mean change (SE)” reported in Table 1. The presence of is implicitly acknowledged, and we will use to derive the value of in the next section.

Deriving correlation coefficient between and

This section shows how to use the variance sum law to derive the correlation coefficient between and . The variance sum law states Let and denote in the control and intervention arms, respectively. Substituting , , and our derived into the variance sum law above, we have and . Table 2 summarise the sample variances and correlation coefficients for the exemplar study. Here we have explicitly calculated the value of using derived in the previous section.

Table 2

Summary of sample variances.

Energy score	Control arm (N = 168)	CPAP arm (N = 171)	Pooled
Variance of baseline score, sY02	23.72	22.42	23.12
Variance of post score, sY12	22.52	20.92	21.72
Variance of change score, s(Y1−Y0)2	18.152	17.002	17.582
Correlation between baseline and post scores	0.6925	0.6937	–

Summary of sample variances. The derived and are very similar, being approximately equal to 0.7; therefore, we will use for the sample size calculation in the following sections. We note that if , the sample size method via ANCOVA in this paper will not be valid; in this example, the values of and are very close, granting the validity of using ANCOVA for sample size calculation. We will discuss the implication of different values for and in later Sections.

Comparing different sample size calculations

The calculation of sample size will depend on whether the outcome measure is to be the post score or the change score, without and with baseline included as a covariate.

Sample size: t-test on post score

Using as the outcome measure in our example, the pooled variance of is (see Appendix) For a two-sided significance level at power , with pooled variance of , the required number of patients per arm is approximately [2].where is the target mean difference between the two treatment arms, and where and are the ordinates for the standard normal distribution, . If assuming equal variance , simply substitute for in Equation (2). In the exemplar considered by this paper, we use two-sided significance level , and power , corresponding to , and , respectively. In our example, the target mean difference is set to be the reported treatment effect in Table 1, . The variances of the two arms are different, and we have calculated the pooled variance . The required number of patients per arm is approximately In the trial design stage, the characteristics of the planned RCT will inevitably differ from those of a previously-published trial, and it is therefore desirable to calculate sample sizes over a range of variances. For example, assuming equal variance using and in Equation (2), the resulting sample sizes are and , respectively. The pooled variance produces a modest sample size . In practice, one may choose to calculate using the most conservative (i.e., the greatest) value of variances when designing a new RCT.

Sample size: t-test on change score

When using change score ) as the outcome measure, we can still use Equation (2) to calculate , using the pooled variance of ), . We have derived in the previous section; substituting the latter into Equation (2) gives For comparison, if we assume equal variance using and in Equation (2), the resulting sample sizes are and , respectively. The pooled variance produces a modest sample size . We have used this pooled variance in the sample size calculation shown in Table 3.

Table 3

Comparing sample sizes using different outcome measures and statistical methods.

Outcome	N in each arm
Outcome	ANCOVA	t-test
Y1	87 (85)	170 (171)
(Y1−Y0)	–	112 (113)

, number of patients in each arm. calculated by equation are shown together with produced by PASS software: by equation ( by PASS).

Comparing sample sizes using different outcome measures and statistical methods. , number of patients in each arm. calculated by equation are shown together with produced by PASS software: by equation ( by PASS). We strongly recommend publishing resulting “mean change (SE)” in a study paper, because it allows the calculation of in each arm, and to pool the variances from both arms. We note here that deriving does not required the knowledge of the correlation coefficient between and , as long as the SE of is reported. As shown in previous sections, the derived enables us to calculate in each arm. This subsequently allows us to calculate sample size using the change score ) as the outcome measure. We will use the derived to calculate via ANCOVA in the next section.

Sample size: assumptions of ANCOVA on adjusting for

When using as the outcome while adjusting for , the sample size can be calculated via ANCOVA. Let and be the variances of and , respectively. Let be the paired data of and , where represents the two treatment arms, and where represents each of the patients. If we assume follow a bivariate normal distribution, then the distribution of , which is conditioned on , is a univariate normal distribution with a variance of , as shown in the Appendix. We note that , the variance of the baseline score , does not appear in the conditional variance of . This relationship indicates a variance deflation factor that can be used for sample size calculation. However, this variance deflation factor is only true under the assumption of a bivariate normal distribution of . As stated above, this means that the marginal distribution of is normal, and that the marginal distribution of is also normal, hence the usual assumed normality for a t-test is met. However, the marginal normal distributions of and do not guarantee the bivariate normal distribution of . Therefore, the assumption of a bivariate normal distribution of is a stronger assumption than the assumption in a t-test for sample size, and can be violated in practice. It is necessary to examine assumption of a bivariate normal distribution of before applying the variance deflation factor in the sample size calculation. It is straightforward to visualise by plotting the data in a two-dimensional space, with treatment arm on the horizontal axis, and on the vertical axis. This visualisation will immediately reveal whether the assumption of a bivariate normal distribution is violated. It is possible that data will form two clusters corresponding to the control and intervention arms, respectively, which therefore violates the assumption. Borm, Fransen et al. [3], used this relationship for sample size calculation via ANCOVA, but the authors did not explicitly discuss its assumption. There are several other assumptions one must make before applying the variance deflation factor . In this paper, we give mathematical details in the Appendix and explicitly examine all the assumptions, summarised below: All pairs , including all patients in both arms, follow a bivariate normal distribution. We recommend visualising the data to examine whether this assumption is violated, as discussed above. The values of the correlation coefficient between and are the same in both arms. This means that there exists no interaction between baseline score and the treatment arm. This assumption is adequately met in our example, where in both arms of the trial. The variances of , denoted , are the same in both arms. We note that the variance of , denoted , does not affect the variance deflation factor, hence it does not have to take the same value in both arms. This assumption is mildly violated in our example, because Table 2 shows that the pooled and are quite similar, being and , respectively. The resulting sample size by ANCOVA shown in Table 3 should still be a reasonable estimate, due to these similar values of the pooled and . If all of the above assumptions hold, then the conditional variance of is , indicating a variance deflation factor of . Let be the sample size (i.e., the number of patients in each arm) by a t-test on , then the sample size by an ANCOVA on adjusting for iswhile achieving the same power as a t-test on . Since , ANCOVA always produces a smaller sample size than a t-test, illustrated in the first row of Table 3. In our example, the variance of in the control and intervention arms is different ( and , respectively), hence it does not meet the assumption of equal variance above (#3).

Comparing sample sizes using different methods

This section summarises and compares different methods for sample size calculation. We discuss the following two factors: The choice of the primary outcome measure: post score vs. change score ). The choice of statistical methods: t-test without using vs. ANCOVA. In all sample size calculations in this paper (including those for which the results are shown in Table 3), we have used the target mean difference , two-sided , allocation ratio = 1, achieving 80% power. All sample sizes are produced using the corresponding pooled variance derived in this paper. We used the PASS 15 system (NCSS, LLC) to validate our sample size calculation by equations, shown as “( by PASS)” in Table 3, and where “ by equation” refers to our derived in previous sections. The algorithm implemented by the PASS software uses Borm, Fransen et al. [3], in its reference for sample size via ANCOVA, and its results (“ by PASS”) are similar to the “ by equation”. The efficiency (i.e. smaller while maintaining the same statistical power) gained in ANCOVA by using comes from making strong assumptions. We have used Equation (3) from Section 4.3 (i.e., sample size via ANCOVA) in Table 3, but we note that its assumptions are not fully met in individual arms, and therefore one should not directly use the variance of individual arms for the sample size calculation in ANCOVA. In this instance, our approach is to use the pooled variance of both arms in the sample size equation via ANCOVA. Acknowledging its limitation in practice, one can produce sample sizes using a range of variances to gain a better sense of the required sample size. In Table 3, we have used for sample size via ANCOVA, as stated previously. In both the “t-test” and “ANCOVA” methods, we have used the pooled variance for the t-test on , and for the t-test on (). In the example corresponding to the results shown in Table 3, ANCOVA produces the smallest sample size, while use of a t-test on produces the largest. Calculating sample size via a t-test for outcome does not consider the correlation between and , hence will always yield a sample size larger than that obtained when using an ANCOVA (which involves the use of the value of ). However, via a t-test for outcome () is not always larger than via ANCOVA, depending on the strength of the correlation and meeting the assumptions presented earlier.

Simulated sample sizes at different values of

We here simulate different values of , and then compare the sample sizes calculated using different methods. The pooled variances , , and are used in all simulations in this section. The variance sum law shown in Equation (1) indicates that we have the following two options for simulation when varying the value of : Option 1: Keeping the variance of the change score (i.e., ) fixed at the derived value of . The implication is that and are allowed to vary according to . Option 2: Allowing the variance of the change score to vary with , while keeping and fixed at the derived values, and , respectively. We show the simulated sample sizes of these two options above in the following sections. The simulated results using both options are shown in Table 4 below, and are plotted in Fig. 1 and Fig. 2. The same parameter values as presented in Table 3 are used for simulation throughout this section.

Fig. 1

Comparing values of sample size produced using different methods at different values of , using the same parameter values as are shown in Table 3. The values of remain fixed for all values of , resulting in a constant value of via a t-test for outcome (), shown by the short-dashed line. Fig. 1 is intended to be compared with Fig. 2, where the values of are allowed to vary according to the values of .

Fig. 2

Similar to Fig. 1 above, except that the values of are allowed to vary according to the values of . Note that the range of the y-axis here is different from that in Fig. 1.

Simulated sample sizes at different values of . “ by ANCOVA” produced by option 1 (plotted in Fig. 1) are the same as those produced by option 2 (plotted in Fig. 2). “ by t-test on post score” remains at a constant value of 170 throughout. In contrast, “ by t-test on change score” by option 1 and 2 are different, and are plotted in Figs. 1 and 2, respectively. Comparing values of sample size produced using different methods at different values of , using the same parameter values as are shown in Table 3. The values of remain fixed for all values of , resulting in a constant value of via a t-test for outcome (), shown by the short-dashed line. Fig. 1 is intended to be compared with Fig. 2, where the values of are allowed to vary according to the values of . Similar to Fig. 1 above, except that the values of are allowed to vary according to the values of . Note that the range of the y-axis here is different from that in Fig. 1.

Option 1: keeping the variance of the change score fixed

Fig. 1 compares sample sizes obtained using option 1 above using different methods at different values of . Sample size via a t-test for outcome is shown in long-dashed line, calculated using the equation in Section 4.1. Sample size via a t-test for outcome () is shown in short-dashed line, calculated using the equation in Section 4.2. The value of produced by both above options is not influenced by the correlation , hence remains the same at different values of . In contrast, the values of for outcome via ANCOVA, produced by Equation (3) in Section 4.3, heavily depend on the value of ; the larger the value of the correlation , the smaller the resulting value of . The results shown in Table 3 correspond to values of at , where the value of obtained via ANCOVA is smaller than the value of obtained via a t-test on the outcome (). However, by ANCOVA becomes larger than by a t-test () once decreases to values below 0.6, as shown in Fig. 1. The value of obtained via a t-test on the outcome remains the largest among the three methods at all values of .

Option 2: varying the variance of the change score according to

Alternatively, we can allow the values of to vary according to , while keeping the values of and fixed in Equation (1). Fig. 2 shows the resulting sample sizes obtained by the three different methods, to be compared with Fig. 1. In Fig. 2, the resulting via ANCOVA remain the same as those shown in Fig. 1, but via a t-test for outcome () are different from those in Fig. 1 due to varying by the values of . Fig. 2 also provides a convenient way of assessing the assumption of equal variance required in Equation (4). If the assumption that and have the same variance is met, the long-dashed line in Fig. 2 (representing the value of obtained via a t-test on ) and the short-dashed line (representing the value of obtained via a t-test on ) will cross at . These two lines cross at in Fig. 2, indicating this assumption is only mildly violated.

Simplified sample size equations under assumptions

The variance sum law when assuming equal variance

Assuming and have the same variance , the variance sum law (Equation (1)) can be simplified to This means that when ) is the outcome measure, its variance deflation factor is , assuming that and have an equal variance . This variance deflation factor gives us a simplified Equation (4) for sample size. Let be the sample size (i.e., the number of patients in each arm) obtained by a t-test on ; then a t-test on ) will require patients to achieve the same power, assuming equal variance of and . Since , if , and vice versa if , then Equation (4) also shows that calculating sample size using a t-test on ) will require fewer patients than would be obtained were a t-test on used, if and vice versa if . The two methods yield the same number of patients if . We emphasise that this relationship only strictly applies when and have equal variance . In practice, if and are sufficiently similar in value, Equation (4) can still give a reasonable estimate of , and hence give a reasonable estimate of sample size. This is further illustrated by Fig. 2, where the long-dashed and short-dashed lines cross at , a close value to 0.5, indicating a mild violation of the assumption on equal variance. In our example, Table 2 shows that and do not have equal variance, hence the above formula is not directly applicable. However, Table 2 also shows that the values of pooled and are quite similar, being and , respectively. In practice, one can use Equation (4) to calculate assuming and are the same, to be compared with the derived using actual results. It turns out that if , Equation (4) will yield , which is quite similar to our derived .

Sample sizes when all assumptions are met

Let be the sample size by a t-test on . If all assumptions discussed in Section 4.3 and Section 5.1 are met, calculating sample size via ANCOVA on while adjusting for will require patients in total, whereas using a t-test on ) will require patients. Using , we havewhere equality occurs at . The left hand and right hand sides of Equation (5) correspond to the sample size obtained via ANCOVA on while adjusting for and via a t-test on ), respectively. In practice, we always have ; therefore ANCOVA on adjusting for always yields a smaller sample size than would be obtained using a t-test on ), if all assumptions hold. Fig. 2 in Section 5.2 also illustrates Equation (5), where the short-dashed line showing by t-test on ) is always above the solid line showing by ANCOVA on adjusting for , except at .

Discussion

The implications of correlation coefficient

When designing a new RCT, one needs to consider whether the duration of the planned trial will differ from that of previous trials. The correlation between and is likely to decrease (i.e., a smaller ) for an increased trial period, and vice versa. In the example used in this paper, the derived correlation coefficient is similar in both treatment arms, being approximately 0.7. If the correlation between and in the two treatment arms is different, one will need to consider the interaction between the treatment arm and baseline measure.

If “mean change (SE)” is not reported

If “mean change (SE)” is not reported for a study, we can calculate a range of potential variances of ) by setting a plausible range of values of , using the variance sum law, as shown in Section 3.3. The simulation method shown in Section 5 can be used to compare sample sizes obtained using different methods at different values of , providing a sense of the required sample size in the trial design stage.

Future work

In this paper we have used change score ) as a choice of outcome measure without questioning its validity. In fact, one should be cautious of using change score as the outcome measure, due to the well-known statistical phenomenon of “regression to the mean”. This will be investigated in a future paper.

Declarations

Ethics approval and consent to participate

N/A. Not required.

Consent for publication

Yes.

Availability of data and material

N/A. Not required.

Competing interests

None.

Funding

N/A.

Authors' contributions

LC conceived the research idea, and led the writing of the paper. JB and DC also contributed to writing the paper.

7 in total

1. Sample size estimation for randomised controlled trials with repeated assessment of patient-reported outcomes: what correlation between baseline and follow-up outcomes should we assume?

Authors: Stephen J Walters; Richard M Jacques; Inês Bonacho Dos Anjos Henriques-Cadby; Jane Candlish; Nikki Totton; Mica Teo Shu Xian
Journal: Trials Date: 2019-09-13 Impact factor: 2.279

2. Testing a self-directed lifestyle intervention among veterans: The D-ELITE pragmatic clinical trial.

Authors: Katherine D Hoerster; Margaret P Collins; David H Au; Amber Lane; Eric Epler; Jennifer McDowell; Anna E Barón; Peter Rise; Robert Plumley; Tanya Nguyen; Mary Schooler; Linnaea Schuttner; Jun Ma
Journal: Contemp Clin Trials Date: 2020-05-28 Impact factor: 2.226

3. Azithromycin Reduction to Reach Elimination of Trachoma (ARRET): study protocol for a cluster randomized trial of stopping mass azithromycin distribution for trachoma.

Authors: Abdou Amza; Boubacar Kadri; Beido Nassirou; Ahmed M Arzika; Ariana Austin; Fanice Nyatigo; Elodie Lebas; Benjamin F Arnold; Thomas M Lietman; Catherine E Oldenburg
Journal: BMC Ophthalmol Date: 2021-01-06 Impact factor: 2.209

4. Single-dose azithromycin for child growth in Burkina Faso: a randomized controlled trial.

Authors: Ali Sié; Boubacar Coulibaly; Clarisse Dah; Mamadou Bountogo; Mamadou Ouattara; Guillaume Compaoré; Jessica M Brogdon; William W Godwin; Elodie Lebas; Thuy Doan; Benjamin F Arnold; Travis C Porco; Thomas M Lietman; Catherine E Oldenburg
Journal: BMC Pediatr Date: 2021-03-17 Impact factor: 2.125

5. The "In It Together" digital intervention to treat distress among older adults with sensory loss and their spouses: Study protocol for a randomized controlled trial study.

Authors: Camilla S Øverup; Christine M Lehane; Gert Martin Hald
Journal: Internet Interv Date: 2022-07-09

6. Effects of dose change on the success of clinical trials.

Authors: Guogen Shan; Aaron Ritter; Justin Miller; Charles Bernick
Journal: Contemp Clin Trials Commun Date: 2022-09-05

7. Addition of Probiotics to Anti-Obesity Therapy by Percutaneous Electrical Stimulation of Dermatome T6. A Pilot Study.

Authors: Oscar Lorenzo; Marta Crespo-Yanguas; Tianyu Hang; Jairo Lumpuy-Castillo; Artur M Hernández; Carolina Llavero; MLuisa García-Alonso; Jaime Ruiz-Tovar
Journal: Int J Environ Res Public Health Date: 2020-10-03 Impact factor: 3.390

7 in total