Literature DB >> 29276357

Sample Size Calculations for Comparing Groups with Binary Outcomes.

Xunan Zhang¹, Jiangnan Lyu², Justin Tu³, Jinyuan Liu⁴, Xiang Lu⁵.

Abstract

Sample size is a critical parameter for clinical studies. However, to many biomedical and psychosocial investigators, power and sample size analysis seems like a magic trick of statisticians. In this paper, we continue to discuss power and sample size calculations by focusing on binary outcomes. We again emphasize the importance of close interactions between investigators and biostatisticians in setting up hypotheses and carrying out power analyses.

Entities: Chemical Disease Gene Species

Keywords: binary outcomes; sample size

Year: 2017 PMID： 29276357 PMCID： PMC5738522 DOI： 10.11919/j.issn.1002-0829.217132

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

1. Introduction

Sample size plays a critical role in clinical research studies. It provides information for optimal use of available resources to detect treatment differences. In the last article, we discussed sample size calculations for comparing means of continuous outcomes between two groups. In this report, we continue our discussion of this topic and turn our attention to extending our earlier considerations to binary outcomes. Sample size is determined through power analysis. Unlike data analysis, power analysis is carried out at the design stage of a clinical study before any data is collected. Because of lack of data during power analysis, study investigators need to provide information about treatment differences, which not only allow biostatisticians to proceed with power analysis, but enable power analysis results to become meaningful and reliable.[ Thus, power analysis is not a “trick” played by the statistician, but rather, an integrative process involving close interactions between study investigators and biostatisticians. Note that editors of some medical journals sometimes ask authors of a manuscript to provide power analysis results of their study to support their findings. Such post-hoc power analysis generally makes no logical sense.[ As most research studies are conducted based on a random sample from a study population of interest, results from power analysis become meaningless, as the random component in the study disappears once data are collected. Before the study begins, the study sample is unknown and outcomes of interest are random. Power analysis shows the probability, or likelihood, that a test statistic (function of data) will hypothesized difference between the two populations, such as the t statistic for comparing mean blood pressure levels between a hypertension and a normal population.[ Once the study is complete, we observe a sample, i.e., a particular group of subjects among many such groups from the study population, and data from this group of subjects become non-random. In this article, we focus on comparing proportions of binary outcomes between two groups. As in our previous article on power analysis for comparing two group means for continuous outcomes, we consider both independent and paired groups. We begin our discussion with a brief overview of the concept of power analysis within the context of one group. Although most studies involve comparing two or more treatment groups, the simplified setting of one group helps better illustrate the basic steps for sample size calculations.

2. Sample Size for One Group

Consider a binary outcome X with values 0 and 1. In most clinical studies, the value 1 of X generally denotes the occurrence of a disease or exposure of interest, such as depression or trauma. A binary outcome is generally modeled by the Bernoulli distribution with the probability p of the occurrence of 1 in the outcome X, denoted by X ~ Bern(p). Note that unlike the normal distribution for a continuous outcome, there is only one parameter for the Bernoulli distribution. This is because unlike the normal distribution, the variance of the Bernoulli is determined by p. Like the normal distribution, p is also the mean of X, which is the proportion of 1’s. For example, if X indicates the presence (X = 1) or absence (X = 0) of major depression in an individual from a population of interest, then p is the prevalence of major depression in the population. Consider testing the hypothesis, where b is a known constant, and H0 and H1 are known as the null and alternative hypotheses, respectively. Note that the above is known as a two-sided hypothesis, as no direction of effect is specified in the alternative hypothesis H ≠ b. As two-sided alternatives are the most popular in clinical research, we only focus on such hypotheses in what follows unless stated otherwise. Let X1, X2, K, X be a random sample from X ~ Bern(p) and be the sample mean, which, within the current context of a binary outcome, is the percent, or proportion, of 1’s in the sample. If X indicates the presence or absence of major depression, X̄ is the percent of major depression in the sample. If the n subjects are randomly sampled, X̄ is an estimate of prevalence p of major depression in the study population (for a non-random sample, X̄ is still an estimate of prevalence, but for the general population because of potential selection bias).[ If the null H0 is true, X̄ has a high probability of being close to b. However, because X̄ is random, it is still possible for X̄ to be distant from a, although such probabilities are small, especially for large n. The type I error α, a quantity introduced to indicate such an error rate, is the probability that measures the likelihood when X̄ is far away from b under H0. This error rate is typically set at α = 0.0 for most studies and at α = 0.01 for studies with large sample sizes. Given α, power is the probability that H0 is rejected when it is false. The decision to reject the null is based on some statistics that capture the difference between X̄ and b and the distribution of such a distance measure. The most popular and well-known measure is the z -score: which approximately follows the standard normal distribution for large sample size n. Thus, we reject H0 as a type I error α, if | z |> z/2, where zα/2 denotes the upper α/2 quantile of the standard normal distribution, i.e., Φ(z/2) = 1 – α/2, with Φ denoting the cumulative distribution function of the standard normal distribution. For example, for a = 0.05, z/2 = 1.69. If H0: p = b is true, the probability of rejecting H0, therefore committing a type I error α, is given by: Note that if we compare the z -score in (2) with the z -score for the continuous outcome in the previous article, we see that they only differ in the denominator: This is because the standard deviation of the Bernoulli distribution is , which is a function of b, rather than a different parameter as for continuous outcomes. In clinical studies, we are really interested in the opposite, H0 is set up as a straw man to help formulate and test our hypotheses against H0. Statistical power allows us to quantify the likelihood of rejecting H0 in favor of the alternative H, to support a different proportion, d ≠ b, i.e., Without loss of generality, we assume d > b. For power analysis, we must also specify a known value d for p a priori, in addition to the value b under H0, in order to quantify our ability to reject H0 in favor of H. Such explicit specification is not required for data analysis after data is observed. Given a type I error α and a specific d in H, we then calculate power, or the probability that (the absolute value of) the standardized difference in (2) exceeds the threshold zα/2, i.e., By comparing the above with (3), we see that the only difference in (5) is the change of condition from H0 to H. The probability in (5) is again readily evaluated to yield: Thus, as in the case of continuous outcomes, power Power(n, α H0, H) is a function of sample size n, type I error α and values of the parameter of the Bernoulli, p, specified in the null H0 and alternative H Once α is selected, power is only a function of sample size n, and b and d specified in the null and alternative hypothesis. To determine sample size n, we must specify b and d reflect treatment effects, which are study specific and require investigators’ knowledge. As power is quite sensitive to these parameters, careful consideration and justification of these quantities is critical for calculated sample size to be meaningful, reliable and informative. Thus, power analysis is not merely an algebraic and computational exercise by biostatisticians, but is an integrative process involving critical input from content researchers. Power increases as n grows and approaches J as n grows unbounded. Thus, by increasing sample size, we can have more power to reject the null, or ascertaining treatment effect. However, we must be mindful about selecting an appropriate power level, as arbitrarily increasing sample size not only leads to waste of precious manpower and resources, but also increases the likelihood of failed studies due to logistic constraints, and diminishing interest and return due to rapid scientific progresses and discoveries and changing technologies. Power is generally set at some reasonable level such as 0.08. Also, small treatment effect may have little clinical relevance. Thus, it is critical that we specify treatment effects that correspond to clinically meaningful differences, which again require critical input from investigators specializing in the field of study. Given a type I error α, a pre-specified power, often denoted as J – β, and H0 and H, sample size is the smallest n such that the test has the given power to reject H0 under H Although it is generally difficult to find an analytical formula to compute the smallest n satisfying (7), such an n is readily obtained by using statistical packages. Note that power in the literature is typically denoted by 1 – β, where β, known as “type II error rate”, denotes the probability that the null H0 is accepted when in fact it is false. For continuous outcomes, difference µ1 – µ0 between µ1 under H and µ0 under H is generally expressed as an “effect size” to remove its dependence on the scale of X: For binary outcomes, such standardization is not needed, as the variance of X is completely determined by the parameter p of the Bernoulli distribution. Thus, for binary outcomes, the difference p1 – p0 between P1 under H and p0 under H is often interpreted as the effect size. Note that for large sample size n, the z -score in (2) has approximately the standard normal distribution, which provides the basis for evaluating power using the expression in (6) when testing the hypothesis in (4). For moderate sample size, the normal approximation can still be used if np ≥ 5 and n (1 – p) ≥ 5, where p is either p0 or p1. If these conditions are not met, the z -score may deviate significantly from the normal distribution and the expression in (6) no longer provides reliable power estimates. Different methods must be used. For example, in exact inference, we use the binomial distribution of count of 1’s to derive the power function.[ Exact methods work for both small and large sample size. However, for large sample size, it takes a long time to evaluate the power function, even with modern computing power. Thus, exact methods are usually used only in cases where p or n or both are small.

3. Sample Size for Two Independent Groups

Now consider two independent samples and let X (i = 0, 1; j = 1, K,n) denote the random outcomes from the two samples. We assume that both group outcomes follow Bernoulli distribution, X(p), with parameters p (i = 0, 1). Considering testing the hypothesis, Let denote the sample mean of the i th group (i = 0,1). As in the one-sample case, the difference between the two sample means X̄1 - X̄0, should be close to 0 if H0 is true. Again, because X̄1 and X̄0 are random, it is still possible for X̄1 - X̄0 to be very different from 0, although such probabilities are small, especially for large n. As noted earlier, the level of such type I error a is set equal to 0.05 or 0.01 depending on sample size as discussed earlier. In most clinical trials, groups have equal sample size, i.e., n1. Some studies may have a larger sample size for one of the groups.[ In our discussion below, we assume unequal sample sizes so that n0 ≠ n1. If H0 : p1 — p0 = 0 is true, the probability of rejecting H0, therefore committing a type I error a, is: where zα/2 is the upper α/2 quantile of the standard normal distribution. For power analysis, our goal is to reject the null H0 in favor of the alternative H. Without loss of generality, we assume δ > 0. Given a significance level α, H0 and H, we then calculate power, or the probability that (the absolutely value of) the difference in (10) exceeds the threshold zα/2, i.e., Note that as the power function depends on the parameters of Bernoulli distributions for both groups, p1 and p0, not just the difference δ = p1 – p0, we must specify both parameters, in addition to the difference δ to compute power. This step is similar to specifying the standard deviations, σ1 and σ0, in addition to the mean difference, µ1 – µ0, when calculating power for continuous outcomes as discussed in the previous article. As the standard deviation of X̄1. (X̄0.) is determined by p1 (p0), the parameter p1 (p0) plays the role of specifying the standard deviation for X̄1. (X̄0.). Thus, we recast hypotheses (9) equivalently as: Given a type I error α, a power 1 – β, and H0 and H, we can also readily determine numerically the smallest n such that the test has the given power to reject the null H0 under H, i.e., Power(n, α, H0, H) ≥ 1 – β.

3. Sample Size for Paired Groups

In the last section, the two groups are assumed independent. This assumption is satisfied when the groups are formed by different subjects, such as male vs. female and depressed vs. healthy control subjects. In many studies, we may also be interested in changes before and after an intervention on the same individual. For example, suppose we are interested in the effect of a new antidepressant medication. We may give the drug to a group of depressed patients and measure their depression severity before and after taking the medication. Unlike groups formed by different subjects, the control (before taking the medication) and intervention (after the medication) groups are formed by the same individuals and outcomes generally become dependent between the two groups. For example, patients higher on depression severity before the mediation likely remain so after the medication. As a result, the power function for testing two independent groups discussed earlier no longer applies to such dependent “paired” groups. Consider a study with n pairs of observations and let (X0, X1j) denote the two paired outcomes the j th pair (1 ≤ j ≤ n). For each pair, treatment difference is Dj = X1 — X0j. If the difference D has a mean d = 0, then there is no treatment effect. Thus, we are interested in testing the hypothesis: For continuous outcomes X1 and X0, the difference D1 — X0 is also continuous. Thus, (13) becomes a hypothesis for testing whether the mean of D is 0 and sample size calculations can be carried out using the power function for the one group case as discussed in the previous article for power analysis for continuous outcomes. This approach, however, does not work within the current context of binary outcomes, since the difference D1 – X0j may take on the value –1 in addition to 0 and 1 and thus no longer follows the Bernoulli distribution. The most common approach for paired groups with binary outcomes is the McNemar’s test. By displaying n paired outcomes in a 2 × 2 contingency table, we have: where the rows (columns) denote the two values of X0 (X1) and n denotes the cell counts for the cell defined by X0 = i and X1 = j (i, j = 0,1). For data analysis, we can estimate the mean p0 of X0, mean p1 of X1 and mean δ = p1 – p0 of D1 – X0j by the cell counts: where and denote sample versions of the (population) parameters p0, p1 and δ, respectively. For large n, McNemar’s test statistic, has approximately a standard normal under the null H0 in (13), i.e., It is interesting to note that |n12 – n21| in the numerator of McNemar’s statistic is the number of discordant pairs and large values of this statistic indicate evidence against the null H0 in favor of the alternative H. Thus, only the number of discordant pairs contributes information to testing the null hypothesis, which makes perfect sense, since concordant pairs indicate no change before and after the intervention. Note that McNemar’s test statistic is also often expressed in the form of an approximate chi-square distribution: where denotes the chi-square distribution with one degree of freedom. If n or any of the cell counts n is small, the normal (or chi-square) distribution may have a poor approximation to the sampling distribution of McNemar’s statistic and other methods may be used to compute p-values. [ For example, in exact inference, we use an alternative form of McNemar’s statistic, , and determine the sampling distribution of this statistic to compute p-values for testing the null in (13). Both the normal (or chi-square) and exact method can be utilized to derive power functions for performing power analysis, with the exact method providing more reliable estimates for relatively small sample sizes.

4. Illustrations

In this section, we illustrate power and sample size calculations for comparing two independent and two paired groups. We continue to use G*Power in our examples, as it is free and easy to use. In all cases, we set power at 80% and two-sided alpha at a = 0.0. Example 1. A San Diego-based biopharmaceutical company plans to conduct a study to test the efficacy of an experimental Ebola drug. To determine the sample size, the investigators use their pilot data and obtain the following information concerning death rates between the company’s new drug and standard care: The problem is to estimate sample size for the study to detect the above difference in death rates between the two treatment conditions. Let p1 (p0) denote the percent of death for the new drug (standard care). We can express the corresponding statistical hypothesis as follows: Since subjects will be randomized to either the new drug or standard care, the study sample forms two independent groups. For convenience, we assume that the two treatment groups have the same number of subjects, i.e., n0 = n1. To calculate sample size using the G*Power package, we enter the following information: Test family > z tests Statistical test > Proportions: Difference between two independent proportions Type of power analysis > A priori: Compute required sample size - given a, power and effect size Tails > Two Proportion p2 > 0.22 Proportion p1 > 0.38 α err prob > 0.05 Power (1 - β err prob) > 0.80 Allocation ratio N2/N1 > 1 By clicking on “Calculate”, we obtain a sample size of 128 for each group, or a total of 256 for both groups (see Figure 1).

Figure 1.

Screenshot of G*Power for calculating sample size for comparing two independent proportions using the asymptotic method for Example 1

The G*Power also offers an exact method to calculate sample size. In this case, we enter the following information: Test family > Exact Statistical test > Proportions: Inequality, two independent groups (Fisher’s exact test) Type of power analysis > A priori: Compute required sample size - given α, power and effect size Tails > Two Proportion p2 > 0.22 Proportion p1 > 0.38 α err prob > 0.05 Power (1 - β err prob) > 0.80 Allocation ratio N2/N1 > 1 By clicking on “Calculate”, we obtain a sample size of 139 for each group, or a total of 278 for both groups (see Figure 2). The estimated sample size using the exact method is slightly higher than the asymptotic method based on the standard normal distribution. Here the sample size is moderate and the discrepancy between the asymptotic and exact methods likely reflects the limited sample size. In general, if exact methods are used, we should go with sample size estimated from such methods. Fortunately, differences between asymptotic and exact methods diminish as sample size increases. Thus, such difference generally does not have any major impact on real studies.

Figure 2.

Screenshot of G*Power for calculating sample size for comparing two independent proportions using the exact method for Example 1

Example 2. A research team is interested in conducting research on sexual behaviors among the Botswana Defense Force. The team has learned from other similar studies that self-reported sexual behaviors based on a daily diary is more accurate than a retrospective survey. They have estimated that about 50% would report having sex with spouse within last two weeks by daily diary, while only 20% would report such events by retrospective recall. Before conducting the survey, the research team wants to confirm such discrepancy to justify their use of a daily diary for their study. Let p1 (p0) denote the percent of sex reported in a daily diary (retrospective recall). Then, the team’s interest can be stated in a hypothesis as: Since both daily diary and retrospective recall are completed by the same subject, the outcomes from the diary and retrospective recall are not independent. Thus, we use McNemar’s test for comparing sexual behaviors reported by the two assessment strategies and estimate sample size using the method for paired groups. To use the G*Power, we need to enter the odds ratio and proportion of discordant pairs under H. To compute these quantities, it is helpful to create the following 2×2 table indicating both the marginal probabilities p1 (p0) (specified in the hypothesis) and joint probabilities (calculated from the marginal probabilities). Both the odds ratio and proportion of discordant pairs are readily computed from the above table: Proportion of discordant pairs: We then enter these quantities, along with some other information, into the G*Power: Test family > Exact Statistical test > Proportions: Inequality, two dependent groups (McNemar) Type of power analysis > A priori: Compute required sample size – given α, power and effect size Tails > Two Odds ratio > 2.333 α err prob > 0.05 Power (1 - β err prob) > 0.80 Prop discordant pairs > 0.18 By clicking on “Calculate”, we obtain a sample of 273 subjects to detect the hypothesized difference in reporting sexual activities between daily diary and retrospective recall (see Figure 3).

Figure 3.

Screenshot of G*Power for calculating sample size for comparing two paired proportions using the asymptotic method for Example 2

4. Conclusion

Sample size estimation is an essential component of planning clinical research studies. It provides critical information for assessing feasibility of a planned study. For power analysis to be informative and useful, it requires reliable information on effect size, which can only be provided by biomedical and psychosocial investigators specializing in the field of the study. Thus, although power and sample size analysis relies on solid statistical theory, efficient computational methods and modern computing power, sample size estimates obtained from state-of-the-art methods and cutting-edge computing power are really useless without input from scientific investigators.

Death rate for new drug:	0.22
Death rate for standard care:	0.38.

Table 1.

A contingency table for joint distribution of paired binary outcomes

		X₁		Marginal Total
		0	1	Marginal Total
X₀	0	n₁₁	n₁₂	n₁₊
	1	n₂₁	n₂₂	n₂₊
Marginal Total		n₊₁	n₊₂	n

Table 2.

Marginal and joint cell probabilities for the marginal and joint distribution of paired binary outcomes.

		X₁		Marginal probability
		0	1	Marginal probability
X₀	0	p₀	p₁ p₀ (1 – p₀)	p₀ (1 – p₀)
	1	p₀ p₁ (1 – p₁)	p p₀	p₀
Marginal probability		p₁ (1 – p₁)	p₁	1

1 in total

1. Prophylactic implantation of a defibrillator in patients with myocardial infarction and reduced ejection fraction.

Authors: Arthur J Moss; Wojciech Zareba; W Jackson Hall; Helmut Klein; David J Wilber; David S Cannom; James P Daubert; Steven L Higgins; Mary W Brown; Mark L Andrews
Journal: N Engl J Med Date: 2002-03-19 Impact factor: 91.245

1 in total

2 in total

1. The facial nerve palsy and cortisone evaluation (FACE) study in children: protocol for a randomized, placebo-controlled, multicenter trial, in a Borrelia burgdorferi endemic area.

Authors: Sofia Karlsson; Sigurdur Arnason; Nermin Hadziosmanovic; Åsa Laestadius; Malou Hultcrantz; Elin Marsk; Barbro H Skogman
Journal: BMC Pediatr Date: 2021-05-04 Impact factor: 2.125

2. Factoring and correlation in sleep, fatigue and mental workload of clinical first-line nurses in the post-pandemic era of COVID-19: A multi-center cross-sectional study.

Authors: Yan Liu; Ji Shu Xian; Rui Wang; Kang Ma; Fei Li; Fei Long Wang; Xue Yang; Ning Mu; Kai Xu; Yu Lian Quan; Shi Wang; Ying Lai; Chuan Yan Yang; Teng Li; Yanchun Zhang; Binbin Tan; Hua Feng; Tu Nan Chen; Li Hua Wang
Journal: Front Psychiatry Date: 2022-08-25 Impact factor: 5.435

2 in total