Literature DB >> 34425632

The statistical properties of RCTs and a proposal for shrinkage.

Erik van Zwet¹, Simon Schwab^2,3, Stephen Senn⁴.

Abstract

We abstract the concept of a randomized controlled trial as a triple ( β , b , s ) , where β is the primary efficacy parameter, b the estimate, and s the standard error ( s > 0 ). If the parameter β is either a difference of means, a log odds ratio or a log hazard ratio, then it is reasonable to assume that b is unbiased and normally distributed. This then allows us to estimate the joint distribution of the z-value z = b / s and the signal-to-noise ratio SNR = β / s from a sample of pairs ( b i , s i ) . We have collected 23 551 such pairs from the Cochrane database. We note that there are many statistical quantities that depend on ( β , b , s ) only through the pair ( z , SNR ) . We start by determining the estimated distribution of the achieved power. In particular, we estimate the median achieved power to be only 13%. We also consider the exaggeration ratio which is the factor by which the magnitude of β is overestimated. We find that if the estimate is just significant at the 5% level, we would expect it to overestimate the true effect by a factor of 1.7. This exaggeration is sometimes referred to as the winner's curse and it is undoubtedly to a considerable extent responsible for disappointing replication results. For this reason, we believe it is important to shrink the unbiased estimator, and we propose a method for doing so. We show that our shrinkage estimator successfully addresses the exaggeration. As an example, we re-analyze the ANDROMEDA-SHOCK trial.

Entities: Chemical

Keywords: Cochrane review; achieved power; exaggeration; randomized controlled trial; type M error

Mesh：

Year: 2021 PMID： 34425632 PMCID： PMC9290572 DOI： 10.1002/sim.9173

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

It is nearly three quarters of a century since what is generally regarded as the first modern randomized clinical trial, the UK Medical Research Council study of the effectiveness of streptomycin in tuberculosis. Since then, tens of thousands of randomized controlled trials (RCT) have been conducted. The purpose of this article is to study this wealth of information, and to try to learn from it. We have collected the results of more than 20 000 RCTs from the Cochrane Database of Systematic Reviews (CDSR), which is the leading journal and database for systematic reviews in health care. These data allow us to determine the broad statistical properties of RCTs. In particular, we are able to estimate the distribution of the achieved power across all RCTs. We find that the achieved power is often quite low. Low statistical power has been noticed before in specific domains of biomedical research, see for instance. , The fact that achieved power is typically low does not imply that the usual sample size calculations aiming for 80% or 90% power are wrong. The goal of such calculations is to guarantee high power against a particular alternative that is considered to be of clinical interest. The fact that high power is often not achieved is merely an indication that treatments often do not provide the benefit that was hoped for. Low power has important implications for the interpretation of the results of a given trial. It is well known that conditional on statistical significance, the estimate of the effect size is positively biased. This bias is sometimes called the “winner's curse”, and it is especially large when the power is low. , , With the Cochrane data we can quantify the bias quite precisely. To be more specific, we will represent an RCT by a triple where is the primary effect parameter and b is an unbiased, normally distributed estimator of , with standard error s. The main idea of the article is that we can use the observed pairs from the RCTs in the Cochrane database to estimate the joint distribution of the z‐value and the signal‐to‐noise ratio . We can use this joint distribution to calculate the distribution of many interesting statistical quantities, conditional on the observed z‐value. For example, we can compute the conditional coverage of the usual 95% confidence interval, given z. Or we can compute the conditional median of the exaggeration , given z. We realize that some readers may consider the very notion of a probability distribution for the 's to be inherently Bayesian. However, non‐Bayesians may wish to think of this distribution as just an empirical model for the true effects across all trials, and in order to understand the variation of estimates in relation to putative parameter values invoking a (joint) probability model is a powerful tool. Applying our method, we find considerable overestimation of the effect and under‐coverage of the confidence interval when the z‐value exceeds 1.5. These are serious issues which are undoubtedly a part of the explanation for the phenomenon of poor replication. , , , In Section 4 we note that the joint distribution of the SNR and the z‐value allows us to compute the conditional expectation . Since , this immediately suggest as a (shrinkage) estimator for . We find that this new estimator almost entirely eliminates the problem of overestimation. In summary, the purpose of the article is 2‐fold. First, we analyze some key statistical properties of the population of RCTs in the Cochrane database, such as the achieved power, the exaggeration, and the coverage. Second, we propose a novel shrinkage estimator to address the exaggeration. We conclude this article with a short discussion.

STATISTICAL ANALYSIS OF A COLLECTION OF RCTS

In this article, we abstract the concept of a RCT as a triple , where is the primary efficacy parameter, b the estimate and s the standard error (). The parameter is either a difference of means, a log odds ratio or a log hazard ratio. We will ignore small sample issues by assuming that b is a normally distributed, unbiased estimator of with known standard error s. In other words, we will make the following assumption. Conditionally on and s, b is normally distributed with mean and SD s. While this assumption may appear to be overly simplistic, we emphasize that inference based on Wald type confidence intervals and associated P‐values is very common. When the sample size is not too small, say at least 60, this is quite appropriate. Of course, we never observe the true effect but Assumption 1 implies that we can use observed pairs () to estimate the joint distribution of the z‐value and the signal‐to‐noise ratio . This works as follows. We start by estimating the marginal distribution of the z‐value directly from the observed sample . By Assumption 1, the z‐value is the sum of the SNR and independent standard normal “noise”. This means that the distribution of the z‐value is the convolution of the distribution of the SNR and the standard normal distribution. Therefore, we can obtain the marginal distribution of the SNR by deconvolution of the estimated density of the z‐value and the standard normal density. Now we have the marginal distributions of both the z‐value and the SNR. Assumption 1 tells us that the conditional distribution of the z‐value given the SNR is normal with mean SNR and SD 1. Putting it all together, we have the joint distribution of the z‐value and the SNR. We start with the estimation of the marginal distribution of the z‐value. We have obtained pairs from 23 551 RCTs from the Cochrane Database of Systematic Reviews (CDSR). These derived data are publicly available at Reference 9. Each pair belongs to a different study, and we have tried to obtain the estimate and standard error of the first or primary effect. The R code to generate the data and additional details about the data collection are provided in Reference 10. Briefly, for each systematic review the data were downloaded from the Cochrane Library as an XML file, parsed, and aggregated in a large database in R. The database included the sample sizes, mean and SD (for continuous outcomes) or the number of events (for dichotomous outcomes) for both the intervention and control arm. Intervention effects and standard errors were estimated using mean differences for continuous outcomes, and odds ratios for dichotomous outcomes. In order to identify which studies were RCTs, parsing of all the study characteristics tables on the Cochrane Library website was required and the database was extended with this information. The histogram of the observed z‐values is shown in the top panel Figure 1. We see that it is nearly symmetric, but not quite. This (near) symmetry was also observed by Djulbegovic et al who conducted a large meta‐analysis of 860 published and unpublished phase III RCTs. We prefer not to use the information about the sign of the z‐values, for two reasons. First, using this information implies that we would treat the active treatment differently from control in our inferences. In most settings, that would not be acceptable. Second, many of the results that we present here depend on the z‐value only through its absolute value.

FIGURE 1

Top panel: The histogram of the observed z‐values together with our fit based on a mixture of 4 zero‐mean normal distributions. Bottom panel: The symmetrized histogram together with the same fit

Top panel: The histogram of the observed z‐values together with our fit based on a mixture of 4 zero‐mean normal distributions. Bottom panel: The symmetrized histogram together with the same fit Disregarding the sign, the symmetrized histogram of the z‐values is shown in the bottom panel of Figure 1. We constructed it by using the R command hist(c(‐z,z)). The symmetrized histogram of the z‐values is symmetric and unimodal. Such distributions can be well approximated by a mixture of zero‐mean Gaussians. We have used the R package “flexmix”, which implements the EM algorithm, to estimate it. We tried 1 up to 6 components, and found that more than 4 components made no discernible difference to the estimated distribution. The estimated mixture density fits the original histogram of the z‐values reasonably well. The fit to the symmetrized histogram is excellent. The estimates of the 4 mixture components are given in Table 1.

TABLE 1

Estimated 4‐part normal mixture distributions of the z‐value and the SNR

	Comp. 1	Comp. 2	Comp. 3	Comp. 4
Proportions	0.32	0.31	0.30	0.07
SD of the z‐value	1.17	1.74	2.38	5.73
SD of the SNR	0.61	1.42	2.16	5.64

Estimated 4‐part normal mixture distributions of the z‐value and the SNR Recalling that the z‐value is the sum of the SNR and independent standard normal noise, we can obtain the distribution of the SNR by deconvolution. Since the distribution of the z‐value is a mixture of normal distributions, this is particularly easy; we simply subtract 1 from the variances of the components. The resulting SDs are given in Table 1. By Assumption 1, the conditional distribution of the z‐value given the SNR is normal with mean SNR and SD 1. The conditional distribution of the SNR given the z‐value is slightly more complicated, but it can be obtained by straightforward application of Gaussian theory. It is again a mixture of Gaussians, and in the Appendix we provide the relevant formulas and a few lines of R code to compute it. We demonstrate the code in Section 4.3.

POWER, EXAGGERATION, AND COVERAGE

Many statistical quantities depend on only through the z‐value and the SNR. Since we now have an estimate of their joint distribution across all the RCTs in the Cochrane database, we can quantify the achieved power, exaggeration (ie, overestimation of the magnitude of ), and coverage.

Power

The achieved power of the two‐sided Wald test of at level 5% depends on the SNR and is given by where is the cumulative distribution function of the standard normal distribution. The power is an even function, so it depends on the SNR only through its absolute value. In fact, the power is a strictly increasing function of absolute value of the SNR. Using formula (1), it is easy to transform a sample from the distribution of the SNR into a sample from the distribution of the power. We generated such a sample of size and show the histogram in Figure 2. In Table 2 we report a number of quantiles of the distribution of the power. For example, the median of the power is 13%. The average power is 28%. O'Hagan et al discuss the importance of average power to which they refer as the “assurance”.

FIGURE 2

Histogram of a sample of size from the estimated distribution of the power

TABLE 2

Estimated quantiles of the absolute value of the signal‐to‐noise ratio and the achieved power

	Q10	Q25	Q50	Q75	Q90
\|SNR\|	0.14	0.36	0.83	1.73	3.03
Power	0.05	0.07	0.13	0.41	0.86

Histogram of a sample of size from the estimated distribution of the power Estimated quantiles of the absolute value of the signal‐to‐noise ratio and the achieved power Most RCTs are designed to have 80% or 90% power against an alternative that is considered to be of clinical interest. As we can see from Figure 2 and Table 2, the achieved power is usually much lower than 80%. Indeed, we estimate that about 88% of RCTs have power less than 80%. However, this does not mean that most sample size calculations are mistaken. It merely indicates that many treatments do not have the effect that was considered to be of clinical interest. Finding good treatments is not easy! Other factors may also contribute to low power, such as difficulties with subject recruitment or larger between‐subject variation than anticipated. In any case, the fact remains that the achieved power in the majority RCTs is quite low, and this has consequences for our inferences.

Exaggeration

Another important statistical quantity is the ratio Its expectation could be called the relative bias of the magnitude. In the absence of bias, it is equal to one. However, the fact that b is unbiased for implies that is positively biased for (Jensen's inequality). So the unbiasedness of b implies that is always greater than one. Gelman and Carlin define the “expected type M error” or “exaggeration ratio” as It represents the factor by which the magnitude of the effect may be expected to be overestimated when we condition on significance at the 5% level (two‐sided). This exaggeration is sometimes referred to as the winner's curse. Undoubtedly, the winner's curse is to a considerable extent responsible for disappointing replication results. It turns out that the exaggeration ratio depends on and s only through the absolute value of the SNR. In fact, van Zwet and Cator show that the exaggeration ratio is decreasing in . Since the power is a strictly increasing function of , it follows the exaggeration ratio is also a decreasing function of the power, see also References 4, 5. In Figure 3 we show the exaggeration ratio as a function of |SNR| and as a function of the achieved power. If the achieved power is 13% (which is the median across the RCTs in the Cochrane database), then the exaggeration is about 3. This means that if a study with median power reaches significance, we would expect the magnitude of the effect to be overestimated by a factor of 3.

FIGURE 3

These curves show the mathematical relation between the exaggeration ratio (expected type M error) as a function of the SNR and the power

These curves show the mathematical relation between the exaggeration ratio (expected type M error) as a function of the SNR and the power Of course, in practice we do not know so we know neither the achieved power nor the exaggeration ratio. However, note that R is just a function of the SNR and the z‐value Since we have the joint distribution of the SNR and the z‐value, we can calculate the conditional distribution of R given z. We represent the conditional distribution of R given z by its three quartiles in the left panel of Figure 4. We notice that considerable bias is already present at fairly small values of z. When , the conditional median of R is about 1.7.

FIGURE 4

Left panel: The distribution of the ratio conditional on the z‐value, represented by its three quartiles. Right panel: The distribution of the ratio conditional on the z‐value

Coverage

We conclude this section by considering the conditional coverage of the usual 95% confidence interval given the observed z‐value. Note that Again, the joint distribution of the SNR and the z‐value allows us to calculate this expression. We plot the conditional coverage in Figure 5. We see that if z exceeds 2, the conditional coverage is much lower than the nominal level of 95%.

FIGURE 5

Probability that the interval covers conditional on the z‐value

SHRINKAGE

When the z‐value exceeds about 1.5, we see substantial overestimation of the magnitude of the effect in Figure 4 and undercoverage of the confidence interval in Figure 5. These are clearly very serious problems. In this section we propose a novel, data‐based shrinkage estimator by conditioning on the observed z‐value.

A shrinkage estimator

We have estimated the conditional distribution of the SNR given the z‐value, so we also have the conditional expectation . Since , this immediately suggests an alternative estimator for , namely Since the marginal distribution of the SNR is a mixture of zero‐mean normal distributions, will be “pulled” towards zero so that will be less than . In other words, is a shrinkage estimator. We stress that the estimator does not apply a universal shrinkage factor. More shrinkage will be applied to noisy estimates with small z‐values, while almost no shrinkage is applied to estimates with large z‐values. In this sense, the amount of shrinkage is adaptive to the amount of information in the data. We can see in the left panel of Figure 4 that is typically very biased for . In the right panel we show the three quartiles of the conditional distribution of given z. We see that the shrinkage successfully addresses the exaggeration.

Independence assumption

Until now, we carried out all our inferences conditionally on the z‐value. However, we observe both b and s, so we are ignoring some potentially relevant information. In particular, we may not conclude that our shrinkage estimator is equal to the Bayes estimator (posterior mean) . Assumption 1 in Section 2 actually makes it possible to estimate the complete joint distribution of from a sample of pairs . However, modeling the three‐dimensional distribution of is much harder than the two‐dimensional distribution of , especially since z and SNR have such a simple relation to each other. The following Proposition formalizes when the joint distribution of z and SNR is sufficient for computing the conditional distribution of given . Let denote the conditional density of given and denote the conditional density of given z. If the pair and the SNR are conditionally independent, given the z‐value, then Let denote the conditional density of given . Since , we have The conditional independence condition now says and the result follows. Under the condition of Proposition 1, the conditional distribution of given is equal to the conditional distribution of the SNR given z, scaled by a factor s. In particular, Moreover, from the conditional distribution of the SNR given z, we can construct an interval such that the conditional probability of covering the SNR is 95%. Under the condition of Proposition 1 we can simply scale this interval by s to get an interval for which has 95% coverage, conditionally on z. The distribution of the z‐value across the Cochrane database carries only information about the distribution of the SNR, which in turn carries only information about the distribution of the achieved power. So, if we use the observed z‐value as a basis for inference about via (6), then we are using only information about the distribution of the achieved power across the Cochrane database. We are ignoring any additional information that may be present in the pair . Doing inference about via (6) has one more practical advantage. Scaling by s means that the inference about is equivariant under changes of the unit of measurement of the outcome. In other words, it does not matter for our conclusions if the outcome happens to be measured in grams or kilograms, for example.

Example

To demonstrate our shrinkage estimator, we will consider the recent ANDROMEDA‐SHOCK RCT. The aim of this trial was to determine if treating patients on the basis of so‐called capillary refill time (CRT) is superior to treating them on the basis of blood lactate levels. The primary outcome was survival at 28 days and the primary effect parameter was the hazard ratio which was estimated at 0.75 with 95% confidence interval from 0.55 to 1.02. This is just shy of statistical significance at the 5% level, which led the investigators to conclude that the experimental strategy “did not reduce all‐cause 28‐day mortality”. We can relate the results of this particular RCT to what we know about RCTs in general. This means that we must view the ANDROMEDA‐SHOCK trial as a “typical” RCT in the sense that its signal‐to‐noise ratio is exchangeable with that of the other RCTs in the Cochrane database. In other words, we use only the information that ANDROMEDA‐SHOCK trial is an RCT. We use none of the particular features of the trial such as the disease, the treatment, the population, the sample size, the type of outcome, et cetera. Moreover, we also treat the two arms of the trial exchangeably. The estimated log hazard ratio is with 95% confidence interval from 0.6 to 0.02. From this confidence interval, we can compute the standard error of b to be . The z‐value is . We can now run the code below to compute our shrinkage estimator of the log hazard ratio. Note that the function conditional(), which is provided in the Appendix, must be sourced first. We find and notice the considerable shrinkage compared to . Next, we can also construct an interval that covers the SNR with probability 95% conditionally on the observed z‐value. Computing the quantiles of a one‐dimensional normal mixture distribution is not difficult. It can be done via a root‐finding algorithm, but we can also use the function qmixnorm from the R package “KScorrect”. We find the 95% interval from 3.0 to 0.43 for the SNR. We multiply this interval by s to get an interval for which also has 95% coverage conditionally on z. In this way, we find the interval from 0.48 to 0.07 for . Finally, we can also compute the conditional probability that is negative, given the observed z‐value. We find that this probability is 91%. This is in line with a Bayesian re‐analysis of the ANDROMEDA‐SHOCK trial.

DISCUSSION

We have analyzed the primary results of a collection of more than 20 000 RCTs from the Cochrane Database of Systematic Reviews (CDSR). By estimating the joint distribution of the SNR and the z‐value, we have been able to study the distribution of the achieved power. We have found that the achieved power is often very low. We have demonstrated how this leads to overestimation and undercoverage when the observed z‐value exceeds about 1.5. We propose a shrinkage estimator that addresses these issues by conditioning on the observed z‐value. The estimator is easy to compute from the reported effect estimate and its standard error. We have demonstrated its use by re‐analyzing the ANDROMEDA‐SHOCK trial. Our shrinkage estimator allows us to take into account the well‐known fact that many RCTs have low achieved power. We are not suggesting that our estimator should replace other approaches, but we do believe that it provides a useful addition to whichever pure Bayesian or frequentist alternative the reader might prefer as their first choice of analysis based on their own philosophical preference. We expect that Bayesians would want to use more prior information than the distribution of the achieved power across all RCTs. See Reference 16 for an extensive and balanced discussion of Bayesian approaches to RCTs. On the other hand, frequentists would prefer not to use any prior information at all. We are not the first to use the Cochrane database to obtain general information about RCTs. Turner et al and Rhodes et al used the database to get information about the heterogeneity of effects within a meta‐analysis of a particular treatment. We do believe that we are the first to propose using the database to obtain prior information about the SNR, or equivalently, the achieved power across all RCTs. While the Cochrane collaboration puts much effort in collecting all available studies on a particular topic (both published and unpublished), it is likely that the database suffers from at least some publication bias, file drawer effects, fishing, forking paths etc. Unfortunately, this means that the true (achieved) power is likely to be even lower and the exaggeration even greater. Strictly speaking, our analysis concerns only the Cochrane database. However, we believe that problems of low power and exaggeration are similar—if not worse—in many other areas of research.

14 in total

1. Contradicted and initially stronger effects in highly cited clinical research.

Authors: John P A Ioannidis
Journal: JAMA Date: 2005-07-13 Impact factor: 56.272

2. STREPTOMYCIN treatment of pulmonary tuberculosis.

Authors:
Journal: Br Med J Date: 1948-10-30

3. Empirical evidence for low reproducibility indicates low pre-study odds.

Authors: Katherine S Button; John P A Ioannidis; Claire Mokrysz; Brian A Nosek; Jonathan Flint; Emma S J Robinson; Marcus R Munafò
Journal: Nat Rev Neurosci Date: 2013-10-23 Impact factor: 34.870

4. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Authors: Andrew Gelman; John Carlin
Journal: Perspect Psychol Sci Date: 2014-11

Review 5. Power failure: why small sample size undermines the reliability of neuroscience.

Authors: Katherine S Button; John P A Ioannidis; Claire Mokrysz; Brian A Nosek; Jonathan Flint; Emma S J Robinson; Marcus R Munafò
Journal: Nat Rev Neurosci Date: 2013-04-10 Impact factor: 34.870

6. Assessing treatment effects and publication bias across different specialties in medicine: a meta-epidemiological study.

Authors: Simon Schwab; Giuachin Kreiliger; Leonhard Held
Journal: BMJ Open Date: 2021-09-14 Impact factor: 3.006

7. Medical research: Trial unpredictability yields predictable therapy gains.

Authors: Benjamin Djulbegovic; Ambuj Kumar; Paul Glasziou; Branko Miladinovic; Iain Chalmers
Journal: Nature Date: 2013-08-22 Impact factor: 49.962

8. Predicting the extent of heterogeneity in meta-analysis, using empirical data from the Cochrane Database of Systematic Reviews.

Authors: Rebecca M Turner; Jonathan Davey; Mike J Clarke; Simon G Thompson; Julian Pt Higgins
Journal: Int J Epidemiol Date: 2012-03-29 Impact factor: 7.196

9. Implementing informative priors for heterogeneity in meta-analysis using meta-regression and pseudo data.

Authors: Kirsty M Rhodes; Rebecca M Turner; Ian R White; Dan Jackson; David J Spiegelhalter; Julian P T Higgins
Journal: Stat Med Date: 2016-08-30 Impact factor: 2.373

10. Low statistical power in biomedical science: a review of three human research domains.

Authors: Estelle Dumas-Mallet; Katherine S Button; Thomas Boraud; Francois Gonon; Marcus R Munafò
Journal: R Soc Open Sci Date: 2017-02-01 Impact factor: 2.963

3 in total

1. Ten simple rules for good research practice.

Authors: Simon Schwab; Perrine Janiaud; Michael Dayan; Valentin Amrhein; Radoslaw Panczak; Patricia M Palagi; Lars G Hemkens; Meike Ramon; Nicolas Rothen; Stephen Senn; Eva Furrer; Leonhard Held
Journal: PLoS Comput Biol Date: 2022-06-23 Impact factor: 4.779

2. How large should the next study be? Predictive power and sample size requirements for replication studies.

Authors: Erik W van Zwet; Steven N Goodman
Journal: Stat Med Date: 2022-04-08 Impact factor: 2.497

3. The statistical properties of RCTs and a proposal for shrinkage.

Authors: Erik van Zwet; Simon Schwab; Stephen Senn
Journal: Stat Med Date: 2021-08-23 Impact factor: 2.497

3 in total