Literature DB >> 33150663

Using Monte Carlo experiments to select meta-analytic estimators.

Abstract

The purpose of this study is to show how Monte Carlo analysis of meta-analytic estimators can be used to select estimators for specific research situations. Our analysis conducts 1620 individual experiments, where each experiment is defined by a unique combination of sample size, effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics. We compare 11 estimators commonly used in medicine, psychology, and the social sciences. These are evaluated on the basis of bias, mean squared error (MSE), and coverage rates. For our experimental design, we reproduce simulation environments from four recent studies. We demonstrate that relative estimator performance differs across performance measures. Estimator performance is a complex interaction of performance indicator and aspects of the application. An estimator that may be especially good with respect to MSE may perform relatively poorly with respect to coverage rates. We also show that the size of the meta-analyst's sample and effect heterogeneity are important determinants of relative estimator performance. We use these results to demonstrate how these observable characteristics can guide the meta-analyst to choose the most appropriate estimator for their research circumstances.

Entities: Chemical Disease

Keywords: Monte Carlo; estimator performance; experiments; meta-analysis; publication bias; simulation design

Year: 2020 PMID： 33150663 PMCID： PMC8074967 DOI： 10.1002/jrsm.1467

Source DB: PubMed Journal: Res Synth Methods ISSN： 1759-2879 Impact factor: 5.273

Despite much previous research, meta‐analysts do not have much guidance when it comes to selecting a “best” estimator This study shows how Monte Carlo experiments can be used to select the “best” estimator for a given research situation We compare 11 estimators commonly used in medicine, psychology, and the social sciences The estimators are evaluated on four performance measures: bias, mean squared error (MSE), coverage rates, and Type I error rates. We conduct 1620 individual experiments, where an experiment is defined by a unique combination of sample size, effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics Estimators that are relatively good on one performance measure may perform relatively poorly on another The size of the meta‐analyst's sample and effect heterogeneity are important determinants of relative estimator performance We demonstrate how the observable characteristics of sample size and effect heterogeneity can guide the meta‐analyst to select the most appropriate estimator for their research circumstances

INTRODUCTION

Meta‐analysts have an embarrassment of riches when it comes to choosing an estimator for measuring mean effects. The list of potential estimators is long and growing. Accordingly, a literature has arisen that attempts to provide guidance to those seeking a “best” estimator. The purpose of this study is not to produce yet another attempt at recommending estimators. Instead, this study lays out a procedure for how one can identify a best estimator for a given research application. While we provide an example of how such a procedure could work, the purpose of the example is to demonstrate the feasibility and practicality of our approach. Selecting a best estimator for meta‐analysis (MA) is complicated. “Best” depends on the meta‐analyst's goals. Different meta‐analysts can have different goals. Further, no estimator performs well in every situation. Yet relatively little is known about the circumstances which would cause a given estimator to perform better than others. In an ideal world, there would exist a flow‐chart that would guide researchers toward the estimator that was best for their research application. Given the current state of knowledge, it is not even clear what factors should be included in such a flow‐chart. This study attempts to make progress on this issue. In the remainder of this introduction, we elaborate on what this study does and the results that we obtain. Our study performs the largest, most extensive Monte Carlo analysis of MA estimators to date. We conduct 1620 individual experiments, where each experiment is defined by a specific combination of sample size (ie, number of estimates in the meta‐analyst's sample), effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics. We compare 11 estimators commonly used in medicine, psychology, and the social sciences. We assess these estimators on the basis of bias, mean squared error (MSE), coverage rates, and Type I error rates. Our Monte Carlo experiments reproduce experimental designs from four previous studies: Stanley et al ; Alinaghi and Reed ; Bom and Rachinger ; and Carter et al. We do this rather than design our own experiments for two reasons. Monte Carlo experiments are by definition artificial representations of a complex reality. They involve a large number of subjective judgments. We wanted to select designs that had to some extent been approved by the peer review process. We also wanted to use multiple experimental designs to see if results would differ across simulation environments. Our research produces four major findings. First, estimators that rank relatively high in terms of average performance on one criterion frequently do not perform as well on other criteria. From this we conclude that meta‐analysts need to prioritize which criteria (Bias, MSE, etc.) are most important to them. Second, estimators that perform relatively well in one experimental design often do not perform as well in others. We identify two possible reasons for this difference across experimental designs, though more research needs to be done to better understand why this is so. Third, we show that effect size heterogeneity and the number of estimates in the meta‐analyst's sample (“sample size”) are important determinants of estimator performance. Both these characteristics are observable to the meta‐analyst. As such, they can serve as elements of a “flow‐chart” that allows the meta‐analyst to match experimental results to their own research situation, and thus guide them to the best estimator for the problem at hand. Lastly, we give a specific example of how this would work. Our example assumes that the meta‐analyst wants an estimator that minimizes MSE. The meta‐analyst's sample consists of 100 estimates characterized by a high degree of effect heterogeneity. Further, they believe that the Monte Carlo design of Carter et al most closely matches their own research situation. The meta‐analyst gathers all the experimental results associated with the Carter et al experimental design having sample sizes of 100 and a high degree of heterogeneity. They then compare MSE performance across all 11 estimators for this set of experiments and select a “best” estimator, which can then be used for their own research. Our study proceeds as follows. Section 2 describes the estimators that we study. Section 3 highlights the main characteristics of the different simulation environments used for our analysis. Section 4 defines the performance measures. Section 5 presents our results. Section 6 gives an example of how Monte Carlo experimental results can be used to guide the selection of a “best” estimator for a given research situation. Section 7 concludes with a summary of our main results and suggestions for future research. All of the programming code and output files associated with this project are available at https://osf.io/pr4mb/. We note that our code borrows considerably from Carter et al.

THE ESTIMATORS

As noted by Carter et al, (p117) while many studies have analyzed the performance of meta‐analytic estimators, “…there is very little overlap among these studies in either the methods they have examined or the simulated conditions they have explored.” Table 1 summarizes a selection of previous Monte Carlo studies and compares them in terms of the number of experiments and estimators studied. Our study analyses and compares the performance of 11 estimators. This compares favorably with previous studies both in terms of number of estimators and variety in the types of estimators. We chose our estimators because they either are widely used in the MA literature, or have recently appeared in prominent publications.

TABLE 1

Summary of selected Monte Carlo studies of estimator performance: number of experiments and estimators studied

Study	Experiments	Estimators
Stanley et al ¹	180	RE, WLS, WAAP, PP
Alinaghi and Reed ²	74	WLS‐FE, WLS‐RE, PP
Bom and Rachinger ³	215	FE, RE, WAAP, PP, EK
Carter et al ⁴	432	TF, pC, pU, RE, 3PSM WAAP, PP
Hedges and Vevea ⁶	176	5PSM
McShane et al ⁷	125	pC, pU, 3PSM
Moreno et al ⁸	240	TF(FE‐FE), TF(FE‐RE), TF(RE‐RE), FE, RE, FE‐se, RE‐se, D‐se, FE‐var, RE‐var, D‐var, Harbord, Peters, and Harbord‐C
Reed ⁹	36	OLS, PET, PEESE, FE, WLS, RE
Rucker et al ¹⁰	36	TF, CSM, RE, LMA
Simonsohn et al ¹¹	30	TF, pC, FE
Stanley ¹²	120	WLS, FE, PP
Stanley and Doucouliagos ¹³	60	FE, RE, Top10, PEESE, PP, WLS‐se, WLS‐Quadratic, WLS‐Cubic
van Aert et al ¹⁴	25	pC, pU, FE, RE
van Assen et al ¹⁵	36	FE, TF, pU, TES
Our study	1620	TF, pC, pU, RE, 3PSM, 4PSM, AK1, AK2, WAAP, PP, EK

Note: Estimators:

3PSM/4PSM/5PSM = Three‐Parameter, Four‐Parameter, and Five Parameter Selection Models

AK1 = Andrews and Kasy's “symmetric selection” model

AK2 = Andrews and Kasy's “asymmetric selection” selection

CSM = Copas selection model (Copas )

EK = Bom and Rachinger's Endogenous Kink estimator

FE = Fixed Effects

FE‐se, RE‐se, and WLS‐se/D‐se/PET = Estimates the following model using FE, RE, and WLS:

FE‐var, RE‐var, and PEESE/D‐var/ = Estimates the following model using FE, RE, and WLS.

Harbord/Harbord‐C = Harbord, Egger, and Sterne's “Regression test for small‐study effects” and variant

LMA = Limit meta‐analysis (Rucker et al ).

OLS = OLS regression of estimated effects on a constant.

pC = p‐curve

pU = p‐uniform

Peters = Peters et al's “Regression test for funnel asymmetry”

PP = PET‐PEESE (Stanley and Doucouliagos )

RE = Random Effects

TES = Test for excess significance (Ioannidis and Trikalinos )

TF/TF(RE‐RE) = Trim and Fill with RE used for both the “trim” and “fill” components

TF(FE‐FE)/TF(FE‐RE) = Trim and Fill with variants depending on whether FE or RE is used for the “trim” and “fill” components, respectively

Top10 = Estimator which uses only the most precise 10% of estimates (Stanley et al. )

WLS/WLS‐FE = Weighted Least Squares with weights

WLS‐RE = Weighted Least Squares with weights

WLS‐Quadratic = Estimates the following model using WLS:

WLS‐Cubic = Estimates the following model using WLS:

WAAP = Stanley et al's Weighted Average of the Adequately Powered‐WLS‐FE hybrid estimator.

Summary of selected Monte Carlo studies of estimator performance: number of experiments and estimators studied Note: Estimators: 3PSM/4PSM/5PSM = Three‐Parameter, Four‐Parameter, and Five Parameter Selection Models AK1 = Andrews and Kasy's “symmetric selection” model AK2 = Andrews and Kasy's “asymmetric selection” selection CSM = Copas selection model (Copas ) EK = Bom and Rachinger's Endogenous Kink estimator FE = Fixed Effects FE‐se, RE‐se, and WLS‐se/D‐se/PET = Estimates the following model using FE, RE, and WLS: FE‐var, RE‐var, and PEESE/D‐var/ = Estimates the following model using FE, RE, and WLS. Harbord/Harbord‐C = Harbord, Egger, and Sterne's “Regression test for small‐study effects” and variant LMA = Limit meta‐analysis (Rucker et al ). OLS = OLS regression of estimated effects on a constant. pC = p‐curve pU = p‐uniform Peters = Peters et al's “Regression test for funnel asymmetry” PP = PET‐PEESE (Stanley and Doucouliagos ) RE = Random Effects TES = Test for excess significance (Ioannidis and Trikalinos ) TF/TF(RE‐RE) = Trim and Fill with RE used for both the “trim” and “fill” components TF(FE‐FE)/TF(FE‐RE) = Trim and Fill with variants depending on whether FE or RE is used for the “trim” and “fill” components, respectively Top10 = Estimator which uses only the most precise 10% of estimates (Stanley et al. ) WLS/WLS‐FE = Weighted Least Squares with weights WLS‐RE = Weighted Least Squares with weights WLS‐Quadratic = Estimates the following model using WLS: WLS‐Cubic = Estimates the following model using WLS: WAAP = Stanley et al's Weighted Average of the Adequately Powered‐WLS‐FE hybrid estimator.

The context

The estimators are best described within a research context. The following example focuses on a linear regression model, but is easily extended to analyses involving Cohen's d and Log‐Odds/logistic regression. Suppose a researcher is interested in synthesizing the results of an empirical literature. The literature consists of studies that estimate the effect of X on Y using the following linear regression model,where i identifies a given regression having observations. The true effect of X on Y in any given regression is given by . can differ across regressions for many reasons that are unobservable to the meta‐analyst. The distribution of the population effect across regressions is represented by , . Let be the estimated effect from regression i. The meta‐analyst collects a sample of estimates, , =1,2,…,N, and wants to estimate , the population mean effect of X on Y. They know that publication selection may distort their sample of estimates. They have the following estimators available to them: Trim‐and‐Fill, p‐curve, p‐uniform, Random Effects, Three‐Parameter and Four‐Parameter Selection Models, Andrews and Kasy's “symmetric selection” and “asymmetric selection” models, the Weighted Average of the Adequately Powered‐WLS hybrid estimator, PET‐PEESE, and Bom and Rachinger's Endogenous Kink estimator. Each of these is briefly described below.

Trim and Fill (TF)

Trim and Fill (Duval and Tweedie ) is a method that assumes that any asymmetry in the distribution of effect sizes and SEs is due to publication selection. The method works by iteratively removing individual observations until symmetry in the distribution of effect sizes and SEs is achieved. The removed observations are then added back into the sample, along with artificially generated effect/SE observations that are the mirror images of the removed observations. This ensures that the reconstructed MA sample achieves symmetry. Our estimates are obtained using the metafor package in R.

p‐curve (pC)/p‐uniform (pU)

The p‐curve (Simonsohn et al ) and p‐uniform (van Assen et al ) methods are conceptually identical and similar in implementation. Both estimate the mean true effect from the sample of MA estimates that are statistically significant; that is, have P‐values less than 5%. Both assume that estimates with P‐values less than .05 are equally likely to be published, and that the respective P‐values are independently distributed. Both methods work from the starting point that the distribution of P‐values (the “p‐curve”) will be uniformly distributed between 0 and .05 if the null hypothesis is true. Larger, positive effects produce a right skewness to the shape of the “p‐curve.” Neither is recommended in the presence of effect size heterogeneity (van Aert et al ). Conceptually, both methods estimate the value of the true (unobserved) effect that would produce a “p‐curve” closest to the observed “p‐curve.” Both define a loss function that measures the distance between the (transformed) expected and the observed p‐curves and choose a mean true effect that minimizes that loss function. The two methods differ in the metric they use to measure distance. P‐curve uses the Kolmogorov‐Smirnov test statistic as a distance metric, while p‐uniform's metric is based on the Irwin‐Hall distribution. They also differ in that the p‐curve estimator does not produce a SE. We follow standard practice and only include significant estimates that are same‐signed (positive in our case). Our p‐curve estimates are obtained from the programming code in Carter et al. Our p‐uniform estimates use method one in the puniform package in R.

Random effects (RE)

The random effects estimator is arguably the most commonly used meta‐analytic estimator. It does not explicitly correct for publication selection other than giving greater weight to more precise estimates of . It estimates the population mean effect assuming the following specification:where , is the variance in due to sampling error, and is the variance of true effects across studies. is estimated by , where is the (estimated) SE of the estimated effect, . A variety of procedures have been developed to estimate . Our RE estimates are obtained using the R package metafor, where is calculated using the restricted maximum likelihood method.

Three‐parameter and four‐parameter selection models (3PSM and 4PSM)

A variety of selection models have been proposed in the literature to correct for publication bias (Iyengar and Greenhouse ; Vevea and Hedges ; Vevea and Woods ). A common model is the Three‐Parameter Selection Model (3PSM). 3PSM assumes that standardized effect sizes () are distributed normally in the population. Publication selection induces differential probabilities of being published, with publication probabilities following a step function. The general method allows researchers to set the values of the steps. For our 3PSM analysis, we follow Carter, Schönbrodt, Gervais, and Hilgard in allocating estimates to two categories depending on whether the estimates are (a) correctly signed (positive) and statistically significant, ; or (b) not correctly signed and significant, . These have relative publication probabilities equal to 1 and p , respectively (see Panel A of Figure 1). The “Three‐Parameters” correspond to the mean true effect (), the extent of effect heterogeneity (), and p .

FIGURE 1

Illustration of 3PSM and 4PSM. [Colour figure can be viewed at wileyonlinelibrary.com]

Illustration of 3PSM and 4PSM. [Colour figure can be viewed at wileyonlinelibrary.com] We also consider a Four‐Parameter Selection Model. Our 4PSM adds another category to the 3PSM model: positive and insignificant estimates. The respective categories then become (a) ; (b) ; and (c) . The associated relative publication probabilities are equal to 1, p , and p (see Panel B of Figure 1); with , , p , and p accounting for the Four‐Parameters. We use R's weightfunct package to estimate 3PSM and 4PSM. When the relative probabilities of being published are equal to one (ie, no publication selection), these models collapse to the RE model.

AK1 and AK2

Similar to 3PSM and 4PSM are two new estimators from Andrews and Kasy. Like 3PSM and 4PSM, these models categorize estimated effects into groups with different probabilities of being published. The AK1 model groups estimates into significant and insignificant estimates without respect to sign: (i) ; and (ii) . Andrews & Kasy refer to this as the “symmetric selection” case (see Panel A of Figure 2). The relative probability that a significant estimate is published is fixed at 1, while estimates that are insignificant are published with probability p .

FIGURE 2

Illustration of AK1 and AK2 [Colour figure can be viewed at wileyonlinelibrary.com]

Illustration of AK1 and AK2 [Colour figure can be viewed at wileyonlinelibrary.com] Andrews and Kasy propose another estimator that recognizes that the sign of the estimated effect may also affect selection. The AK2 estimator allocates estimates into four groups: (a) , (b) , (c) and (d) . These have relative publication probabilities equal to 1, p , p , and p (see Panel B of Figure 2). Andrews and Kasy call this the “asymmetric selection” case. Because the P‐values produced by AK1 and AK2 are based on t‐statistics, they require four and six observations, respectively, in order to obtain P‐values for all the parameter estimates. This can be a problem for meta‐analyses with very small samples, such as is common in medicine. We use the programming code that accompanies Andrews and Kasy to obtain our AK1 and AK2 estimates.

Weighted average of the adequately powered‐weighted least squares hybrid estimator (WAAP)

The Weighted Average of the Adequately Powered‐Weighted Least Squares hybrid estimator was introduced in Stanley et al. Conceptually, this estimator chooses a subset of the N estimates that are “adequately powered,” defined as coming from regression equations having a power of at least 80%. Weighted Least Squares (weights = ) is used to estimate Equation (2) in order to obtain an initial estimate of . To determine whether a particular estimate comes from an “adequately powered” regression equation, the WAAP estimator determines a threshold value, , for the effect SE:where is the WLS estimate of in Equation (2) based on the full sample of N estimated effects. Note that this initial estimate of does not correct for publication bias. WAAP then selects all the for which . Let M ≤ N of the satisfy this criterion. It then uses WLS to re‐estimate Equation (2) using only the M estimates (the “adequately powered” estimates) to obtain a revised estimate of . A problem can arise when there too few effect estimates that are adequately powered. Following Stanley et al if there are fewer than two adequately powered effect estimates, the WAAP estimator uses the WLS estimate from the full sample of N estimated effects.

PET‐PEESE (PP)

PET‐PEESE stands for Precision Effect Test and Precision Effect Estimate with SE (Stanley and Doucouliagos ). The PP estimator proceeds in two steps. The first step estimates a publication‐corrected variant of Equation (2) using WLS:with weights equal to . It then tests whether . If it fails to reject this hypothesis, then PP takes as an estimate of . If it rejects , it then estimates The estimate of from Equation (4b) then becomes the updated PP estimate of . Following Stanley's recommendation, we use a one‐tailed test when testing . While endogeneity lies outside the purview of our study, it should be noted that PET‐PEESE, unlike the other estimators, easily accommodates IV methods.

Endogenous Kink (EK)

Bom and Rachinger recently proposed a modification to the PET‐PEESE model. The modification concerns a nonlinearity between the size of the bias due to publication selection and the SE. When is nonzero there is no publication selection when SE is very small because all or virtually all estimates are statistically significant. As SE increases, the degree of publication selection increases. This induces a non‐linearity in the relationship between bias and SE. This nonlinearity is the reason why Stanley and Doucouliagos propose including SE in Equation (4b). As an alternative, Bom and Rachinger propose the following kinked regression specification:where is a dummy variable that takes the value 1 whenever SE is larger than a cut‐off point a. This induces a kink at . To determine , Bom and Rachinger follow a two‐step procedure. They first estimate as if one was implementing the first stage of the PET‐PEESE procedure. Assuming the estimated effect is positive, they then calculate the lower bound of a 95% confidence interval around where the SE is derived from a RE model (to accommodate effect heterogeneity): . The cutoff value a is the value of SE that satisfies the equality . Below , most estimates of are likely to be statistically significant and thus unaffected by publication selection. Beyond , publication selection is likely to become an increasing problem, causing the bias to be linearly related to SE. To estimate the EK model, we use programming code provided by Bom and Rachinger.

THE SIMULATION ENVIRONMENTS

To assess the 11 estimators above, we reproduce the simulation designs from four recently published studies: Stanley et al, Alinaghi and Reed, Bom and Rachinger, and Carter et al. We chose to work with multiple simulation environments in light of Carter et al (p117) assessment of previous research: Different simulation studies have implemented bias differently, have drawn sample sizes from different distributions, and have varied widely in the value and form of the simulated true underlying effects. This lack of overlap is not surprising given that there is an effectively infinite number of possible combinations of different conditions to explore and no way of determining which conditions actually underlie real‐world data. In other words, not only is there an inherent dimensionality problem in these simulation studies, but there is also no ground truth. These problems are often not discussed in reports of simulation studies, and indeed, many of the reports just cited—explicitly or implicitly—recommended the use of a single method, despite the fact that each study examined performance of only a handful of correction methods in only a limited subset of possible conditions. Working with multiple simulation environments allows us to determine the sensitivity of our results to alternative experimental designs. Our choice of simulation environments was made to ensure that we covered scenarios of interest to multiple disciplines. Stanley et al was published in Statistics in Medicine. Carter et al was published in Advances in Methods and Practices in Psychological Science. Alinaghi and Reed and Bom and Rachinger were recently published in Research Synthesis Methods. Each of the simulation designs are briefly described below. More extensive discussions can be found in the original articles. While the simulation designs cover many different scenarios, many relevant scenarios are missing from the designs.

Stanley, Doucouliagos, and Ioannidis

SD&I consider two scenarios where researchers are interested in determining the effect of a given treatment, . In the “Log Odds Ratio” scenario, primary studies track the effect of a treatment on a binary indicator of “success.” Individual observations are simulated such that the probability of “success” (Y = 1) is 10% for the control group, and (10% + a fixed effect + a mean zero, random component) for the treatment group. Effect heterogeneity is regulated by the variance of the random component, . Primary studies estimate a logistic regression to determine the effect of the treatment on Prob(Y = 1). The parameter of interest is the coefficient on , . Each study produces a single estimated effect. Variation in the SE of the estimated effects across studies is generated by allowing the primary studies to have different numbers of observations. The mean effect of treatment across all studies, , equals 0.0, 0.30, or 0.54, depending on the experiment. Sample sizes for the simulated meta‐analyses vary across experiments and are pre‐determined to consist of 5, 10, 20, 40, or 80 estimated effects. In the absence of publication selection, a regression of the estimated treatment effects on a constant should produce an unbiased estimate of in any given MA sample. Publication selection consists of two regimes: no publication selection, or 50% publication selection. Under 50% publication selection, estimates are sequentially evaluated for inclusion in the meta‐analyst's sample. Each estimate has a 50% chance of being “selected.” If it avoids selection, the estimate is “published” without consideration to its sign and statistical significance. If selected, the estimate is “published” if it is positive and significant. If not, new estimates are generated until a positive and significant estimate is found. This continues until the meta‐analyst's sample attains its pre‐determined size (see Panel A of Appendix 1 in Data S1). In the second scenario, “Cohen's d,” primary studies estimate the effect of a treatment, but this time the dependent variable is continuous. The difference in outcomes between the treatment and control groups is equal to a fixed effect, , plus a random component that differs across studies. Effect heterogeneity is introduced through this random component, which is regulated by the parameter . Each primary study calculates Cohen's d, which is the standardized difference in the mean outcome values across the two groups. The mean value of d across studies is set equal to either 0 or 0.5, depending on the experiment. Differences in the SEs of d are generated by allowing the simulated primary studies to have different sample sizes. In the absence of publication selection, a regression of the estimated treatment effects on a constant will produce an unbiased estimate of the population mean of d. Sample sizes for the simulated meta‐analyses are pre‐determined to consist of 5, 10, 20, 40, or 80 estimated effects, depending on the experiment. The Cohen's d experiments include the no publication selection and 50% publication selection scenarios used for the Log OR scenario, plus one more: 75/100% publication selection. Under 75/100% publication selection, positive and statistically significant estimates are selected with probability 75%, but 100% of the estimates are restricted to be positive (see Panel B of Appendix 1 in Data S1).

Alinaghi and Reed

A&R study univariate regression models where a variable X affects a continuous variable Y. The parameter of interest is the coefficient on X. In the “Random Effects” data environment, each study produces one estimate and the population effect differs across studies. The coefficient on X equals a fixed component, , plus a random component that is fixed within a study but varies across studies. The overall mean effect of X on Y is given by . To estimate , the meta‐analyst regresses the study specific estimates on a constant. In the absence of publication selection, the resulting estimate will be unbiased. A distinctive feature of A&R's experiments is that they fix the size of the sample of estimated effects before publication selection, rather than after. The size of the meta‐analyst's sample is thus determined endogenously, and is affected by the size of the effect. For example, very large population effects will be subject to relatively little publication selection as most estimates will satisfy the selection criteria, whether it be statistical significance or correct sign. Another distinctive feature of A&R's experiments is that they separate statistical significance from the sign of the estimated effect as criteria for selection. Other studies commonly combine these two, assuming a mechanism that selects estimates that are both positive and statistically significant. A&R's experiments accommodate the fact that these two criteria have different, sometimes conflicting, consequences for estimator performance. All significant/correctly‐signed estimates are “published,” while insignificant/wrong‐signed estimates only have a 10% chance of getting published. A&R design their simulations to be representative of meta‐analyses in economics and business. These typically have samples of several hundred estimates and substantial effect heterogeneity. In addition to the “Random Effects” data environment described above, A&R also construct a “Panel Random Effects” data environment, where each study has 10 estimates. This models the fact that the overwhelming share of meta‐analyses in economics and business have multiple estimates per study. Effect estimates and SEs are simulated to be more similar within studies than across studies. Publication selection targets the study rather than individual estimates. To be included in the meta‐analyst's sample, a study must have at least 7 out of the 10 estimates be significant/correctly signed.

Bom and Rachinger

B&R consider univariate regression environments where researchers are interested in estimating the effect of a variable X on a dependent variable Y, represented by the parameter . Variation in the SEs of estimated effects is accomplished by allowing sample sizes to differ across primary studies. Effect heterogeneity is introduced via an omitted variable (X ) that is correlated with X . The coefficient on the omitted variable, , is randomly distributed across studies with mean zero and variance . Individual estimates of will be biased for nonzero values of . In the population of all studies, the omitted variable bias averages out. However, publication selection induces a bias in the meta‐analyst's sample when selection depends on the sign and significance of . The experiments are designed to produce 5, 10, 20, 40, or 80 “studies” for a given simulated MA, with each study consisting of one estimated effect. In the absence of publication selection, the regression on a constant produces an unbiased estimate of , where equals either 0 or 1 depending on the experiment. Publication selection consists of four regimes: no publication selection, 25%, 50%, and 75% publication selection. The publication selection algorithm is modeled after SD&I's 50% publication selection algorithm (see Panel A of Appendix 1 in Data S1).

Carter, Schönbrodt, Gervais, and Hilgard

In the simulation environment of CSG&H (for Carter, Schönbrodt, Gervais, and Hilgard), primary studies estimate the effect of a treatment using Cohen's d as their measure of effect. The difference in outcomes for the treatment and control groups is equal to a fixed effect, , plus a random component that differs across studies. Effect heterogeneity is introduced through this random component, which is regulated by the parameter . The mean value of d takes on four values depending on the experiment: 0, 0.2, 0.5, and 0.8. Differences in the SEs of d for a given experiment are generated by allowing the simulated primary studies to have different sample sizes. CSG&H introduce two types of distortions in the research environment. They employ a publication selection algorithm in which the probability of estimates being “published” depends nonlinearly on both the sign of the estimated effect and its P‐value. They construct three different publication selection regimes which they call “No Publication Bias,” “Medium Publication Bias,” and “Strong Publication Bias.” These are obtained by altering the parameters of the publication selection algorithm. They also simulate four different types of “questionable research practices” (QRPs): (a) optional removal of outliers, (b) optional selection between two dependent variables, (c) optional use of moderators, and (d) optional stopping. Finally, CSG&H also construct experiments in which the simulated MA samples take on four different sizes: 10, 30, 60, and 100. Table 2 reports the number of experiments for each of the four simulation environments, categorized by number of estimates included in the meta‐analyst's sample (“Sample Size”) and the extent of measured effect heterogeneity (“ ”). We calculate I as:where is the estimate of effect heterogeneity using the restricted maximum likelihood method, and

TABLE 2

Number of experiments by sample size and extent of effect heterogeneity

A. Stanley, Doucouliagos, and Ioannidis ¹
Sample size	Low 𝐼²	Moderate 𝐼²	High 𝐼²	Total
Sample size	I² ≤ 0.25	0.25 < I² ≤ 0.75	0.75 < I²	Total
{5,10}	30	27	15	72
20	15	10	11	36
40	15	10	11	36
80	13	12	11	36
{100, 200, 400, 800}	51	49	44	144
Total	124	108	92	324

Note: The table lists the number of experiments for each {sample size, effect heterogeneity} category, by simulation environment. An experiment is defined as a unique set of parameters determining (a) effect size, (b) effect heterogeneity, (c) publication selection, (d) sample size, and (e) (for Carter et al., 2019) questionable research practices (see Appendix 2 in Data S1). Each experiment consists of 3000 simulated meta‐analyses. measures the share of effect size variance that is due to heterogeneity in true effects. It is based on , which we, following Carter et al, estimate using restricted maximum likelihood (REML) [see Equation (6) in the text and the associated discussion].

Number of experiments by sample size and extent of effect heterogeneity Note: The table lists the number of experiments for each {sample size, effect heterogeneity} category, by simulation environment. An experiment is defined as a unique set of parameters determining (a) effect size, (b) effect heterogeneity, (c) publication selection, (d) sample size, and (e) (for Carter et al., 2019) questionable research practices (see Appendix 2 in Data S1). Each experiment consists of 3000 simulated meta‐analyses. measures the share of effect size variance that is due to heterogeneity in true effects. It is based on , which we, following Carter et al, estimate using restricted maximum likelihood (REML) [see Equation (6) in the text and the associated discussion]. , and is the number of estimates in the meta‐analyst's sample. takes values between 0 and 100%. is often interpreted as a measure of the share of effect size variance that is due to heterogeneity in true effects in the population. However, Augusteijn et al. demonstrate, that it is affected by publication selection. The effect of publication selection can be large, and can either increase or decrease the value of . Our simulations calculate post‐publication selection. Whether that vitiates the usefulness of in the selection of estimators is an empirical question. In order to induce greater overlap in the simulation environments, we added simulations to the SD&I, B&R, and CSG&H experimental designs that allow for larger sample sizes. These are yellow‐highlighted in the table. This resulted in a total of 1620 experiments, where an experiment is defined as a unique set of parameters determining (a) effect size, (b) effect heterogeneity, (c) publication selection, (d) sample size, and (e) (for CSG&H ) questionable research practices. This compares favorably with previous studies (see Table 1). Details about the experiments are reported in Appendix 2 in Data S1.

THE PERFORMANCE MEASURES

We evaluate estimators on three performance measures: (a) Bias, (b) Mean Squared Error (MSE), and (c) 95% Coverage Rates. With respect to bias, the average bias for any given experiment k is calculated bywhere is the total number of iterations for that experiment (typically 3000). Note that can be positive or negative. When aggregating over experiments to obtain a summary measure of performance, we calculate the average of absolute values, | ,, where R is the total number of experiments included in the evaluation. “Best estimator” with respect to bias is defined as the estimator with the smallest value of average |. MSE for a given experiment k is calculated by When used as a summary measure of performance, it is calculated by . “Best estimator” with respect to MSE is defined as the estimator with the smallest value of MSE. |Coverage‐0.95| calculates the absolute value of the difference between the coverage rate – the percent of times the 95% confidence interval covers the true effect ‐ and 95%. For example, if the coverage rate in one simulation was 97%, and in another it was 93%, the mean absolute value of the difference would be 2%. “Best estimator” with respect to |Coverage‐0.95| is the estimator that produces values closest to 0. The previous performance measures apply to all 1620 experiments. The last performance measure, Type I Error Rate, only applies to experiments where the true value of the mean effect equals 0. It measures how often the estimator finds a statistically significant when in fact there is no effect. Good performance on this criterion is represented by values close to 5%.

A caveat about using average performance to assess evaluators

A number of the estimators (3PSM, 4PSM, AK1, AK2, pC, pU, and TF) use maximization procedures to produce their estimates. In some cases, the algorithms do not converge and no estimate is produced. This can cause comparisons of average performance to be misleading. Consider the extreme case where all estimators perform well in simulation environment A but poorly in simulation environment B, but one of the estimators always fails to converge in B. That estimator will appear to be superior based on its average performance. Its superior average performance would reflect differences in experiments, and not differences in actual performance. As an aggregate measure, average performance is suggestive, but conclusions regarding performance should always be based on an inspection of performance at the level of individual experiments. Section 6 provides a demonstration of this approach.

RESULTS

Relative performance differs across criteria

Table 3 ranks average performance of the estimators for all 1620 experiments. Results are separated by performance measure (Bias, MSE, etc.). The purpose of this table is not to demonstrate overall superiority for any given estimator. In addition to the problem of convergence rates discussed above, there is too much heterogeneity in these average numbers for them to be very useful. The main purpose of this table is to note that estimators that dominate on one criterion may perform relatively poorly on another.

TABLE 3

Comparison of estimator performance: all experiments

Performance criterion
\|Bias\|		MSE		\|Coverage‐0.95\| ^a		Type I Error ^b
EK	0.076	WAAP	0.075	AK2	0.172	AK2	0.113
PP	0.081	PP	0.106	3PSM	0.212	EK	0.236
AK2	0.083	EK	0.107	4PSM	0.258	3PSM	0.243
4PSM	0.090	TF	0.110	PP	0.290	PP	0.267
3PSM	0.101	AK1	0.120	EK	0.297	4PSM	0.274
WAAP	0.109	AK2	0.136	WAAP	0.310	WAAP	0.516
AK1	0.132	3PSM	0.140	AK1	0.346	AK1	0.566
TF	0.140	pU	0.160	TF	0.396	TF	0.586
RE	0.216	4PSM	0.163	RE	0.512	pU	0.589
pU	0.229	RE	0.195	pU	0.578	RE	0.640
pC	0.333	pC	0.608	pC	NA	pC	NA

Note: The values in the table represent the average values of the respective performance measures across all 1620 experiments for the first three columns. The last column only reports results for those experiments where the true mean effect = 0. The three “best” performing estimators on the dimensions of Bias, MSE, and Coverage rates/Type I Error (EK, WAAP, and AK2) are color‐coded to facilitate comparison across performance measures.

Estimators:

3PSM/4PSM = Three‐Parameter/Four‐Parameter Selection Models

AK1 = Andrews and Kasy's “symmetric selection” model

AK2 = Andrews and Kasy's “asymmetric selection” selection

EK = Bom and Rachinger's Endogenous Kink estimator

pC = p‐curve

pU = p‐uniform

PP = PET‐PEESE (Stanley and Doucouliagos )

RE = Random Effects

TF = Trim and Fill

WAAP = Stanley et al's Weighted Average of the Adequately Powered‐WLS hybrid estimator.

This column reports the average, absolute value of the difference between (a) the percent of times the 95% confidence interval contains the true mean value and (b) 95%.

This column reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect.

Comparison of estimator performance: all experiments Note: The values in the table represent the average values of the respective performance measures across all 1620 experiments for the first three columns. The last column only reports results for those experiments where the true mean effect = 0. The three “best” performing estimators on the dimensions of Bias, MSE, and Coverage rates/Type I Error (EK, WAAP, and AK2) are color‐coded to facilitate comparison across performance measures. Estimators: 3PSM/4PSM = Three‐Parameter/Four‐Parameter Selection Models AK1 = Andrews and Kasy's “symmetric selection” model AK2 = Andrews and Kasy's “asymmetric selection” selection EK = Bom and Rachinger's Endogenous Kink estimator pC = p‐curve pU = p‐uniform PP = PET‐PEESE (Stanley and Doucouliagos ) RE = Random Effects TF = Trim and Fill WAAP = Stanley et al's Weighted Average of the Adequately Powered‐WLS hybrid estimator. This column reports the average, absolute value of the difference between (a) the percent of times the 95% confidence interval contains the true mean value and (b) 95%. This column reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect. For example, on the dimension of bias, Bom and Rachinger's Endogenous Kink (EK) estimator produces the lowest overall, mean absolute bias (“|Bias|”). However, it is dominated by Stanley, Doucouliagos, and Ioannidis's WAAP estimator when it comes to mean squared error (“MSE”); and Andrews and Kasy's “asymmetric selection” estimator (AK2) with respect to |Coverage‐0.95| and Type I Error rates. The table color‐codes the three estimators with best average performance on the respective criteria to facilitate comparison. It highlights that superior performance on one dimension does not guarantee superior performance on another. As a result, when choosing an estimator, meta‐analysts should prioritize which performance measure is of most importance. Table 3 also highlights the poor performance of all the estimators when it comes to coverage and Type I error rates. While AK2 may be the “best” performing estimator on these dimensions, the mean absolute deviation between the coverage rate and 95% is 17%. The mean Type I error rate is 11%, compared to an expectation of 5%. The other estimators perform much worse. Most of the estimators have coverage rates below 70% and Type I error rates larger than 25%. This should cause researchers to question the reliability of any hypothesis testing about effect sizes that is performed in meta‐analyses that use these estimators.

Relative performance differs across simulation environments

We next focus on the sensitivity of results to simulation environment. Table 4 collects the results from all 1620 experiments and breaks them out according to each of the four simulation environments and each of the four performance criteria. In both panels, we are looking for consistency in relative performance across simulation environments.

TABLE 4

Comparison of Estimator Performance across Simulation Environments

A. \|Bias\|
SD&I ¹		A&R ²		B&R ³		CSG&H ⁴
AK2 ^a	0.031	AK2	0.200	AK2	0.071	PP	0.058
3PSM ^a	0.036	EK	0.213	EK	0.089	WAAP	0.062
4PSM	0.040	PP	0.256	4PSM	0.099	AK1	0.064
PP	0.050	WAAP	0.263	PP	0.124	EK	0.071
EK	0.053	TF	0.284	3PSM	0.147	3PSM	0.080
AK1	0.060	4PSM	0.298	WAAP	0.187	TF	0.091
WAAP	0.083	AK1	0.390	TF	0.238	4PSM	0.095
TF	0.088	3PSM	0.468	AK1	0.262	pU	0.105
RE	0.107	RE	0.550	RE	0.361	AK2	0.107
pU ^a	0.146	pC	1.530	pU	0.373	pC	0.114
pC	0.420	pU	1.556	pC	0.521	RE	0.150

Note: The four panels rank the performance of the 11 estimators on the basis of their average Bias, MSE, |Coverage‐0.95|, and Type I Error performance, disaggregated by simulation environment. Estimators are ranked from “best” (least Bias, smallest MSE, etc.) to worst. Values in the tables are the average values for the respective performance measures and simulation environments. In each panel the best and second best performing estimators in the SD&I environments are color‐coded brown and gray, respectively. This allows one to track their relative performance across the remaining three simulation environments.

It is important to note that the maximization procedures that underlie some of the estimators do not always converge. Averages across estimators will not be comparable if they average across different experiments due to lack of convergence. To indicate this in the table, we indicate three types of convergence behaviour. Boldfaced estimators indicate a convergence rate of 99% or higher (eg, AK1). Conventional, non‐boldfaced type indicates that the estimator converged between 90%‐99% of the time (eg, AK1). Italicized estimators indicate that convergence rates were lower than 90% (eg, AK1).

This panel reports the average, absolute value of the difference between (i) the percent of times the 95% confidence interval contains the true mean value and (ii) 95%.

This panel reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect.

Comparison of Estimator Performance across Simulation Environments Note: The four panels rank the performance of the 11 estimators on the basis of their average Bias, MSE, |Coverage‐0.95|, and Type I Error performance, disaggregated by simulation environment. Estimators are ranked from “best” (least Bias, smallest MSE, etc.) to worst. Values in the tables are the average values for the respective performance measures and simulation environments. In each panel the best and second best performing estimators in the SD&I environments are color‐coded brown and gray, respectively. This allows one to track their relative performance across the remaining three simulation environments. It is important to note that the maximization procedures that underlie some of the estimators do not always converge. Averages across estimators will not be comparable if they average across different experiments due to lack of convergence. To indicate this in the table, we indicate three types of convergence behaviour. Boldfaced estimators indicate a convergence rate of 99% or higher (eg, AK1). Conventional, non‐boldfaced type indicates that the estimator converged between 90%‐99% of the time (eg, AK1). Italicized estimators indicate that convergence rates were lower than 90% (eg, AK1). This panel reports the average, absolute value of the difference between (i) the percent of times the 95% confidence interval contains the true mean value and (ii) 95%. This panel reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect. In Panel A, which reports performance with respect to bias, simulations for three of the four simulation environments show that the AK2 estimator has best average performance, as measured by smallest mean absolute bias. However, in the CSG&H simulations, the AK2 estimator ranks 9th of 11. The 3PSM estimator ranks second in the SD&I simulations with respect to mean absolute bias, but 8th in A&R's, 5th in B&R's, and 5th in CSG&H's simulations. These inconsistencies are not unusual. Panel B ranks average performance of the estimators with respect to MSE. The AK2 estimator ranks 1st in the SD&I and A&R simulations, but 6th in B&R's simulations, and 9th in CSG&H's. Other inconsistencies across simulation environments are easily found in Panels C and D. Table 1 demonstrated that it was difficult to draw guidance from previous studies about which estimator to use because there was little overlap in the estimators being compared. Table 4 identifies a more critical issue. Even when the same estimators are being compared, one can obtain different results depending on which simulation design is being used. This raises the question: what are the factors responsible for these differences? A full treatment of the question lies beyond the scope of this study. However, we undertake an initial effort at answering this question by focusing on two features of the simulation designs: number of estimates in the simulated MA samples (“sample size”), and the extent of effect size heterogeneity, as measured by I . Table 2 highlights that the different simulation environments differ on these dimensions. If these two features systematically affect estimator performance, then differences in the combinations of sample size and effect heterogeneity would provide at least a partial explanation for the differences in average performance across simulation environments.

The influence of sample size and effect heterogeneity on relative estimator performance

It is well‐known that estimator performance generally declines as effect heterogeneity increases and improves as the meta‐analyst's sample size gets larger (Moreno et al., ; Stanley ). Less well‐known is that relative estimator performance is also affected by these factors. In this section we demonstrate both phenomena. We use the results from the CSG&H simulations to estimate the following regressions for each of the 11 estimators (j):and Regressions were estimated using OLS with bootstrapped t‐statistics to obtain P‐values. Each regression used the Bias/MSE results for a given estimator j. The respective samples were constructed from the individual results of the 756 experiments in the CSG&H simulations (see Panel D of Table 2). Table 5 presents the results. They provide strong evidence that Bias and MSE increase as effect heterogeneity ( increases. With only one exception, the coefficient on the term is positive and significant in both the Bias and MSE regressions for each of the 11 estimators. The exception is the coefficient for in the MSE regression for the p‐curve estimator (pC). Sample size is also strongly associated with MSE performance. Sample size is negatively and significantly associated with MSE for each of the 11 estimators. The evidence for sample size affecting bias is not as strong. Still, 9 of the 11 estimated coefficients are negative, with 5 of 11 negative and significant at the 5% level.

TABLE 5

Sample size and effect heterogeneity as determinants of absolute estimator performance: CSG&H simulation environment

Estimator	Bias		MSE
Estimator	β_SampleSize	βI2	β_SampleSize	βI2
AK1	−0.0143* (0.0074)	0.1147*** (0.0068)	−0.0101*** (0.0018)	0.0240*** (0.0018)
4PSM	0.0112 (0.0116)	0.2214*** (0.0098)	−0.0160*** (0.0045)	0.0812*** (0.0040)
3PSM	0.0029 (0.0101)	0.1624*** (0.0088)	−0.0156*** (0.0034)	0.0536*** (0.0030)
WAAP	−0.0366*** (0.0091)	0.1163*** (0.0078)	−0.0235*** (0.0027)	0.0344*** (0.0024)
TF	−0.0150 (0.0121)	0.1555*** (0.0093)	−0.0121*** (0.0042)	0.0413*** (0.0033)
AK2	−0.0355** (0.0168)	0.1883*** (0.0122)	−0.0458*** (0.0099)	0.0676*** (0.0075)
PP	−0.0206*** (0.0069)	0.0868*** (0.0060)	−0.0443*** (0.0040)	0.0468*** (0.0037)
RE	−0.0222 (0.0178)	0.2180*** (0.0151)	−0.0139** (0.0083)	0.0927*** (0.0071)
EK	−0.0286*** (0.0058)	0.0125*** (0.0058)	−0.055*** (0.0041)	0.0399*** (0.0039)
pU	−0.0180 (0.0124)	0.1352*** (0.0121)	−0.0182*** (0.0050)	0.0575*** (0.0053)
pC	−0.0403*** (0.0151)	0.1360*** (0.0150)	−0.1140*** (0.0302)	−0.0025 (0.0318)

Note: The table reports the results of estimating Equations (8a) and (8b) in the text. Regressions were estimated using OLS with bootstrapped t‐statistics to obtain p‐values. Each regression used the Bias/MSE results for a given estimator j. The respective samples were constructed from the individual results of the 756 experiments in the Carter, Schönbrodt, Gervais, and Hilgard simulations. Bootstrap SEs are reported in parentheses. When estimating the model we use Sample size/1000. This transformation increases the size of by a factor of 1000, but leaves economic and statistical significance unchanged.

Sample size and effect heterogeneity as determinants of absolute estimator performance: CSG&H simulation environment −0.0143* (0.0074) 0.1147*** (0.0068) 0.0240*** (0.0018) 0.0112 (0.0116) 0.2214*** (0.0098) 0.0812*** (0.0040) 0.0029 (0.0101) 0.1624*** (0.0088) −0.0366*** (0.0091) 0.1163*** (0.0078) −0.0150 (0.0121) 0.1555*** (0.0093) −0.0355** (0.0168) 0.1883*** (0.0122) −0.0206*** (0.0069) 0.0868*** (0.0060) −0.0222 (0.0178) 0.2180*** (0.0151) −0.0286*** (0.0058) 0.0125*** (0.0058) −0.0180 (0.0124) 0.1352*** (0.0121) −0.0403*** (0.0151) 0.1360*** (0.0150) −0.0025 (0.0318) Note: The table reports the results of estimating Equations (8a) and (8b) in the text. Regressions were estimated using OLS with bootstrapped t‐statistics to obtain p‐values. Each regression used the Bias/MSE results for a given estimator j. The respective samples were constructed from the individual results of the 756 experiments in the Carter, Schönbrodt, Gervais, and Hilgard simulations. Bootstrap SEs are reported in parentheses. When estimating the model we use Sample size/1000. This transformation increases the size of by a factor of 1000, but leaves economic and statistical significance unchanged. While Table 5 documents changes in absolute estimator performance, Table 6 presents evidence of changes in relative performance. Once again we use the CSG&H simulation results and focus on bias and MSE. We divide the 756 CSG&H experiments into 21 separate cells depending on sample size (10, 30, 60, 100, 200, 400, 800) and effect heterogeneity (, 0.25 < , 0.75 < ). Panel D of Table 2 reports the number of experiments for each sample size/ cell.

TABLE 6

The relationship between relative estimator performance, sample size, and I : CSG&H simulation environment

A. Sample size = 10
Bias						MSE
Low I ²		Moderate I ²		High I ²		Low I ²		Moderate I ²		High I ²
AK1 ^a	0.028	4PSM	0.071	AK1	0.097	AK1	0.006	AK1	0.027	AK1	0.027
4PSM	0.033	3PSM	0.074	WAAP	0.104	3PSM	0.007	pU	0.042	TF	0.045
3PSM	0.035	PP	0.087	TF	0.110	4PSM	0.008	TF	0.043	WAAP	0.057
WAAP ^a	0.040	AK1	0.088	AK2	0.119	TF	0.010	3PSM	0.043	3PSM	0.069
TF	0.042	EK	0.098	3PSM	0.146	WAAP	0.010	WAAP	0.046	RE	0.075
AK2 ^a	0.047	pU	0.107	EK	0.153	AK2	0.010	4PSM	0.049	AK2	0.078
PP	0.063	WAAP	0.112	PP	0.160	RE	0.018	RE	0.068	4PSM	0.093
RE	0.082	TF	0.127	4PSM	0.177	PP	0.021	PP	0.081	pU	0.114
pU	0.090	pC	0.147	RE	0.179	pU	0.023	EK	0.092	pC	0.164
EK	0.101	AK2	0.160	pU	0.253	EK	0.030	AK2	0.102	PP	0.209
pC	0.150	RE	0.188	pC	0.270	pC	0.278	pC	0.203	EK	0.220

Note: The panels above rank the performance of the 11 estimators on the basis of their average Bias and MSE performance, disaggregated by {sample size, effect heterogeneity} categories. Estimators are ranked from “best” (least Bias, smallest MSE) to worst. Values in the tables are the average values for the respective performance measures and {sample size, effect heterogeneity} categories. For both Bias and MSE, the top two estimators in the cell for smallest sample size (10) and effect heterogeneity (low I 2) are identified by color‐coding. For Bias, these are the AK1 and 4PSM estimators. For MSE, they are AK1 and 3PSM. The relative position of these estimators are then tracked as sample size and effect heterogeneity increases.

It is important to note that the maximization procedures that underlie some of the estimators do not always converge. Averages across estimators will not be comparable if they average across different experiments due to lack of convergence. To indicate this in the table, we indicate three types of convergence behaviour. Boldfaced estimators indicate a convergence rate of 99% or higher (eg, AK1). Conventional, non‐boldfaced type indicates that the estimator converged between 90%–99% of the time (eg, AK1). Italicized estimators indicate that convergence rates were lower than 90% (eg, AK1).

The relationship between relative estimator performance, sample size, and I : CSG&H simulation environment Note: The panels above rank the performance of the 11 estimators on the basis of their average Bias and MSE performance, disaggregated by {sample size, effect heterogeneity} categories. Estimators are ranked from “best” (least Bias, smallest MSE) to worst. Values in the tables are the average values for the respective performance measures and {sample size, effect heterogeneity} categories. For both Bias and MSE, the top two estimators in the cell for smallest sample size (10) and effect heterogeneity (low I 2) are identified by color‐coding. For Bias, these are the AK1 and 4PSM estimators. For MSE, they are AK1 and 3PSM. The relative position of these estimators are then tracked as sample size and effect heterogeneity increases. It is important to note that the maximization procedures that underlie some of the estimators do not always converge. Averages across estimators will not be comparable if they average across different experiments due to lack of convergence. To indicate this in the table, we indicate three types of convergence behaviour. Boldfaced estimators indicate a convergence rate of 99% or higher (eg, AK1). Conventional, non‐boldfaced type indicates that the estimator converged between 90%–99% of the time (eg, AK1). Italicized estimators indicate that convergence rates were lower than 90% (eg, AK1). For both Bias and MSE, we identify the top two estimators in the cell for smallest sample size (10) and effect heterogeneity (low I ). For Bias, these are the AK1 and 4PSM estimators. For MSE, they are AK1 and 3PSM. We then track the relative position of these estimators as sample size and effect heterogeneity increases. The respective estimators are color‐coded to facilitate tracking across cells. Table 6 clearly reveals that there is substantial movement in the relative rankings of average estimator performance as sample size and effect heterogeneity change. For the sake of brevity, we only report results for Bias and MSE. In some cases, the change in relative ranking is dramatic. When sample size = 10, the 4PSM estimator ranks 2nd and 1st on Bias, respectively, for low and moderate effect heterogeneity. It falls to 9th when effect heterogeneity is high. In other cases, relative performance is relatively stable: Across all sample sizes, AK1 is either ranked 1st or 2nd in terms of smallest average MSE. The table demonstrates two things. It underscores a point made previously that no estimator dominates in all research settings. However, it also suggests that there may be circumstances where one estimator is generally preferred. For example, if a researcher is interested in estimator efficiency and works in an area where effect heterogeneity is expected to be high, and if the researcher is convinced that the CSG&H simulation environment captures the key elements of their research situation, then Table 6 suggests that AK1 may be the best estimator for their analysis. However, the Table 6 results are based on average performance within a given {sample size, effect heterogeneity} cell. As demonstrated previously, averages can conceal much variation. The next section illustrates how further investigation can lead to a more definitive conclusion regarding “best” estimator.

AN EXAMPLE OF HOW SIMULATION EXPERIMENTS CAN GUIDE THE SELECTION OF A “BEST” ESTIMATOR

Previous sections demonstrated that there is no superior estimator for all research situations. “Best” is conditional on performance measure, and depends on observable characteristics of the meta‐analyst's sample such as sample size and effect heterogeneity. It also can depend on unobservable characteristics such as the type of publication selection (statistical significance, correct sign, both), the extent of publication selection, and other factors such as assorted questionable research practices (QRPs). By conditioning on observables and investigating performance over unobservables, one can study the relative performance of estimators and use the results to guide estimator selection for use in a given research situation. This section demonstrates how this can be done. Suppose a meta‐analyst is studying the empirical literature on a given “effect,” measured by Cohen's d. They collect a sample of 100 estimates. Initial analysis indicates a high degree of effect heterogeneity (). While they are unsure whether publication selection is a problem, if it does exist, they believe selection would depend on both correct sign and statistical significance. Looking over the alternatives, it is their experienced judgment that the CSG&H simulation environment best captures the salient aspects of their research situation. However, they do not have strong priors about the size of the effect, the severity of publication selection, nor the extent of QRPs. While they would like to have an estimator that minimized bias, produced accurate coverage rates, and provided reliable tests of significance, their main priority is choosing an estimator that is efficient. We show how simulation results can be used to guide that selection. Table 7 reports the individual experimental results for sample size = 100/ High . There are a total of 30 experimental results (cf. Table 2), covering a wide range of effect sizes {0, 0.2, 0.5, 0.8}, severities of publication selection {No, Medium, Strong}, and QRP behaviors {None, Medium, High} (see Appendix 2 in Data S1). We suppose the meta‐analyst is interested in not just average MSE performance, but also the variation of MSE across situations. Since they do not know which of the respective experiments best represents their research situation, they want to avoid an estimator that occasionally produces a bad result, even if it does well on average.

TABLE 7

Comparison of MSE performance: Sample size = 100, high I, 2 CSG&H simulation environment

{Effect Size, I ² , QRP, Publication Selection}	Estimators
{Effect Size, I ² , QRP, Publication Selection}	TF	pC	pU	RE	3PSM	4PSM	AK1	AK2	WAAP	PP	EK
{0, 0.822, None, No}	0.005	0.207	0.203	0.002	0.004	0.008	0.002	0.012	0.007	0.019	0.025
{0.2, 0.821, None, No}	0.006	0.110	0.106	0.002	0.004	0.008	0.002	0.012	0.017	0.019	0.025
{0.5, 0.818, None, No}	0.007	0.032	0.029	0.002	0.004	0.008	0.003	0.011	0.014	0.012	0.027
{0.8, 0.810, None, No}	0.010	0.004	0.003	0.002	0.005	0.008	0.004	0.009	0.010	0.013	0.029
{0, 0.864, Med, No}	0.004	0.101	0.096	0.003	0.020	0.028	0.001	0.025	0.005	0.030	0.034
{0.2, 0.856, Med, No}	0.006	0.046	0.040	0.003	0.032	0.037	0.003	0.034	0.016	0.035	0.039
{0.5, 0.835, Med, No}	0.012	0.008	0.006	0.003	0.048	0.066	0.012	0.079	0.015	0.027	0.043
{0.8, 0.805, Med, No}	0.019	0.003	0.004	0.003	0.051	0.079	0.021	0.083	0.010	0.016	0.039
{0, 0.879, High, No}	0.003	0.070	0.065	0.004	0.040	0.050	0.001	0.043	0.005	0.042	0.045
{0.2, 0.869, High, No}	0.006	0.029	0.024	0.005	0.063	0.065	0.004	0.055	0.015	0.050	0.052
{0.5, 0.837, High, No}	0.010	0.005	0.003	0.006	0.096	0.109	0.018	0.120	0.013	0.034	0.052
{0.8, 0.803, High, No}	0.020	0.005	0.007	0.004	0.109	0.142	0.031	0.142	0.011	0.019	0.047
{0, 0.769, None, Med}	0.022	0.056	0.055	0.053	0.009	0.020	0.021	0.009	0.020	0.014	0.010
{0, 0.763, Med, Med}	0.047	0.009	0.009	0.100	0.006	0.018	0.014	0.009	0.020	0.011	0.008
{0, 0.757, High, Med}	0.061	0.003	0.003	0.125	0.010	0.012	0.013	0.005	0.021	0.013	0.010
{0, 0.920, None, Med}	0.031	0.201	0.197	0.103	0.006	0.086	0.041	0.023	0.040	0.030	0.027
{0.2, 0.859, None, Med}	0.034	0.107	0.103	0.106	0.009	0.063	0.027	0.022	0.033	0.034	0.022
{0.5, 0.785, None, Med}	0.012	0.030	0.028	0.058	0.007	0.021	0.011	0.018	0.016	0.012	0.017
{0.8, 0.774, None, Med}	0.003	0.003	0.003	0.023	0.004	0.007	0.003	0.012	0.007	0.008	0.026
{0, 0.908, Med, Med}	0.050	0.100	0.090	0.136	0.072	0.151	0.029	0.048	0.040	0.027	0.024
{0.2, 0.829, Med, Med}	0.036	0.045	0.038	0.119	0.060	0.146	0.009	0.060	0.030	0.030	0.021
{0, 0.901, High, Med}	0.060	0.069	0.060	0.153	0.108	0.136	0.026	0.042	0.042	0.030	0.027
{0.2, 0.816, High, Med}	0.038	0.029	0.022	0.124	0.111	0.145	0.006	0.067	0.030	0.032	0.022
{0, 0.755, None, Strong}	0.071	0.056	0.056	0.133	0.010	0.009	0.036	‐ ^a	0.033	0.030	0.011
{0, 0.895, None, Strong}	0.116	0.202	0.196	0.255	0.019	0.101	0.081	‐ ^a	0.087	0.074	0.043
{0.2, 0.807, None, Strong}	0.080	0.108	0.104	0.185	0.018	0.043	0.050	‐ ^a	0.064	0.053	0.023
{0.5, 0.759, None, Strong}	0.019	0.030	0.028	0.083	0.009	0.013	0.014	‐ ^a	0.023	0.015	0.014
{0.8, 0.768, None, Strong}	0.002	0.003	0.003	0.031	0.005	0.006	0.003	‐ ^a	0.008	0.008	0.027
{0, 0.843, Med, Strong}	0.130	0.101	0.090	0.271	0.051	0.053	0.056	‐ ^a	0.081	0.060	0.034
{0, 0.823, High, Strong}	0.135	0.071	0.060	0.277	0.041	0.041	0.051	‐ ^a	0.080	0.057	0.031
Average MSE =	0.035	0.062	0.058	0.079	0.034	0.056	0.020	0.041	0.027	0.028	0.028
(Smallest, Largest) =	(0.002,0.135)	(0.003,0.207)	(0.003,0.203)	(0.002,0.277)	(0.004,0.111)	(0.006,0.151)	(0.001,0.081)	(0.005,0.142)	(0.005,0.087)	(0.008,0.074)	(0.008,0.052)

Note: This table reports estimator MSE performance results for the 30 experiments included within the {sample size = 100, high I 2} category of the CSG&H simulations. The estimators are described in Section 2 of the text. The first column gives details about the individual experiment (cf. the bottom panel in Appendix 2 in Data S1). Each cell represents results for a single experiment consisting of 3000 simulated meta‐analyses. Each simulated meta‐analysis produces a single estimate of the mean population effect. The numbers in the table are the averaged mean squared error (MSE) value for the 3000 simulated meta‐analyses for that estimator and experiment. The last two rows of each panel report the overall average MSE, followed by the smallest and largest (average) MSE values over the 30 experiments. Yellow‐highlighted cells in the upper panel of the table identify the smallest (average) MSE for each experiment. The yellow‐highlighted cell in the bottom panel of the table identifies the estimator (AK1) with the lowest overall, averaged MSE value. The blue‐highlighted cells identify estimators that are close to AK1 in terms of overall performance.

Indicates that all estimates failed to converge for that experiment.

Comparison of MSE performance: Sample size = 100, high I, 2 CSG&H simulation environment Note: This table reports estimator MSE performance results for the 30 experiments included within the {sample size = 100, high I 2} category of the CSG&H simulations. The estimators are described in Section 2 of the text. The first column gives details about the individual experiment (cf. the bottom panel in Appendix 2 in Data S1). Each cell represents results for a single experiment consisting of 3000 simulated meta‐analyses. Each simulated meta‐analysis produces a single estimate of the mean population effect. The numbers in the table are the averaged mean squared error (MSE) value for the 3000 simulated meta‐analyses for that estimator and experiment. The last two rows of each panel report the overall average MSE, followed by the smallest and largest (average) MSE values over the 30 experiments. Yellow‐highlighted cells in the upper panel of the table identify the smallest (average) MSE for each experiment. The yellow‐highlighted cell in the bottom panel of the table identifies the estimator (AK1) with the lowest overall, averaged MSE value. The blue‐highlighted cells identify estimators that are close to AK1 in terms of overall performance. Indicates that all estimates failed to converge for that experiment. The top part of the table reports the individual MSE experimental results. We yellow‐highlight the minimum MSE value in each experiment. Of the 11 estimators, all but two of them (WAAP and PP) are “best” in at least one experiment. This again highlights the fact that no estimator is best in all research situations. To assist in processing the large amount of information in the table, we report average performance for the 30 experiments, along with minimum and maximum MSE values, at the bottom of the table. Given that the researcher does not know which simulated situation best represents their actual research situation, they first consider the estimator with the lowest overall average MSE. That is the AK1 estimator. It has an overall average value of 0.020. The next best estimator is the WAAP, with an overall average of 0.027. AK1 also takes on a relatively narrow range of values across the 30 experiments. Its minimum value is 0.001, and its maximum value is 0.081. This compares favorably with most of the other estimators, but not all. For example, Bom & Rachinger's EK estimator, while producing a slightly larger overall average value of MSE (0.028), takes on a narrower set of values (minimum = 0.008, maximum = 0.052). The WAAP and PP estimators have similar characteristics. With respect to AK1, it is worth noting that simulations will tend to be biased toward selection models, because selection models have been designed to capture the very kinds of behaviors built into selection algorithms. This is not necessarily a bad thing. However, to the extent that actual publication selection behavior differs from simulated selection behavior, results may overstate the performance of selection models in real world datasets. The researcher's choice comes down to a trade‐off between mean and dispersion, a choice that is complicated by the fact that randomness in the simulation process cautions against attaching too much significance to small numerical differences. We propose one possible solution, with the researcher choosing the AK1 estimator as best (yellow‐highlighted), while also choosing one or two other estimators (WAAP, PP, EK; highlighted in blue) for robustness checking.

CONCLUSION

The subject of MA estimator performance has received much attention in the literature (Alinaghi and Reed ; Bom and Rachinger ; Carter et al ; Hedges and Vevea ; McShane et al ; Moreno et al., ; Rücker et al ; Simonsohn et al ; Stanley ; Stanley and Doucouliagos ; Stanley et al ; van Aert et al ; van Assen et al ). A goal of many of these studies has been to find a “best” estimator. However, there is an increasing awareness that no single estimator is “best” in all circumstances (Carter et al ). Unfortunately, the way previous studies have been conducted and reported has not been conducive to guiding meta‐analysts toward the best estimator for their particular research applications. Different studies examine different sets of estimators, making it difficult to aggregate results across studies. They employ different experimental designs with different features, without being clear about which features are important for estimator performance. Finally, they typically do not make a distinction between the influence of observable and unobservable characteristics. For example, knowing that an estimator performs well when a MA sample is unaffected by publication selection, but poorly when it is affected, is not helpful if the meta‐analyst cannot observe whether their particular sample has this problem. What is missing is something akin to a flow‐chart that would map observable characteristics to experimental results which the meta‐analyst could then use to select the best estimator for their situation. This study contributes toward that goal. We demonstrate how two characteristics that can be observed by the meta‐analyst – number of estimates in the meta‐analyst's sample (“sample size”) and the degree of effect heterogeneity, as measured by I ‐ can be used to guide the meta‐analyst to experimental results that are most germane to their research application. We construct an extensive database of 1620 experiments and give an example how the database can be used to select a “best” estimator, or best set of estimators. In our example of sample size = 100 and I > 0.75. we find that Andrews and Kasy's symmetric selection estimator (“AK1”) performs best with respect to minimizing MSE, closely followed by Bom and Rachinger's “EK” estimator; Stanley, Doucouliagos, and Ioannidis' WAAP estimator; and Stanley and Doucouliagos' PET‐PEESE estimator. However, this example assumes the simulation design of Carter et al . Other simulation designs with the same sample size and effect heterogeneity give different results. Thus, a major challenge going forward is to gain a better understanding of the factors that determine estimator performance. A final contribution of our study is that we have made all of our experimental results publicly accessible via a ShinyApp at https://hong‐reed.shinyapps.io/HongReedInteractiveTables. Table 7 presented the results of 30 Monte Carlo experiments from the Carter, Schönbrodt, Gervais, and Hilgard simulation environment for sample sizes of 100 and high effect heterogeneity. The online results allow researchers to explore other scenarios that may be more relevant for their particular research situations.

CONFLICT OF INTEREST

The authors declare no conflicts of interest. Data S1. Supplementary Information. Click here for additional data file.

18 in total