Literature DB >> 32459806

Reproducibility of individual effect sizes in meta-analyses in psychology.

Esther Maassen¹, Marcel A L M van Assen^1,2, Michèle B Nuijten¹, Anton Olsson-Collentine¹, Jelte M Wicherts¹.

Abstract

To determine the reproducibility of psychological meta-analyses, we investigated whether we could reproduce 500 primary study effect sizes drawn from 33 published meta-analyses based on the information given in the meta-analyses, and whether recomputations of primary study effect sizes altered the overall results of the meta-analysis. Results showed that almost half (k = 224) of all sampled primary effect sizes could not be reproduced based on the reported information in the meta-analysis, mostly because of incomplete or missing information on how effect sizes from primary studies were selected and computed. Overall, this led to small discrepancies in the computation of mean effect sizes, confidence intervals and heterogeneity estimates in 13 out of 33 meta-analyses. We provide recommendations to improve transparency in the reporting of the entire meta-analytic process, including the use of preregistration, data and workflow sharing, and explicit coding practices.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32459806 PMCID： PMC7252608 DOI： 10.1371/journal.pone.0233107

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The ever-increasing growth of scientific publication output [1] has increased the need for -and use of- systematic reviewing of evidence. Meta-analysis is a widely used method to synthesize quantitative evidence from multiple primary studies. Meta-analysis involves a set of procedural and statistical techniques to arrive at an overall effect size estimate, and can be used to inspect whether study outcomes differ systematically based on particular study characteristics [2]. Careful considerations are needed when conducting a meta-analysis because of the many (sometimes arbitrary) decisions and judgments that one has to make during various stages of the research process [3]. Procedural differences in coding primary studies could lead to variation in results, thus potentially affecting the validity of drawn conclusions [4]. Likewise, meta-analysts often need to perform complex computations to synthesize primary study results, which increases the risk of faulty data handling and erroneous estimates [5]. When these decisions and calculations are not carefully undertaken and specifically reported, the methodological quality of the meta-analysis cannot be assessed [6,7]. Additionally, reproducibility (i.e., reanalyzing the data by following reported procedures and arriving at the same result) is undermined by reporting errors and by inaccurate, inconsistent, or biased decisions in calculating effect sizes. Research from various fields has demonstrated issues arising from substandard reporting practices. Transformations from primary study effect sizes to the standardized mean difference (SMD) used in 27 biomedical meta-analyses were found to be commonly inaccurate [8], leading to irreproducible pooled meta-analytic effect sizes. Randomized controlled trials often contain an abundance of outcomes, leaving many opportunities for cherry-picking outcomes that can influence conclusions in various ways [9,10]. Similar evidence of incomplete and nontransparent meta-analytic reporting practices was found in the organizational sciences [11-13]. However, the numerous choices and judgment calls meta-analysts need to make do not always influence meta-analytic effect size estimates [14]. Finally, a severe scarcity of relevant information related to effect size extraction, coding, and adhering to reporting guidelines hinders and sometimes obstructs reproducibility in psychological meta-analyses [15]. This project builds upon previous efforts examining reproducibility of psychological primary study effect sizes and associated meta-analytic outcomes. In the first part, we considered effect sizes reported in 33 randomly chosen meta-analytic articles from psychology, and searched for the corresponding primary study articles to examine whether we could recompute 500 effect sizes reported in the meta-analytic articles; we refer to this as primary study effect size reproducibility. In the second part, we considered if correcting any errors in these primary study effect sizes affected main meta-analytic outcomes. Specifically, we looked at the estimate of the average effect size and its confidence interval, and heterogeneity parameter τ2. We refer to this as meta-analysis reproducibility. Although we acknowledge that more aspects in a meta-analysis could be checked for reproducibility (e.g., search strategy, application of inclusion and exclusion criteria), we focus here on the effect sizes and analyses thereof. While primary study effect size reproducibility is important for assessing the accuracy of the specific effect size calculations, meta-analytic reproducibility bears on the overall conclusions drawn from the meta-analysis. Without appropriate reporting on what was done and what results were found it is untenable to determine the validity of the meta-analytic results and conclusion [16].

Part 1: Primary study effect size reproducibility

In Part 1, we documented primary study effect sizes as they are reported in meta-analyses (i.e., in a data table) and attempted to reproduce them based on the calculation methods specified in the meta-analysis and the estimates reported in the primary study articles. There exist several reasons why primary study effect sizes might not be reproducible. First, the primary study article may lack sufficient information to reproduce the effect size (e.g., missing standard deviations). Second, it might be unclear which information from the primary study was used to compute the effect size. That is, multiple statistical results may be reported in the paper, and ambiguous reporting in the meta-analytic paper might obscure which information from the primary study was used in the computation or which calculation steps were performed to standardize the effect size. Finally, it could also be that retrieval, calculation, or reporting errors were made during the meta-analysis. We hypothesized that a sizeable proportion of reproduced primary effect sizes would be different from the original calculation of the authors because of ambiguous reporting or errors in effect size transformations [8]. We expected more discrepancies in effect size estimates that require more calculation steps (i.e., SMDs) compared to effect sizes that are often extracted directly (i.e., correlations). We also expected more errors in unpublished primary studies compared to published primary studies, because the former are less likely to adhere to strict reporting standards and are sometimes not peer-reviewed. Our goal was to document the percentage of irreproducible effect sizes and to categorize these irreproducible effect sizes as being incomplete (i.e., not enough information is available), incorrect (i.e., an error was made), or ambiguous (i.e., it is unclear which effect size or standardization was chosen). The hypotheses, design, and analysis plan of our study were preregistered and can be found at https://osf.io/v2m9j. In this paper, we focus only on primary study effect size and meta-analysis reproducibility. Additional preregistered results concerning reporting standards can be found in S1 File (https://osf.io/pf4x9/). We deviated from our preregistration in some ways. First, we extended our preregistration by also checking whether irreproducible primary study effect sizes affected the meta-analytic estimate of heterogeneity τ2, in addition to the already preregistered meta-analytic pooled effect size estimate and its confidence interval. We also checked the differences in reproducibility of primary study effect sizes that we classified as outliers compared to those we classified as non-outliers, which was not included in the preregistration. Another deviation from the preregistration was that we did not explore the between-study variance across meta-analyses with and without moderators. The reason for this is that the Q and I2 statistics tend to be biased in certain conditions when true effect sizes are small and when publication bias and the number of studies is large [17], and we did not take that into account when we preregistered our study.

Method

Sample

Meta-analysis selection

The goal of the meta-analysis selection was to obtain a representative sample of psychological meta-analyses. We therefore included a sample from the PsycArticles database from 2011 and 2012 that matched the search criteria “meta-a*”, “research synthesis”, “systematic review”, or “meta-anal*”. Only meta-analyses that contained a data table with primary studies, sample sizes and effect sizes, and had at least ten primary studies were included. Earlier research by Wijsen (https://osf.io/xswvg/) and [18] already drew random samples from the eligible meta-analytic articles, resulting in 33 meta-analyses. For this study we used the same sample of 33 meta-analyses to assess reproducibility. A list of meta-analyses, a detailed sampling scheme and flowcharts can be found in S2 File: https://osf.io/43ju5/. In total, we selected 33 meta-analytic articles containing 1,978 primary study effect sizes, of which we sampled 500 (25%) primary study effect sizes to reproduce. We decided on 500 primary study effect sizes because of feasibility constraints given the substantial time needed to fully reproduce effect sizes. The coding process entailed selecting, retrieving, and recomputing each effect size by two independent coders (EM and AOC). Differences or disagreements in coding were mostly due to both coders selecting different effects from the primary study, due to ambiguous reporting in the meta-analysis. These disagreements were solved after discussion and one primary study effect size was chosen. The interrater reliability for which category the effect size belonged to (i.e., a reproduced, different, incomplete or ambiguous effect) was κ = .71 [19]. We estimate the total time spent on selecting, retrieving, recomputing, verifying, and discussing all primary study effect sizes to be over 500 hours.

Primary study selection

We wanted to ensure that our sample of primary studies would give the most accurate reflection of the reproducibility of primary effect sizes. There are specific primary study characteristics that may relate to reproducibility. For instance, some data errors in meta-analyses are likely to result in outliers that inflate variance estimates [20]. Since outliers occur less often than non-outliers and we wanted to ensure a fair distribution of types of effect sizes, we oversampled primary study effect sizes that could be considered outliers compared to the rest of the primary studies in the meta-analysis. We later corrected for this oversampling. More information on how we classified outlier primary study effect size is given in S2 File: https://osf.io/43ju5/. Detailed information on how we corrected for oversampling, including an example, is displayed in S5 File: https://osf.io/u2j3z/. For each meta-analysis separately, we first fitted a random-effects model in which we used the Q statistic from the leave-one-out-function in the metafor package (version 1.9–9) in R (version 3.3.2) to classify all primary studies with their reported effect sizes as either being an outlier or non-outlier [21,22]. If the Q statistic after the leave-one-out function showed a statistically significant (α = 0.05 for all analyses) difference from the Q statistic from the complete meta-analysis, the left out primary study effect size was classified as an outlier. We then randomly sampled ten outlier primary study effect sizes per meta-analysis, if that many could be obtained. The median number of outlier primary effect sizes was seven. In total, we selected 197 outlier primary study effect sizes from 33 meta-analyses, leaving 303 effect sizes that remained to be sampled from the non-outlier primary study effect sizes. We randomly selected approximately ten non-outlier primary study effect sizes per meta-analysis. If fewer than ten could be selected, we randomly divided the remaining number among other meta-analyses that had effect sizes left to be sampled.

Procedure

To recompute the primary study effect sizes, we strictly followed calculation methods as described in the meta-analytic article. In cases where the meta-analysis was unclear as to which methods were used, we employed well-known formulas for effect size transformation. The information we extracted and formulas we used to compute and transform all primary study effect sizes are displayed for each meta-analysis in S3 File: https://osf.io/pqt9n/. In case of a two-group design we assumed equal group sizes if no further information was given in the meta-analysis or the primary study article. We adjusted all effects in such a way that insofar the prediction of the meta-analysts was corroborated, the mean effect size would be positive, which entailed changing the sign of the effect sizes of five meta-analyses. Consequently, a negative effect size indicates an effect opposite to what was expected. We analyzed results using the metafor package (version 2.1–0) in R (version 3.6.0) [21,22]. Relevant code and files to reproduce the results from this study can be found at https://osf.io/7nsmd/files/. Our procedure for recalculating primary study effect sizes is displayed in Fig 1. We categorized reproduced primary study effects into one of four categories we considered, ranging from best to worst outcome: (0) reproducible: we could reproduce the effect size as reported in the meta-analytic article or with a margin of error of correlation r < .025 or Hedges’ g < .049; (1) incomplete: not enough information was available to reproduce the effect size (e.g., SDs are missing in the primary article). In this case we copied the original effect size as reported in the meta-analysis because we were not sure what computations the meta-analysts performed, or whether they had contacted the authors for necessary statistics; (2) incorrect: our recalculation resulted in a different effect size of at least r ≥ .025 or Hedges’ g ≥ .049 (i.e., a potential calculation or reporting error was made); or (3) ambiguous: it was unclear what steps the meta-analysts followed so we categorize the effect as ambiguous. We consider ambiguous effects (category 3) to be more problematic than incorrect effects (category 2), as for the incorrect effects it was clear which effect size the meta-analysts had chosen, whereas for the ambiguous effects it was impossible to ascertain if we had selected the correct effect size, and what steps were taken to get to the effect size as reported. We acknowledge this categorization is open to discussion, and only used it descriptively throughout this study. When we found multiple appropriate effects given the information available in the meta-analysis, we chose the primary study effect size that showed the least discrepancy with the effect size as reported in the meta-analysis.

Fig 1

Decision tree of primary study effect size recalculation and classification of discrepancy categories.

A composite effect refers to a combination of two or more effects.

Decision tree of primary study effect size recalculation and classification of discrepancy categories.

A composite effect refers to a combination of two or more effects. Next to checking whether certain primary study effect sizes were irreproducible due to incomplete, erroneous, or ambiguous reporting, we were also interested in quantifying whether the discrepancy between the reported and reproduced effect size estimates were small, moderate, or large. Effect sizes in various psychological fields (e.g., personality, social, developmental, clinical psychology, intelligence) show small, moderate, and large effect sizes corresponding approximately to r = 0.10, 0.25, and 0.35 [23,24]. Based on these results, we chose the classifications of discrepancies in correlation r to be small [≥ 0.025, <0.075], moderate [≥ 0.075, <0.125] and large [≥ 0.125] and transformed them to similar classifications for the other effect sizes based on N = 64, relating to the 50th percentile of the degrees of freedom of reported test statistics in eight major psychology journals [24]. For Hedges’ g, classifications were small [≥ .049, < .151], moderate [≥ .151, < .251], and large [≥ .250]. For Cohen’s d, classifications were small [≥ .050, < .152], moderate [≥ .152, < .254] and large [≥ .254]. For Fisher’s r-to-z transformed correlation, classifications were the same as r.

Results

Out of 500 sampled primary study effect sizes we could reproduce 276 without any issues (reproducible: 55%). For 54 effect sizes, the primary study paper did not contain enough information to reproduce the effect (incomplete: 11%), so the original effect size was copied. For 74 effect sizes, a different effect than originally stated in the meta-analytic article was calculated (incorrect: 15%), whereas for 96 effect sizes it was unclear what procedure was followed by the meta-analysts (ambiguous: 19%), and the effect size which was most relevant and closest to the reported effect size was chosen. Fig 2 displays all primary study effect sizes that used either Cohen’s d or Hedges’ g as the meta-analytic effect size (k = 247), where all primary study effect sizes in Cohen’s d were transformed to Hedges’ g. The horizontal axis displays the original reported effect sizes and the vertical axis the reproduced effects; all data points on the diagonal line indicate no discrepancy between reported and reproduced primary study effect sizes. Likewise, Fig 3 displays all primary study effect sizes that used either a product-moment correlation r or Fisher’s r-to-z transformed correlation (k = 253), where all primary study effect sizes using a product-moment correlation r were transformed to Fisher’s z correlation. The results as illustrated in Figs 2 and 3 can be found separately per meta-analysis in S4 File: https://osf.io/65b8z/.

Fig 2

Scatterplot of 247 original and reproduced standardized mean difference effect sizes from 33 meta-analyses.

All effect sizes are transformed to Hedges’ g.

Fig 3

Scatterplot of 253 original and reproduced correlation effect sizes from 33 meta-analyses.

All effect sizes are transformed to Fisher’s z.

Scatterplot of 247 original and reproduced standardized mean difference effect sizes from 33 meta-analyses.

All effect sizes are transformed to Hedges’ g.

Scatterplot of 253 original and reproduced correlation effect sizes from 33 meta-analyses.

All effect sizes are transformed to Fisher’s z. In total, 114 out of 500 recalculated primary study effect sizes (23%) showed effect size discrepancies compared to the primary study effect sizes as reported in the meta-analytic articles. Of those 114 discrepancies, 62 were small (54%), 21 were moderate (18%), and 31 were large (27%). We note that it is possible for a primary study effect size to be classified as irreproducible, even if we found no discrepancies between the reported and recalculated primary study effect size estimate. This happened for instance for all primary studies that did not contain enough information to reproduce the effect size, for which we copied the reported effect size. The number of effect sizes we calculated that were larger than reported (k = 162) was approximately equal to those that we calculated that were smaller than reported (k = 165), which indicates no systematic bias in either direction. The most common reason for not being able to reproduce a primary study effect size was missing or unclear information in the meta-analysis (i.e., ambiguous effect sizes, k = 96, 19%). More specifically, it was often unclear which specific effect was extracted from the primary study because multiple effects were relevant to the research question, and we did not know if and how the included effect was constructed from a combination of multiple effects. Other prevalent issues pertaining to potential data errors or lack of clarity in the meta-analytic process were inconsistencies and unclear reporting in inclusion criteria within a meta-analytic article (k = 67), uncertainty about which samples or time points were included (k = 50), reporting formulas or corrections that were incorrect or not used (k = 23), including the same sample of respondents for multiple effects without correction (k = 7), lacking information on how the primary study effect was transformed to the effect size included in the meta-analysis (k = 3), and mistaking the standard error for the standard deviation when calculating effects (k = 2). Within over a quarter of primary studies (147; 29%, see Table 1) we combined multiple effects into one overall effect size estimate for that primary study. The percentage of irreproducible effect sizes is relatively large within this group. Within this subset, 18% was classified as incorrect (single effect sizes: 13%), 34% as ambiguous (single effect sizes: 13%), 1% as incomplete (single effect sizes: 15%), and 46% as reproducible (single effect sizes: 59%); X2 (3, N = 500) = 45.78, p < .0001, ΦCramer = 0.30, showing that combining multiple effect sizes into one overall estimate is moderately associated with irreproducibility of effect size estimates.

Table 1

Reproducibility frequencies separated by primary study effect sizes consisting of one or multiple combined effect sizes.

	Single effect size	Combined effect sizes
Reproducible	208 (59%)	68 (46%)
Incorrect	47 (13%)	27 (18%)
Incomplete	52 (15%)	2 (1%)
Ambiguous	46 (13%)	50 (34%)
Total	353 (100%)	147 (100%)

Table 2 contains descriptive statistics on reproducibility and various primary study characteristics. We hypothesized that primary studies with SMDs would be less reproducible than studies with other effect sizes. In line with our hypothesis, we found that 57% of all primary studies containing SMDs were irreproducible, whereas for primary studies with correlations this was 33% (see Table 2). This difference between SMDs and correlations can also clearly be seen in Fig 2 and Fig 3. Constructing SMDs requires more transformations and calculations of effects, whereas correlations are often extracted from the primary paper as is. As such, it is not surprising that we found that correlations are more reproducible than SMDs in our sample.

Table 2

Reproducibility frequencies separated by primary study characteristics.

	Irreproducible	Reproducible	Total
SMD	140 (57%)	107 (43%)	247 (100%)
Correlation	84 (33%)	169 (67%)	253 (100%)
Outlier	77 (39%)	120 (61%)	197 (100%)
Non-outlier	147 (49%)	156 (51%)	303 (100%)
Published	216 (47%)	239 (53%)	455 (100%)
Unpublished	8 (18%)	37 (82%)	45 (100%)

We hypothesized that effect sizes from unpublished studies would be less reproducible compared to published studies, but contrary to our expectation we found that 18% of unpublished studies and 47% of published studies were irreproducible (see Table 2). Because we oversampled primary study effect sizes classified as outliers, our sample of 500 primary study effect sizes is not representative for the 1,978 effect sizes we sampled from. In the sample of 1,978 effect sizes, we classified 30% as an outlier, compared to 39% in our sample of 500. This means our sample contains too many outlier primary study effect sizes, and too few non-outliers. To calculate the probability that any given effect size in a certain meta-analysis is irreproducible, we needed to correct for this overrepresentation of outliers by design (S5 File: https://osf.io/u2j3z/). We calculated correction weights using type of effect size (outlier or non-outlier) as the auxiliary variable, and used the sample proportions of outliers and non-outliers to determine the probability of finding a potential error (i.e., either an incomplete, ambiguous, or different primary study effect size) for each meta-analysis and across all 33 meta-analyses in total. The computed primary study error probability for the 33 meta-analyses varied from 0 to 1. Across all meta-analyses we estimated the chance of any randomly chosen primary study effect size being irreproducible to be 37%. Fig 4 displays a bar plot with the frequency of irreproducible effect sizes per meta-analysis. The distribution of reproducible effect sizes (category 0) ranged from 0% to 100% with a mean of 53% and median of 56%. Only three of the 33 samples of primary studies were completely reproducible, and one was completely irreproducible (k = 11). The percentage of incorrect effect sizes (category 1) ranged from 0% to 91% across meta-analyses, (mean = 14%, median = 11%); incomplete effect sizes ranged from 0% to 67% (mean = 12%, median = 5%), and ambiguous effect sizes (category 3) ranged from 0% to 91% (mean = 19%, median = 11%). Note that the reporting within meta-analyses is often at least partly ambiguous (24 out of 33 meta-analyses contain at least one ambiguous effect size).

Fig 4

Frequencies of reproduced primary study effect sizes with and without errors, per meta-analysis.

Exploratory findings

The previously reported results can be considered confirmatory because we preregistered our hypotheses and procedures. Next to confirmatory analyses we also performed one exploratory analysis, in which we compared the reproducibility of outlier and non-outlier primary study effect sizes. We found that 39% of all outlier effect sizes were irreproducible, whereas for non-outlier effect sizes it was 49% (see also Table 2).

Conclusion

In Part 1 we set out to investigate the reproducibility of 500 primary study effect sizes as reported in 33 psychological meta-analyses. Of the 500 reported primary study effect sizes, almost half (224) could not be reproduced, and 30 out of 33 meta-analyses contained effect sizes that could not be reproduced. Poor reproducibility at the primary study level might affect meta-analytic outcomes, which is worrisome because it could bias the meta-analytic evidence and lead to substantial changes in conclusions. We investigate this in Part 2 of this study.

Part 2. Meta-analysis reproducibility

In Part 2, we examined whether irreproducible primary study effect sizes affect three meta-analytic outcomes; the overall effect size estimates, its confidence interval, and the estimate of heterogeneity parameter τ2. We hypothesized to find discrepancies between the reported and reproduced primary studies, and subsequently also expected several meta-analytic pooled effect size estimates to be irreproducible. Discrepancies in primary study effect sizes we found in Part 1 can be either systematic or random. Systematic errors in primary study effect sizes will bias results and thus have a larger impact on the meta-analytic mean estimate, whereas random primary study errors can be expected to increase the estimate of heterogeneity in the meta-analysis. Because we expected most primary study errors to be random instead of systematic, we hypothesized corrected primary study effect sizes would have a larger impact on the boundaries of the confidence interval (i.e., smaller CIs after adjustment) than the meta-analytic effect size estimate. To investigate meta-analysis reproducibility, we first documented for each of the 33 meta-analyses which software was used to run the meta-analysis, what type of meta-analysis was conducted (i.e., fixed effect, random-effects, or mixed-effects), and what kind of estimator was originally used. The information we extracted to reproduce the meta-analytic effects can be found for each meta-analysis in S3 File: https://osf.io/pqt9n/. For each of the 33 meta-analyses, we conducted two analyses: one with the primary study effect sizes as reported in the meta-analysis, and one with the primary study effect sizes as we recalculated them. We then documented the pooled meta-analytic effect size estimate, its confidence interval, and the estimated τ2 parameter for both the reported and reproduced meta-analysis, and compared these outcomes for discrepancies. For Part 2 we upheld the same discrepancy measures as in Part 1. The results presented next are based on the subset of primary study effect sizes that were sampled for checking, instead of complete meta-analyses including all primary studies. We decided to report on subsets of the meta-analysis containing only sampled studies, because the effect of corrected primary study effect sizes on meta-analytic outcomes can best be shown when only the corrected primary study effect sizes are included in the meta-analysis. We also conducted analyses on all 33 complete meta-analyses, for which results are reported in S1 File: https://osf.io/pf4x9/. We first documented which procedures the meta-analysts used for their analyses. We found the same level of imprecise reporting here as in Part 1: most meta-analytic articles reported scarce information related to estimation methods. Many meta-analyses simply referred to well-known meta-analysis books without mentioning specifically which method was used. Out of 33 meta-analyses, only two explicitly reported which models, software and estimator they used for their estimation. For the other meta-analyses, we were forced to guess the used estimation method. Most meta-analytic authors (m = 25, 76%) used either a random-effects or both fixed effect and random-effects models. For meta-analytic outcomes, 13 out of 33 meta-analyses (39%) showed discrepancies in either the pooled effect size estimate, its confidence interval, or τ2 parameter. For example, in meta-analysis no. 17 the pooled effect size estimate the subset we sampled was g = 0.35, 95% CI [-0.02, 0.72], which dropped to g = 0.23, 95% CI [-0.01, 0.47] after three (out of 11) primary study effect sizes showed large discrepancies between the reported and recalculated results. Note that even though the number of primary study effect sizes that were irreproducible was large, the number of discrepancies between the reported and reproduced primary study effect sizes was relatively small, mostly due to our decisions to copy the effect size if not enough information was available, or choosing the estimate that most closely resembled the reported one when the effect was ambiguous. We found small discrepancies in the pooled effect size estimates for nine out of 33 meta-analyses (27%), displayed in panel a of Fig 5 (for all meta-analyses using SMDs), and Fig 6 (for all meta-analyses using correlations). We plotted the difference between the upper and lower bound of the confidence intervals in Figs 5 and 6 (panel b). We found 13 meta-analyses with discrepancies in the confidence intervals (39%), of which nine were small (Hedges’ g: ≥ .049 and < .151, Fisher’s z: ≥ .025 and < .075) and three were moderate (Hedges’ g: ≥ .151 and < .251, Fisher’s z: ≥ .075 and < .125). In line with our hypothesis, this result shows that corrected primary study effect sizes have a larger impact on the boundaries of the confidence interval than the pooled effect size estimate. In none of the meta-analyses was the statistical significance of the average effect size affected by using the recalculated primary study effect sizes.

Fig 5

Scatterplot of reported and reproduced meta-analytic outcomes for meta-analyses using standardized mean differences, where all Cohen’s d estimates are transformed to Hedges’ g.

Fig 6

Scatterplot of reported and reproduced meta-analytic outcomes for meta-analyses using correlations, where all product-moment correlations r are transformed to Fisher’s z.

That the scale of panel c is different from panel a and b.

Scatterplot of reported and reproduced meta-analytic outcomes for meta-analyses using correlations, where all product-moment correlations r are transformed to Fisher’s z.

That the scale of panel c is different from panel a and b. We did not find any evidence for systematic bias in meta-analytic results; we estimated 19 pooled effect sizes to be larger than originally reported and 14 to be smaller than originally reported. We estimated wider pooled effect size CIs in 18 cases and smaller CIs in 15. This latter result is contrary our expectations that CI estimates would be smaller after adjustment; we think this is due to the large number of primary study effect sizes that we found were ambiguous. We extended our preregistration by checking whether irreproducible primary study effect sizes affected the meta-analytic estimate of heterogeneity τ2. The discrepancies in τ2 estimates are displayed in panel c of Fig 5 and Fig 6. The heterogeneity estimate changed from statistically significant to non-significant in one meta-analysis, and from statistically non-significant to significant in another meta-analysis. In total, 17 τ2 parameter estimates were larger after recalculating the primary study effect sizes, 12 were smaller, and four showed no difference. A reviewer to a previous version of this paper rightfully claimed that it is problematic to include the incomplete effect sizes in tests of meta-analytic reproducibility, since these effects could not be reproduced due to a lack of information. Including these incomplete effect sizes might deflate the differences between the reported and reproduced meta-analyses. As such, we decided to compare the reported and reproduced meta-analyses again, only including effect sizes we could calculate: correct, incorrect, and ambiguous effect sizes. In total, 14 out of 33 meta-analyses (42%) showed discrepancies in either the pooled effect size estimate, its confidence interval, or τ2 parameter, which is one meta-analysis more than in the previous analyses when all four effects were included. The discrepancies between the original and reproduced meta-analyses became larger in this analysis, but the statistical significance of the overall pooled effect sizes did not change; detailed results are displayed in S1 File. We acknowledge that it is impossible to formulate definite conclusions regarding the reproducibility of full meta-analyses because we took samples of individual effect sizes from each. However, under some strong assumptions we can predict the probability that a full meta-analysis is reproducible. A χ2 test showed that it is unrealistic to assume that the probability of a reproducible individual effect size is equal across all meta-analyses (χ (32) = 156.12, p < .001, χ2 test of independence). We performed a multilevel logistic regression analysis with (in)correctly reproduced individual effect sizes as the dependent variable, and the 33 meta-analyses as the grouping variable. We used the estimates from this model (intercept = .214, variance = 2.09) to approximate the distribution of the probability of a reproducible effect size with 1,000 points, and used this approximated distribution to calculate the probability of reproducing a meta-analysis with a given number of effect sizes. The analysis script is located at https://osf.io/5k4as/. Using this model, we predict meta-analyses of size 1, 5, 10, 16 or larger to be fully reproducible with respective probabilities .538, .170 .082, and < .050.

General discussion

In Part 1 we investigated the reproducibility of 500 primary study effect sizes from 33 meta-analyses. In Part 2, we then looked at the effect of irreproducible effect sizes on meta-analytic pooled effect size estimates, confidence intervals and heterogeneity estimates. Almost half of the primary study effect sizes could not be reproduced based on the reported information in the meta-analysis, due to incomplete, incorrect, or ambiguous reporting. However, overall, the consequences were limited for the main meta-analytic outcomes. We found small and a few moderate discrepancies in meta-analytic outcomes in 39% of meta-analyses, but most discrepancies were negligible. In none of the meta-analyses did the use of recalculated primary study effect sizes change the statistical significance of the pooled effect size, whereas the statistical significance of the heterogeneity estimate changed in two meta-analyses. These two meta-analyses were characterized by many ambiguous effect size computations (meta-analysis 5 in Fig 5) and one relatively large effect size discrepancy (meta-analysis 29, Fig 6). In this study, we focused only on the effect of primary study errors on meta-analytic estimates. Since we found errors in primary studies to have a (minimal) effect on meta-analytic mean and heterogeneity estimates, we expect the errors to also have a (small) effect on methods that correct for publication bias. However, as primary study errors seemed mostly random rather than systematic, we expect an increase in variance of estimates when correcting for publication bias. An increase in variation might affect the results by obscuring true patterns of bias, because of diminished power in some analyses of bias. For instance, in Egger’s test, the added random variation might lower the power to detect asymmetry in the funnel plot that is indicative of publication bias. We should note that we were conservative in our estimations for primary study effect sizes for which we did not have enough information from the original papers to recompute them ourselves. We decided to copy those effect sizes as they were reported, meaning they did not count towards any discrepancies when rerunning the meta-analyses in Part 2. Similarly, for ambiguous effects, we chose the estimate that was closest to the reported one, leading to conservative estimates of the discrepancies. We acknowledge that results of our study could have been (very) different if we had decided to include either the minimum, maximum, or a random effect size with these ambiguous effect sizes. Moreover, we took samples from each meta-analysis to keep our coding time manageable, and so we leave it to the interested (and industrious) reader to recompute specific meta-analytic outcomes after checking all effect size computations featured in it. Based on previous research on the reproducibility of meta-analyses in medicine [8], we expect the effect may be significantly more detrimental. Surprisingly, we found unpublished and non-outlier primary studies to be more reproducible than published or outlier primary studies. It could be that meta-analytic authors are more cautious when calculating effect sizes from unpublished articles because the article is perhaps not peer-reviewed. Similarly, perhaps many meta-analysts do pay more attention to effect sizes that are relatively large.

Limitations

We recognize some limitations in our study. Our primary study effect size sample is not completely random because we first identified which primary study effect sizes were outliers and oversampled these, before we took a random sample from both outlier and non-outlier primary studies. However, we believe our estimate of the average probability of being able to reproduce a random primary study effect size from a meta-analysis (37%) is accurate, as we corrected for our planned oversampling of outlier primary effect sizes and included a large and systematically drawn sample. Additionally, although our selection of meta-analyses was random and based on a large and fairly comprehensive database, we only included meta-analyses that contained a data table with a minimum amount of information. These meta-analyses can be considered to be relatively well reported compared to meta-analyses lacking any data table. Based on the finding that reluctance to share data is associated with more reporting errors in primary studies ([25]; but see also [26]), one would expect meta-analyses accompanied by open data to be of higher (reporting) quality. Thus, we expect meta-analyses that we omitted because of lacking data to show even weaker reproducibility. If meta-analysts wish to convince readers of reproducible outcomes, they could start sharing their data table and reporting in clear manner how their computations were performed. Another limitation is that we coded a subset of 500 primary study effect sizes, instead of all 1,978 effects. We acknowledge that it is hard to get a comprehensive view on the reproducibility of meta-analyses based on only a subset of primary study effect sizes. It should however be noted that for 19 out of 33 meta-analyses (57%) we attempted to reproduce at least half of all included primary study effect sizes. Moreover, the coding process for the primary study effect sizes and meta-analyses proved to be very labor intensive, due to the substandard reporting practices that hindered the reproducibility and verifiability of many results. Our study revealed that most meta-analysts do not adhere to the Meta-Analysis Reporting Standards (MARS; [27]) and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines [16]. This also implies that our reproduced primary study effect sizes might not always have been the primary study effect sizes that were selected by the meta-analytic authors; in many cases (19%) the reporting on which specific effect was extracted from the primary study was vague, and which effects we deemed fitting might differ substantially from what the meta-analysts intended to include. These issues made this project particularly time-consuming, and it is clear that almost none of the published meta-analyses in our sample were easily reproducible, making it nearly impossible to retrace the steps taken during the research. A final limitation is that our sampled meta-analyses were from 2011 and 2012. It could be that the increasing emphasis on reproducibility in psychology of the last years has incentivized meta-analysts to make reporting more extensive and result checking more diligent. However, results from [15], who checked meta-analyses from 2013–2014, show similar results and emphasize a lack of adherence to reporting standards. More research to adherence of reporting standards and subsequent reproducibility of results in recent meta-analyses is needed to help us understand whether there is any sign of improvement. Meta-analyses are increasingly being used and widely cited. In 2018 alone, the 33 meta-analyses in our sample have been cited a total of 846 times. It is important that future meta-analyses improve in how they report their results.

Recommendations

The results of this study call for improvement in reporting practices in psychological literature and particularly in meta-analyses. Such improvements entail specifically sharing necessary details regarding the entire meta-analytic process. Meta-analyses should report not only a data table containing the basic primary study statistics, but also details regarding study design and analyses. Specifying conditions, samples, outcomes, variables, and methods used to find and analyze data is imperative for reproducibility. The meta-analytic reporting guidelines such as MARS and PRISMA provide good guidance on meta-analytic reporting [16,27]. Moreover, a recent study found explicit mention of the PRISMA guidelines to be associated with more complete reporting of meta-analyses in psychology [6]. For a meta-analysis to be completely reproducible, we would add to the existing guidelines the requirement that effect size computations should be specified per effect size in supplementary materials. It should be clear which decisions were made and when, and each primary study effect size should be able to be uniquely identified. For an example, we refer to S3 File (https://osf.io/pqt9n), where we documented the relevant text, references, and formulas from all 33 meta-analyses to indicate which specific transformations we made to the effect sizes within meta-analyses. Moreover, in our codebook (https://osf.io/7abwu) we specified the names of the groups and variables that were compared for each primary study effect size, exactly as they were reported in the primary study. If certain groups or effects were combined, it was also necessary to add a comment on how and in which order these were combined. We acknowledge it is hard to document all relevant information related to effect size computation in meta-analyses, and emphasize that sharing all data, code, and documentation used in the process would benefit reproducibility of meta-analyses tremendously. For more information on best practices in systematic reviewing, we refer to [28]. Ideally, specification of the methods takes place before the data collection starts to counter biasing effects. By preregistering the meta-analytic methods, an explicit distinction can be made between exploratory and confirmatory analyses. This helps avoid hypothesizing after results are known [29] and potential biases caused by the many seemingly arbitrary choices meta-analysts make during the collection and processing of data. [30] composed a checklist for various types of researcher degrees of freedom related to psychological research, such as failing to specify the direction of effects and failing to report failed studies. It would be worthwhile to expand this checklist to the context of meta-analyses. Other suggestions for improvement in reporting practices include opening data, materials and workflow to use transparency as an accountability measure (e.g., by dynamic documenting through RMarkdown [31]), or the use of tools that benefit data extraction for systematic reviews [32]. Fortunately, many online initiatives promote preregistration and data sharing practices, with increasingly more journals requiring authors to share the data that would be needed by someone wishing to validate or replicate the research (e.g., PLOS ONE, Scientific Reports, the Open Science Framework, the Dataverse Project). Accurately conducted and reported meta-analyses are necessary considering the increasing advancement of research and knowledge, and it is crucial that methods in meta-analyses become more reproducible. Only then will trust in meta-analytic results be justified for building better theories, steering future research efforts, and informing practitioners in a wide range of settings.

Additional results.

https://osf.io/pf4x9/. (PDF) Click here for additional data file.

Sampling scheme and flowcharts.

https://osf.io/43ju5/. (PDF) Click here for additional data file.

Formulas and methods.

https://osf.io/pqt9n/. (PDF) Click here for additional data file.

Reported vs reproduced primary study effect sizes.

https://osf.io/65b8z/. (PDF) Click here for additional data file.

Post stratification calculations.

https://osf.io/u2j3z/. (PDF) Click here for additional data file. 16 Mar 2020 PONE-D-20-01919 Reproducibility of individual effect sizes in meta-analyses in psychology PLOS ONE Dear Ms Maassen, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. I have now received two reviews from experts in meta-analytic methods. As you can see, both reviewers were very positive and suggested publication of your manuscript. I concur with their assessment. The reviewers raised some minor points that you should clarify in a revision. Let me also add two points of my own: If I understood it correctly, “reproducible” effect sizes were classified as such if the difference between the original and recalculated effect size was smaller than r = .025 (page 10); so, you were trying to recalculate the effect size with some margin of error. I would recommend clarifying this point when introducing your classification scheme at the end of page 9. It would be informative to describe Figure 4 in more detail. In particular, what was the distribution (e.g., median, quartiles) of the proportion of reproducible effect sizes in the meta-analyses? Is the reported proportion of 37% (page 14) an adequate estimate for all meta-analyses or is the distribution rather skewed; for example, it might be the case that in many meta-analyses most effect sizes are irreproducible, whereas few meta-analyses primarily reported irreproducible effect sizes. In addition to these points, both reviewers made further excellent suggestions that you should consider in your revision. I strongly encourage you to address these issues and submit a revised version of your manuscript. We would appreciate receiving your revised manuscript by Apr 30 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Timo Gnambs Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thank you for the opportunity to review this manuscript. This manuscript is an important contribution to the methodological literature on the conduct and reporting of systematic reviews that include meta-analyses. The manuscript focuses on one of many ways that meta-analyses can fail to be reproducible – the computation of effect sizes. The manuscript makes a clear and compelling case for transparency and the use of open science practices in the conduct and reporting of meta-analysis particularly with regard to the computation of effect sizes. The associated registry of their own conduct of this research project provides an example of how to implement transparency in research. The manuscript also points out issues that often encountered in reviewing meta-analyses and that reflect issues with current meta-analysis practice. Few reviews adhere to the published standards in MARS or PRISMA. Few published meta-analyses provide complete data tables, much less information on how effect sizes were computed from primary studies. This manuscript should provide more evidence that meta-analyses need to follow current reporting standards and use open science practices. One hypothesis about why some effect sizes might be irreproducible or ambiguous relates to how the meta-analysis handles multiple effect sizes from primary studies. Since the publication date of the meta-analyses used in this manuscript, methods for handling dependent effect sizes are being used more widely. Prior to the publication of Hedges, Tipton & Johnson (2010), researchers either chose a single effect size from a study or combined effect sizes measuring a similar construct in a non-transparent manner. Though this may be one contributing factor to the irreproducibility of effect sizes in this manuscript, the general message still holds today that more transparency is needed when reporting effect size computations in a meta-analysis. As a minor point, I wondered how many of the irreproducible and ambiguous effect sizes were related to the issue of dependent effect size. In sum, the manuscript will add to the growing evidence that meta-analyses need to embrace fully open science practices given the dependence of many policy decisions on meta-analytic results. Reviewer #2: I think that this work is of great interest given the important and lofty role meta-analysis plays in scientists' and practitioners' evaluation of research results. It is useful to finally have some data on the reproducibility of these research results. If I understand the results correctly, the reproducibility of individual effect sizes is poor, but this does not seem to have a substantial or predictable effect on the reproducibility of meta-analytic means and heterogeneity. This is useful to know as we continue to practice meta-analysis, calibrate the degree of trust we put in meta-analysis, and consider which transparency guidelines are most important in meta-analysis. I can't go through all the data and code, although I appreciate its presence in the Github repository. I loaded up codebook-primary-studies-complete.xlsx just to check on those d > 3 estimates. I guess the SIRS is a very powerful instrument. This looks like a complete piece of work that has already been through several rounds of peer review. I only have some minor suggestions: It's not clear to me that "ambiguous" is necessarily worse than "incorrect". As introduced in line 190 I thought this was going to be the basis for some sort of ordinal regression but was relieved to see it is merely descriptive and the badness of these categories merely a statement of the authors' beliefs. Discussion When you say "For a meta-analysis to be completely reproducible, we would add to these guidelines the requirement that effect size computations should be specified per effect size in supplementary materials. It should be clear which decisions were made and when, and each primary study effect size should be able to be uniquely identified." Can you point to examples in your own codebook to provide a concrete demonstration of how this should look? Future directions: Authors will sometimes confuse SE and SD in their own writing, leading to meta-analysts using the SE instead of the SD when calculating effect sizes, yielding effects that are too big. Did you observe this with any frequency in this data set? This could yield an effect size that is reproducible given the primary study and yet incorrect. These errors seem to have little influence over the mean and heterogeneity. How might these errors influence less robust statistics like PET/PEESE, p-curve, etc? I always sign my reviews, Joe Hilgard ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Terri Pigott Reviewer #2: Yes: Joseph Hilgard [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 26 Apr 2020 April 26, 2020 Dear Dr. Gnambs, We hereby resubmit our manuscript ‘Reproducibility of Individual Effect Sizes in Meta-Analyses in Psychology’. We have responded to all points raised by you and the reviewers. Below, we respond to all the points in italics, and we have added the specific edits we made to the manuscript after in red. We hope our revision has addressed the comments of both you and the reviewers sufficiently. We look forward to your reply. Kind regards, Esther Maassen also on behalf of, Marcel A.L.M. van Assen, Michèle B. Nuijten, Anton Olsson-Collentine, and Jelte M. Wicherts. emaassen@protonmail.com ------ 1. If I understood it correctly, “reproducible” effect sizes were classified as such if the difference between the original and recalculated effect size was smaller than r = .025 (page 10); so, you were trying to recalculate the effect size with some margin of error. I would recommend clarifying this point when introducing your classification scheme at the end of page 9. As per your recommendation, we have clarified there is a margin of error when calculating {ir}reproducible effect sizes. We have now explicitly stated this margin for the two most common effect sizes (correlation r and standardized mean difference Hedges' g) on page 9 and 10: We categorized reproduced primary study effects into one of four categories we considered, ranging from best to worst outcome: (0) reproducible: we could reproduce the effect size as reported in the meta-analytic article or with a margin of error of correlation r < .025 or Hedges’ g < .049; (1) incomplete: not enough information was available to reproduce the effect size (e.g., SDs are missing in the primary article). In this case we copied the original effect size as reported in the meta-analysis because we were not sure what computations the meta-analysts performed, or whether they had contacted the authors for necessary statistics; (2) incorrect: our recalculation resulted in a different effect size of at least r ≥ .025 or Hedges’ g ≥ .049 (i.e., a potential calculation or reporting error was made); 2. It would be informative to describe Figure 4 in more detail. In particular, what was the distribution (e.g., median, quartiles) of the proportion of reproducible effect sizes in the meta-analyses? Is the reported proportion of 37% (page 14) an adequate estimate for all meta-analyses or is the distribution rather skewed; for example, it might be the case that in many meta-analyses most effect sizes are irreproducible, whereas few meta-analyses primarily reported irreproducible effect sizes. We have described the distribution of effect size errors in more detail on page 15. We added percentages per category per meta-analysis to Figure 4. We added the following: Fig 4 displays a bar plot with the frequency of irreproducible effect sizes per meta-analysis. The distribution of reproducible effect sizes (category 0) ranged from 0% to 100% with a mean of 53% and median of 56%. Only three of the 33 samples of primary studies were completely reproducible, and one was completely irreproducible (k = 11). The percentage of incorrect effect sizes (category 1) ranged from 0% to 91% across meta-analyses, (mean = 14%, median = 11%); incomplete effect sizes ranged from 0% to 67% (mean = 12%, median = 5%), and ambiguous effect sizes (category 3) ranged from 0% to 91% (mean = 19%, median = 11%). Note that the reporting within meta-analyses is often at least partly ambiguous (24 out of 33 meta-analyses contain at least one ambiguous effect size). 3. As a minor point, I wondered how many of the irreproducible and ambiguous effect sizes were related to the issue of dependent effect size. Reviewer 1 recommended to add more details on the reproducibility rates when using dependent effect sizes (i.e., when multiple effects within one primary study are combined). We did so on page 13, and added a new table (Table 1) containing percentages of effect size reproducibility for single and multiple effect sizes. More specifically, we added the following paragraph: Within over a quarter of primary studies (147; 29%, see Table 1) we combined multiple effects into one overall effect size estimate for that primary study. The percentage of irreproducible effect sizes is relatively large within this group. Within this subset, 18% was classified as incorrect (single effect sizes: 13%), 34% as ambiguous (single effect sizes: 13%), 1% as incomplete (single effect sizes: 15%), and 46% as reproducible (single effect sizes: 59%); Χ^2(3, N = 500) = 45.78, p < .0001, ΦCramer = 0.30, showing that combining multiple effect sizes into one overall estimate is moderately associated with irreproducibility of effect size estimates. Table 1. Reproducibility frequencies separated by primary study effect sizes consisting of one or multiple combined effect sizes Single effect size Combined effect sizes Reproducible 208 (59%) 68 (46%) Incorrect 47 (13%) 27 (18%) Incomplete 52 (15%) 2 (1%) Ambiguous 46 (13%) 50 (34%) Total 353 (100%) 147 (100%) 4. It's not clear to me that "ambiguous" is necessarily worse than "incorrect". As introduced in line 190 I thought this was going to be the basis for some sort of ordinal regression but was relieved to see it is merely descriptive and the badness of these categories merely a statement of the authors' beliefs. On page 10 we clarified that the categorization of effect sizes is our own, and is only used descriptively. More specifically: We consider ambiguous effects (category 3) to be more problematic than incorrect effects (category 2), as for the incorrect effects it was clear which effect size the meta-analysts had chosen, whereas for the ambiguous effects it was impossible to ascertain if we had selected the correct effect size, and what steps were taken to get to the effect size as reported. We acknowledge this categorization is open to discussion, and only used it descriptively throughout this study. 5. When you say "For a meta-analysis to be completely reproducible, we would add to these guidelines the requirement that effect size computations should be specified per effect size in supplementary materials. It should be clear which decisions were made and when, and each primary study effect size should be able to be uniquely identified." Can you point to examples in your own codebook to provide a concrete demonstration of how this should look? Reviewer 2 asked us to point to an example in our own codebook to provide a concrete demonstration on how specific effect size computation should look like. We added an extra reference to our supplemental materials in text, which indicate all inclusion criteria and formulas used per meta-analysis, as well as our codebook that documents the relevant information per primary study. More specifically: For a meta-analysis to be completely reproducible, we would add to the existing guidelines the requirement that effect size computations should be specified per effect size in supplementary materials. It should be clear which decisions were made and when, and each primary study effect size should be able to be uniquely identified. For an example, we refer to S3 (https://osf.io/pqt9n/), where we documented the relevant text, references, and formulas from all 33 meta-analyses to indicate which specific transformations we made to the effect sizes within meta-analyses. Moreover, in our codebook (https://osf.io/7abwu/) we specified the names of the groups and variables that were compared for each primary study effect size, exactly as they were reported in the primary study. If certain groups or effects were combined, it was also necessary to add a comment on how and in which order these were combined. We acknowledge it is hard to document all relevant information related to effect size computation in meta-analyses, and emphasize that sharing all data, code, and documentation used in the process would benefit reproducibility of meta-analyses tremendously. For more information on best practices in systematic reviewing, we refer to [28]. 6. Authors will sometimes confuse SE and SD in their own writing, leading to meta-analysts using the SE instead of the SD when calculating effect sizes, yielding effects that are too big. Did you observe this with any frequency in this data set? This could yield an effect size that is reproducible given the primary study and yet incorrect. Reviewer 2 stated that a common mistake made by authors is confusing the SE and SD. We indicated on page 12 and 13 how often this occurred in our set of primary studies, and added estimates for other occurrences (e.g., how often it was unclear which variables to include in the primary study effect size estimation). More specifically: The most common reason for not being able to reproduce a primary study effect size was missing or unclear information in the meta-analysis (i.e., ambiguous effect sizes, k = 96, 19%). More specifically, it was often unclear which specific effect was extracted from the primary study because multiple effects were relevant to the research question, and we did not know if and how the included effect was constructed from a combination of multiple effects. Other prevalent issues pertaining to potential data errors or lack of clarity in the meta-analytic process were inconsistencies and unclear reporting in inclusion criteria within a meta-analytic article (k = 67), uncertainty about which samples or time points were included (k = 50), reporting formulas or corrections that were incorrect or not used (k = 23), including the same sample of respondents for multiple effects without correction (k = 7), lacking information on how the primary study effect was transformed to the effect size included in the meta-analysis (k =3), and mistaking the standard error for the standard deviation when calculating effects (k = 2). 7. These errors seem to have little influence over the mean and heterogeneity. How might these errors influence less robust statistics like PET/PEESE, p-curve, etc? Reviewer 2 asked how the reproducibility errors might influence less robust statistics like PET/PEESE and p-curve. Since our focus in this study was based on meta-analytic outcomes, we decided not to check for each of the 33 meta-analyses separately what the effects of these errors on publication bias methods were. We added a paragraph to our discussion section on page 22 on what we expect the effect on publication bias methods to be. More specifically: In this study, we focused only on the effect of primary study errors on meta-analytic estimates. Since we found errors in primary studies to have a (minimal) effect on meta-analytic mean and heterogeneity estimates, we expect the errors to also have a (small) effect on methods that correct for publication bias. However, as primary study errors seemed mostly random rather than systematic, we expect an increase in variance of estimates when correcting for publication bias. An increase in variation might affect the results by obscuring true patterns of bias, because of diminished power in some analyses of bias. For instance, in Egger’s test, the added random variation might lower the power to detect asymmetry in the funnel plot that is indicative of publication bias. Submitted filename: response-to-reviewers.pdf Click here for additional data file. 29 Apr 2020 Reproducibility of individual effect sizes in meta-analyses in psychology PONE-D-20-01919R1 Dear Dr. Maassen, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Timo Gnambs Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 4 May 2020 PONE-D-20-01919R1 Reproducibility of individual effect sizes in meta-analyses in psychology Dear Dr. Maassen: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Timo Gnambs Academic Editor PLOS ONE

16 in total

Review 1. Meta-analyses indexed in PsycINFO had a better completeness of reporting when they mention PRISMA.

Authors: Victoria Leclercq; Charlotte Beaudart; Sara Ajamieh; Véronique Rabenda; Ezio Tirelli; Olivier Bruyère
Journal: J Clin Epidemiol Date: 2019-06-26 Impact factor: 6.437

Review 2. Multiple outcomes and analyses in clinical trials create challenges for interpretation and research synthesis.

Authors: Evan Mayo-Wilson; Nicole Fusco; Tianjing Li; Hwanhee Hong; Joseph K Canner; Kay Dickersin
Journal: J Clin Epidemiol Date: 2017-05-18 Impact factor: 6.437

3. How to Do a Systematic Review: A Best Practice Guide for Conducting and Reporting Narrative Reviews, Meta-Analyses, and Meta-Syntheses.

Authors: Andy P Siddaway; Alex M Wood; Larry V Hedges
Journal: Annu Rev Psychol Date: 2018-08-08 Impact factor: 24.137

4. A method for evaluating research syntheses: The quality, conclusions, and consensus of 12 syntheses of the effects of after-school programs.

Authors: Jeffrey C Valentine; Harris Cooper; Erika A Patall; Diana Tyson; Jorgianne Civey Robinson
Journal: Res Synth Methods Date: 2010-01-28 Impact factor: 5.273

5. Features and functioning of Data Abstraction Assistant, a software application for data abstraction during systematic reviews.

Authors: Jens Jap; Ian J Saldanha; Bryant T Smith; Joseph Lau; Christopher H Schmid; Tianjing Li
Journal: Res Synth Methods Date: 2018-11-19 Impact factor: 5.273

6. Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results.

Authors: Jelte M Wicherts; Marjan Bakker; Dylan Molenaar
Journal: PLoS One Date: 2011-11-02 Impact factor: 3.240

Review 7. Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking.

Authors: Jelte M Wicherts; Coosje L S Veldkamp; Hilde E M Augusteijn; Marjan Bakker; Robbie C M van Aert; Marcel A L M van Assen
Journal: Front Psychol Date: 2016-11-25

8. The Reporting Quality of Systematic Reviews and Meta-Analyses in Industrial and Organizational Psychology: A Systematic Review.

Authors: Naomi Schalken; Charlotte Rietbergen
Journal: Front Psychol Date: 2017-08-22

9. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement.

Authors: David Moher; Alessandro Liberati; Jennifer Tetzlaff; Douglas G Altman
Journal: PLoS Med Date: 2009-07-21 Impact factor: 11.069

10. Epidemiology and Reporting Characteristics of Systematic Reviews of Biomedical Research: A Cross-Sectional Study.

Authors: Matthew J Page; Larissa Shamseer; Douglas G Altman; Jennifer Tetzlaff; Margaret Sampson; Andrea C Tricco; Ferrán Catalá-López; Lun Li; Emma K Reid; Rafael Sarkis-Onofre; David Moher
Journal: PLoS Med Date: 2016-05-24 Impact factor: 11.069

9 in total

1. Future Objectivity Requires Perspective and Forward Combinatorial Meta-Analyses.

Authors: Barbara Hanfstingl
Journal: Front Psychol Date: 2022-06-17

2. Statistical Significance Filtering Overestimates Effects and Impedes Falsification: A Critique of.

Authors: Jonathan Z Bakdash; Laura R Marusich; Jared B Kenworthy; Elyssa Twedt; Erin G Zaroukian
Journal: Front Psychol Date: 2020-12-22

Review 3. Null and Void? Errors in Meta-analysis on Perceptual Disfluency and Recommendations to Improve Meta-analytical Reproducibility.

Authors: Sophia C Weissgerber; Matthias Brunmair; Ralf Rummer
Journal: Educ Psychol Rev Date: 2021-02-03

4. The REPRISE project: protocol for an evaluation of REProducibility and Replicability In Syntheses of Evidence.

Authors: Matthew J Page; David Moher; Fiona M Fidler; Julian P T Higgins; Sue E Brennan; Neal R Haddaway; Daniel G Hamilton; Raju Kanukula; Sathya Karunananthan; Lara J Maxwell; Steve McDonald; Shinichi Nakagawa; David Nunan; Peter Tugwell; Vivian A Welch; Joanne E McKenzie
Journal: Syst Rev Date: 2021-04-16

5. A Meta-Analysis on Mobile-Assisted Language Learning Applications: Benefits and Risks.

Authors: Mariela Mihaylova; Simon Gorin; Thomas P Reber; Nicolas Rothen
Journal: Psychol Belg Date: 2022-09-16

6. Development of a checklist to detect errors in meta-analyses in systematic reviews of interventions: study protocol.

Authors: Raju Kanukula; Matthew Page; Kerry Dwan; Simon Turner; Elizabeth Loder; Evan Mayo-Wilson; Tianjing Li; Adya Misra; Steve McDonald; Andrew Forbes; Joanne McKenzie
Journal: F1000Res Date: 2021-06-08

7. A new method for testing reproducibility in systematic reviews was developed, but needs more testing.

Authors: Dawid Pieper; Simone Heß; Clovis Mariano Faggion
Journal: BMC Med Res Methodol Date: 2021-07-29 Impact factor: 4.615

8. A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses.

Authors: Heidi Seibold; Severin Czerny; Siona Decke; Roman Dieterle; Thomas Eder; Steffen Fohr; Nico Hahn; Rabea Hartmann; Christoph Heindl; Philipp Kopper; Dario Lepke; Verena Loidl; Maximilian Mandl; Sarah Musiol; Jessica Peter; Alexander Piehler; Elio Rojas; Stefanie Schmid; Hannah Schmidt; Melissa Schmoll; Lennart Schneider; Xiao-Yin To; Viet Tran; Antje Völker; Moritz Wagner; Joshua Wagner; Maria Waize; Hannah Wecker; Rui Yang; Simone Zellner; Malte Nalenz
Journal: PLoS One Date: 2021-06-21 Impact factor: 3.240

Review 9. A meta-review of transparency and reproducibility-related reporting practices in published meta-analyses on clinical psychological interventions (2000-2020).

Authors: Rubén López-Nicolás; José Antonio López-López; María Rubio-Aparicio; Julio Sánchez-Meca
Journal: Behav Res Methods Date: 2021-06-26

9 in total