Literature DB >> 34736291

Low statistical power and overestimated anthropogenic impacts, exacerbated by publication bias, dominate field studies in global change biology.

Yefeng Yang^1,2,3, Helmut Hillebrand^4,5,6, Malgorzata Lagisz¹, Ian Cleasby⁷, Shinichi Nakagawa¹.

Abstract

Field studies are essential to reliably quantify ecological responses to global change because they are exposed to realistic climate manipulations. Yet such studies are limited in replicates, resulting in less power and, therefore, potentially unreliable effect estimates. Furthermore, while manipulative field experiments are assumed to be more powerful than non-manipulative observations, it has rarely been scrutinized using extensive data. Here, using 3847 field experiments that were designed to estimate the effect of environmental stressors on ecosystems, we systematically quantified their statistical power and magnitude (Type M) and sign (Type S) errors. Our investigations focused upon the reliability of field experiments to assess the effect of stressors on both ecosystem's response magnitude and variability. When controlling for publication bias, single experiments were underpowered to detect response magnitude (median power: 18%-38% depending on effect sizes). Single experiments also had much lower power to detect response variability (6%-12% depending on effect sizes) than response magnitude. Such underpowered studies could exaggerate estimates of response magnitude by 2-3 times (Type M errors) and variability by 4-10 times. Type S errors were comparatively rare. These observations indicate that low power, coupled with publication bias, inflates the estimates of anthropogenic impacts. Importantly, we found that meta-analyses largely mitigated the issues of low power and exaggerated effect size estimates. Rather surprisingly, manipulative experiments and non-manipulative observations had very similar results in terms of their power, Type M and S errors. Therefore, the previous assumption about the superiority of manipulative experiments in terms of power is overstated. These results call for highly powered field studies to reliably inform theory building and policymaking, via more collaboration and team science, and large-scale ecosystem facilities. Future studies also require transparent reporting and open science practices to approach reproducible and reliable empirical work and evidence synthesis.

Entities: Chemical

Keywords: climate change; exaggerated effect size; experimentation; meta-research; meta-science; reproducibility; second-order meta-analysis; selective reporting bias; small-study effect; transparency

Mesh：

Year: 2021 PMID： 34736291 PMCID： PMC9299651 DOI： 10.1111/gcb.15972

Source DB: PubMed Journal: Glob Chang Biol ISSN： 1354-1013 Impact factor: 13.211

INTRODUCTION

As human‐induced environmental changes accelerate, it is more important than ever that we can reliably quantify ecological responses to a range of environmental stressors (Hanson & Walker, 2020; Sage, 2020; Way, 2021). Although laboratory experiments could elucidate the underlying mechanisms of such ecological responses, they are often too small, too short‐lived, and too artificial to reflect naturally occurring responses accurately (Rineau et al., 2019). Therefore, field experiments (both manipulations and non‐manipulative observations) are essential to understand how an ecosystem responds to global change (Elmendorf et al., 2015; Sternberg & Yakir, 2015; Wolkovich et al., 2012). In particular, field experimental manipulations are paramount because they could quantify the effect of stressor magnitudes that go well beyond currently observed levels (Hillebrand et al., 2020; Rineau et al., 2019). Accordingly, thousands of field experiments have been conducted in the field to investigate ecological responses to a wide range of different anthropologic environmental impacts such as climate change, biodiversity loss, and agricultural intensification (Hanson & Walker, 2020; Scheffer et al., 2001). Yet, few researchers seem to have asked whether these thousands of global change experiments could provide statistically reliable results to advance our understanding of ecosystems of the future (Korell et al., 2020). While field experiments offer the possibility to work with realistic abundances and naturally occurring environmental conditions (and their variation), their replications often are limited by logistical constraints (Filazzola & Cahill, 2021; Fraser et al., 2020; Nakagawa & Parker, 2015). Therefore, it is essential to know whether these field experiments are adequately powered and reliable. Earlier work suggests that ecological studies seem to be underpowered in some subfields (Fidler et al., 2017; Jennions & Møller, 2003; T. H. Parker et al., 2016). That is, a study usually has a sample size too small to detect a “true” effect size as statistically significant (for a given alpha level .05). An important yet often underappreciated consequence of underpowered studies is that empirical studies with small sample sizes often present distorted estimates of true effects (Button et al., 2013; Nakagawa & Foster, 2004). This is because, given an underpowered study, the observed effect often fails to achieve statistical significance (i.e., two‐tailed p‐value < .05), unless the effect is overestimated. In other words, when an observed effect reaches statistical significance in an underpowered or small‐sample study, the observed effect will be always higher than the corresponding “true” effect in magnitude (Lemoine et al., 2016; Young et al., 2008; also see a simulated example in Figure S1). Then, due to preferential publications of statistically significant effects (i.e., publication bias), such overestimated effects would dominate the literature. The inflation of magnitude concerning a “true” effect is known as exaggeration ratio or Type M (magnitude) error. A related concept is the Type S (sign) error that is the probability of obtaining a statistically significant effect in the opposite direction to the true effect (Gelman & Carlin, 2014). Recently, a few papers have pointed out the importance of quantifying the Type M and S error rates (Cleasby et al., 2021; Lemoine et al., 2016; T. H Parker et al., 2018). For example, Lemoine et al. (2016) showed that reported effect sizes of global warming on plant growth were, on average, three times larger than a “true” effect that was approximated by an overall meta‐analytic mean (Type M error rate: 3). In animal tracking studies, Cleasby et al. (2021) demonstrated that researchers could be overestimating the effect of bio‐logging devices on animal behavior by 10‐fold (Type M error rate) and estimating the direction of the effect incorrectly 20% of the time (Type S error rate), using effect sizes derived from a previous meta‐analysis (Cohen's d = 0.1; Bodey et al., 2018). Given these, both studies argued that understanding Type M (and S) error rates, along with statistical power, would lead to better interpretation of results and improve the experimental design in a field of study (cf. Button et al., 2013; Ioannidis et al., 2017; T. Stanley et al., 2018). However, no previous publications have systematically quantified statistical power, Type M and S error rates across global change studies (but see Lemoine et al., 2016). Importantly, although earlier work often used a meta‐analytic mean as a surrogate of the true effect to quantify the statistical power and error rates (e.g., Cleasby et al., 2021; Lemoine et al., 2016), large‐scale power analyses from other fields have shown that meta‐analytic means often suffer from publication bias (Button et al., 2013; Ioannidis et al., 2017; T. Stanley et al., 2018). This can lead to an overestimation of statistical power unless the bias is corrected (Button et al., 2013; Ioannidis et al., 2017; T. Stanley et al., 2018). Furthermore, environmental stressors are likely to influence not only ecological responses in magnitude (mean value of a given ecological trait) but also the variance around the magnitude (i.e., heteroscedasticity; Figure 1a; for examples of biological explanations of heteroscedasticity see Cleasby & Nakagawa, 2011; De Villemereuil et al., 2018; Seekell et al., 2011). Therefore, it is important to quantify the three statistical parameters not only for response magnitude but also for response variability. As far as we know, no such investigations for response variability exist in the entire scientific literature so far.

FIGURE 1

Conceptual diagrams of effect size calculations from existing field studies and meta‐analyses in global change biology, and analytic approaches used to assess the reliability of manipulative experiments and non‐manipulative observations to evaluate the effect of stressors on both ecosystem's response magnitude and variability. (a) An overview of the effect sizes used to quantify the ecosystem's response magnitude and variability. Mean differences metrics were utilized to quantify the response magnitude to environmental stressors (i.e., lnRR, SMD, and SMDH), while variance differences metrics were used to characterize the response variability to environmental stressors (i.e., lnVR and lnCVR). In the context of this paper, response variability was an indicator of heteroscedasticity (also known as heterogeneous variances or unequal variance). The detailed definitions and formulas for these effect size metrics are reported in Table 1. (b) An overview of the datasets used to quantify statistical power, Type M and Type S errors. The datasets were derived from the work of Hillebrand et al. (2020), compiling 36 meta‐analyses. Our lnRR* dataset contained 30 meta‐analyses whose effect size metrics were originally expressed as lnRR. Our lnRR dataset contained recalculated metric of lnRR using descriptive statistics available in 12 of 30 meta‐analyses in the lnRR* dataset. Datasets SMD, SMDH, lnVR, and lnCVR contained corresponding metrics also calculated using descriptive statistics available in 12 out of 30 meta‐analyses in the lnRR* dataset. n MA represents the number of meta‐analyses per dataset. (c) The three‐step modeling procedure was employed to test our hypotheses

TABLE 1

The formulas for effect size statistics used to quantify the effect of environmental stressors on ecosystems response magnitude (mean difference: lnRR, SMD, and SMDH) and response variability (variance difference or heteroscedasticity: lnVR and lnCVR). In this paper's context, lnRR, SMD, and SMDH represent differences in mean values (magnitude) between a group under a global change stressor and another group under a benign environment, whereas lnVR and lnCVR represent differences in variance around mean between the two groups, without and with adjusting the effect of mean change, respectively

Effect size	Statistics	Annotation
Natural logarithm of response ratio, lnRR (ratio of means)	lnRR=lnmpmc (1)	mp and mc denote the average values of measurements from a group with an environmental stressor (p) and a control (c) group
Sampling variance of lnRR	SlnRR2=sdp2npmp2+sdc2ncmc2 (2)	sdp2 and sdc2 denote corresponding variances of mp and mc (standard deviations of the sample), and np and nc denote the sample sizes for environmental stressor (p) and a control (c) group. Other symbols are as with Equation (1)
Standard mean difference, SMD (Hedges’ g or Cohen's d)	SMD=mp‐mcnp‐1sdp2+nc‐1sdc2np+nc‐2 (3)	Symbols are as with Equations (1 and 2)
Sampling variance of SMD	SSMD2=np+ncnpnc+SMD22np+nc (4)	Symbols are as with Equations (1 and 2)
Standardized mean difference with heteroscedasticity, SMDH	SMDH=mp‐mcsdp2+sdc22 (5)	Symbols are as with Equations (1 and 2)
Sampling variance of SMDH	SSMDH2=SMDH2sdp4np‐1+sdc4nc‐12sdp2+sdc22+sdp2np‐1+sdc2nc‐1sdp2+sdc22 (6)	Symbols are as with Equations (1 and 2)
Natural logarithm of variability ratio, lnVR	lnVR=lnsdpsdc+121np‐1‐1nc‐1 (7)	Positive values of lnVR indicate that environmental stressor increases the variance of measurements without adjusting for the effect of mean change (i.e., more variable traits). Symbols are as with Equations (1 and 2)
Sampling variance of lnVR	SlnVR2=121np‐1‐1nc‐1 (8)	Symbols are as with Equation (2)
Natural logarithm of the coefficients of variation, lnCVR	lnCVR=lnCVpCVc+121np‐1‐1nc‐1 (9)	CVp and CVc are the coefficient of variation (i.e., standard deviation divided by its mean) for Environmental stressor (p) and control (c) groups. Other symbols are as with Equation (2) Positive values of lnCVR indicate that environmental stressor increases the variance of measurements, while adjusting the effect of mean change (i.e., more variable traits). Other symbols are as with Equation (2)
Sampling variance of lnCVR	SlnCVR2=sdp2npmp2+sdc2ncmc2+121np‐1+1nc‐1 (10)	Symbols are as with Equations (1 and 2)

To this end, we conduct the first large‐scale quantification of statistical power, Type M and S error rates, using manipulative field experiments and non‐manipulative observations covering the dominant stressors in global change biology (cf. Sage, 2020). More specifically, we quantify these three parameters at two different levels, a single experiment, and meta‐analysis (e.g., the statistical power of a field experiment vs. meta‐analysis), for ecological response magnitude and variability (i.e., mean and variance differences between an environmental stressor and a benign or control environment). In addition, we estimate true effects with and without correcting for publication bias because, as mentioned, failing to correct for publication bias can lead to the overestimation of statistical power, and also of Type M and S errors. We hypothesize that global change studies are generally underpowered with high exaggeration ratios, although Type S error rates are relatively low. We also predict that manipulative field experiments will have greater statistical power and lower Type M and S errors than non‐manipulative field observations because manipulative experiments would often involve stressor levels beyond currently observed levels so that ecological responses (i.e., effect size) should be higher both in magnitude and variation (Hillebrand et al., 2020; Kreyling & Beier, 2013).

MATERIALS AND METHODS

An overview of the methodology

To address our main aims above, we chose to use a database of global change biology, containing 30 meta‐analyses (3847 field experiments/observations) over a multitude of environmental stressors (see Section 2.2 below; Hillebrand et al., 2020). Using this database, we calculated five standardized effect size statistics to quantify response magnitude (mean difference) and variability (variance difference) to environmental stressors in global change studies. For response magnitude, we used (1) the natural logarithm of response ratio, (lnRR; Hedges et al., 1999), (2) standardized mean difference, SMD (also known as Hedges’ g or Cohen's d; Hedges, 1982), and (3) standardized mean difference with heteroscedastic population variances in the two groups, SMDH (see formulas in Table 1). Note that SMD assumes homoscedasticity (i.e., equal variances; Hedges, 1982) whereas SMDH allows for heteroscedasticity (Bonett, 2008, 2009). Also, heteroscedasticity only affects the sampling variance of lnRR, not the point estimate (Sánchez‐Tójar et al., 2020). For quantifying response variability, we used (4) the natural logarithm of variability ratio, lnVR (Nakagawa et al., 2015), and (5) the natural logarithm of the coefficients of variation, lnCVR (Nakagawa et al., 2015) which adjusts for changes in mean values (see formulas in Table 1). The formulas for effect size statistics used to quantify the effect of environmental stressors on ecosystems response magnitude (mean difference: lnRR, SMD, and SMDH) and response variability (variance difference or heteroscedasticity: lnVR and lnCVR). In this paper's context, lnRR, SMD, and SMDH represent differences in mean values (magnitude) between a group under a global change stressor and another group under a benign environment, whereas lnVR and lnCVR represent differences in variance around mean between the two groups, without and with adjusting the effect of mean change, respectively and are the coefficient of variation (i.e., standard deviation divided by its mean) for Environmental stressor (p) and control (c) groups. Other symbols are as with Equation (2) Positive values of lnCVR indicate that environmental stressor increases the variance of measurements, while adjusting the effect of mean change (i.e., more variable traits). Other symbols are as with Equation (2) We used a three‐step modeling procedure to test our main hypotheses (Figure 1c). In the first step, we used a meta‐analytic approach to obtain the key quantity for power calculations—an estimate of the “true” effect size of a phenomenon (Nakagawa & Foster, 2004). To achieve this, we employed the meta‐analytic (overall) mean, rather than the “observed” effect size from a given study, as a proxy of true effect to avoid overestimating statistical power (for examples using this approach, see Button et al., 2013; Cleasby et al., 2021). Therefore, we meta‐analyzed five effect size statistics (Table 1) separately to obtain meta‐analytic means for each meta‐analytic case (Section 2.3). For lnRR, SMD and SMDH, we also estimated bias‐corrected versions of corresponding effect sizes to adjust for publication bias (also known as the small‐study effect; Vevea & Hedges, 1995; Section 2.4). Contrastingly, we cannot calculate bias‐corrected lnVR and lnCVR because statistical significance, rather than response variability (heteroscedasticity or variance difference), drives publication bias (see Senior, Gosby, et al., 2016). Therefore, we assumed that lnVR and lnCVR were not affected by publication bias in the way lnRR, SMD, and SMDH were. In the second step, we calculated the statistical power to detect the estimates of true effects and their magnitude (Type M) and sign (Type S) error rates, for each meta‐analysis and every single experiment included in the meta‐analysis (Section 2.5.1; Table 2). In the third step, to obtain overall estimates of the three parameters across different meta‐analyses (which provided us with comparable summaries of the three parameters), we used a weighted regression to statistically aggregate over the three parameters obtained at the meta‐analysis level, whereas we used a mixed‐effects model to aggregate these parameters at the experiment level. Both procedures involved aggregating the parameters across meta‐analyses (i.e., between‐meta‐analysis modeling; Section 2.5.2). We also conducted a secondary synthesis of the true effects (which were estimated from the first step) across meta‐analyses (i.e., conducting a meta‐analysis of overall means obtained from the included 30 meta‐analyses; also referred to as a second‐order meta‐analysis or meta‐meta‐analysis; cf. Nakagawa et al., 2019; Section 2.6). We conducted all analyses in the r environment v. 4.0.3 (R Core Team, 2020). All relevant data and code can be found at https://zenodo.org/record/5496789#.YTmbiI4zY2w.

TABLE 2

The definitions of statistical power, Type M and S error rates. For the definitions of lnRR, SMD, SMDH, lnVR, and lnCVR, see Table 1

Terms	Definitions
Statistical power	The probability of detecting a statistically significant effect size: response magnitude (lnRR and SMD) or response variability (lnVR or lnCVR), given that the effect size is non‐zero. Given a sample size, the smaller the true effect size (response mangnitude or variability), the lower the statistical power. Also, note that statistical power is 1—Type 2 error
Type S error	The probability of a statistically significant effect size having an opposite sign to the true direction (for lnRR, SMD, lnVR, or lnCVR), if the true effect size is non‐zero. Given a sample size, the smaller the effect size (response mangnitude or variability), the higher the Type S error rate
Type M error	The multiplicative factor by which the magnitude of an effect size (lnRR, SMD, lnVR, or lnCVR) might be exaggerated when the true effect size is non‐zero. Given a sample size, the smaller the effect size (response mangnitude or variability), the higher the Type M error

The definitions of statistical power, Type M and S error rates. For the definitions of lnRR, SMD, SMDH, lnVR, and lnCVR, see Table 1

Global change meta‐analysis database

Our global change meta‐analysis database reflected a range of the responses of ecosystem processes to the most pervasive anthropogenic global change stressors, including climate warming, fire eutrophication, and nitrogen fertilization (Hillebrand et al., 2020). The database was originally used to quantify how evident thresholds were in ecological responses to anthropogenic global change (at https://zenodo.org/record/5496789#.YTmbiI4zY2w). The dataset did not contain single species experiments and included experimental/manipulative community level experiments, mostly in the field, and non‐manipulative observations. It followed strict inclusion and exclusion criteria (as depicted in Hillebrand et al., 2020) and finally contained 36 meta‐analyses (providing 4601 effect sizes in the form of lnRR). We excluded six meta‐analyses from the original database because they did not provide sampling variance (; Table 1), which was required for formal weighted meta‐analyses and calculations of statistical power and Type M and S errors. Thus, our final database contained 30 meta‐analyses (Figure 1b), which provided 3850 estimates of lnRR paired with a corresponding estimate of sampling variance (). For these 30 meta‐analyses in the form of lnRR (referred to as dataset lnRR*), the number of studies (N) included in meta‐analysis ranged from 11 to 186 (M = 37.3, median = 26.5, SD = 37.1). The number of effect sizes (k) of lnRR* ranged from 35 to 562 (M = 128.2, median = 85.0, SD = 121). In addition, within dataset lnRR*, 12 of 30 meta‐analyses provided descriptive statistics in included primary studies: mean ( or ), standard deviation ( or ), and sample size ( or ), which enabled us to calculate SMD, SMDH, lnVR, and lnCVR and their sampling errors for these 12 meta‐analyses. We also re‐calculated lnRR (to distinguish with lnRR*, we referred it to as dataset lnRR) using these 12 meta‐analyses so as to compare the statistical power, Type M and S errors for lnRR, SMD, SMDH, lnVR, and lnCVR (Section 2.5). For the 12 meta‐analyses (effect size in the form of lnRR, SMD, SMDH, lnVR, and lnCVR), N ranged from 11 to 186 (M = 42.8, median = 19, SD = 58.2), k ranged from 44 to 450 (M = 164.8, median = 119.5, SD = 119.2). The replicates (n; sample size per study) in each study of the 12 datasets ranged from 4 to 10,000 (M = 38.4, median = 12, SD = 83.0). Of the 30 meta‐analyses, 11 meta‐analyses used non‐manipulative observations and 17 used manipulative experiments, while 2 used both non‐manipulative observations and manipulative experiments. We followed the original database in defining the categories of environmental stressors; namely, acidification (Acid, k = 62; Nagelkerken & Connell, 2015), biodiversity loss (BD loss, k = 942; Cardinale et al., 2006; Griffin et al., 2013; Östman et al., 2016), fertilization (Fert, k = 811; Akiyama et al., 2010; Elser et al., 2007; Liang et al., 2016; Treseder, 2008), bush fire (Fire, k = 179; Dijkstra & Adams, 2015; Dooley & Treseder, 2012), plant invasion (Inv, k = 316; Gaertner et al., 2014; Gallardo et al., 2016; Vilà et al., 2011), land use change (LUC, k = 612; Gibson et al., 2011; Van Lent et al., 2014), precipitation (Precip, k = 138; Liu et al., 2016), and global warming (Warm, k = 790; Ateweberhan & McClanahan, 2010; Lin et al., 2010; Lu et al., 2013).

Meta‐analyses and estimating the proxies of “true” effects

As the first step of our three‐step modeling procedure, we estimated various proxies of “true” effects for each meta‐analysis. The proxies of “true” effects included (1) meta‐analytic overall means (MAOMs), which represented a common “true” effect shared by the multiple experiments within a given meta‐analysis (Section 2.3.1), (2) effect size specific predictions (ESSPs), which represented experiment‐dependent effects (i.e., multiple true effects within a given meta‐analysis; Section 2.3.2), and (3) the publication‐bias‐corrected versions of MAOMs and ESSPs (Section 2.4).

Meta‐analytic overall means

To estimate “true” effects for each meta‐analysis, we employed a multilevel model to estimate MAOMs (Nakagawa & Santos, 2012), in which the non‐independence in the datasets (i.e., multiple effect sizes per study) was accounted for by incorporating effect size and study identities as random factors (Noble et al., 2017). We used the rma.mv function in the metafor package (Viechtbauer, 2010) to run the following multilevel meta‐analytic model for lnRR*, lnRR, SMD, SMDH, lnVR, or lnCVR, respectively (Nakagawa & Santos, 2012): where with being a normal distribution with two parameters, mean and variance. Here is the observed effect size estimates (i.e., lnRR, SMD, SMDH, lnVR, or lnCVR), is the intercept (i.e., MAOM), and is the between‐study effect for the study j, is the within‐study effect for the effect size i in the study j, is the sampling error for the effect size i in the study j, and , , and are associated variance components.

Effect size specific predictions

Given the high heterogeneities in ecological datasets (I 2 > 90%; Senior, Grueber, et al., 2016), there rarely exists a common effect size between different studies within a meta‐analysis. For example, nutrient enrichment has a large effect on plant biomass, whereas lack of light stimuli will largely reduce this effect. Therefore, we used an alternative proxy of true effect to accommodate such an experiment‐dependent effect (i.e., multiple true effects within a given meta‐analysis): ESSP (see Equation 12). ESSPs can be estimated by using the best linear unbiased predictions in the observation level, which are defined as (conditional) point estimates given a set of random effects in a mixed effect model (Hadfield et al., 2010). We defined ESSPs as follows: where the notations are the same as Equation (11) (note that , , and are the estimated parameters from Equation 11). Equation (12) shows that ESSPs are the sum of the overall mean (MAOM), the between‐study effect , the within‐study (effect size specific) effect . ESSPs were obtained using the rma.mv function in metafor (Viechtbauer, 2010).

Obtaining bias‐corrected meta‐analytic estimates

For response magnitude (i.e., lnRR, SDM, and SMDH), publication bias can translate into overestimated meta‐analytic means, MAOMs (Vevea & Hedges, 1995). To alleviate such a bias, we employed an extended version of Egger's regression approach (multilevel meta‐regression, cf. Nakagawa, Lagisz, Jennions, et al., 2021) which resulted in a bias‐corrected version of MAOMs. In brief, this approach incorporates uncertainty term as a moderator in a multilevel meta‐regression model: the inverse of “effective sample size” or its square root (strictly speaking, “effective sample size” = ). is the (conditional) bias‐corrected meta‐analytic overall mean (cMAOM, hereafter) when assuming no uncertainty exists: in Equation (13) or in Equation (14). If in Equation (13) is statistically non‐significant (p‐value > .05), in Equation (13) (the slope of ) is the best estimate of cMAOM. If in Equation (13) (the slope of ) is statistically significant (p‐value < .05), in Equation (14) is the best estimate of cMAOM (T. D. Stanley & Doucouliagos, 2014; T. D. Stanley et al., 2017). We note that the slope () of Equation (13) could be in the opposite direction from what was expected from publication bias (Figure S2); in such a case, we considered the dataset did not suffer from the publication bias and we used MAOMs as their cMAOMs. Eighteen meta‐analyses within lnRR* dataset did not report replicates (n; sample size per study) for calculation of “effective sample size;” we used sampling error (, the square root of the sampling variance) and sampling variance () to replace in Equation (13) and in Equation (14), respectively. When calculating statistical power, Type M and S error rates, we used unconditional standard error (SE) rather than a conditional SE (viz, using SE for in Equation 11 to replace that of Equations 13 or 14). The models in Equations (13 and 14) were implemented by the rma.mv function in metafor. Furthermore, with cMAOMs, we used Equation (12) to obtain ‘bias‐corrected effect size specific predictions (cESSPs). In our datasets, lnRR*, lnRR, SMD, and SMDH had 20 of 30, 6 of 12, 5 of 12, and 5 of 12 meta‐analyses, respectively, which did not show the statistical evidence of the small‐study effect (Figure S3).

Estimating statistical power, Type M and S error rates

(Within‐)meta‐analysis level modeling

We calculated statistical power, Type M and S errors at two levels: the meta‐analysis level (i.e., three parameters for each of the meta‐analysis identified), and single experiment level (i.e., three parameters for experiments or effect sizes within a given meta‐analysis; Figure 1c). We expected that statistical power at the meta‐analysis level would be much higher than that at the single experiment level, although it would still be possible that a meta‐analysis might not have enough statistical power to detect the estimated overall effect (i.e., non‐significant overall effect; Cohn & Becker, 2003). In addition to the proxies of “true” effects (i.e., MAOMs, ESSPs, cMAOMs, and cESSPs), we required SE for each effect size estimate to calculate statistical power, Type M and S errors. For the meta‐analysis level, we used SEs from the meta‐analytic models (i.e., Equations 11, 13, or 14). For the single experiment level, we used the square root of the sampling variance for each effect size (see Table 1) as SEs.

Between‐meta‐analysis modeling

Importantly, we also obtained an overall (average) statistical power, Type M and S errors for each effect size statistic across different meta‐analyses (i.e., between‐meta‐analyses estimates; Figure 1c). Such overall estimates provided us with comparable summaries of statistical power, Type M and S errors. For the meta‐analysis level, we used a weighted regression, implemented with the base r function, lm, with the number of effect sizes (k) for each meta‐analysis as weight. The weighted regression models allowed us to average over the estimates of meta‐analysis level power and Type M and S errors (using MAOMs and cMAOMs). For the single experiment level, we used mixed‐effects models employing the lmer function in the r package lme4 (Bates et al., 2014), with study identities as a random factor. These mixed‐effects models allowed us to average over the single experiment level estimates (using MAOMs, cMAOMs, ESSPs, and cESSPs). Furthermore, to these mixed‐effects models, we added study approach (manipulative experiment vs. non‐manipulative observation) as a fixed factor, and stressor categories as a random factor to compare the average statistical power, Type M and S errors between manipulative experiments and non‐manipulative observations. Before constructing the above models using lm and lmer, we ln‐transformed the response variables (estimates of statistical power, Type M and S error rates) to better meet the “normal residuals” assumption (Figures S4–S6). For easy interpretation, we back‐transformed (i.e., exponentiated) the intercept of lm and lmer models so that we obtained the median value on the original scale (Nakagawa et al., 2017). We also obtained the mean value on the original scale (using equation 5.8; Nakagawa et al., 2017). Furthermore, for the Type S error rate, we added 0.025 to all the cases because the estimates of Type S error included many zeros and extremely small values, which made ln‐transformation impossible or ineffective. Note that when we back‐transformed estimates from these models, we adjusted these estimates on the original scale by subtracting a value of 0.025. Furthermore, when back‐transformed estimates (statistical power and Type S error) went below or above the boundary values (i.e., 0 or 1, respectively), we constrained the estimates to the boundaries.

Response magnitude and variability across environmental stressors

To estimate the overall response magnitude and variability across meta‐analyses (i.e., between‐meta‐analysis synthesis), we conducted a secondary synthesis of the estimates of response magnitude and variability from each meta‐analysis. Of note, one meta‐analysis represented one specific stressor (e.g., a meta‐analysis of acidification, a meta‐analysis of global warming; see Section 2.2). We also assessed whether there were significant differences in such overall effects between manipulative experiments and non‐manipulative observations. To achieve this, first, we obtained the absolute values of (c) MAOMs and their sampling variances (i.e., the variance estimated from a folded normal distribution; see Morrissey, 2016) for each meta‐analysis (that is, across stressors). Second, we statistically aggregated these absolute estimates ( and ) via a random‐effect model using rma function in the r package metafor (Viechtbauer, 2010). Third, we conducted meta‐regression with the study approach as a moderator to quantify effects for manipulative experiments and non‐manipulative observations (we excluded two meta‐analyses that contained both experimental and observational data; see Section 2.2).

RESULTS

The effects of stressors on ecosystem response magnitude and variability

Overall, environmental stressors had a statistically significant impact on response magnitude (more than a 33.7% increase; Figure 2a). For the result of each stressor, see Figures S7–S9 (each meta‐analysis was focused upon a specific stressor, but a given stressor may be covered by multiple different meta‐analyses, e.g., Warm 1, Warm 2, and Warm 3 were three meta‐analyses all concerned with global warming). Bias‐corrected estimates of response magnitude declined by 17%–31% (Figure 2b). Similarly, stressors had a statistically significant effect on response variability (more than a 20% increase; Figure 2c; shown by a stressor in Figure S10). Furthermore, manipulative experiments had a statistically significant larger response magnitude than non‐manipulative observations for some effect size types (i.e., uncorrected SMD, uncorrected SMDH, corrected SMDH; Table S1). In contrast, the differences in response variability between manipulative experiments and non‐manipulative observations were not statistically significant.

FIGURE 2

Orchard (forest‐like) plots showing the weighted average of response magnitude and variability across all environmental stressors. (a) The effects of environmental stressors on ecosystem response magnitude measured as lnRR*, lnRR, SMD, and SMDH. (b) Bias‐corrected ecosystem response magnitude. (c) The effects of environmental stressors on ecosystem response variability measured as lnVR and lnCVR. The unfilled circles represent the weighted overall average of response magnitude and variability. The filled circles represent the associated MAOM of each type of environmental stressors (MAOMs or cMAOMs estimated at each meta‐analysis). The size of filled circles signifies the estimates of single stressors scaled proportionally to their precisions (precision is the inverse of standard error, SE). Bold whisker line = 95% confidence interval (CI), thin whisker line = 95% prediction interval (PI), k = number of effect sizes (in the context of this figure, it represents the number of MAOM or cMAOM estimates). cMAOM, bias‐corrected meta‐analytic overall mean; MAOM, meta‐analytic overall mean. We used the r package orchaRd (Nakagawa, Lagisz, O'Dea, et al., 2021) for visualizations

Statistical power in global change studies

Statistical power in detecting response magnitude

Across all stressors, single experiments had much lower power to detect bias‐corrected response magnitude compared to the nominal 80% power (Table 3): 23.3% for lnRR* (Figure 3a), 38.5% for lnRR (Figure 3a), 19.1% for SMD (Figure 3b), and 18.2% for SMDH (Figure 3d). When considering that each experiment has its own true effect (cESSP), the power values were similar to the values estimated from a common true effect (cMAOM; Table 3; Figure 3). The corresponding power values for uncorrected response magnitude were 19%–66% higher than that of the bias‐corrected version (Table 3; Figure 3). The median proportion of single experiments that had adequate power to detect bias‐corrected lnRR*, lnRR, SMD, and SMDH were only 16.3%, 33.2%, 6.6%, and 6.9%, respectively (Figure 3). As expected, the median power for meta‐analysis to detect bias‐corrected response magnitude was greater than that of single experiments although it fell short of the nominal 80% level: 42.4%–63.5% (depending on effect size types; Table 3; Figure 3). As at the single experiment level, uncorrected meta‐analyses overestimated power by ~2%–33% compared to the bias‐corrected version (Table 3; Figure 3).

TABLE 3

Effect size	True effect	Model estimates of Statistical power				k	N
Effect size	True effect	Median	CI.lb	CI.ub	Mean	k	N
Single experiment
lnRR*	cMAOM	0.233	0.218	0.248	0.433	3847	1119
	cESSP	0.279	0.262	0.2887	0.547	3847	1119
	MAOM	0.277	0.260	0.2885	0.515	3847	1119
	ESSP	0.286	0.269	0.304	0.560	3847	1119
lnRR	cMAOM	0.385	0.353	0.420	0.716	1940	516
	cESSP	0.359	0.331	0.390	0.704	1940	516
	MAOM	0.523	0.486	0.780	0.973	1940	516
	ESSP	0.401	0.370	0.436	0.786	1940	516
SMD	cMAOM	0.191	0.179	0.205	0.356	1977	516
	cESSP	0.209	0.194	0.225	0.195	1977	516
	MAOM	0.318	0.288	0.343	0.591	1977	516
	ESSP	0.268	0.249	0.288	0.526	1977	516
SMDH	cMAOM	0.182	0.170	0.195	0.339	1977	516
	cESSP	0.187	0.174	0.201	0.367	1977	516
	MAOM	0.269	0.250	0.2881	0.501	1977	516
	ESSP	0.234	0.217	0.252	0.458	1977	516
lnVR	MAOM	0.115	0.109	0.122	0.214	1902	514
lnVR	ESSP	0.186	0.172	0.201	0.365	1902	514
lnCVR	MAOM	0.064	0.062	0.067	0.120	1886	513
lnCVR	ESSP	0.105	0.098	0.112	0.205	1886	513
Meta‐analysis
lnRR*	cMAOM	0.424	0.286	0.628	0.583	3847	1119
lnRR*	MAOM	0.567	0.424	0.756	0.780	3847	1119
lnRR	cMAOM	0.512	0.249	1^#	0.704	1940	516
lnRR	MAOM	0.665	0.195	1^#	0.915	1940	516
SMD	cMAOM	0.621	0.330	1^#	0.855	1977	516
SMD	MAOM	0.645	0.357	1^#	0.887	1977	516
SMDH	cMAOM	0.635	0.352	1^#	0.873	1977	516
SMDH	MAOM	0.646	0.362	1^#	0.889	1977	516
lnVR	MAOM	0.439	0.250	0.77	0.604	1902	514
lnCVR	MAOM	0.526	0.315	0.878	0.723	1886	513

Abbreviations: cESSP, bias‐corrected effect size‐specific prediction; cMAOM, bias‐corrected meta‐analytic overall mean; ESSP, effect size‐specific prediction; k, the number of effect sizes; MAOM, meta‐analytic overall mean; N, the number of primary studies.

FIGURE 3

Single experiments’ median power to detect response magnitude and variability for each category of environmental stressors (on the y‐axis; stressors with different subscripts denoted that a given stressor may be covered by multiple different meta‐analytic cases), assuming one common “true” effect per stressor (MAOM), experiment‐specific “true” effects within a stressor (ESSP), and their bias‐corrected estimates (cMAOM and cESSP) as “true” effects. The use of meta‐analysis increased the statistical power for some environmental stressors (MAOM.MA and cMAOM.MA). (a) The dataset lnRR* (n MA = 30, k = 3847). (b) The dataset SMD (n MA = 12, k = 1977). (c) The dataset lnVR (n MA = 12, k = 1902). (d) The dataset SMDH (n MA = 12, k = 1977). (e) The dataset lnCVR (n MA = 12, k = 1886). Acid, acidification; BD loss, biodiversity loss; cESSP, bias‐corrected effect size‐specific prediction; cMAOM, bias‐corrected meta‐analytic overall mean; ESSP, effect size specific prediction; Fert, fertilization; Fire, bush fire; Inv, plant invasion; k, the number of effect sizes; LUC, land use change; MAOM, meta‐analytic overall mean; n MA, the number meta‐analyses per dataset; Precip, precipitation; Warm, global warming

The model estimates of statistical power to detect the effect of environmental stressors on ecosystem response magnitude (lnRR*, lnRR, SMD, and SMDH and their publication bias‐corrected versions) and response variability (or heteroscedasticity: lnVR and lnCVR). The model estimates of power were reported both on single experiment level and meta‐analysis level. We used mixed‐effects models and weighted regression models to average over single experiment level statistical power (using MAOMs, cMAOMs, ESSPs, and cESSPs), and meta‐analysis level statistical power (using MAOMs and cMAOMs), respectively. We noted that (1) the confidence intervals of statistical estimates were asymmetrical due to the back‐transformation, (2) statistical power estimates below or above the boundary values (i.e., 0 or 1) were constrained to the boundaries (i.e., 0# or 1#) Abbreviations: cESSP, bias‐corrected effect size‐specific prediction; cMAOM, bias‐corrected meta‐analytic overall mean; ESSP, effect size‐specific prediction; k, the number of effect sizes; MAOM, meta‐analytic overall mean; N, the number of primary studies. Single experiments’ median power to detect response magnitude and variability for each category of environmental stressors (on the y‐axis; stressors with different subscripts denoted that a given stressor may be covered by multiple different meta‐analytic cases), assuming one common “true” effect per stressor (MAOM), experiment‐specific “true” effects within a stressor (ESSP), and their bias‐corrected estimates (cMAOM and cESSP) as “true” effects. The use of meta‐analysis increased the statistical power for some environmental stressors (MAOM.MA and cMAOM.MA). (a) The dataset lnRR* (n MA = 30, k = 3847). (b) The dataset SMD (n MA = 12, k = 1977). (c) The dataset lnVR (n MA = 12, k = 1902). (d) The dataset SMDH (n MA = 12, k = 1977). (e) The dataset lnCVR (n MA = 12, k = 1886). Acid, acidification; BD loss, biodiversity loss; cESSP, bias‐corrected effect size‐specific prediction; cMAOM, bias‐corrected meta‐analytic overall mean; ESSP, effect size specific prediction; Fert, fertilization; Fire, bush fire; Inv, plant invasion; k, the number of effect sizes; LUC, land use change; MAOM, meta‐analytic overall mean; n MA, the number meta‐analyses per dataset; Precip, precipitation; Warm, global warming

Statistical power in detecting response variability

Overall, at the single experiment level, lnVR and lnCVR showed comparatively low statistical power to detect heteroscedasticity than the nominal 80% level: 11.5% for lnVR and 6.4% for lnCVR (Table 3; Figure 3c,e). The median proportion of experimental lnVR and lnCVR that had adequate power to detect response variability was only 3.7% and 0%, respectively (Figure 3). Meta‐analysis increased the overall power to identify response variability roughly by four‐ to six‐fold: power was now 43.9% for lnVR and 52.6% for lnCVR (Table 3; Figure 3). The proportion of single experiments that had adequate power increased to 33.3% and 16.7% when using meta‐analysis to detect lnVR and lnCVR, respectively (Figure 4).

FIGURE 4

Single experiments’ median Type M error rates (i.e., exaggeration ratio) in detecting response magnitude to each category of environmental stressors (on the y‐axis; stressors with different subscripts denoted that a given stressor may be covered by multiple different meta‐analytic cases), assuming one common “true” effect per stressor (MAOM), experiment‐specific “true” effects within a stressor (ESSP), and their bias‐corrected estimates (cMAOM and cESSP) as “true” effects. The use of meta‐analysis reduced the Type M error rates in some environmental stressors (MAOM.MA). (a) The dataset lnRR*. (b) The dataset SMD. (c) The dataset lnVR. (d) The dataset. (e) The dataset lnCVR. The definition of Type M error rate can be found at Table 2. Gray cells indicate that Type M errors are greater than 3. See more details in the legend of Figure 3

Type M and S error rates in global change studies

Type M and S error rates in detecting response magnitude

Single experiments tended to overestimate the effect of the environmental stressors consistently (Type M error rates; Table 4; Figure 4). Depending on which effect metric was used, single experiments were on average two to threefold larger than the true effect size estimated as MAOMs. Single experiments rarely had the wrong estimation of the sign of the true effect size (Type S error rate; Table 5; Figure 5). As expected, meta‐analyses largely reduced the magnitude of Type M (1–2; see Table 4; Figure 4). When bias correction was not employed, the overestimation of the true effect was even larger (Type M error rates by 2–6 and S error rates by 10%–30%).

TABLE 4

Effect size	True effect	Model estimates of Type M error rate				k	N
Effect size	True effect	Median	CI.lb	CI.ub	Mean	k	N
Single experiment
lnRR*	cMAOM	3.220	2.960	3.503	6.286	3847	1119
	cESSP	2.900	2.666	3.154	6.947	3847	1119
	MAOM	2.604	2.429	2.793	5.084	3847	1119
	ESSP	2.727	2.539	2.930	6.533	3847	1119
lnRR	cMAOM	2.004	1.835	2.188	3.911	1940	516
	cESSP	2.100	1.946	2.267	5.031	1940	516
	MAOM	1.526	1.431	1.628	2.980	1940	516
	ESSP	1.968	1.819	2.127	4.714	1940	516
SMD	cMAOM	2.875	2.680	3.085	5.613	1977	516
	cESSP	3.016	2.778	3.274	7.226	1977	516
	MAOM	2.028	1.902	2.162	3.958	1977	516
	ESSP	2.450	2.272	2.641	5.869	1977	516
SMDH	cMAOM	2.936	2.748	3.137	5.731	1977	516
	cESSP	3.151	2.912	3.409	7.548	1977	516
	MAOM	2.259	2.116	2.413	4.410	1977	516
	ESSP	2.703	2.498	2.924	6.474	1977	516
lnVR	MAOM	3.949	3.734	4.176	7.709	1902	514
lnVR	ESSP	3.386	3.132	3.660	8.112	1902	514
lnCVR	MAOM	9.925	9.311	10.58	19.375	1886	513
lnCVR	ESSP	6.292	5.713	6.929	15.073	1886	513
Meta‐analysis
lnRR*	cMAOM	1.823	1.252	2.648	2.037	3847	1119
lnRR*	MAOM	1.345	1.123	1.610	1.504	3847	1119
lnRR	cMAOM	1.600	0.897	2.839	1.788	1940	516
lnRR	MAOM	1.251	0.879	1.776	1.399	1940	516
SMD	cMAOM	1.379	0.836	2.265	1.542	1977	516
SMD	MAOM	1.292	0.868	1.917	1.445	1977	516
SMDH	cMAOM	1.305	0.875	1.940	1.459	1977	516
SMDH	MAOM	1.286	0.874	1.887	1.438	1977	516
lnVR	MAOM	1.555	1.081	2.231	1.738	1902	514
lnCVR	MAOM	1.488	0.911	2.421	1.664	1886	513

TABLE 5

The model estimates of Type S error rate in detecting the effect of environmental stressors on ecosystem response magnitude (lnRR*, lnRR, SMD, and SMDH and their publication bias‐corrected versions) and response variability (or heteroscedasticity: lnVR and lnCVR). The model estimates of Type S error rate were reported both on single experiment level and meta‐analysis level. See more details in Table 3

Effect size	True effect	Model estimates of Type S error rate				k	N
Effect size	True effect	Median	CI.lb	CI.ub	Mean	k	N
Single experiment
lnRR*	cMAOM	0.032	0.029	0.036	0.079	3847	1119
	cESSP	0.027	0.024	0.030	0.070	3847	1119
	MAOM	0.025	0.022	0.028	0.060	3847	1119
	ESSP	0.027	0.024	0.03	0.069	3847	1119
lnRR	cMAOM	0.014	0.011	0.017	0.035	1940	516
	cESSP	0.018	0.015	0.020	0.042	1940	516
	MAOM	0.007	0.005	0.009	0.016	1940	516
	ESSP	0.015	0.012	0.018	0.038	1940	516
SMD	cMAOM	0.023	0.020	0.027	0.046	1977	516
	cESSP	0.028	0.024	0.032	0.064	1977	516
	MAOM	0.013	0.010	0.015	0.025	1977	516
	ESSP	0.020	0.016	0.023	0.045	1977	516
SMDH	cMAOM	0.026	0.022	0.029	0.049	1977	516
	cESSP	0.030	0.026	0.034	0.065	1977	516
	MAOM	0.016	0.013	0.019	0.031	1977	516
	ESSP	0.023	0.019	0.026	0.051	1977	516
lnVR	MAOM	0.050	0.046	0.056	0.077	1902	514
lnVR	ESSP	0.037	0.033	0.042	0.083	1902	514
lnCVR	MAOM	0.199	0.187	0.213	0.260	1886	513
lnCVR	ESSP	0.087	0.078	0.097	0.171	1886	513
Meta‐analysis
lnRR*	cMAOM	0.014	0.003	0.029	0.017	3847	1119
lnRR*	MAOM	0.004	0^#	0.009	0.007	3847	1119
lnRR	cMAOM	0.014	0^#	0.045	0.017	1940	516
lnRR	MAOM	0.004	0^#	0.017	0.007	1940	516
SMD	cMAOM	0.009	0^#	0.031	0.012	1977	516
SMD	MAOM	0.007	0^#	0.022	0.010	1977	516
SMDH	cMAOM	0.007	0^#	0.022	0.010	1977	516
SMDH	MAOM	0.006	0^#	0.021	0.009	1977	516
lnVR	MAOM	0.007	0^#	0.021	0.010	1902	514
lnCVR	MAOM	0.005	0^#	0.021	0.008	1886	513

FIGURE 5

Single experiments’ median Type S error rates in detecting response magnitude to each category of environmental stressors (on the y‐axis; stressors with different subscripts denoted that a given stressor may be covered by multiple different meta‐analytic cases), assuming one common “true” effect per stressor (MAOM), experiment‐specific “true” effects within a stressor (ESSP), and their bias‐corrected estimates (cMAOM and cESSP) as “true” effects. The use of meta‐analysis reduced the Type S error rates in some environmental stressors (MAOM.MA). (a) The dataset lnRR*. (b) The dataset SMD. (c) The dataset lnVR. (d) The dataset. (e) The dataset lnCVR. The definition of Type S error rate can be found at Table 2. See more details in the legend of Figure 3

The model estimates of Type M error rate in detecting the effect of environmental stressors on ecosystem response magnitude (lnRR*, lnRR, SMD, and SMDH and their publication bias‐corrected versions) and response variability (or heteroscedasticity: lnVR and lnCVR). The model estimates of Type M error rate were reported both on single experiment level and meta‐analysis level. See more details in Table 3 The model estimates of Type S error rate in detecting the effect of environmental stressors on ecosystem response magnitude (lnRR*, lnRR, SMD, and SMDH and their publication bias‐corrected versions) and response variability (or heteroscedasticity: lnVR and lnCVR). The model estimates of Type S error rate were reported both on single experiment level and meta‐analysis level. See more details in Table 3 Single experiments’ median Type S error rates in detecting response magnitude to each category of environmental stressors (on the y‐axis; stressors with different subscripts denoted that a given stressor may be covered by multiple different meta‐analytic cases), assuming one common “true” effect per stressor (MAOM), experiment‐specific “true” effects within a stressor (ESSP), and their bias‐corrected estimates (cMAOM and cESSP) as “true” effects. The use of meta‐analysis reduced the Type S error rates in some environmental stressors (MAOM.MA). (a) The dataset lnRR*. (b) The dataset SMD. (c) The dataset lnVR. (d) The dataset. (e) The dataset lnCVR. The definition of Type S error rate can be found at Table 2. See more details in the legend of Figure 3

Type M and S error rates in variance differences

At the single experiment level, lnVR and lnCVR on average showed large Type M error rates (~4 and 10, respectively; Table 4; Figure 4), but low Type S error rates (5%–19.9%; Table 5; Figure 5). By contrast, meta‐analyses only overestimated lnVR and lnCVR by 1.6‐fold and 1.5‐fold, respectively.

Contrasting manipulative experiments and non‐manipulative observations

Both single manipulative experiments and non‐manipulative observations were underpowered to detect the effects of environmental stressors on ecosystem response magnitude and variability (16%–39% depending on effect metrics; Figure 6a–f). With one exception, the differences in power between manipulative experiments and non‐manipulative observations were not statistically significant (Figure 6d). When bias correction of ESSPs was employed, manipulative experiments had statistically greater power than non‐manipulative observations (32% vs. 20%). Similarly, differences between manipulative experiments and non‐manipulative observations were not significant in terms of their Type M (with one exception: bias‐corrected lnRR*; Figure 6g–l). Manipulative experiments had statistically larger Type M error than non‐manipulative observations if bias correction of ESSPs was used (twofold vs. sixfold). A similar pattern was found for Type S errors in manipulative experiments and non‐manipulative observations (Figure 6m–r).

FIGURE 6

Forest plots showing the model estimates of statistical power, Type M and S errors. The mixed‐effects models were used to compare the statistical power, Type M and S error rates between manipulative experiments and non‐manipulative observations. (a–f) Statistical power of manipulative experiments and non‐manipulative observations to detect response magnitude (lnRR*, lnRR, SMD, and SMDH) and variability (lnVR and lnCVR). (g–l) Type M errors in manipulative experiments and non‐manipulative observations. (m–r) Type S errors in manipulative experiments and non‐manipulative observations. *Indicates a statistically significant difference between manipulative experiments and non‐manipulative observations. See more details in the legend of Figure 3

DISCUSSION

We have conducted the first study to systematically assess the power, Type M and Type S error rates for global change studies. Concurring with our hypotheses, global change studies are generally underpowered, resulting in high Type M error rates (overestimating the magnitude of the response) whereas Type S error rates (wrong estimation of sign) are relatively low. Across different ecosystems and stressors, single experiments were underpowered to detect bias‐corrected response magnitude (~18%–38% depending on effect size types; Table 3; Figure 3). Similarly, single experiments also had a much lower power to detect response variability (heteroscedasticity) than response magnitude (~6%–12%; Table 3; Figure 3). Such underpowered field experiments could exaggerate an effect by 2–3 times for response magnitude (with bias‐correction) and by 4–10 times for response variability when their results are statistically significant (Table 4; Figure 4). Also, single experiments rarely incorrectly estimated the direction of the true anthropogenic impact (Table 5; Figure 5). Notably, our results were consistent regardless of assuming one “true” effect per meta‐analysis (e.g., cMAOM) or experiment‐specific “true” effects within a meta‐analysis (cESSP). In contrast to our expectation, apart from one exception, manipulative field experiments and non‐manipulative observations were not statistically different in terms of their statistical power or Type M/S errors. Taken together, we conclude that the low statistical power, coupled with publication bias, may have led to distorted estimates of anthropogenic impacts in the literature. Below, we first extend our discussion on the comparisons between manipulative experiments and non‐manipulative observations. Then, we consider three statistical (but biologically relevant) points that emerged from our results and how they can improve future empirical studies (manipulative experiments and non‐manipulative observations) and meta‐analyses in global change biology in general.

Manipulative experiments and non‐manipulative observations both lack power

Rather surprisingly, the statistical power of manipulative experiments and non‐manipulative observations was similar (e.g., uncorrected SMD and bias‐corrected SMD in Table S1). The differences between manipulative experiments and non‐manipulative observations have been often assumed because experimental work usually has greater effect magnitude (Palmer, 2000). Yet, as far as we are aware, no work has identified whether such differences empirically occur. The lack of power differences between manipulative experiments and non‐manipulative observations may be due to the trade‐off between the magnitude of effect sizes and the number of replicates (i.e., sample size). That is, higher experimental effect sizes are offset by smaller sample sizes in manipulative experiments than non‐manipulative observations. Indeed, we found that manipulative experiments had larger effects than non‐manipulative observations. For example, manipulative experiments had statistically larger estimates of SMD than non‐manipulative observations (see Table S1). Contrastingly, non‐manipulative observations had 2.5‐fold larger replicates (sample sizes), on average, than manipulative experiments (25 vs. 10; Figures S11 and S12). Although we may tend to think manipulative experiments have greater power and are therefore more reliable, this assumption is not tenable, at least in the field of global change studies.

Meta‐analysis is not only a powerful tool but maybe the only tool?

As expected, meta‐analyses have increased the power to detect response magnitude (both before and after correcting for publication bias) by at least 30% compared to single experiments. For example, the overall power for meta‐analyses was 51.2% and 62.1% for lnRR and SMD, respectively, compared to 38.5% and 19.1% for single experiments (Table 3). Indeed, the nominal 80% power is difficult to achieve in many disciplines in a single experiment level, such as Neuroscience (median power = 21%; Button et al., 2013), Clinical medicine (median power = 20%; Lamberink et al., 2018), Psychology (median power = 36%; T. Stanley et al., 2018), and Economics (median power = 18%; Ioannidis et al., 2017). Such low statistical power averages for single experiments highlight the importance of meta‐analyzing response magnitude (Gurevitch et al., 2018). We note that, although single experiments are often underpowered and more prone to Type M error, they are essential to global change biology research. Such experiments contribute to evidence accumulation, providing raw materials for systematic reviews and meta‐analyses. Perhaps, more importantly, local field experiments are an effective way to reveal the casual mechanisms of ecological responses at a particular ecosystem, and idiosyncrasies among ecosystems from different localities (Rineau et al., 2019; Roy et al., 2021). Similarly, meta‐analysis of variance (i.e., synthesizing lnVR and lnCVR from individual studies; Nakagawa et al., 2015) is a powerful approach to detect response variability (i.e., heteroscedasticity). Indeed, we found meta‐analysis of variance increased the statistical power by four to sixfold (meta‐analytic lnVR vs. individual lnVR: 43.9% vs. 11.5%, meta‐analytic lnCVR vs. individual lnCVR: 52.6% vs. 6.4%; Table 3). Furthermore, meta‐analysis of variance could mitigate Type M and S error rates compared to single experiments. Ecologists have been aware of difficulties in detecting response variability reliably (Andersen et al., 2009; Carpenter & Brock, 2006; Seekell et al., 2011), and have already discussed the need for a large sample size (Engle, 1982; Seekell et al., 2011). Yet, the number of replicates (n; sample size per study) in global change studies was usually too small to detect response variability reliably (medium n = 12 in our dataset). Practically speaking, to get an adequate sample size for estimating effects on response variability, we need to organize more global research collaboration network, such as Nutrient Network (NutNet; Harpole et al., 2016; Lekberg et al., 2021), US Long‐Term Ecological Research network (LTER; Crossley et al., 2020), and Zostera Experimental Network (ZEN; Wu et al., 2017). Alternatively, we would require heavily instrumented and controlled environmental facilities (e.g., UHasselt Ecotron, see Clobert et al., 2018; Rineau et al., 2019; Roy et al., 2021). Fortunately, meta‐analysis of variance provides us with an alternative approach for increasing the chance of detecting changing response variability hidden in global change studies.

Publication bias may have exacerbated the inflation of anthropologic effects

We have shown that meta‐analyses result in a sizeable increase in power over single experiments, although some meta‐analyses were generally underpowered relative to a nominal value of 80% power (Table 3; Figure 3). Furthermore, only half of the meta‐analyses (15 of 30) had tested for the existence of publication bias in their datasets. Furthermore, only half of the meta‐analyses (15 of 30) had tested for the existence of publication bias in their datasets. The methods used to assess publication bias were: funnel plots (n = 8), rank correlation tests (n = 4), fail‐safe N (n = 4), Egger's regression (n = 1), and normal quantile plots (n = 1). Among these, only two meta‐analyses have corrected for the potential influence of publication bias (i.e., using the trim‐and‐fill method; see Gallardo et al., 2016; Liu et al., 2016). This means that meta‐analyses in global change biology are likely to be overestimating overall effects. In this study, we have used a recently proposed multilevel meta‐regression approach (Nakagawa, Lagisz, Jennions, et al., 2021) to adjust for publication bias in meta‐analyses. After adjustment of publication bias, the magnitude of overall effect sizes has declined by 17%–32% (see Figure 2). The corresponding values for single experiment power decreased by 9%–66%. Type M error rates increased by 20%, which indicates that publication bias might have exacerbated the overestimation of anthropogenic impacts in global change studies. Our results indicate that effect sizes in global change studies are severely exaggerated and call into question their “reproducibility.” Peer‐review journals are more likely to publish statistically significant results, perhaps using statistical significance as a gate‐keeping tool to maintain their “prestige” (e.g., inflated impact factors). Under the publish‐or‐perish research culture, ecologists may intentionally “pick” significant results or “hack” p‐values (e.g., HARKing) to pursue a more publishable result (Amrhein et al., 2017; Fraser et al., 2018). However, the gate‐keeping policy might not work well (e.g., failing to increase the citation of papers; Wardle, 2012) and more importantly does not equal good scientific research. Evidence from other disciplines has also shown that meta‐analyses without correcting publication bias subsequently led to a biased assessment of power (see Button et al., 2013; Ioannidis et al., 2017; T. Stanley et al., 2018). However, even our bias‐corrected effect sizes may still be biased (overestimating) to some degree. This is because our meta‐regression approach could not control for heterogeneities between studies, which may have prevented more accurate adjustments for publication bias (i.e., potentially important moderators not available to incorporate in meta‐regression; Nakagawa & Santos, 2012; Noble et al., 2017). Therefore, it is necessary not only to test publication bias and further adjust the influence of publication bias in every meta‐analysis, but also, to transparently report all predictors and model information in a publication so that any researchers can implement such adjustments later.

The choice of effect sizes for global change studies

Our study provides the first empirical evidence that lnRR is, on average, a more powerful and less biased effect size than SMD and SMDH. Experimental lnRR was twice powerful as SMD and SMDH (lnRR vs. SMD vs. SMDH: 38.5% vs. 19.1% vs. 18.2%; see Table 3; Figure 3) and less vulnerable to overestimation; lnRR has been exaggerated by twofold, whereas SMD and SMDH have been exaggerated by threefold (Table 4; Figure 4). However, lnRR has a major disadvantage; that is it is only appropriate for ratio scale data (i.e., measurements being bounded at zero; cf. Houle et al., 2011; Nakagawa et al., 2015). Nonetheless, lnRR has many other merits over SMD (Nakagawa et al., 2015), which includes: (1) being more robust with small sample sizes (as SMD is biasedly estimated with small N; cf. Hamman et al., 2018), (2) incorporating heteroscedasticity (note that SMDH does assume heteroscedasticity; cf. Bonett, 2008, 2009; Sánchez‐Tójar et al., 2020), and (3) being less affected by scale dependence (Spake et al., 2021). Incidentally, unlike choosing the mean difference metrics based on the power, the choice between lnCVR and lnVR depends on biological questions, which is described elsewhere (Nakagawa et al., 2015; Senior et al., 2020).

CONCLUSIONS AND FUTURE PERSPECTIVES

We have demonstrated that low statistical power and exaggerated effect size estimates are potentially widespread across experimental studies in global change biology, especially when correcting for the influence of publication bias. Manipulative field experiments are not superior to non‐manipulative observations in terms of their statistical power and Type M and S errors. Therefore, single experiments whether manipulations or non‐manipulations may fail, on average, to provide reliable insights into the anthropogenic impacts of global change by themselves. Likewise, although response variability (heteroscedasticity or variance differences) has important biological and statistical implications in the field, our results have shown single experiments are too underpowered to reliably detect response variability. Therefore, to address questions associated with variance, researchers should use meta‐analysis of variation to increase power to reliably detect response variability (we have found 8 of 12 meta‐analyses showing significant response variability—lnCVR, which never have been revealed before; see Figure S10). Such use of meta‐analysis of variation can generate new biological hypotheses and inform methodological decisions (i.e., choice of standardized mean effect size; Nakagawa et al., 2015; Senior et al., 2020). Future global change studies warrant highly powered field studies to reliably inform theory building and policymaking. Such studies are likely to call for more collaboration and team science (Camerer et al., 2016; O’Dea et al., 2021), and the use of large‐scale ecosystem research infrastructures (Roy et al., 2021). Moreover, researchers should strive for open and transparent science practices (Gallagher et al., 2020), such as controlling for magnitude and sign errors when planning field experiments (i.e., an extension of power analysis; Lemoine et al., 2016), archiving and sharing data, following the FAIR guideline (i.e., findable, accessible, interoperable, and reusable data; Wilkinson et al., 2016; see also, Crystal‐Ornelas et al., 2021), increasing transparent reporting (T. H. Parker et al., 2016), embracing preregistrations and registered reports (T. Parker et al., 2019), and implementing more replication projects (Fraser et al., 2020). Adopting these practices will not only aid further meta‐analytical syntheses but also make ecological findings more reproducible and reliable in general (Nakagawa & Parker, 2015; O’Dea et al., 2021).”

CONFLICT OF INTEREST

The authors declare no conflict of interest. Supplementary Material Click here for additional data file.

67 in total

1. Conditional heteroscedasticity as a leading indicator of ecological regime shifts.

Authors: David A Seekell; Stephen R Carpenter; Michael L Pace
Journal: Am Nat Date: 2011-08-25 Impact factor: 3.926

2. Ecological impacts of invasive alien plants: a meta-analysis of their effects on species, communities and ecosystems.

Authors: Montserrat Vilà; José L Espinar; Martin Hejda; Philip E Hulme; Vojtěch Jarošík; John L Maron; Jan Pergl; Urs Schaffner; Yan Sun; Petr Pyšek
Journal: Ecol Lett Date: 2011-05-19 Impact factor: 9.492

3. Experiment, monitoring, and gradient methods used to infer climate change effects on plant communities yield consistent patterns.

Authors: Sarah C Elmendorf; Gregory H R Henry; Robert D Hollister; Anna Maria Fosaa; William A Gould; Luise Hermanutz; Annika Hofgaard; Ingibjörg S Jónsdóttir; Ingibjörg I Jónsdóttir; Janet C Jorgenson; Esther Lévesque; Borgþór Magnusson; Ulf Molau; Isla H Myers-Smith; Steven F Oberbauer; Christian Rixen; Craig E Tweedie; Marilyn D Walker; Marilyn Walker
Journal: Proc Natl Acad Sci U S A Date: 2014-12-29 Impact factor: 11.205

4. Making conservation science more reliable with preregistration and registered reports.

Authors: Timothy Parker; Hannah Fraser; Shinichi Nakagawa
Journal: Conserv Biol Date: 2019-05-22 Impact factor: 6.560

5. Meta-analysis of magnitudes, differences and variation in evolutionary parameters.

Authors: M B Morrissey
Journal: J Evol Biol Date: 2016-10 Impact factor: 2.411

6. Research Weaving: Visualizing the Future of Research Synthesis.

Authors: Shinichi Nakagawa; Gihan Samarasinghe; Neal R Haddaway; Martin J Westgate; Rose E O'Dea; Daniel W A Noble; Malgorzata Lagisz
Journal: Trends Ecol Evol Date: 2018-12-20 Impact factor: 17.712

7. Illustrating the importance of meta-analysing variances alongside means in ecology and evolution.

Authors: Alfredo Sánchez-Tójar; Nicholas P Moran; Rose E O'Dea; Klaus Reinhold; Shinichi Nakagawa
Journal: J Evol Biol Date: 2020-07-06 Impact factor: 2.411

8. Advancing global change biology through experimental manipulations: Where have we been and where might we go?

Authors: Paul J Hanson; Anthony P Walker
Journal: Glob Chang Biol Date: 2019-11-29 Impact factor: 10.863

Review 9. Ecotrons: Powerful and versatile ecosystem analysers for ecology, agronomy and environmental science.

Authors: Jacques Roy; François Rineau; Hans J De Boeck; Ivan Nijs; Thomas Pütz; Samuel Abiven; John A Arnone; Craig V M Barton; Natalie Beenaerts; Nicolas Brüggemann; Matteo Dainese; Timo Domisch; Nico Eisenhauer; Sarah Garré; Alban Gebler; Andrea Ghirardo; Richard L Jasoni; George Kowalchuk; Damien Landais; Stuart H Larsen; Vincent Leemans; Jean-François Le Galliard; Bernard Longdoz; Florent Massol; Teis N Mikkelsen; Georg Niedrist; Clément Piel; Olivier Ravel; Joana Sauze; Anja Schmidt; Jörg-Peter Schnitzler; Leonardo H Teixeira; Mark G Tjoelker; Wolfgang W Weisser; Barbro Winkler; Alexandru Milcu
Journal: Glob Chang Biol Date: 2021-01-28 Impact factor: 10.863

10. Low statistical power and overestimated anthropogenic impacts, exacerbated by publication bias, dominate field studies in global change biology.

Authors: Yefeng Yang; Helmut Hillebrand; Malgorzata Lagisz; Ian Cleasby; Shinichi Nakagawa
Journal: Glob Chang Biol Date: 2021-12-10 Impact factor: 13.211

3 in total

1. Low statistical power and overestimated anthropogenic impacts, exacerbated by publication bias, dominate field studies in global change biology.

Authors: Yefeng Yang; Helmut Hillebrand; Malgorzata Lagisz; Ian Cleasby; Shinichi Nakagawa
Journal: Glob Chang Biol Date: 2021-12-10 Impact factor: 13.211

2. Are experiment sample sizes adequate to detect biologically important interactions between multiple stressors?

Authors: Benjamin J Burgess; Michelle C Jackson; David J Murrell
Journal: Ecol Evol Date: 2022-09-14 Impact factor: 3.167

3. Meta-analysis reveals an extreme "decline effect" in the impacts of ocean acidification on fish behavior.

Authors: Jeff C Clements; Josefin Sundin; Timothy D Clark; Fredrik Jutfelt
Journal: PLoS Biol Date: 2022-02-03 Impact factor: 8.029

3 in total