Benjamin Djulbegovic1, Muhammad Muneeb Ahmed2, Iztok Hozo3, Despina Koletsi4, Lars Hemkens5,6,7, Amy Price8, Rachel Riera9, Paulo Nadanovsky10, Ana Paula Pires Dos Santos11, Daniela Melo12, Ranjan Pathak13, Rafael Leite Pacheco14, Luis Eduardo Fontes14,15, Enderson Miranda15, David Nunan15,16. 1. Department of Computational & Quantitative Medicine, Beckman Research Institute, City of Hope, Duarte, California, USA. 2. Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada. 3. Department of Mathematics, Indiana University Northwest, Gary, Indiana, USA. 4. Clinic of Orthodontics and Pediatric Dentistry, Center of Dental Medicine, University of Zurich, Zurich, Switzerland. 5. Department of Clinical Research, University of Basel, Basel Institute for Clinical Epidemiology & Biostatistics, University Hospital Basel, Basel, Switzerland. 6. Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA. 7. Meta-Research Innovation Center Berlin (METRIC-B), Berlin Institute of Health, Berlin, Germany. 8. Anesthesia Informatics and Media Lab, Stanford University, Stanford, California, USA. 9. Universidade Federal de São Paulo, Escola Paulista de Medicina, Brazil (Unifesp), São Paulo, Brazil. 10. Department of Epidemiology and Quantitative Methods in Health, National School of Public Health, Fundação Oswaldo Cruz (FIOCRUZ) - Department of Epidemiology, Institute of Social Medicine, Universidade do Estado do Rio de Janeiro (UERJ), Rio de Janeiro, Brazil. 11. Department of Pharmaceutical Sciences, Universidade Federal de São Paulo (Unifesp), Rio de Janeiro, Brazil. 12. Department of Community and Preventive Dentistry, Faculty of Dentistry, Universidade do Estado do Rio de Janeiro (UERJ), Rio de Janeiro, Brazil. 13. Department of Medical Oncology and Therapeutics Research, City of Hope, Duarte, California, USA. 14. Centro Universitário São Camilo, Researcher at the Center of Health Technology Assessment, Hospital Sirio-Libanês, São Paulo, Brazil. 15. Department of Intensive Care, and Emergency Medicine at Faculdade de Medicina de Petrópolis, in Petrópolis, Rio de Janeiro, Brazil. 16. Kellogg College, University of Oxford, Oxford, UK.
Abstract
RATIONALE, AIMS, AND OBJECTIVES: It is generally believed that evidence from low quality of evidence generate inaccurate estimates about treatment effects more often than evidence from high (certainty) quality evidence (CoE). As a result, we would expect that (a) estimates of effects of health interventions initially based on high CoE change less frequently than the effects estimated by lower CoE (b) the estimates of magnitude of effect size differ between high and low CoE. Empirical assessment of these foundational principles of evidence-based medicine has been lacking. METHODS: We reviewed the Cochrane Database of Systematic Reviews from January 2016 through May 2021 for pairs of original and updated reviews for change in CoE assessments based on the Grading of Recommendations Assessment, Development and Evaluation (GRADE) method. We assessed the difference in effect sizes between the original versus updated reviews as a function of change in CoE, which we report as a ratio of odds ratio (ROR). We compared ROR generated in the studies in which CoE changed from very low/low (VL/L) to moderate/high (M/H) versus M/H to VL/L. Heterogeneity and inconsistency were assessed using the tau and I2 statistic. We also assessed the change in precision of effect estimates (by calculating the ratio of standard errors) (seR), and the absolute deviation in estimates of treatment effects (aROR). RESULTS: Four hundred and nineteen pairs of reviews were included of which 414 (207 × 2) informed the CoE appraisal and 384 (192 × 2) the assessment of effect size. We found that CoE originally appraised as VL/L had 2.1 [95% confidence interval (CI): 1.19-4.12; p = 0.0091] times higher odds to be changed in the future studies than M/H CoE. However, the effect size was not different (p = 1) when CoE changed from VL/L → M/H [ROR = 1.02 (95% CI: 0.74-1.39)] compared with M/H → VL/L (ROR = 1.02 [95% CI: 0.44-2.37]). Similar overlap in aROR between the VL/L → M/H versus M/H → VL/L subgroups was observed [median (IQR): 1.12 (1.07-1.57) vs. 1.21 (1.12-2.43)]. We observed large inconsistency across ROR estimates (I2 = 99%). There was larger imprecision in treatment effects when CoE changed from VL/L → M/H (seR = 1.46) than when it changed from M/H → VL/L (seR = 0.72). CONCLUSIONS: We found that low-quality evidence changes more often than high CoE. However, the effect size did not systematically differ between the studies with low versus high CoE. The finding that the effect size did not differ between low and high CoE indicate urgent need to refine current EBM critical appraisal methods.
RATIONALE, AIMS, AND OBJECTIVES: It is generally believed that evidence from low quality of evidence generate inaccurate estimates about treatment effects more often than evidence from high (certainty) quality evidence (CoE). As a result, we would expect that (a) estimates of effects of health interventions initially based on high CoE change less frequently than the effects estimated by lower CoE (b) the estimates of magnitude of effect size differ between high and low CoE. Empirical assessment of these foundational principles of evidence-based medicine has been lacking. METHODS: We reviewed the Cochrane Database of Systematic Reviews from January 2016 through May 2021 for pairs of original and updated reviews for change in CoE assessments based on the Grading of Recommendations Assessment, Development and Evaluation (GRADE) method. We assessed the difference in effect sizes between the original versus updated reviews as a function of change in CoE, which we report as a ratio of odds ratio (ROR). We compared ROR generated in the studies in which CoE changed from very low/low (VL/L) to moderate/high (M/H) versus M/H to VL/L. Heterogeneity and inconsistency were assessed using the tau and I2 statistic. We also assessed the change in precision of effect estimates (by calculating the ratio of standard errors) (seR), and the absolute deviation in estimates of treatment effects (aROR). RESULTS: Four hundred and nineteen pairs of reviews were included of which 414 (207 × 2) informed the CoE appraisal and 384 (192 × 2) the assessment of effect size. We found that CoE originally appraised as VL/L had 2.1 [95% confidence interval (CI): 1.19-4.12; p = 0.0091] times higher odds to be changed in the future studies than M/H CoE. However, the effect size was not different (p = 1) when CoE changed from VL/L → M/H [ROR = 1.02 (95% CI: 0.74-1.39)] compared with M/H → VL/L (ROR = 1.02 [95% CI: 0.44-2.37]). Similar overlap in aROR between the VL/L → M/H versus M/H → VL/L subgroups was observed [median (IQR): 1.12 (1.07-1.57) vs. 1.21 (1.12-2.43)]. We observed large inconsistency across ROR estimates (I2 = 99%). There was larger imprecision in treatment effects when CoE changed from VL/L → M/H (seR = 1.46) than when it changed from M/H → VL/L (seR = 0.72). CONCLUSIONS: We found that low-quality evidence changes more often than high CoE. However, the effect size did not systematically differ between the studies with low versus high CoE. The finding that the effect size did not differ between low and high CoE indicate urgent need to refine current EBM critical appraisal methods.
A foundational epistemological principle underpinning evidence‐based medicine (EBM) is based on the assumption that the estimates of the effects of health interventions are closer to the ‘truth’ if they are based on higher than on lower quality (certainty) of evidence (CoE).
If the estimated treatment effects are close to the ‘true’ effects, this would also imply that they would less likely to change as evidence accumulates after new studies are completed. Conversely, because its relation to the ‘truth’ is less certain, this also implies that the estimated effects when evidence is of low quality would more likely change in future research. Research to date indicates that guideline panels are willing to issue stronger recommendations when they deem evidence to be of high quality, thus indirectly affirming this central EBM assumption.
,
,
,However, whether this indirect assessment of quality of evidence based on guidelines panels' decision‐making is accurate is not known. It is possible that current methods of critical appraisal of CoE do not discriminate well between ‘true’ accurate from inaccurate estimates of treatment effects. That is, the effects of health interventions based on low quality of evidence may turn out to reflect ‘true effects’ by testing in subsequent studies. On the other hand, what was originally deemed as high‐quality evidence may be undermined by future studies more often than initially expected. Thus, it is not known if low‐quality evidence is more often revised than high‐quality evidence. Empirical evidence supporting this foundational principle of EBM is lacking.The main purpose of this report is to assess if (a) low certainty evidence is more often revised than high certainty evidence in subsequent studies and if (b) the magnitude of effect size differs between high and low CoE.
METHODS
We assessed the change in CoE between the original and updated Cochrane systematic reviews, which reported rating of CoE as per the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for critical appraisal of medical evidence.
We used GRADE as this has been widely recognized as the most advanced system for operationalization of fundamental principles of EBM and critical evaluation of medical evidence.
,
,
GRADE was developed in the first decade of 21st Century after critical appraisal of 106 systems for rating of the quality of medical research evidence showed that none of them was capable of distinguishing low from high‐quality evidence.
,
,We focused on the assessment of systematic reviews, rather on individual trials, because the second important EBM principle is that assessment of the true effects of health interventions is best accomplished by evaluating total evidence on the topic rather than based on a study selected to favour a particular claim.
GRADE is also considered a suitable method to asses certainty of evidence at the level of systematic review/meta‐analysis.
Thus, the unit of our analysis was a systematic review/meta‐analysis (SR/MA).Cochrane Reviews are regularly updated providing a unique opportunity to assess when and whether the assessment of CoE changes between the original and updated reviews as a result of new evidence generated between two reviews. Since 2013 Cochrane Reviews have mandated the use of GRADE Summary of Findings (SoF)
to summarize CoE and magnitude effects of interventions that the reviews assessed. We evaluated all Cochrane reviews published in the last 5 years in the Cochrane Database of Systematic Reviews [https://www.cochranelibrary.com/cdsr/about-cdsr].We used SoFs from the original and updated reviews to extract data for the primary outcome related to CoE and to assess the magnitude and direction of effect. (In case of multiple primary outcomes, the data were extracted from the first one listed in SoF table that contained data in both original and updated review). Eligible SR/MAs were divided into five groups; data were extracted from each group by pairs of independent reviewers. Kappa interrater agreement was calculated for each pair regarding CoE. As explained, we recorded CoE according to GRADE criteria (very low, low, moderate and high).
,We also extracted summary meta‐analytic estimates for the primary outcome from each pair of reviews, that is, point estimates, dispersion (e.g., 95% confidence interval), metric used (e.g., relative risk, odds ratio, hazard ratio, standardized mean differences, etc.), number of trials per meta‐analysis, number of participants, type of comparator (active vs. placebo/no treatment), type of treatment (pharmaceutical vs. non‐pharmaceutical), whether the authorship of the original and updated reviews changed (to capture potential differences in judgment of CoE by the review team), and type of studies (randomized controlled trials vs. observational studies) that were meta‐analyzed.We converted all effect estimates into odds ratio (OR). We also converted all effect sizes in the same direction, with OR < 1 indicating reduction of undesirable outcomes (i.e., more beneficial treatment). Because GRADE separates recommendations as strong versus weak based on the CoE,
typically endorsing strong versus weak (conditional) recommendations based on moderate/high versus low/very low, respectively,
,
our key analysis focused on the differences in effect sizes between these subgroups. We conducted McNemar's test for paired (before vs. after) data to reject the null hypothesis of equal probability that CoE remained the same, that is, in very low/low CoE versus moderate/high CoE groups. To test for linear trend in change of CoE over all categories—from very low to high—we employed a symmetry test with marginal homogeneity tests (which reduces to McNemar's test for two non‐independent categories of observations).To asses for differences in the magnitude of effect size between original and updated evidence as a function of change in the assessment of CoE we calculated the ratio of odds ratio (ROR) across meta‐analytic estimates.
ROR compares intervention effects in meta‐analysis of trials with very low/low versus those with moderate/high CoE (or vice versa).
Thus, if the comparison referred to OR with very low/low versus those with moderate/high CoE pertains to ROR < 1, this would mean that treatment effects were more beneficial in meta‐analysis of trials with very low/low CoE, while ROR > 1 would indicate the opposite.
,
A test of interactions was performed to assess the hypothesis of no difference between the subgroups (i.e, treatments effects in very low/low vs. moderate/high CoE).
Because of assumed correlations in comparison of treatment effects, we calculated standard errors for ROR by correlating the effect sizes observed in the original versus updated reviews.
We obtained the values for correlation coefficients from the data. We performed sensitivity analyses by: (a) assuming one correlation coefficient between effects sizes in the original versus updated reviews and (b) calculating correlation coefficients for each subgroup according to direction of treatment effects (i.e., we calculated separate correlation coefficients for the subgroup showing positive, negative and no change in direction of effects between the original versus updated review—three correlation coefficients in total). We also repeated all analyses assuming no correlations between the effect sizes. Since we observed no differences in the results regardless of the postulated assumptions, we report the default analysis based on calculation with three different correlation coefficients.Our hypothesis was that ROR between the subgroups would differ; in addition, we would expect that the effect size would be larger if CoE change from moderate/high to very low/low than other way around.The analyses were based on using random effect Sidik‐Jonkman model. We assessed heterogeneity, that is, dispersion of effect size across the meta‐analytic estimates by calculating τ (tau) statistic.
We used I
2 statistic to assess inconsistency; I
2 represents the estimated proportion of the observed variance in true effect sizes across individual meta‐analyses rather than sampling error
; it depends both on heterogeneity and total variation in the estimates between the analyses.
,
We complemented assessment of heterogeneity with calculation of the absolute deviation of treatment effects (aROR) as a function of change in CoE.
By definition, aROR is positive and reflects the x‐fold deviation of treatment effect from OR = 1 on the OR scale. Thus, if ROR = 0.8 or ROR = 1.25, the absolute deviation is equal to aROR = 1.25. aROR across all SR/MAs was expressed as (unweighted) median and interquartile range (IQR).
We also evaluated how the precision of the estimates changed by calculating the ratio of standard errors for each subgroup summarized as (unweighted) median and IQR.
Values >1 indicate larger standard errors (less precision) associated with given category (e.g., very low/low vs. moderate/high) of CoE.A number of subgroup analyses—all defined a priori and published in the protocol to provide further methodological details
—were performed. These include assessment of differences between patient‐oriented (e.g., mortality, quality of life, etc.) versus disease‐oriented outcomes (e.g., disease response, laboratory outcomes, etc.), effect of a change in authorship between the original and updated reviews, effect of comparator intervention (active treatment vs. placebo/no treatment control) and type of treatment category (pharmaceutical vs. non‐pharmaceutical). Finally, in some cases, the SRs included observational studies along with randomized controlled trials (RCTs) and implausibly large ORs generated in conversion processes from standardized mean differences. We further analyzed these results by performing sensitivity analyses excluding SRs with observational studies and large ORs from the analysis.This paper is reported per PRISMA guidelines.
All analyses were conducted with the Stata,ver17 statistical package.
RESULTS
The original search, performed on 20 October 2020, identified 3323 potentially eligible reviews of which 419 SR were included in the final analysis (Figure 1). Of these, 414 (207 × 2) and 384 (192 × 2) pairs of the reviews were eligible for the analysis of CoE and effect size, respectively. Total number of trials included in 414 reviews was 4217 (1814 before and 2403 after); mean number of trials per meta‐analysis was 10 (minimum: 1, maximum: 133). Total number of participants was 3,057,956; mean number of participants per meta‐analysis was 10,506 (minimum: 16; maximum: 1,202,382). Interrater kappa agreement between the reviewers varied from 0.79 to 0.97.
Figure 1
PRISMA diagram (study flow diagram for evidence source and selection)
PRISMA diagram (study flow diagram for evidence source and selection)Figure 2 shows comparison of CoE in the original and updated Cochrane reviews across all categories of CoE (Figure 2A) and from very low/low to moderate/high (Figure 2B) according to GRADE criteria. Consistent with EBM principles, evidence judged to be of very low/low CoE had 2.1 (1.19–4.12; p = 0.0065) times higher odds to be upgraded in the future studies than moderate/high CoE (Figure 2B). Similarly, across all categories of CoE, the test for trend was highly significant, indicating an increased probability of change in CoE from very low to high CoE (p = 0.0021 for linear trend). We observed no instance in which high or moderate quality evidence was re‐assessed as very low‐quality evidence in the updated SR, while very low CoE was upgraded to moderate or high CoE in 9/39 of updated SR (Figure 2A).
Figure 2
Change in certainty of evidence (CoE) in original and updated Cochrane systematic review. (A) across all categories of CoE as characterized by GRADE; (B) grouped as very low/low versus moderate/high‐quality evidence
Change in certainty of evidence (CoE) in original and updated Cochrane systematic review. (A) across all categories of CoE as characterized by GRADE; (B) grouped as very low/low versus moderate/high‐quality evidenceHowever, we detected no effect of change in CoE on the magnitude of treatment effects [ROR = 1.02 (95% CI: 0.74–1.39) for change of CoE from very low/low to moderate/high versus 1.02 (95% CI: 0.44–2.37) for moderate/high to very low/low CoE]. Test between the subgroups was not significant (p = 1). (Figure 3) Although, as explained earlier, from guidelines recommendations perspectives, GRADE typically groups CoE as moderate/high versus low/very low, we also tried to compare the effect sizes at the two extremes of CoE: very low versus high. Because we observed no study with high CoE that changed into very low CoE (Figure 2A), ROR was impossible to calculate for this comparison.
Figure 3
Comparison of effects of health interventions in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low (A); (B) summary of studies shown in (A) with addition of comparison of meta‐analyses where CoE did not change. ROR‐ratio of odds ratio; τ
2 (tau2) statistic and H
2, measures of heterogeneity; I
2 statistic, measure of inconsistency
Comparison of effects of health interventions in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low (A); (B) summary of studies shown in (A) with addition of comparison of meta‐analyses where CoE did not change. ROR‐ratio of odds ratio; τ
2 (tau2) statistic and H
2, measures of heterogeneity; I
2 statistic, measure of inconsistencyNevertheless, there was larger dispersion in ROR in meta‐analyses where CoE changed from moderate/high to very low/low than in the opposite direction. This was probably driven by low power for the analysis instead of the hypothesis that effect size would be larger if CoE changed from moderate/high to very low/low than other way around. [We had half as many of meta‐analyses available for the assessment of ROR based on change of CoE from moderate/high to very low/low (n = 16) as those in which CoE changed from very low/low to moderate high (n = 33).]aROR was similar between the subgroups [median (IQR): 1.12 (1.07–1.57) vs. 1.21 (1.12–2.43)] (Figure 4A, Table 1). As in case of ROR, we observed larger dispersion in aROR in meta‐analyses where CoE changed from moderate/high to very low/low than in the opposite direction (Figures 4A,B).
Figure 4
(A) Absolute deviation (AD) of treatment effects (aROR) in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low; (B) summary of aROR by change in CoE (For graph displaying aROR for all studies, including those that did not have change in CoE, see Supporting Information Appendix, App Figure S4 and App S4a)
Table 1
Summary of aROR (absolute deviation of treatment effects away from OR = 1)
All data
After dropping outliersa
All studies, median [IQR]: 1.14 [1.05 1.65]
All studies, median [IQR]: 1.12 [1.03 1.40]
VeryLow/Low → Mod/High, median [IQR]: 1.12 [1.07 1.57]
VeryLow/Low → Mod/High, median [IQR]: 1.11 [1.06 1.47]
Mod/High → VeryLow/Low, median [IQR]: 1.21 [1.12 2.43]
Mod/High → VeryLow/Low, median [IQR]: 1.19 [1.11 1.52]
CoE didn't change, median [IQR]: 1.13 [1.04 1.66]
CoE didn't change, median [IQR]: 1.12 [1.03 1.39]
After dropping studies that were converted to OR from studies that originally used standardized mean difference [SMD] (n = 20) and mean difference [MD] (n = 19) metrics to summarize treatment effects.
(A) Absolute deviation (AD) of treatment effects (aROR) in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low; (B) summary of aROR by change in CoE (For graph displaying aROR for all studies, including those that did not have change in CoE, see Supporting Information Appendix, App Figure S4 and App S4a)Summary of aROR (absolute deviation of treatment effects away from OR = 1)After dropping studies that were converted to OR from studies that originally used standardized mean difference [SMD] (n = 20) and mean difference [MD] (n = 19) metrics to summarize treatment effects.The meta‐analyses with no change in CoE had similar ROR [ROR = 1.01 (95% CI: 0.85 to 1.21)] (Figure 3B) and aROR [median (IQR): 1.13 (1.04–1.66)] (Table 1, App Figure S4 and App Figure SA) to those MAs in which CoE changed (Figure 4 and App Figure SA). Inconsistency was large across all meta‐analytic estimates (I
2 = 99%). There was larger imprecision in treatment effects when CoE changed from VL/L → M/H (seR = 1.46) than when it changed from M/H → VL/L (seR = 0.72).Qualitative analysis indicated that direction of the effect changed in 6 SR/MAs only: two in the reviews in which CoE changed from very low/low to moderate/high (of which one was statistically significant) and in 4 SR/MAs with no change in the assessment of CoE (of which one was statistically significant) (Figure 5, App Figures S12 and S13).
Figure 5
Change in effect size, qualitative analysis (see also App Figures S12 and S13)
Change in effect size, qualitative analysis (see also App Figures S12 and S13)Sensitivity analyses for all pre‐defined subgroups showed no change in the results. In fact, when non‐randomized studies or outliers were excluded from the analyses, no statistically significant changes were seen in any of the analyses (Appendix).
DISCUSSION
Almost 30 years ago, EBM
was introduced to wide medical audience, subsequently being assessed to represent one of the most important medical milestones of the last 160 years, in the same category as innovations such as antibiotics and anesthesia.
At the heart of EBM is notion that ‘not all evidence is created equal’—some evidence is more credible than others; the higher quality of evidence, the more accurate and trustworthy are our estimates about true effects of health interventions.
Surprisingly, however, the relationship between CoE and estimates of treatment effects has not been empirically evaluated.Here, we provide the first empirical support for the foundational EBM principle that low‐quality evidence changes more often than high CoE (Figure 2). However, we found no difference in effect sizes between studies appraised as very low versus high [or, very low/low versus moderate/high CoE (Figure 3)]. This implies that effects that are assessed as less trustworthy/potentially unreliable (as when CoE is low) cannot be distinguished from those assessments, which are presumably more trustworthy/accurate (as when CoE is high). If the magnitude of treatment effects cannot be meaningfully distinguished from evidence appraised as high versus low quality, then the core principle of EBM seems to be challenged.Our ‘negative’ results should not be construed as a challenge to sound, normative EBM epistemological principles, which hold that optimal practice of medicine requires explicit and conscientious attention to the nature of medical evidence.
,
,
Rather, in assessing the relationship between CoE and ‘true’ effects of health interventions, more salient question is to ask if the current appraisal methods capture CoE as intended by the EBM principles. Critical appraisal of CoE is integral aspect of conduct of systematic reviews, guidelines development and is widely integrated in the curricula in most medical and allied professional schools across the world. Over the years, many critical appraisal methods have been developed
to eventually culminate in development of GRADE methodology, which has been endorsed by more than 110 professional organizations.
However, as we demonstrate here, despite GRADE's capacity to distinguish CoE across its categories, it could not—and we suspect none of other appraisal methods that GRADE has replaced—reliably discerned the influence of CoE on the estimates of treatment effects. The results agree with those of Gartlehner et al who, based on cumulative meta‐analysis of 37 Cochrane reviews, found
limited value of GRADE in predicting stability of strength of evidence as new studies emerged. Other authors also questioned validity of GRADE as the system that is sufficiently empirically justified to ensure that our judgments are proportional to underlying (quality) of evidence.
,The finding that the magnitude of effect size is not reflected in a change of CoE is surprising as elucidating bias effects that resulted in misleading advices to patients has been one of the key reasons for the rise of EBM. For example, a large body of observational evidence indicated that hormone replacement therapy (HRT) can reduce heart attack by 40%–50%, which resulted in advice to millions of women to take HRT to prevent heart attack.
However, when high quality of evidence was generated, the opposite was observed: more women died from heart attack if they took HRT than from placebo.
Similarly, thousands of women with breast cancer were advised to undergo highly toxic stem cell transplant based on unreliable observational evidence indicating improvement in disease‐free survival by about 50% compared with historical control
—the findings that were overturned once high‐quality randomized trials were done.
,In addition, previous meta‐epidemiological studies showed that various study limitations that affect CoE significantly influence estimates of treatment effects
(although not always consistently
). For example, as measured by ROR, inadequate or unclear (vs. adequate) random‐sequence generation, inadequate or unclear (vs. adequate) allocation concealment, or lack of or unclear double‐blinding (vs. double‐blinding) led to statistically significant exaggeration of treatment effects by 11%, 7% and 13%, respectively.
These study limitations are taken into account in rating of CoE using GRADE method,
so one would expect that effect size would differ between low versus high CoE in the GRADE assessment. However, on further examination, we observe that GRADE combines the study limitations such as adequacy of allocation concealment, blinding, etc. (risk of bias) with the assessment of inconsistency, imprecision, indirectness and publication bias to assign the final rating of CoE (from very low to high quality) in additive fashion.
,
It appears that using additive means to report the properties of negative and positive changes in treatment effect could unhelpfully neutralize this effect and cause imprecision in the overall estimate. Thus, one can have the same estimates of treatment effects but completely different GRADE ratings. This is, however, problematic because central assumption of GRADE is that estimates underpinned by high CoE are unlikely to change, whereas the very low/low CoE estimates are more likely to change.A potential limitation of our study is that we have not collected data on the individual factors that drove assessment of CoE (i.e., study limitations/risk of bias vs. inconsistency, imprecision, or indirectness, for example). However, the present empirical report targets, for first time, the end‐stage level assessment of CoE, according to GRADE specifications, which is how CoE is used in practice to aid interpretation of evidence and affect development of clinical guidelines.We also detected imprecision in the estimates of effects sizes and relatively wide ROR confidence intervals, particularly in the subgroup of meta‐analyses describing treatment effects when CoE changed from moderate/high to low/very low. It may be argued that the current methods of CoE appraisal are simply not sensitive enough and that with much larger sample size of SR/MAs, we would be able to differentiate between effect sizes across categories of CoE. This point was made by Howick and colleagues
who showed no change in the CoE between original and updated reviews in a set of the 48 trials they examined, albeit they made no attempt to identify changes in effect sizes. We also found that in 71 cases the updated reviews were based on inclusion of only 1 extra trial, which might not be enough to overturn or appreciably revise the effect estimate. However, sensitivity analyses comparing the changes in effect size as a function of the number of trials added in the updated meta‐analyses showed no difference in the results, regardless of the choice of cut‐off for the inclusion of these additional trials in the analysis (e.g., 1 vs. ≥3, or any other way). Importantly, critical appraisal (and GRADE) applies to both evidence obtained in single and multiple trials and is required in the Cochrane Reviews regardless of the quantity of existing evidence. Obtaining the larger sample sizes is also unrealistic given that we reviewed almost all SRs in the Cochrane database since the GRADE assessment of CoE was mandated (up to May 2021). Finally, few Cochrane Reviews we analyzed included observational studies. It is possible that GRADE may not differentiate the quality of randomized evidence well but that it may perform better if the comparison is made between randomized versus observational studies. The Cochrane Reviews, however, are typically based on randomized trials. Therefore, categorization of CoE based on currently mandated critical appraisal system using GRADE in the Cochrane Reviews does not meaningfully separate effect sizes across the existing gradation of CoE (although, capacity of GRADE to distinguish the magnitude of effect size between randomized and observational studies outside of the purview of Cochrane Reviews remains a worthwhile goal for further empirical research).Given that studies can be well done, and correctly estimated treatment effects, but be poorly reported,
,
it is also possible that we could not detect influence of CoE on the estimates of treatment effects because current critical appraisal methods depend on the quality of reporting of the trials that are selected for meta‐analysis. However, if we believe that quality of reporting does not matter, then the entire critical appraisal efforts can be considered misplaced to begin with.
CONCLUSIONS
To the extent that the central to the epistemology of EBM is that what is justifiable or reasonable to believe depends on CoE,
our findings indicate urgent need to refine current EBM critical appraisal methods. If EBM is going to flourish, it is crucial to develop methods with capacity to categorize CoE to reliably differentiate between magnitude effects that are potentially biased from those that are accurate and trustworthy. The major opportunity, therein, lies in addressing the main limitations of this study‐ carefully and painstakingly discerning various aspects of CoE (from the components related to study limitations/risk of bias to inconsistency, imprecision, or indirectness) to better characterize CoE and its relationship to the magnitude of effects of health interventions.
CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.
AUTHOR CONTRIBUTIONS
The authors are notable as an interdisciplinary team of EBM practitioners and instructors who are respected as clinicians, mathematicians, epidemiologists, statisticians, methodologists, and researchers across academic institutions, hospitals and clinics in the UK, Can, USA, Brazil and Switzerland. Their research experience ranges from recently acquired doctorates to over 40 years in research and clinical practice. All authors contributed to the methods, commented on the analysis and contributed to writing and revising the manuscript. Our sources and selection criteria are contained within the document, the data is publicly available from the Cochrane Database and our statistical methods are outlined in the methods, figures and tables. PRISMA was used to report our findings. BD serves as the guarantor of the article. A conceptual idea: Benjamin Djulbegovic; Design: Benjamin Djulbegovic and David Nunan; Protocol development: Benjamin Djulbegovic, Muhammad Muneeb Ahmed, David Nunan, Lars Hemkens, Despina Koletsi, Amy Price, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes; Data acquisition: Muhammad Muneeb Ahmed, Despina Koletsi, Amy Price, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes, Ranjan Pathak. Statistical analysis: Iztok Hozo, Benjamin Djulbegovic, Lars Hemkens; Drafting manuscript: Benjamin Djulbegovic; Critical revision of the manuscript for important intellectual content: Benjamin Djulbegovic, Lars Hemkens, David Nunan, Amy Price, Despina Koletsi, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes, Ranjan Pathak. Administrative, technical, or material support: Benjamin Djulbegovic, Muhammad Muneeb Ahmed. Supervision: Benjamin Djulbegovic.Supporting information.Click here for additional data file.
Authors: Heloisa P Soares; Stephanie Daniels; Ambuj Kumar; Mike Clarke; Charles Scott; Suzanne Swann; Benjamin Djulbegovic Journal: BMJ Date: 2004-01-03
Authors: Alessandro Liberati; Douglas G Altman; Jennifer Tetzlaff; Cynthia Mulrow; Peter C Gøtzsche; John P A Ioannidis; Mike Clarke; P J Devereaux; Jos Kleijnen; David Moher Journal: J Clin Epidemiol Date: 2009-07-23 Impact factor: 6.437
Authors: Gordon H Guyatt; Andrew D Oxman; Regina Kunz; James Woodcock; Jan Brozek; Mark Helfand; Pablo Alonso-Coello; Paul Glasziou; Roman Jaeschke; Elie A Akl; Susan Norris; Gunn Vist; Philipp Dahm; Vijay K Shukla; Julian Higgins; Yngve Falck-Ytter; Holger J Schünemann Journal: J Clin Epidemiol Date: 2011-07-31 Impact factor: 6.437
Authors: W P Peters; M Ross; J J Vredenburgh; B Meisenberg; L B Marks; E Winer; J Kurtzberg; R C Bast; R Jones; E Shpall Journal: J Clin Oncol Date: 1993-06 Impact factor: 44.544
Authors: Jelena Savović; Hayley E Jones; Douglas G Altman; Ross J Harris; Peter Jüni; Julie Pildal; Bodil Als-Nielsen; Ethan M Balk; Christian Gluud; Lise Lotte Gluud; John P A Ioannidis; Kenneth F Schulz; Rebecca Beynon; Nicky J Welton; Lesley Wood; David Moher; Jonathan J Deeks; Jonathan A C Sterne Journal: Ann Intern Med Date: 2012-09-18 Impact factor: 25.391
Authors: David Atkins; Martin Eccles; Signe Flottorp; Gordon H Guyatt; David Henry; Suzanne Hill; Alessandro Liberati; Dianne O'Connell; Andrew D Oxman; Bob Phillips; Holger Schünemann; Tessa Tan-Torres Edejer; Gunn E Vist; John W Williams Journal: BMC Health Serv Res Date: 2004-12-22 Impact factor: 2.655
Authors: Benjamin Djulbegovic; Muhammad Muneeb Ahmed; Iztok Hozo; Despina Koletsi; Lars Hemkens; Amy Price; Rachel Riera; Paulo Nadanovsky; Ana Paula Pires Dos Santos; Daniela Melo; Ranjan Pathak; Rafael Leite Pacheco; Luis Eduardo Fontes; Enderson Miranda; David Nunan Journal: J Eval Clin Pract Date: 2022-01-28 Impact factor: 2.336