Adriani Nikolakopoulou1,2, Dimitris Mavridis2,3, Matthias Egger1, Georgia Salanti1,2,4. 1. 1 Institute of Social and Preventive Medicine (ISPM), University of Bern, Bern, Switzerland. 2. 2 Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece. 3. 3 Department of Primary Education, University of Ioannina, Ioannina, Greece. 4. 4 Bern Institute of Primary Care (BIHAM), University of Bern, Bern, Switzerland.
Abstract
Pairwise and network meta-analysis (NMA) are traditionally used retrospectively to assess existing evidence. However, the current evidence often undergoes several updates as new studies become available. In each update recommendations about the conclusiveness of the evidence and the need of future studies need to be made. In the context of prospective meta-analysis future studies are planned as part of the accumulation of the evidence. In this setting, multiple testing issues need to be taken into account when the meta-analysis results are interpreted. We extend ideas of sequential monitoring of meta-analysis to provide a methodological framework for updating NMAs. Based on the z-score for each network estimate (the ratio of effect size to its standard error) and the respective information gained after each study enters NMA we construct efficacy and futility stopping boundaries. A NMA treatment effect is considered conclusive when it crosses an appended stopping boundary. The methods are illustrated using a recently published NMA where we show that evidence about a particular comparison can become conclusive via indirect evidence even if no further trials address this comparison.
Pairwise and network meta-analysis (NMA) are traditionally used retrospectively to assess existing evidence. However, the current evidence often undergoes several updates as new studies become available. In each update recommendations about the conclusiveness of the evidence and the need of future studies need to be made. In the context of prospective meta-analysis future studies are planned as part of the accumulation of the evidence. In this setting, multiple testing issues need to be taken into account when the meta-analysis results are interpreted. We extend ideas of sequential monitoring of meta-analysis to provide a methodological framework for updating NMAs. Based on the z-score for each network estimate (the ratio of effect size to its standard error) and the respective information gained after each study enters NMA we construct efficacy and futility stopping boundaries. A NMA treatment effect is considered conclusive when it crosses an appended stopping boundary. The methods are illustrated using a recently published NMA where we show that evidence about a particular comparison can become conclusive via indirect evidence even if no further trials address this comparison.
Entities:
Keywords:
Sequential methods; efficacy and futility boundaries; multiple treatments; stopping rules; update of systematic reviews
In 1898, George Gould, the first president of the Association of Medical Librarians, presented his vision regarding the optimal use of existing evidence. He was looking forward to a situation where “a puzzled worker in any part of the civilized world shall in an hour be able to gain a knowledge pertaining to a subject of the experience of every other man in the world”.[1] Highlighting the increasing information overload and the pivotal role of systematic reviews in health care,[2] Mike Clarke updated Gould’s vision in 2004, hoping for a system in which decision makers “would be able, in 15 minutes, to obtain up-to-date, reliable evidence of the effects of interventions they might choose, based on all the relevant research.”[3]In essence Gould and Clarke call for cumulative (network) meta-analyses of randomized trials of health care interventions.[4-7] Ideally, cumulative meta-analyses are prospectively planned: investigators establish a collaboration before the design of their trials is finalized, so that study procedures, interventions and outcomes can be harmonized and analyses can be done as soon as the results become available.[6,7] Prospectively planned meta-analyses have the potential to reduce bias because key decisions on inclusion criteria, outcome definition and other procedures are made a priori.[7] Several prospective meta-analyses have been conducted in recent years, for example in cardiology[8,9] or oncology.[10,11]However, the vast majority of meta-analyses are not prospectively planned. Reviewers tend to update their meta-analysis when relevant studies are published but have no direct influence over the planning of future studies. Nevertheless, after each update they need to characterize the evidence (for a particular treatment comparison and outcome) as conclusive or not, decide whether future updates of the evidence are needed and recommend the realization of further studies or not. The Cochrane Collaboration has a policy about when a systematic review should be updated.[12] Updating meta-analysis (either because it is prospectively planned or because its result would be used to form a decision about conclusiveness) involves multiple tests as evidence accumulates and effect sizes are recalculated at each step, resulting in an inflated type I error.[13-15] Sequential methods for standard pairwise meta-analysis have been developed to account for multiple testing and adjust nominal significance.[5,16-19]For many conditions, several treatment options exist and data on their comparative effectiveness are of primary interest to clinicians. At present comparisons of one treatment with no treatment, or with placebo, continue to dominate clinical research, and head-to-head comparisons remain uncommon. Network meta-analysis (NMA) addresses this situation. Under the condition that studies are similar with respect to the variables that might modify the treatment effects, NMA can synthesize evidence from trials that form a network of interventions in a single analysis. Summary estimates of comparative effectiveness for all treatment options are thus obtained, including treatments that have not been compared directly in head-to-head comparisons.[20,21] In line with recent calls for comparative evidence at the time of market authorization,[22,23] Naci and O’Connor suggested the use of prospective, cumulative NMA in the regulatory setting.[19] Evidence on relative effects of treatments can become conclusive even if there are no new trials that directly compare them because of new studies contributing indirect evidence.In this article we extend ideas of sequential monitoring of trials to provide methods for updating NMAs. We argue that sequential methods are relevant in any setting where a decision is to be made based on the results of an updated meta-analysis; when future studies are to be planned based on existing meta-analytic results (prospective meta-analysis) or when decisions are made about the necessity of future updates. We introduce cumulative NMA, discuss ways to adjust for multiple testing and recommend graphical representations of the sequential NMA process. We then discuss how important outputs of NMA can be monitored when updating a NMA.
2 Illustrative example: Coronary revascularization in diabetic patients
To illustrate the methodology we use a recently published NMA evaluating the optimal revascularization technique in diabeticpatients.[24] The primary outcome examined is a composite of all-cause mortality, non-fatal myocardial infarction and stroke measured using odds ratio (OR). Authors combined 15 studies examining the effectiveness of three interventions; percutaneous coronary intervention with bare mental stent (BMS) or drug eluting stents (DES) and coronary artery bypass grafting (CABG). For illustration purposes we consider that NMA has been undertaken sequentially; each study included in the data as soon as it is published and the systematic reviewers have to decide, after each update, whether future updates of the NMA are necessary to provide a conclusive answer. This particular NMA was chosen because it examines few treatments and includes a substantial number of studies to ensure that methods will be easily exemplified and the sequential process will be conveniently presented. Throughout we assume that comparability between trial populations and characteristics that may act as potential effect modifiers is justified, so that the synthesis of the planned trials in a NMA model is sensible. The data set comprises 12 two-arm studies and three three-arm studies. NMA suggests that the best treatment is CABG which is significantly better than BMS (OR 0.59; 95% confidence interval 0.44 to 0.78) and marginally better than DES (OR 0.73; 95% confidence interval 0.54 to 0.98). Studies were published between 2007 and 2013 and it would be interesting to see whether significance is sustained after correcting for multiple testing and if yes, at which point in time the accumulated evidence was conclusive. Note that when updating NMA the comparison of ‘BMS versus CABG’ can become statistically significant even when ‘BMS versus DES’ studies are published via indirect comparison.In order to undertake a sequential analysis, one needs to specify type I and type II errors, as well as the alternative hypothesis. The specification of the effect size to be detected is of crucial importance as the alternative hypothesis should express a clinically important effect reflecting the perspectives, needs and preferences of different individuals.[15,25-28] However, determination of an effect that reflects patient perceptions is very challenging when the primary outcome is a composite endpoint.[24] For illustrative reasons, in the remainder of the paper we will use arbitrarily (yet clinically plausible) log ORs for the three comparisons to be , , and ; these correspond to ORs of 1.32, 1.11 and 1.20, respectively. In clinical applications however, we recommend the consideration of a variety of alternative hypotheses taking into account patient preferences that may be driven by discomfort, inconvenience and risk of adverse events.[28] Note that in a NMA context, the alternative treatment effects need to be consistent (e.g. here ). Particular attention is needed when more than three treatments are examined; alternative effect sizes should be determined for all comparisons in the network and consistency between them needs to be satisfied. Clinicians who suggest values for the alternative effect sizes are often asked to guess absolute effects for the various treatments. Consequently, the assumption of consistency would be satisfied in practice.
3 Methods
3.1 Cumulative NMA
Consider a network of n trials forming a set of T competing interventions for a healthcare condition. We assume that the evidence base is updated sequentially; each trial indexed with enters the analysis when its results become available. After the inclusion of each study, pairwise and NMA models are updated and cumulative treatment effects are derived. We assume that the number and timing of interim analyses are not known at the start of NMA and that updates take place after the publication of any new study that meets the inclusion criteria. The method can be generalized to NMAs that are updated after more than one studies are included.Let be the vector of all cumulative direct relative effects for each treatment comparison after the inclusion of trial i. Vector contains the respective cumulative NMA treatment effects, derived from any appropriate statistical NMA model which integrates direct and indirect evidence and accounts for the correlation introduced by multi-arm trials.[29-31] Elements of and are replaced with the addition of each study i to represent the updated treatment estimates. As evidence is accumulated and treatments are added in the evidence base, and may change dimensions to include additional treatment effects with the dimension of being equal to or smaller than the dimension of . In the last step and will contain at most treatment effects and will be denoted as and respectively; note that the dimensions of and will be exactly in a fully connected network. We may focus on each element of and or restrict ourselves to a subset of comparisons that are of more interest. Reasons to restrict the set of comparisons of interest may include the establishment of their comparative effectiveness or safety, their association with adverse events or even the withdrawal of certain treatments from the market. Consider for instance the comparison ‘Y versus X’; and denote the respective cumulative direct and NMA treatment effects with standard errors and where and i index the last study introduced. Similarly to cumulative pairwise meta-analysis, a cumulative NMA is a mechanism of displaying the cumulative NMA treatment effects along with their confidence intervals for in a table or in a plot. Each ‘Y versus X’ NMA cumulative effect is modified not only when a study comparing the particular set of treatments is performed, but also when indirect evidence that informs the ‘Y versus X’ comparison becomes available. From this point we will focus on to illustrate the sequential methodology for NMA estimates. The developments equally apply to any element of as well as to the direct cumulative estimates.
3.2 Assumptions underlying the updating of NMA
The justification of similarity in effect modifiers is important to ensure the plausibility of the transitivity assumption after each update of the network.[19,20,32] Throughout, we assume that the transitivity assumption is epidemiologically evaluated and deemed reasonable. The consistency assumption is the statistical manifestation of transitivity and lies on the statistical agreement between different sources of evidence.[33] A statistical test for inconsistency can be monitored as soon as its evaluation is possible, that is when a closed loop (not composed only by multi-arm trials) is formed. Large amounts of inconsistency should prohibit a joint synthesis of the data and explore the differences between the various sources of evidence. However, the power of inconsistency tests might be low even after the inclusion of several studies in NMA.[34,35] In collaborative prospective NMAs inconsistency is likely to be avoided through the efforts of the researchers to ensure the comparability of the studies and maximize the chances of transitivity.We adopt a random effects NMA model and we assume a network specific heterogeneity variance . One could re-estimate the heterogeneity variance at each step of the analysis; this process would be associated with poor estimation of heterogeneity while the number of included studies is small.[36,37] To overcome this limitation we choose to inform the unknown heterogeneity parameter by predictive distributions conditional on the type of outcome and treatment comparison based on findings from previous meta-analyses.[38,39] In order to account for uncertainty in the imputations of the heterogeneity parameter, we suggest the use of the 25th, the 50th and the 75th quantiles of the respective predictive distribution of heterogeneity formulated in Turner et al. (for binary outcomes) and Rhodes et al. (when continuous outcomes are assessed).[38,39] Setting a priori an expected value for heterogeneity might be more appropriate in the setting of prospective meta-analysis as studies are prospectively designed and their inclusion criteria are similar. Alternative strategies for heterogeneity, such as the re-estimation of heterogeneity after a sizeable number of included studies, could be applied.
3.3 Z-score and relevant information of cumulative network estimates
The cumulative network estimate is assumed to approximately follow the normal distribution; we assume variances to be known and equal to the sampling variances, denoted as . The null hypothesis is tested using the statistic
which we refer to as z-score. It is rejected if for a two-sided test where the value is the quantile of the distribution.Several approaches have been suggested for measuring information in pairwise meta-analysis; we adopt an approach which is directly related to the precision of the meta-analytic estimates and consequently to the amount of evidence accumulated.[5,15,40,41] According to that approach the information contained within each comparison in the network can be measured as
We will conventionally refer to as the ‘amount of information’. Plotting the z-score versus the amount of information at each update i provides a visualization of the accumulation of evidence for the network estimate ‘Y versus X’.
3.4 Construction of efficacy stopping boundaries
Several methods have been proposed to control type I error in clinical trials when multiple looks at the data are taken through the construction of stopping boundaries for deciding whether or not to reject the null hypothesis. These methods include the Haybittle-Peto method, the Pocock boundaries and the O’Brien-Flemming monotone decreasing boundaries.[42] Application of different stopping boundaries can lead to different conclusions regarding early stopping of a clinical trial in an interim analysis. It has been suggested that the O’Brien-Flemming method is more close to the behavior of data monitoring committees who require a great beneficial effect to stop a trial at an early stage.[43] An important problem associated with standard sequential methods is the necessity to define the number of interim analyses at the beginning and the requirement of equally spaced interim analyses. These problems are handled by the introduction of alpha spending functions which extend group sequential designs to allow flexibility in the number and timing of interim analyses.[44] An alpha spending function describes the rate at which the total significance level is spent at each intermediate testing; information fraction t indicates the proportion of the information that has been accumulated.Appending efficacy boundaries to the plane can lead to a stopping framework when updating NMA. We adopt the continuous alpha spending function which resembles the O’Brien-Flemming boundaries, defined as
where Φ represents the cumulative standard normal distribution.[42,44] The parameter indicates the position in the analysis regarding the accumulated information and is calculated as .As the total amount of information that will be employed is unknown, the specification of needs to rest on assumptions. In order to specify the respective quantity in a sequential framework for pairwise meta-analysis, Wetterslev et al. assume that studies are approximating one big trial and follow conventional calculations made in sequential analysis of individual trials.[41] Higgins et al. use values obtained from the O’Brien and Flemming design for specific values of the alternative effect size, type I and type II errors.[15] We specify following conventional power calculations, imposing consistency between alternative effect sizes,, for all comparisons involved in the network and taking into account the multiplicity induced by multiple comparisons. Specifically, is derived as the information that would be needed in an adequately powered multi-arm trial. As involves the estimation of heterogeneity, and are also affected by the heterogeneity value. In particular, larger heterogeneity values are associated with smaller and with the respective meta-analytic estimates occupying places which are further to the left in the plane. Details on derivation of can be found in Appendix A1.The alpha spending function is used to allocate a portion of the total a to each . The efficacy boundaries are the quantiles corresponding to . If crosses the boundaries ( the meta-analysis has reached a conclusive answer for the ‘Y versus X’ comparison. Note that even when a NMA effect estimate is deemed conclusive, indirect evidence may continue to feed into that particular comparison if the rest of the comparisons in the network do not contain sufficient evidence to infer about their conclusiveness. Similarly, a NMA effect estimate is updated and might reach conclusiveness even in the absence of trials addressing that particular comparison because of indirect evidence. A counter-intuitive situation can occur when a conclusive result becomes inconclusive in the next update. This could be the result of an important increase in heterogeneity or inconsistency leading to less precise effect estimates when data is synthesized using the random effects approach. In such situations, formal exploration and interpretation of sources of variability is required before inclusion of the new evidence in warranted updating of the network.Figure 1 panel a presents the plane of a fictional example where nine studies are synthesized sequentially and conclusiveness is achieved after eight studies for . The plane along with the derived efficacy stopping boundaries can equivalently be presented by repeated confidence intervals on the estimates of the summary effects as .[44,45] The repeated confidence interval would include 0 when . This particular representation provides the same information regarding stopping decisions while it offers the advantage of displaying the NMA stopping framework in a forest plot along with the effect estimates[15] as shown in Figure 1 panel b.
Figure 1.
Panel a: Hypothetical stopping framework for efficacy and futility. Futility here means that Y will not be shown better than X by more than 0.5 effect size. Panel b: Hypothetical forest plot with repeated confidence intervals (dotted lines).
Panel a: Hypothetical stopping framework for efficacy and futility. Futility here means that Y will not be shown better than X by more than 0.5 effect size. Panel b: Hypothetical forest plot with repeated confidence intervals (dotted lines).
3.5 Construction of futility stopping boundaries
Future updates of NMA can be considered unnecessary when there are early signs of efficacy or because it is considered unlikely that the relative superiority of a treatment will be shown in subsequent steps of analysis. Such decisions in clinical trials are known as stopping for futility.[46] Roughly, there are four major methods used to stop further experiments for futility: conditional power, predictive power—which is the analogue of conditional power in Bayesian analysis—, construction of triangular regions—also known as sequential probability ratio tests—, and beta spending functions.[46-48] We choose to transfer the later method for stopping for futility in NMA because of its analogy to the alpha spending functions and its convenient visualization along with the efficacy boundaries in the plane. Note that the use of conditional power in NMA has been considered elsewhere.[49]We adopt a method described by Lachin to determine futility boundaries.[50] Without loss of generality we assume that positive values of represent a relative advantage of treatment Y. We consider that a study is futile if it cannot show that Y is better than X with an effect size of at least . The treatment effect parameter is an additive measure and should be defined so that it represents a clinically significant advantage of Y over X. Then, a decision to stop for futility can be specified if the upper confidence limit of the interim effect estimate does not exceed the pre-defined value . It turns out that the futility confidence limit is equivalent to the determination of a futility stopping boundary for the interim value. Then, the futility stopping rule for a relative advantage of Y over X can be expressed as
and we define the futility boundaries for Y over X on the plane as . Note that while is constant throughout the analysis, depends on the amount of information accumulated at the ith update.The value could be set equal to employed in power calculations. Values of clinical significance should be chosen so as to satisfy consistency; that is for a triangular network that includes treatments X, Y and Z, we need to specify the respective values for only two out of the three treatment comparisons. If we specify and it turns out that the value of clinical significance for the comparison ‘Z versus Y’ is . In the hypothetical example illustrated in Figure 1 panel a we present the futility boundary for the case that we expect Y being better than X with an effect size of at least 0.5 which results to a decision of stopping for futility after the inclusion of the third study.It has been shown that under the alternative hypothesis , stopping for futility inflates type II error.[51] A common solution to this limitation is the delay in making inferences regarding stopping for futility in the updating procedure, for instance appending futility boundaries only after at least half of the total planned information has been accumulated (that is at ).[50] The vertical line in Figure 1 panel a indicates this point in the analysis which is termed ‘half-information futility assessment’.
3.6 Other network characteristics to be monitored
Monitoring changes in the conclusions from NMA should be accompanied by an evaluation in changes in the inconsistency and heterogeneity (if re-estimated at each update) so as to put results into context. Investigators planning NMA should make sure that the inclusion criteria of the studies ensure their comparability and maximize the chances of transitivity and that the distribution of effect modifiers is comparable across treatment comparisons. However, even after careful planning, there is always the possibility of inconsistency in the assembled data.[20,52] Thus, we consider that in each update of NMA an estimation of inconsistency is included; here, we consider the cumulative performance of the loop specific approach.[53] Taking into account the low power of tests for inconsistency, we do not recommend adjusting for multiple testing.[34,35] Any signs of inconsistency in interim stages should be explored and the inclusion of new evidence should be carefully reconsidered.Monitoring changes in the treatment ranking might also be useful in particular in large networks where many treatments are compared. Probabilities for each treatment being at each possible rank can be obtained and the surface under the cumulative ranking probabilities (SUCRAs) and their equivalent P-scores or mean ranks can be illustrated in graphs.[54,55] As these measures are based on the estimated summary effects at each update, their uncertainty should be expressed by the repeated confidence intervals while P-scores could be based on the adjusted p values.
4 Application
We apply our methodology to the network of trials for coronary revascularization in diabeticpatients.[24] Arm level data for the 15 studies along with the year of publication and the respective ORs are given in Appendix Table 1. For a ‘non-pharmacological versus any’ intervention comparison type and a semi-objective outcome a log-normal distribution for heterogeneity has been recommended corresponding to 25th, 50th and 75th quantiles , , and respectively (Appendix Figure 4).[39] We adopt a significance level of and a type II error . Using the alternative effect sizes described in section 2 we estimate the maximum information needed to detect them as , , and . To derive futility boundaries we assumed values equal to the alternative effect sizes; that is we consider it is futile to continue undertaking trials if we cannot show that CABG is better than DES and BMS with log ORs and , respectively. From consistency it follows that (in favor of DES).
Appendix Table 1.
Data of the network of coronary revascularization in diabetic patients.
Study
Publication time
Set of treatments compared
Treatment
Events
Sample size
OR (95% CI)
X
Y
X
Y
X
Y
1. Jlmenez-Quevedo
August 2007
‘DES vs. BMS’
DES
BMS
7
12
80
80
0.54 (0.20, 1.46)
2. Rodriguez
September 2007
‘DES vs. BMS vs. CABG’
DES
CABG
11
6
47
39
1.68 (0.56, 5.05)
2. Rodriguez
September 2007
‘DES vs. BMS vs. CABG’
DES
BMS
11
5
47
39
2.08 (0.65, 6.60)
2.Rodriguez
September 2007
‘DES vs. BMS vs. CABG’
BMS
CABG
5
6
39
39
0.81 (0.22, 2.91)
3. Kirtane
February 2008
‘DES vs. BMS’
DES
BMS
58
68
408
419
0.86 (0.58, 1.25)
4. Maresta
June 2008
‘DES vs. BMS’
DES
BMS
14
16
75
75
0.85 (0.38, 1.89)
5. Booth
July 2008
‘BMS vs. CABG’
BMS
CABG
7
9
68
74
0.83 (0.29, 2.36)
6. Chan
November 2008
‘DES vs. BMS’
DES
BMS
2
4
54
29
0.24 (0.04, 1.40)
7. Calxeta
September 2009
‘DES vs. BMS’
DES
BMS
40
37
195
233
1.37 (0.83, 2.24)
8. Kapur
February 2010
‘DES vs. BMS vs. CABG’
BMS
CABG
13
26
82
248
1.61 (0.78, 3.30)
8. Kapur
February 2010
‘DES vs. BMS vs. CABG’
DES
CABG
20
26
172
248
1.12 (0.61, 2.09)
8. Kapur
February 2010
‘DES vs. BMS vs. CABG’
DES
BMS
20
13
172
82
0.70 (0.33, 1.48)
9. Mauri
December 2010
‘DES vs. BMS’
DES
BMS
35
19
555
132
0.40 (0.22, 0.73)
10. Onuma
March 2011
‘DES vs. BMS vs. CABG’
DES
BMS
25
28
159
112
0.56 (0.31, 1.02)
10. Onuma
March 2011
‘DES vs. BMS vs. CABG’
DES
CABG
25
16
159
96
0.93 (0.47, 1.85)
10. Onuma
March 2011
‘DES vs. BMS vs. CABG’
BMS
CABG
28
16
112
96
1.67 (0.84, 3.31)
11. Park
May 2011
‘DES vs. CABG’
DES
CABG
12
9
102
90
1.20 (0.48, 2.30)
12. Sinning
March 2012
‘DES vs. BMS’
DES
BMS
30
30
95
95
1.00 (0.54, 1.84)
13. Farkouh
December 2012
‘DES vs. CABG’
DES
CABG
253
177
953
947
1.57 (1.26, 1.95)
14. Kamalesh
February 2013
‘DES vs. CABG’
DES
CABG
27
19
101
97
1.50 (0.77, 2.92)
15. Kappeteln
May 2013
‘DES vs. CABG’
DES
CABG
54
39
231
221
1.42 (0.90, 2.26)
Appendix Figure 5.
Cumulative pairwise (black) and network meta-analysis (red) estimates for the three comparisons in the network of coronary revascularization in diabetic patients along with predictive intervals (dotted lines). Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . Effects are measured as log odds ratios. logOR: log odds ratio. CI: confidence interval. PrI: predictive interval.
4.1 Description of the accumulation of evidence
When evidence is updated regularly, researchers perform both pairwise and NMA and evaluate the criteria of stopping early for efficacy or futility for both procedures. Appendix Figure 5 shows the cumulative pairwise and NMA effect estimates along with their confidence and predictive intervals after the inclusion of each study.Figure 2 shows the stopping framework for the three evaluated comparisons in the network assuming a heterogeneity standard deviation equal to the median of the predictive distribution, .
Figure 2.
Stopping framework for efficacy (solid lines) and futility (dashed lines) for the network of coronary revascularization in diabetic patients. Maximum information is not displayed in the graphs as it is everywhere larger than 10. Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . Black circles indicate that the latest update comes from a study with direct evidence; blue circles indicate that the latest update comes from indirect evidence and red circles indicate that the latest update comes from a three-arm trial (both direct and indirect evidence). Stopping for efficacy is taking place if observations are outside the efficacy boundaries. The arrow on the Y-axis indicates the side of the futility boundary that suggests stopping. Conventional significance thresholds are represented with dotted lines.
While inference regarding the comparison ‘BMS versus CABG’ is inconclusive using evidence only from the four trials providing direct evidence this is not the case for the accumulated evidence from NMA. More specifically, the 13th study was conducted in December of 2012 and examined the relative effectiveness of DES compared to CABG. This study informs the comparison ‘BMS versus CABG’ indirectly leading to a conclusion that further research is not needed for that particular comparison. Note that the comparison ‘BMS versus CABG’ would have become marginally significant after the inclusion of the 12th study in an unadjusted cumulative NMA as the respective ‘z-score’ lies on the dotted boundary which represents conventional stopping. The inclusion of nearly half of the included studies rendered the ‘DES versus CABG’ comparison statistically significant in favor of DES in an unadjusted cumulative NMA; adjusting for multiple testing though, both ‘DES versus BMS’ and ‘DES versus CABG’ comparisons remain inconclusive using either pairwise meta-analysis or NMA (Figure 2).Stopping framework for efficacy (solid lines) and futility (dashed lines) for the network of coronary revascularization in diabeticpatients. Maximum information is not displayed in the graphs as it is everywhere larger than 10. Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . Black circles indicate that the latest update comes from a study with direct evidence; blue circles indicate that the latest update comes from indirect evidence and red circles indicate that the latest update comes from a three-arm trial (both direct and indirect evidence). Stopping for efficacy is taking place if observations are outside the efficacy boundaries. The arrow on the Y-axis indicates the side of the futility boundary that suggests stopping. Conventional significance thresholds are represented with dotted lines.For all three comparisons, the accumulated data do not cross the futility boundaries so no decision over stopping for futility is being made throughout the updating process of NMA or pairwise meta-analysis.During the updating process, the inclusion of the 13th study would lead investigators to reach conclusive results about the relative effectiveness of one out of the three evaluated comparisons indicating that CABG is better than BMS. After the inclusion of 15 studies, DES would appear to have an insignificant advantage over BMS and CABG a non-statistical significant benefit over DES. As and would not have been reached, studies would continue to be performed (if the particular comparisons were still of interest).The information regarding stopping for efficacy using results from pairwise meta-analysis or NMA given in Figure 2 can also be visualized in the form of repeated confidence intervals (Appendix Figure 6).
Appendix Figure 6.
Cumulative pairwise (black) and network meta-analysis (red) estimates for the three comparisons in the network of coronary revascularization in diabetic patients along with repeated confidence intervals (dotted lines). Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . Exclusion of line of no effect from the repeated confidence interval suggests that the particular comparison provides conclusive evidence after adjusting for multiple testing. Effects are measured as log odds ratios and are given in Appendix Figure 5. RCI: repeated confidence intervals.
Assuming a 25th and 75th quantile of the predictive distribution for heterogeneity instead of the median in our calculations does not markedly change the conclusions of the stopping framework (Appendix Figure 7 and Appendix Figure 8). The influence of the 13th study continues to be pronounced in the stopping decisions. In general, greater values of heterogeneity render the repeated confidence intervals larger and consequently stopping for efficacy is less likely to occur.
Appendix Figure 7.
Cumulative pairwise (black) and network meta-analysis (red) estimates for the three comparisons in the network of coronary revascularization in diabetic patients along with repeated confidence intervals (dotted lines). Heterogeneity standard deviation is assumed to be equal to the 25th quantile of the predictive distribution, . Exclusion of line of no effect from the repeated confidence interval suggests that the particular comparison provides conclusive evidence after adjusting for multiple testing. Effects are measured as log odds ratios and are given in Appendix Figure 5. RCI: repeated confidence intervals.
Appendix Figure 8.
Cumulative pairwise (black) and network meta-analysis (red) estimates for the three comparisons in the network of coronary revascularization in diabetic patients along with repeated confidence intervals (dotted lines). Heterogeneity standard deviation is assumed to be equal to the 75th quantile of the predictive distribution, . Exclusion of line of no effect from the repeated confidence interval suggests that the particular comparison provides conclusive evidence after adjusting for multiple testing. Effects are measured as log odds ratios and are given in Appendix Figure 5. RCI: repeated confidence intervals.
Appendix Figure 9 shows the cumulative estimates of the inconsistency factor for the loop ‘DES-BMS-CABG’. It suggests that the initial inconsistency factor of 1.34 (on a logOR scale) in 2007 was decreased to 0.84 in 2009 and finally to a relatively small inconsistency factor of 0.26 in 2013. The confidence intervals become smaller as more studies are included and, although the method is underpowered, initial concerns that the network might be inconsistent are challenged.
Appendix Figure 9.
Accumulated inconsistency plot using the loop specific approach for the network of coronary revascularization in diabetic patients. Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . Dotted lines indicate the latest update comes from a three-arm trial. Inconsistency factors are measured on the logOR scale. IF: inconsistency factor.
We calculate the SUCRAs of the three treatments in each interim analysis allowing for uncertainty expressed by the repeated confidence intervals. Cumulative estimation of SUCRAs is illustrated in Figure 3. Repeated SUCRAs are relatively close to each other in the first years of the sequential NMA while their distinction is growing as evidence is accumulated.
Figure 3.
Accumulated SUCRAs for the network of coronary revascularization in diabetic patients. Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . SUCRA: surface under the cumulative ranking probabilities.
Accumulated SUCRAs for the network of coronary revascularization in diabeticpatients. Heterogeneity standard deviation is assumed to be equal to the median of the predictive distribution, . SUCRA: surface under the cumulative ranking probabilities.It is important to note that the assumptions feeding into the analysis (values for heterogeneity, type I and type II error, alternative effect sizes) may not be universally acceptable to all health-care professionals and patients. Thus, results from a sequential NMA should be interpreted in the light of such decisions. Moreover, firm recommendations on the need of further studies should take into account that new studies might be useful for the examination of a secondary outcome; indeed Tu et al. point out that although CABG seems to be better than BMS and DES in terms of the primary outcome, it is associated with an increased risk of stroke and might not be preferred for patients at high risk of such an event.[24] In that case, it is even more important to avoid undertaking further trials that involve CABG because its superiority has been established and further experimentation might be deemed unethical. Instead, indirect evidence, e.g. by planning more ‘BMS versus DES’ studies, should be sought for all comparisons of interest. In general, clinical judgment considering several outcomes that might be of interest to patients is necessary to evaluate which intervention is appropriate to which patient group.
5 Discussion
We suggest formal statistical monitoring when decisions need to be made every time a ΝMA is updated. The outlined method is adopted from the respective methodologies developed for clinical trials and pairwise meta-analyses. We consider two situations in which our methodology can be appropriate; in both situations the analyses of studies are performed as their results become available. The first one is the prospective design of a NMA at the time of market entry of a new drug as suggested by Naci and O’Connor.[19] In contrast to the current practice that drug approval often relies on the evaluation of each drug in placebo-controlled trials, such a procedure would feed regulatory agencies with the optimal level of evidence regarding the comparative efficacy and safety of the new drug. The establishment of designing prospective NMAs in the regulatory setting may be challenged by the potential reluctance of manufacturers to compare their treatments with all competing alternatives, which might lead to selective inclusion of pieces of evidence. Moreover, efforts to reduce the cost of performing a series of trials might lead to postponing the design of prospective NMA until a competing company has collected enough relevant evidence. Informing policy decision-making by health technologies assessments could also include an evaluation of the sufficiency of included evidence using methods described in this paper.The second context that our method can be used in is the regular update of systematic reviews that contain multiple treatments when new trials become available. Application of the statistical monitoring is of particular interest to organizations that produce and maintain systematic reviews such as the Cochrane Collaboration. As the main aim of the Cochrane Collaboration is to provide the best available and most up-to-date evidence, authors not only prepare systematic reviews but are also committed into updating them. This commitment aims to minimize the risk of the reviews to become out-of-date and potentially misleading. Frequent updates of systematic reviews, however, can result in an inflated type I error, in a similar manner as in a genuinely prospective NMA.Appending a stopping rule to the meta-analysis context has received considerable criticism.[15,56] In particular, expressed concerns highlight the lack of direct control over the process of collecting and synthesizing studies in the sense that the meta-analyst is not in a position to decide whether more trials are to be conducted or not. We consider the formal statistical monitoring to be relevant for situations in which a researcher can have control—or at least provide recommendations—over future updates of the meta-analysis.Our methods are similar to those proposed by Whitehead and Higgins et al. for pairwise meta-analyses, extended to the case where multiple treatments are competing.[5,15] Whitehead has developed a sequential method for meta-analysis using the triangular test in a series of concurrent clinical trials and Higgins et al. focused on the restricted procedure of Whitehead, equivalent to an O’Brien and Flemming boundary.[5,15] Wetterslev et al. have developed an alternative sequential method for pairwise meta-analysis[41]; they have also created software (www.ctu.dk/tsa) which has been largely applied in practice and they argue that their methodology should be adopted by Cochrane authors.[25] Their approach has technical and conceptual similarities and dissimilarities with that proposed by Higgins et al.[15,25] For instance, Wetterslev et al. adjust the required sample size by a factor that depends on the estimated heterogeneity. As estimation of heterogeneity is difficult at the beginning of the sequential process, the estimate at the final update is employed. This is something that can be feasible only on retrospective cumulative meta-analysis. Higgins et al. explores several ways to handle heterogeneity in a sequential random-effects meta-analysis approach including incorporating a prior distribution for the between-studies variance parameter. They argue in the discussion that “further empirical research is needed to characterize the degree of heterogeneity that can be anticipated in a meta-analysis with particular clinical and methodological features, so that realistic informative prior distributions can be formulated.”[15] As such empirical research has been conducted since then,[38,39] we here employ informative priors for the heterogeneity variance.Whitehead suggests that the sequential procedure in meta-analysis may be more justifiable for safety outcomes while Higgins et al. propose the area of adverse effects of pharmacological interventions as a potential application of sequential methods.[5,15] Whether the proposed methods work well in the context of rare events is an issue that remains to be investigated. It has been argued that when a major adverse event is rare it might be inappropriate to control over inflated type I error as even a small signal could be sufficient for the meta-analysis to ‘stop’.[5] In any case, the practice of accumulating evidence in a formal way becomes even more imperative in the context of rare events.
6 Concluding remarks
The evolvement of technology can decisively contribute to the realization of living systematic reviews—the high quality, up-to-date online summaries, updated as new research become available—by providing semi-automation to the production process. The inclusion of all available treatment options in such ‘real-time’ syntheses has been termed as “live cumulative network meta-analysis” and can further facilitate informed research prioritization and decision-making.[57] Development, refinement, and evaluation of appropriate statistical methodology as well as guidance over the optimal update of systematic reviews can aid the attempt of living systematic reviews and live cumulative NMA to bridge the gap between research evidence and health care practice.[4]Methodology described in this paper should ideally be viewed as part of a holistic framework for strengthening existing evidence by judging when evidence summaries provide conclusive answers,[28,58] planning new studies when needed[27,49,58,59] and subsequently updating meta-analysis to include the—assumed justified—future studies. While methodological developments regarding parts of this process have appeared in the literature, they are rarely used in practice. In order to shift the paradigm to evidence-based research planning, methodology needs to be refined and summarized in a comprehensive global framework while its properties need to be evaluated in real world examples. The development of user-friendly software routines along with educational material could also contribute to the usefulness and applicability of the methodology.
Authors: Adriani Nikolakopoulou; Dimitris Mavridis; Toshi A Furukawa; Andrea Cipriani; Andrea C Tricco; Sharon E Straus; George C M Siontis; Matthias Egger; Georgia Salanti Journal: BMJ Date: 2018-02-28
Authors: Paolo Fusar-Poli; Cathy Davies; Marco Solmi; Natascia Brondino; Andrea De Micheli; Magdalena Kotlicka-Antczak; Jae Il Shin; Joaquim Radua Journal: Front Psychiatry Date: 2019-12-11 Impact factor: 4.157
Authors: Yan Luo; Anna Chaimani; Toshi A Furukawa; Yuki Kataoka; Yusuke Ogawa; Andrea Cipriani; Georgia Salanti Journal: Res Synth Methods Date: 2020-05-25 Impact factor: 5.273