Mediation analysis is a useful tool to illuminate the mechanisms through which an exposure affects an outcome but statistical challenges exist with time-to-event outcomes and longitudinal observational data. Natural direct and indirect effects cannot be identified when there are exposure-induced confounders of the mediator-outcome relationship. Previous measurements of a repeatedly-measured mediator may themselves confound the relationship between the mediator and the outcome. To overcome these obstacles, two recent methods have been proposed, one based on path-specific effects and one based on an additive hazards model and the concept of exposure splitting. We investigate these techniques, focusing on their application to observational datasets. We apply both methods to an analysis of the UK Cystic Fibrosis Registry dataset to identify how much of the relationship between onset of cystic fibrosis-related diabetes and subsequent survival acts through pulmonary function. Statistical properties of the methods are investigated using simulation. Both methods produce unbiased estimates of indirect and direct effects in scenarios consistent with their stated assumptions but, if the data are measured infrequently, estimates may be biased. Findings are used to highlight considerations in the interpretation of the observational data analysis.
Mediation analysis is a useful tool to illuminate the mechanisms through which an exposure affects an outcome but statistical challenges exist with time-to-event outcomes and longitudinal observational data. Natural direct and indirect effects cannot be identified when there are exposure-induced confounders of the mediator-outcome relationship. Previous measurements of a repeatedly-measured mediator may themselves confound the relationship between the mediator and the outcome. To overcome these obstacles, two recent methods have been proposed, one based on path-specific effects and one based on an additive hazards model and the concept of exposure splitting. We investigate these techniques, focusing on their application to observational datasets. We apply both methods to an analysis of the UK Cystic Fibrosis Registry dataset to identify how much of the relationship between onset of cystic fibrosis-related diabetes and subsequent survival acts through pulmonary function. Statistical properties of the methods are investigated using simulation. Both methods produce unbiased estimates of indirect and direct effects in scenarios consistent with their stated assumptions but, if the data are measured infrequently, estimates may be biased. Findings are used to highlight considerations in the interpretation of the observational data analysis.
Mediation analysis is used to statistically explore the possible mechanisms through which a treatment or exposure affects an outcome. To achieve this, the analysis attempts to decompose the total effect of the exposure on the outcome into an indirect effect and a direct effect. The indirect effect is the part of the total effect that is realised by the exposure acting on the mediator and that mediator then acting on the outcome. The direct effect captures the portion of the total effect that does not act via the mediator. Through this decomposition we hope to gain insight about the process through which an effect occurs. We refer readers to VanderWeele
for a thorough overview of mediation analysis.Most methods for mediation analysis have focused on a single mediator and a continuous or binary outcome, though extensions to more complex settings are emerging. In this paper, we are interested in the setting where the outcome is time to an event and the mediator is a time-dependent variable for which repeated measurements are available. The repeatedly-measured nature of the mediator allows us to focus on a process that evolves over time. For example, consider a situation where the exposure is the onset of a condition that causes a progressive elevation in a biomarker and that high levels of this biomarker lead to increased risk of death. One challenge that arises is that the mediator measurement from one time point may confound the association between the mediator measured at a later time point and the outcome. This is an example of exposure-induced confounding of the mediator-outcome association and, for identification, many methods assume this type of confounding does not exist. Time-to-event outcomes pose another difficulty as survival for a given time is a post-exposure confounder of the association between a mediator measured at a later time and subsequent survival (i.e. survival is required to take a mediator measurement) and, therefore, the exposure affects the mediator both directly and indirectly via survival time.Vansteelandt et al.
recently proposed a method for mediation analysis in this setting by estimating the combination of path-specific effects where the exposure first acts upon the repeatedly-measured mediator. In other words, this method estimates how the effect of the exposure on the outcome is mediated via the entire longitudinal mediator process. As their approach also accommodates time-varying mediator-outcome confounders, it is suitable for situations where multiple mediators may exist. A second mediation method for this setting was proposed by Aalen et al.
and is based on dynamic path analysis using the additive hazards model. As this method requires that control for confounding be made using only covariate measurements taken prior to the exposure, it is restricted to settings without time-varying confounders.In Buse et al.
the method proposed by Vansteelandt et al.
was applied to data from a randomised controlled trial (the LEADER trial) to identify possible mediators of the effect of treatment on the risk of cardiovascular events. In another application, this method was used to quantify the indirect effect of treatment on the risk of a composite kidney disease outcome via several candidate mediators.
In analyses such as these based on data from randomised controlled trials, control for confounding of the exposure-outcome and exposure-mediator relationships is rendered unnecessary via the randomisation and there is a clearly defined starting time for each individual at the time of randomisation. To the best of our knowledge, to date the methods of Vansteelandt et al.
and Aalen et al.
have only been applied to data from randomised controlled trials. Our focus is on the use of these methods to address mediation questions using observational data. When working with observational data, such as a registry dataset, control for confounding can only be accomplished via adjustment with measured covariates and in cases where the exposure is onset of a condition, there is no natural time zero for comparison of the exposed and unexposed. Further, measurements of time-updated variables are taken on a schedule designed for long-term data collection as opposed to targeting a specific research question. We discuss these issues in our motivating example, which is based on data from the UK Cystic Fibrosis Registry dataset.Our primary aim is to more thoroughly evaluate two methods available for mediation analysis in the setting of a repeatedly-measured mediator and a time-to-event outcome: the method of Vansteelandt et al.
and the method of Aalen et al.
Although Vansteelandt et al.
provide a basic simulation illustrating their approach for cases with and without direct and indirect effects in an appendix, we are not aware of any extensive simulation study of their method or any simulation study of the method of Aalen. Here, we use simulation to look both at scenarios where we expect good performance as well as at scenarios that may challenge the methods. Further, we apply these methods to analyse mediation in cystic fibrosis-related diabetes. This is the first application of causal mediation methods to the UK Cystic Fibrosis Registry dataset and may motivate further research into mechanisms of disease progression. The paper is organised as follows. Section 2 introduces the motivating application: cystic fibrosis-related diabetes. In Section 3, a brief introduction to causal mediation analysis is provided as well as descriptions of the two mediation analysis methods studied. We present a simulation study assessing the performance of the two methods with a focus on bias in Section 4. Section 5 contains an analysis of the UK Cystic Fibrosis Registry dataset to investigate mediation of the effect of cystic fibrosis-related diabetes on survival. We conclude with a discussion in Section 6.
This study is motivated by the setting of cystic fibrosis (CF) and the desire to better understand the mechanisms associated with mortality. CF is a genetic, life-shortening disease that affects more than 10,500 people in the UK
and approximately 100,000 people worldwide.
It is characterised by a progressive loss of lung function and most people with CF die from respiratory failure.
Although there is no cure for CF, improved care and early diagnosis have led to substantial improvements in life expectancy, with a median predicted survival age for babies born today in the UK of 50.6 years.
The increasing lifespan of people with CF is concomitant with an increased risk of co-morbidities, the most common being CF-related diabetes (CFRD). CFRD has been shown to be associated with an increased risk of mortality[9-12] but the mechanisms for this effect are not well understood. One hypothesis is based on the association of CFRD with worse pulmonary function.[13-15] Using causal mediation analysis, we aim to quantify how much of the effect of CFRD on survival is mediated through lung function.We use data from the UK CF Registry, which holds data on more than 12,000 people with CF representing over 99% of the CF population in the UK. This registry dataset contains demographic information, genotype, and time-updated measures of pulmonary function, bacterial infections and other health indicators.
Data are systematically collected at approximately annual routine monitoring visits conducted at specialist CF centres.
3 Causal mediation
3.1 Background
We begin by outlining some concepts for the setting of a single mediator and a non-time-to-event outcome. Many mediation analyses based on counterfactuals will aim to estimate the natural indirect effect (NIE) and natural direct effect (NDE).[17,18] In a setting with a binary exposure
, a single mediator
measured once and outcome
, define
and
to be the outcome for individual
if the exposure had been set to 1 (exposed) or 0 (unexposed), respectively. We can only observe one of these outcomes for each individual
; the other is counterfactual. We similarly define
and
as the mediator value that would have been seen if
were exposed or unexposed, respectively. Let
represent the potential outcome for individual
if their exposure was set to
and the mediator was set to the level it would have taken if the exposure had been
. Note that
may or may not equal
. Using this nested counterfactual framework, the NIE and NDE can be defined as:The NIE captures the change in outcome that would result from fixing the exposure but changing the mediator from the level it would have taken if exposed to the level it would have taken if unexposed. The NDE captures the effect of the exposure on the outcome if the mediator had taken the level it would have taken without exposure. The total causal effect of the exposure on the outcome is the sum of the NIE and NDE. These natural effects can be estimated under the assumption of no unmeasured exposure-outcome, mediator-outcome or exposure-mediator confounding and that there is no exposure-induced mediator-outcome confounding. These are strong assumptions that not even a randomised controlled trial may satisfy. In particular, the assumption of no exposure-induced mediator-outcome confounding is problematic because it requires that no such confounder exists. In other words, being able to measure such a confounder does not allow us to obtain unbiased estimates of the NIE and NDE. Although this assumption will not hold in many settings, it may be reasonable if the mediator is measured a very short time after the exposure.In the setting studied here involving a repeatedly-measured mediator, survival outcome, and possible time-varying confounding, natural indirect and direct effects cannot be identified. As mentioned in the Introduction, there are problems with exposure-induced mediator-outcome confounding and with survival acting as a post-exposure confounder. In the next sections, we outline the methods of Vansteelandt et al.
and Aalen et al.
which have been developed to overcome these limitations.
3.2 Method of Vansteelandt et al. (2019)
Vansteelandt et al.
proposed a mediation analysis method suitable for settings with a time-to-event outcome, time-updated mediator and time-varying confounding of the mediator-outcome association using counterfactual scenarios based on a hypothetical intervention on the mediator. Their proposal infers the effect of the exposure on the outcome through combinations of path-specific effects via the time-updated mediator measurements. A key contribution of this method is that it allows for time-varying mediator-outcome confounders, which could themselves be affected by the exposure.Consider a situation in which mediators and other time-dependent covariates are observed at regular visit times. The data-generating mechanism and causal ordering for the case of two post-exposure visits is shown in Figure 1. Let
be a binary exposure measured at time
0 and
index the visit at which the mediator and other time-updated variables are measured.
and
are the values of the mediator and the time-varying confounder(s) at visit
, respectively.
includes an indicator of whether the person remains at risk at visit
. The set of baseline confounders, which may include
and
, are represented by
.
Figure 1.
Data-generating mechanism assumed in the method of Vansteelandt for the case of two post-exposure visits. The time-varying confounder
may influence
and
may influence
. Survival to visit
is included in the definition of
.
indicates survival past the second mediator measurement and
is the set of baseline covariates, which includes
and
.
Data-generating mechanism assumed in the method of Vansteelandt for the case of two post-exposure visits. The time-varying confounder
may influence
and
may influence
. Survival to visit
is included in the definition of
.
indicates survival past the second mediator measurement and
is the set of baseline covariates, which includes
and
.The indirect effect of the exposure on the outcome via the mediator is taken to be the combination of paths where one or more measurements of the mediator are directly influenced by the exposure and the mediator subsequently affects the outcome. These pathways that make up the indirect effect are shown in black in Figure 2 – upper panel. Conversely, the effect of the exposure on the outcome not via the mediator, referred to as the direct effect, is defined as the combination of paths where the exposure does not directly influence the mediator. These pathways include those in which the exposure first affects
and then
affects
as shown in black in Figure 2 – bottom panel.
Figure 2.
Highlighted path-specific effects. In the top directed acyclic graph (DAG), the path-specific indirect effect of
on
via
is the combination of pathways in black. This is the set of pathways where
first affects
. On the bottom, the combination of pathways in black captures the effect of
on
not via
. Note that these pathways may involve
, but only when
affects another variable first. This is referred to as the direct effect.
Highlighted path-specific effects. In the top directed acyclic graph (DAG), the path-specific indirect effect of
on
via
is the combination of pathways in black. This is the set of pathways where
first affects
. On the bottom, the combination of pathways in black captures the effect of
on
not via
. Note that these pathways may involve
, but only when
affects another variable first. This is referred to as the direct effect.Under the method of Vansteelandt, we estimate the survival probabilities
, where
denotes the counterfactual event time had a person’s exposure been set to
and their mediators set to the level they would have taken if the exposure had been
. For example, for
this is the probability of survival to time
if all individuals had exposure
, time-varying confounders
remain at the levels they would have taken with exposure
and mediator levels
were set to the levels they would have taken with exposure
if
had taken the value it would have taken under
. Because
and
can affect one another, their counterfactual values rely on all previously set levels. For
,
is set to
and
is set to
. By varying
and
, three survival functions of interest are estimated:
,
and
and these can be contrasted as differences or ratios to estimate the path-specific effects via
and not via
. We use ratios as this simplifies estimation of the method of Aalen (see next section) and define the IE(
) and DE(
) as:
Based on the edge g-formula applied to counterfactuals
and assuming no unmeasured confounding, the causal ordering shown in Figure 1, non-informative censoring and that the causal structure represents a non-parametric structural equation model with independent errors, Vansteelandt et al.
demonstrate that
can be identified by:
where
(
) is the history of the mediator (time-varying confounders) up to time
and
is the visit time at or before time
. Note that
contains both the individual’s measured clinical levels and an indicator of whether they remain at risk at time
. Although similar, this is distinct from the mediational g-formula for non-nested counterfactuals.
Repeated regressions are used to estimate
. Models are specified for each term in equation (5) and fitted using the observed data. The models are fitted in a sequential way, working backwards from the last visit time to the first. At each step, the outcome of a model is a predicted fitted value between 0 and 1 from a previous model. The final result is an estimated survival probability to time
for each individual, conditional on
, and these are averaged to estimate
. We refer readers to Vansteelandt et al.
for details of the procedure. Because this method does not assume a particular parametric model, any suitable model may be used in each regression.
3.3 Method of Aalen et al. (2020)
Aalen et al.
proposed a mediation analysis method for the special case of a time-to-event outcome and time-updated mediator where control for confounding of the exposure-mediator and exposure-outcome relationships can be achieved using only the set of baseline confounders. When using this method, it is assumed that there are no time-varying confounders of the mediator-outcome association and only a single mediator. A key idea in the method of Aalen is exposure separation.[22,23] This assumes that the exposure can be separated into two components: one that acts on the mediator process and one that affects survival either directly or through pathways not involving the mediator. Biologically, this means the exposure must be able to be split into separate physiological mechanisms that we could, in theory, manipulate independently of one another.We use the same notation as described in the method of Vansteelandt and introduce
and
, which denote the separate components of the exposure
.
represents the part of the exposure that acts through the mediator process and
represents the part of the exposure that affects survival via pathways not through the mediator. Define
to be an indicator of survival time
being greater than
. Figure 3 shows the data-generating mechanism for two post-exposure visits. The separation of the exposure into two components is represented by bold arrows. Because individuals will either be exposed or unexposed, we will only observe the scenario where
. However, exposure separation contemplates hypothetical interventions on both
and
where
is not necessarily equal to
.
Figure 3.
Data-generating mechanism assumed in the method of Aalen for the case of two post-exposure visits. The effect of the exposure
(bold arrows) is split into one component,
, that affects the mediator and one,
, that affects survival through other pathways not involving the mediator.
indicate survival to visits 1 and 2, respectively and
represents the survival outcome past the second visit.
is the set of baseline covariates, which includes
.
Data-generating mechanism assumed in the method of Aalen for the case of two post-exposure visits. The effect of the exposure
(bold arrows) is split into one component,
, that affects the mediator and one,
, that affects survival through other pathways not involving the mediator.
indicate survival to visits 1 and 2, respectively and
represents the survival outcome past the second visit.
is the set of baseline covariates, which includes
.The estimand is a survival probability,
where
is the counterfactual survival time given a hypothetical intervention where
was set to level
and
was set to level
. This is estimated using a continuous-time mediational g-formula derived from a linear mediator model and an additive hazards model. Aalen et al.
show how a mediator model that is marginal over the past history of
is specified. The mediator measurement
for the
individual (
=1, …, n) at the
event time (
) is:
where
indexes the unique ordered event times
,
is the intercept, the error term
is assumed independent of
and
, and
and
are coefficients.An additive hazards model for the hazard of the event at time
conditional on the history of the mediator, the exposure and the baseline covariates is:
where
is the intercept and
,
and
are coefficients, all of which may be time-dependent. Here,
for
and
is the history of the mediator values for times
. The model in (7) incorporates an assumption of this method that only the most recent value of the mediator,
, directly affects the hazard at time
; the prior mediator history does not impact it. This is shown in Figure 3 by an absence of an arrow from
to
. Assuming the models in (6) and (7), and using the mediational g-formula, the estimand of interest can be written:
where
is a function representing the part of the expression that depends only on
and
.The assumptions made in equations (6) and (7) result in the special form of the mediational g-formula in (8). This allows for simple expressions for the IE and DE based on the probabilities
:
When the direct and indirect effects are defined as ratios, the resulting quantities are independent of
.In the estimation procedure, event times are modelled as a counting process. At each event time, equation (7) is used to regress the change in the counting process onto the mediator and exposure and equation (6) is used to regress the mediator onto the exposure. The integrals in equations (9) and (10) are estimated as cumulative sums of the estimates of the model coefficients, with the integral in equation (10) being the standard cumulative coefficient reported from an additive hazards model. We refer readers to Aalen et al.
and Strohmaier et al.
for a detailed description of their dynamic path analysis approach to estimating the above quantities.
3.4 Comparison of the method of Vansteelandt and the method of Aalen
3.4.1 Conceptual considerations
The two approaches outlined above differ fundamentally in their conceptual approach to mediation in a survival context. A key difference is the nature of the counterfactuals. In the method of Vansteelandt, we consider a hypothetical intervention on the mediator to set it to a level that would have been seen if the exposure were different. This could lead to ill-defined effects when the outcome is survival because if the individual would survive longer when exposed (
) than unexposed (
), the level of the mediator when unexposed is undefined after the time of death. Vansteelandt et al.
suggest this can be conceptualised as a hypothetical intervention that sets the mediator to the level the individual would have had if their death had been prevented under the unexposed scenario. As Vansteelandt et al. acknowledge, this may be difficult to envisage. Although we are not required to imagine a mediator value for a person at time
in a scenario in which they would have died before time
, we do need to be able to imagine a mediator value for a person at time
in a hypothetical scenario in which they are alive at time
, but who in reality died before time
. The result is a complication in the interpretation of the mediation results. An alternative interpretation of the Vansteelandt et al. identification result (equation 5) is given by the stochastic randomised intervention approach.[25,26] Instead of hypothetically intervening on the mediator using a counterfactual scenario had the exposure been
, we can equivalently imagine that the level of the mediator at a given time is set based on a random draw from the distribution of the mediator among surviving individuals with exposure
, conditional on the history of covariates. This assumes that individuals surviving to time
in the exposed and unexposed scenarios are similar given covariate histories. Vansteelandt et al.
point out that although generally survival bias could lead to exposed individuals being different from unexposed individuals, the assumption of no unmeasured common causes of the mediators and time-varying confounders in their framework escapes this issue. The conceptual difficulty of different survival times under exposed/unexposed counterfactual scenarios is avoided entirely in the method of Aalen. In this method, there is no problem because the hypothetical intervention is on the separated exposure variable, not on the mediator. Therefore, we do not need to imagine a counterfactual outcome under exposure
and mediator
where
.The practical value of methods based on nested counterfactuals, which the method of Vansteelandt relies on, has also been discussed more generally.[27,23,28,29] It involves considering an individual had their exposure been set to one level, but had their mediator been set to the value that would have been seen under a different exposure level. This is not a situation that could ever be observed in practice, which has raised conceptual concerns about the interpretation of the resulting estimands.Although the exposure separation approach used in the method of Aalen avoids the use of nested counterfactuals, there are also conceptual hurdles involved in exposure separation. Here, physiologically, we must be able to decompose the exposure status into one component that affects survival but not the mediator and one component that affects the mediator but not survival. The independence of these two components is essential; if, for example, the proposed component affecting the mediator also affects survival, the assumptions of the analysis will not be valid. While an imagined exposure separation could correspond to a testable intervention, in practice, it may be difficult or even clinically impossible to individually manipulate the two components separately.
3.4.2 Statistical considerations
Both approaches require that there is no unmeasured confounding of the exposure-outcome, mediator-outcome and exposure-mediator relationships in order to obtain unbiased estimates of the estimands that they target. The method of Aalen requires that this control for confounding be via confounders measured at baseline and expressly forbids the existence of an exposure-induced mediator-outcome confounder. In contrast, the method of Vansteelandt was designed for settings with time-varying mediator-outcome confounders, including those affected by the exposure, and, as long as they can be measured, identification is possible. Another difference is that the method of Vansteelandt does not rely on parametric models in equation (5) for identification. In theory, arbitrary models may be selected for estimation. In the method of Aalen, however, the simplicity of the form of the IE and DE estimands is due to assumptions that the mediator follows a linear model, that the hazard of an event follows an additive hazards model and that only the most recent value of the mediator is necessary to model the hazard.Despite these fundamental differences between the two approaches, Vansteelandt et al.
describe their approach as ‘a generalisation of dynamic path analysis’. Further, they showed the equivalence of the two approaches when there are no time-varying confounders, the mediator and hazard for the event follow additive models as in equations (6) and (7) and all individuals survive to the first mediator measurement. Under those conditions, using method of Vansteelandt equations (5) to calculate survival probabilities and (3) and (4) to calculate the IE(t) and DE(t), the resulting expressions are equivalent to those obtained from the method of Aalen for IE(t) and DE(t) in equations (9) and (10). This shows a connection between the nested counterfactual approach and the exposure splitting approach under certain conditions.
4 Simulation Study
4.1 Design
4.1.1 Overview and aims
To expand our understanding of the performance of these two mediation methods in more complex scenarios with a time-to-event outcome and repeatedly measured mediator, we conducted a simulation study. Both methods were evaluated using scenarios where we expected good performance as well as scenarios with data issues commonly found in observational datasets such as time-varying confounding and infrequent measurements of longitudinal variables. To assist other researchers interested in these methods, R code for generating truth data and simulated data as described in this manuscript is available from https://github.com/KamTan/MediationSimulation. Additionally, we use the results of this simulation study to aid in the interpretation of our analysis of the UK CF Registry dataset (see Section 5).
4.1.2 Data-generating mechanisms
Several different scenarios were studied and we begin by describing a reference scenario which is consistent with the assumptions of both methods. We consider a setting where there is a binary exposure
at time
0 and a binary time-fixed confounder,
. The study period is 4 years during which event times
are observed. Individuals who do not have the event are administratively censored at time
4. A repeatedly-measured continuous mediator is measured at baseline (
) and at post-exposure visits
1,2,3. Note that
may be affected by
. Figure 4 illustrates this reference scenario.
Figure 4.
Illustration of the data-generating mechanism. The exposure
occurs at time
and there are baseline measurements of the confounder
and mediator
.
is an indicator of survival defined as
. Event times
are generated in waves. The mediator is measured at visit time
for all individuals with
.
Illustration of the data-generating mechanism. The exposure
occurs at time
and there are baseline measurements of the confounder
and mediator
.
is an indicator of survival defined as
. Event times
are generated in waves. The mediator is measured at visit time
for all individuals with
.For each scenario considered, three sub-scenarios are studied: (1) both a direct and an indirect effect of the exposure on the outcome are present (“DE+IE”); (2) only an indirect effect is present, meaning there is no path from
to
that does not first go through
(“NoDE”); and (3) only a direct effect is present because the exposure does not affect the mediator (“NoIE”). In Figure 4, removal of all lines from
to
corresponds to the NoDE sub-scenario and removal of all lines from
to
corresponds to the NoIE sub-scenario.The following steps were used to generate data according to the data generating mechanisms illustrated in Figure 4 for individuals
:Generate
from a Bernoulli distribution with probability
=0.5.Generate
from a Bernoulli distribution with probability
, where
0.6 and
0.4. This results in
0.5 and is equivalent to simulating the setting where the baseline covariate affects the exposure.Generate values for the longitudinal mediator measurement as random draws from:
and, for
1,2,3:
, the random intercept and
, the random time slope are generated from a bivariate normal distribution with
and
. The effect of the exposure on the mediator,
may be time-varying but is constant in the reference scenario.Generate the conditional hazard from:
Here,
represents the visit time at or prior to
or 0 for
,
is the constant baseline hazard,
is the time-varying effect of the mediator on the hazard, and
and
are the time-fixed effects of the exposure and baseline confounder on the hazard, respectively. In the reference scenario,
was constant but it was allowed to have an increasing, step change and decreasing effect in scenarios R1-R3, respectively. An additive hazards model was chosen to generate survival times because the method of Aalen assumes additive hazards.Generate event times
in waves
0,1,2,3 for the intervals between visit times [0,1), [1,2), [2,3), [3,4). For individuals still at risk in wave
:GenerateCalculateIf
, then
. Otherwise, generate a new event time in the next wave.At
4, administratively censor all individuals still at risk.Generate an event indicator
equal to 1 if
4 and 0 otherwise.The result is a dataset with values of
,
,
,
,
and
(
1,2,3) for all simulated individuals. Table 1 in the Supplemental Information provides the parameter values used in data generation. Values of parameters were chosen to yield a positive hazard at all times.To further probe each method, we considered two additional scenario types: one with infrequent mediator measurements and one with time-varying confounding present. Table 1 provides a summary of the simulation scenarios investigated. To create the additional scenarios, some modification of the data generating procedure was needed as outlined below.
Table 1.
Listing of all simulation scenarios, the abbreviated name used in the Results section, the percent of simulated individuals experiencing an event prior to time
=4, and the table number where full results are provided in the Supplemental Information for that scenario.
Scenario Type
Description
Scenario name
Events %
Results table
Reference
Constant effect of M on hazard
Baseline
87–89
3
Increasing effect of M on hazard
R1
87–89
–
Immediate effect of M on hazard
R2
90–92
–
Delayed effect of M on hazard
R3
84–87
–
Infrequent mediator measurements
βAt>0
F1
87–89
5
βAt<0
F2
78–87
6
Time-varying confounders
L with E[μl1]=0.5
L1
84–86
7
L with E[μl1]=5.0
L2
87–88
8
L with E[μl1]=15.0
L3
91–92
9
L with σμl1=5.0
L4
85–87
10
L with σμl1=10.0
L5
85–86
11
L where A affects L
L6
78–82
12
Listing of all simulation scenarios, the abbreviated name used in the Results section, the percent of simulated individuals experiencing an event prior to time
=4, and the table number where full results are provided in the Supplemental Information for that scenario.To create scenarios for investigating the impact of an infrequently measured mediator, the above described procedure was adapted to generate mediator values at time intervals of 0.25 (i.e. at times
0, 0.25, 0.5,
,3.75). Event times were generated in waves at the same frequency as the mediator measurements, however, effect estimation was done using only the four mediator measurements at
0, 1, 2, 3. The two scenarios generated in this manner differed only in the direction of the effect of
on
. The exposure positively affected the mediator in scenario F1 while it negatively affected the mediator in scenario F2.Finally, to create scenarios with a time-varying confounder,
, we additionally generate a random intercept
and random time slope
where
and
indexes individuals. We assume a causal ordering where
, measured at the same time as
, may influence
and
may influence
. First,
and
are generated as:
where
is the effect of
on
and
is the effect of
on
. Then,
and
are successively generated for
1, 2, 3 using:
Event times are generated using the same procedure as described above but the conditional hazard also depends on
as shown below:
We generated six scenarios with a time-varying confounder (Table 1). Scenarios L1–L3 are used to study a time-varying confounder generated from different distributions. The impact of greater variability in the random slope is studied with scenarios L4 and L5 and scenario L6 considers the case where
affects the values of
.
4.1.2 Estimands
We focus on two estimands: the indirect effect of the exposure on the outcome via the mediator and the direct effect of the exposure on the outcome not via the mediator (see equations (3) and (4) for the method of Vansteelandt and equations (9) and (10) for the method of Aalen). Total effect (TE) estimates are also reported for completeness. We do not consider proportion mediated as an estimand. Although it is intuitively appealing for quantifying mediation, it tends to have wide confidence intervals
and, when the total effect estimate is small, it becomes unstable as the denominator approaches zero.
4.1.3 Methods
Both the method of Vansteelandt and the method of Aalen were applied to each simulation scenario for effect estimation. We implemented the method of Vansteelandt with an additive hazards model as the simulated event times were generated under this assumption. We use the notation Vansteelandt
to remind readers of this choice. A quasi-binomial regression with a logit link was used to model the predictions resulting from each step in the repeated regression estimation procedure. The method of Aalen was implemented with an additive hazards model and linear mediator model. All models for both methods included main effects only. As the implementation provided in Aalen et al.
requires survival to the first visit time, we define our population for both methods to be only those individuals with survival time
. Note that this does not affect the estimate of indirect effect, which cannot be estimated prior to the first mediator measurement. See the Supplemental Information for details. We included both
and
in all analyses. We refer readers to Landau et al.
in which they concluded that mediation analysis of trial data should include baseline measurements of key variables not only to improve precision but also to avoid potential confounding bias.
4.1.4 Performance measures
The primary performance measure assessed is bias in the estimates of DE(
), IE(
), and TE(
), and we report this at three time points: the times corresponding to the 20th, 50th and 80th percentile of event occurrence (which differ by scenario). Let
be the true value of the estimand,
the estimated value of the estimand,
the mean of
and
the number of simulated datasets. Following Morris et al.
we define the bias and Monte Carlo standard error (MCSE) of the bias as:Based on the results of several simulation runs, we expect the Var(
) to be less than 0.015. Because we require the MCSE to be below 0.005, the minimum number of simulated datasets needed will be 600 given that
. To accommodate cases with slightly greater variance, we use
=1000 with
=2000 simulated individuals per dataset.
4.1.5 Generation of the true values of the estimands
The true values of the estimands were estimated using a large (
=3,500,000) simulated dataset that gives stable values following the approach used, for example, by Keogh et al.
For each simulated individual, data were generated for four cases:
The same equations used to generate the simulated datasets were used except that the exposure was not affected by
or
, thereby ensuring that there was no exposure-outcome confounding. All simulated individuals with event times prior to the first mediator measurement time were removed. The probability of survival at time
in each of the four cases equals the proportion still at risk at time
as the only censoring in the dataset is administrative censoring at time
4. Given these survival probabilities,
, the TE, DE and IE at time
are given by:
Note that in the NoDE sub-scenario,
and the total effect equals the indirect effect. Similarly, in the NoIE sub-scenario, the total effect equals the direct effect because
. This is depicted graphically in the Supplemental Information, Figure 1.withwithwithwith
4.1.6 Software
R v4.0.2
was used for all analyses and generation of simulated data. We used the timereg
R package for the additive hazards model. For the method of Aalen, we used R code provided in the Supplemental Materials by the authors.
4.2 Results
4.2.1 Reference scenario
Using the reference scenario, the estimated TE, DE and IE were approximately unbiased for both methods for all three sub-scenarios (DE+IE, NoDE, NoIE). Full results are shown in Supplemental Information Table 3. The MCSE of all bias estimates was <0.005. The empirical standard error for the method of Aalen was lower in both the DE+IE and NoDE sub-scenarios leading to a relative efficiency greater than one (1.55–2.94) over the method of Vansteelandt (Supplemental Information Table 4).The reference scenario was extended to study three settings with a time-varying effect of the mediator on the hazard: (R1) an effect that increases over time, (R2) an immediate effect only, and (R3) a delayed effect. Again, both methods produced approximately unbiased effect estimates (percent bias <2% and Monte Carlo standard error <0.005) at all time points for all sub-scenarios (results not shown).
4.2.2 Infrequent mediator measurements
For both methods and both scenarios (F1, F2) investigating the impact of infrequent mediator measurements, the estimated indirect effect was closer to 1.0 (no effect) than the true indirect effect in the DE+IE and NoDE scenarios. Figure 5 shows the estimated and true IE (left) and estimated and true DE (right) for scenarios F1–DE+IE and F2–DE+IE, using the method of Vansteelandt. Method of Aalen results were similar. In the F1–DE+IE scenario, both methods over-estimated the IE by 7% when 50% of individuals had experienced an event (time
1.66); this increased to 18% by the time 80% of individuals had an event (time
2.45). For scenario F2, at the time at which 50% of individuals had an event, the absolute bias in the estimate of IE was
0.13, equivalent to a percent bias of
11% in the DE+IE sub-scenario, and was
0.15 (
12%) in the NoDE sub-scenario for both methods.
Figure 5.
Effect estimates using the method of Vansteelandt when the mediator is infrequently measured. On the left, the true IE (solid line) and the estimated IE (dashed line) are plotted over time. A similar plot for the true DE (solid line) and estimated DE (dashed line) is shown on the right. Estimates and true values for scenario F1 (
) are in blue and for F2 (
) in orange. Exposure occurs at time 0.
Effect estimates using the method of Vansteelandt when the mediator is infrequently measured. On the left, the true IE (solid line) and the estimated IE (dashed line) are plotted over time. A similar plot for the true DE (solid line) and estimated DE (dashed line) is shown on the right. Estimates and true values for scenario F1 (
) are in blue and for F2 (
) in orange. Exposure occurs at time 0.To better understand the source of this bias, we looked at estimates of the effect of the mediator on the hazard (
in equation 7) and the exposure on the hazard (
in equation 7) using the method of Aalen, which estimates these parameters directly. Figure 6 displays the results for scenario F1–DE+IE. In nearly all of the 1,000 simulation runs, the estimated effect of the mediator on the hazard was lower than the true value but the estimated effect of the exposure on the hazard was greater than the true value. Because
is positive in scenario F1 and the estimate of
is less than the true value, the estimated indirect effect is greater than the true value (see equation 9). In scenario F2, the estimate of
was again less than the true value but because
is negative, the indirect effect estimate was biased downwards.
Figure 6.
Results from 1000 simulation datasets of scenario F1–DE+IE using the method of Aalen to estimate the parameters from equation 7. On the left, the true value of
is shown as a dark green line and estimated values of this parameter in lighter green. On the right, the true value of
is shown as a brown line with estimated values in beige.
Results from 1000 simulation datasets of scenario F1–DE+IE using the method of Aalen to estimate the parameters from equation 7. On the left, the true value of
is shown as a dark green line and estimated values of this parameter in lighter green. On the right, the true value of
is shown as a brown line with estimated values in beige.When the IE was overestimated (underestimated), the corresponding DE was underestimated (overestimated) resulting in an approximately unbiased estimated total effect. Supplemental Information Tables 5 and 6 provide complete results.
4.2.3 Time-varying confounders present
Scenarios L1, L2 and L3 include a time-varying confounder
with mean random slope of 0.5, 5.0 and 15.0, respectively. Both the method of Vansteelandt and the method of Aalen produced approximately unbiased effect estimates for these scenarios. Detailed results are available in the Supplemental Information, Tables 7 to 9. Scenarios L4 and L5 simulated a time-varying confounder with moderate and large variance of the random slope, respectively. The method of Vansteelandt produced unbiased results in these scenarios (Tables 10 and 11 in the Supplemental Information). Some bias was seen in the estimates of DE and IE using the method of Aalen. For scenario L5, the estimate of IE using the method of Aalen when 50% of events had occurred and both a direct and indirect effect were present was underestimated by 2%; IE was underestimated by 8% when 80% of events had occurred. In the final time-varying confounding scenario (L6), the exposure affected the values of
,
and
in the simulated data. In both the DE+IE and NoIE sub-scenarios, bias was seen in the estimates of DE and IE using the method of Aalen (Table 2). By the time 80% of events had occurred, this bias exceeded 10%. Effect estimates from the method of Vansteelandt were approximately unbiased as were effect estimates using the method of Aalen in the NoDE sub-scenario. Full results for scenario L6 are available in Supplemental Information Table 12.
Table 2.
Bias of effect estimates for scenario L6 with a time-varying covariate that is affected by the exposure. Percent bias is shown beneath the absolute bias in parentheses. Results are shown at times corresponding to the 20th, 50th (median) and 80th percentile of event occurrence for the DE+IE scenario. The Monte Carlo Standard Error was
for all estimates of absolute bias.
Absolute bias (percent bias)
Truth
Aalen
Vansteelandtadd
Events
Time
TE
DE
IE
TE
DE
IE
TE
DE
IE
Both Direct and Indirect Effects
20%
1.27
0.94
0.98
0.96
0.00
−0.02
0.02
0.00
0.00
0.00
(0%)
(−2%)
(2%)
(0%)
(0%)
(0%)
50%
1.81
0.82
0.93
0.88
0.00
−0.05
0.05
0.00
0.00
0.00
(0%)
(−5%)
(6%)
(0%)
(0%)
(0%)
80%
2.71
0.66
0.86
0.76
0.00
−0.09
0.09
0.00
0.01
0.01
(0%)
(−11%)
(12%)
(0%)
(1%)
(1%)
Bias of effect estimates for scenario L6 with a time-varying covariate that is affected by the exposure. Percent bias is shown beneath the absolute bias in parentheses. Results are shown at times corresponding to the 20th, 50th (median) and 80th percentile of event occurrence for the DE+IE scenario. The Monte Carlo Standard Error was
for all estimates of absolute bias.
5 Application to CF-related diabetes
5.1 Methods
Data were obtained from annual review records from the UK CF Registry between 2008 and 2017, inclusive. The study population consisted of all individuals in the UK CF Registry aged 18–60, with known genotype and at least two measurements of forced expiratory volume in 1 second (FEV1%), a key predictor of survival. From this group of 6374 individuals, we further excluded people who were pancreatic insufficient (to ensure positivity) and people who had been diagnosed with CFRD prior to the beginning of follow-up. We excluded these prevalent cases to avoid bias due to unknown duration of disease and focus only on incident cases of CFRD.[35,36] The resulting cohort consisted of 3708 individuals with 18,693 annual review records.The exposure was diagnosis of CFRD (Y/N) and the outcome was the composite of death from any cause or lung transplantation. The mediator studied was lung function, measured by FEV1%, a continuous variable. Five baseline confounders were adjusted for in all analyses: gender (M/F), genotype (F508del homozygous Y/N), calendar year, baseline FEV1% and baseline body mass index (BMI). To ensure proper temporal ordering of the data, baseline measurements were taken from the annual review prior to the one where the exposure was assessed and the first mediator measurement was taken from the annual review after exposure assessment. We also adjusted for time-updated measures of BMI (continuous) and respiratory infections (proxied by the number of days in hospital receiving IV antibiotics) when using the method of Vansteelandt. Hospital IV days was categorised into six categories as: 0, 1–7, 8–14, 15–21, 21–28 and >28 days as IV antibiotics are frequently given in week-long courses.To create the analysis dataset, we assumed that measurements of the exposure, mediator and time-varying covariates were taken at integer-valued ages. For each age, 18–50 years, an age-specific dataset was created comprising all individuals at risk at that age who were either not diagnosed with CFRD or diagnosed with CFRD within the past year. In each age-specific dataset, time was reset to zero when CFRD was or was not diagnosed and age was included as a covariate. In this structure, each individual contributed data as an unexposed person at multiple ages (each age that they were at risk but not diagnosed) but only contributed data as an exposed person at the one age they were first diagnosed, if ever diagnosed. This allows us to make the best use of the longitudinal data in our situation where there is no natural time zero for an unexposed person. More details on the construction of the analysis dataset are available in the Supplemental Information.Estimates are presented for IE using the same relative survival scale described in the simulation study. Estimates were computed every 0.1 years starting at time
0.05 in the method of Vansteelandt analysis. We used a Cox proportional hazards survival model with a linear predictor containing main effects only. Quasi-binomial models with a logit link containing main effects only were used to model the other terms in equation (5). Method of Aalen regressions (also main effects only) were performed at each event time in the study population. Non-parametric bootstrap was used to compute 95% confidence intervals using the percentile method with 500 bootstrap samples and resampling done at the individual level.
5.2 Results
Both mediation analysis methods estimated the indirect effect of CFRD on survival via lung function to be modest in size. Figure 7 shows the estimated IE for each mediation method, as a function of time. Using the method of Vansteelandt, the estimated indirect effect increases in magnitude over time, reaching 0.996 at time
4 years post-evaluation of CFRD. The interpretation is as follows. Suppose that a random set of individuals in the population had been assigned to have CFRD. The estimand compares (using a ratio) what their survival probability at year 4 would have been had FEV1%, BMI and IV days been set to the levels they would be if they had CFRD versus what it would have been had BMI and IV days been set to the levels they would be if they had CFRD but FEV1% had been set to the level it would have taken had they not had CFRD. Bootstrap confidence intervals for the estimated IE contain 1.0 at all visit times suggesting that there may be no mediation via lung function. The estimated proportion mediated at time
4 is 4.5% [95% CI:
4.3%, 12.8%] with a total effect estimate of 0.918 [95% CI: 0.889, 0.953]. The method of Aalen analysis estimated a slightly greater indirect effect of 0.993 at 4 years post-evaluation of CFRD. Bootstrap confidence intervals of the estimates of IE do not contain 1.0 at times greater than 2 years post-evaluation. This corresponds to an estimated proportion mediated of 7.9% [95% CI: 3.6%, 14.0%] at
4 with a total effect estimate of 0.919 [95% CI: 0.888, 0.945].
Figure 7.
Results from the method of Vansteelandt and the method of Aalen. The indirect effect of CFRD on time to death or lung transplant via FEV1%, the mediator, is shown. Vertical bars at visit times illustrate 95% bootstrap confidence intervals.
Results from the method of Vansteelandt and the method of Aalen. The indirect effect of CFRD on time to death or lung transplant via FEV1%, the mediator, is shown. Vertical bars at visit times illustrate 95% bootstrap confidence intervals.
6 Discussion
In this study, we explored two recently proposed methods for mediation analysis in a setting with a time-to-event outcome and a time-updated mediator using a simulation study and in our motivating example of CFRD. Both methods produced approximately unbiased estimates of TE, DE and IE in simulated scenarios consistent with their stated assumptions. When the mediator was measured infrequently or confounding was not controlled for, however, bias was seen in the effect estimates for both methods.The presence of time-varying confounding of the mediator and outcome is likely in many settings where the mediator is repeatedly-measured over time. An important feature of the method of Vansteelandt et al.
is the ability to identify indirect and direct effects even when time-varying confounding exists. In six simulation scenarios with a time-varying covariate that influenced the mediator process, the method of Vansteelandt returned approximately unbiased effect estimates. Using the method of Aalen, which explicitly assumes that control for confounding can be accomplished with baseline covariates, there is no mechanism to adjust for values of
post-exposure. When the baseline value
was a good representation of post-exposure values
,
and
, the bias seen was minimal. Greater bias was seen when the post-exposure values of
were affected by the exposure or when the variance of the random slope used to generate values of
was larger. This highlights the importance of understanding whether time-varying confounding exists in a particular setting and, if so, implementing a method that explicitly accounts for it.Substantial bias was found in the estimates of DE and IE for scenarios with infrequent mediator measurement. For example, this could occur in observational datasets where the mediator is a continuous biomarker measured periodically at infrequent visits. In this case, the analysis incorporates snapshots of the trajectory of the continuous biomarker but survival is affected in continuous time. This is a type of measurement error and it resulted in an attenuation of the effect estimates in our simulation study. Further, the bias accumulated over time. Strohmaier et al.
reported similar results but found that increasing the frequency of the mediator measurements did not necessarily improve estimation of the IE. Rather, less bias was seen when mediator measurements better represented the underlying biological process. Infrequently measured confounders may also contribute to the bias and this bias could be in either direction. Interesting avenues for future research would be to explore methods for mitigating the impacts of measurement error in mediators and infrequent measurements. Correcting for measurement error would require external information about the error, and Aalen et al.
have suggested a calibration approach based on work in VanderWeele
that may be useful when such information is available. A possible solution to the second issue could involve using mixed models or joint models to impute unmeasured longitudinal values of the mediators and covariates.In the analysis of the UK CF Registry dataset, both methods produced similar effect estimates and found only a small indirect effect of CFRD on survival via lung function. The conclusion is that the majority of the total effect acts through pathways where the exposure does not first affect the mediator process. Because the primary cause of death in CF is respiratory failure and previous studies have shown that CFRD is associated with both increased mortality and decreased lung function, we hypothesised that the evidence for mediation would be greater. From the simulation study, we learned that indirect effect estimates may be attenuated if the mediator is measured infrequently. It is possible that annual measurements of lung function were not frequent enough for this process and that the estimated indirect effect was biased. Another potential source of bias is uncertainty in the mediator measurements. We do not believe this to be a substantial problem with the lung function measurements because of the standard laboratory procedures used and pre-planned measurement times at annual well visits. A further limitation is that some unexposed people later became exposed but their change in exposure status was not incorporated into the analysis. We chose not to censor them at the time they were diagnosed with CFRD because this would violate the assumption of non-informative censoring. The changing exposure status of some individuals may have biased the effect estimates and future work could include the specification of estimands for these two methods when the exposure is time-varying.For the method of Aalen, we must posit an exposure separation. One possible mechanism for lung disease associated with diabetes is via a build-up of collagen in lung tissue leading to reduced elasticity.
As this aspect is unlikely to affect survival other than via lung function, we may envision a split of this aspect away from the other effects of glucose intolerance to obtain a theoretical exposure separation. The further assumption of the method of Aalen that only the most recent measurement of lung function affects the hazard seems unlikely to hold as there is evidence that previous values of FEV1% are significant predictors of the hazard at a given time in addition to the most recent value.
A causal interpretation for these analyses is also reliant upon the assumption of no unmeasured confounding. We attempted to control for confounding of the exposure-outcome and exposure-mediator relationships by adjusting for five baseline covariates which are known predictors of survival in CF but, as with all causal analyses using observational data, it is impossible to verify that confounding has been completely addressed. Controlling for confounding of the mediator-outcome relationship was via measurements of the two time-varying confounders (BMI and IV days) in the method of Vansteelandt. In the method of Aalen however, it was necessary to assume that sufficient control for confounding could be achieved using the baseline confounders. Because respiratory infections can be a time-varying common cause of both lung function and survival, this assumption is likely not valid and the method of Aalen results may be biased. In the method of Vansteelandt, a specific causal ordering of covariates and mediators was assumed where IV days and BMI measured at visit
could affect the FEV1% measured at visit
but not the reverse. This seems plausible for both because IV days quantifies the prior year’s days in hospital and there is evidence BMI affects lung function via its effect on respiratory musculature.
In the analyses presented here, we adjusted for the baseline measurement of lung function to control for exposure-outcome and exposure-mediator confounding. We wish to stress the importance of adjusting for the baseline mediator measurement as an unadjusted analysis can produce very different results. For example, in a repeat analysis that differed only by not adjusting for baseline FEV1%, the indirect effect was estimated to be 0.96 at time
5 as opposed to 0.99 with adjustment. This equates to a tripling of the estimated proportion mediated.Both of the methods studied here are valuable tools for mediation analysis in the setting of a survival outcome with a time-updated mediator. The method of Vansteelandt allows time-varying mediator-outcome confounding and can also be extended to multiple longitudinal mediators, a situation that will be common in clinical studies. The sample size required by the method of Vansteelandt to achieve the desired precision may be greater, particularly as the number of visit times increases because each regression is performed on the history of all of the covariates. At later visit times when there are more covariates in the model, there may be fewer observations due to fewer people having survived to that visit. An alternative evaluation technique for the method of Vansteelandt equation (5) is Monte Carlo integration but it requires jointly modelling the distribution of all variables. The method of Aalen offers fast computation times but is limited to the setting where time-varying confounding is believed not to exist. For some settings, it may be preferable to assume that the treatment or exposure can be split into separate biological components, as in the method of Aalen, than to contemplate a hypothetical intervention on the mediator. Also, further research into methods for assessing fit of these models would be helpful.We have shown that both the method of Vansteelandt and the method of Aalen produce approximately unbiased results in a reference scenario consistent with both of their assumptions. Further, the method of Vansteelandt returned approximately unbiased effect estimates in a variety of scenarios where time-varying confounding was introduced. Both techniques rely on a number of assumptions to make causal statements and care should be taken with the interpretation of any analysis. The importance of discussions with experts on the clinical aspects of the data cannot be overstated.Click here for additional data file.Supplemental material, sj-pdf-1-smm-10.1177_09622802221107104 for Methods of analysis for survival outcomes with time-updated mediators, with application to longitudinal disease registry data by Kamaryn T Tanner, Linda D Sharples, Rhian M Daniel and Ruth H Keogh in Statistical Methods in Medical Research
Authors: Dario Pitocco; Leonello Fuso; Emanuele G Conte; Francesco Zaccardi; Carola Condoluci; Giuseppe Scavone; Raffaele A Incalzi; Giovanni Ghirlanda Journal: Rev Diabet Stud Date: 2012-05-10
Authors: Connor Lewis; Scott M Blackman; Amanda Nelson; Ewa Oberdorfer; Daniel Wells; Jordan Dunitz; William Thomas; Antoinette Moran Journal: Am J Respir Crit Care Med Date: 2015-01-15 Impact factor: 21.405
Authors: John B Buse; Stephen C Bain; Johannes F E Mann; Michael A Nauck; Steven E Nissen; Stuart Pocock; Neil R Poulter; Richard E Pratley; Martin Linder; Tea Monk Fries; David D Ørsted; Bernard Zinman Journal: Diabetes Care Date: 2020-05-04 Impact factor: 17.152