Literature DB >> 32684529

Understanding Marginal Structural Models for Time-Varying Exposures: Pitfalls and Tips.

Abstract

Epidemiologists are increasingly encountering complex longitudinal data, in which exposures and their confounders vary during follow-up. When a prior exposure affects the confounders of the subsequent exposures, estimating the effects of the time-varying exposures requires special statistical techniques, possibly with structural (ie, counterfactual) models for targeted effects, even if all confounders are accurately measured. Among the methods used to estimate such effects, which can be cast as a marginal structural model in a straightforward way, one popular approach is inverse probability weighting. Despite the seemingly intuitive theory and easy-to-implement software, misunderstandings (or "pitfalls") remain. For example, one may mistakenly equate marginal structural models with inverse probability weighting, failing to distinguish a marginal structural model encoding the causal parameters of interest from a nuisance model for exposure probability, and thereby failing to separate the problems of variable selection and model specification for these distinct models. Assuming the causal parameters of interest are identified given the study design and measurements, we provide a step-by-step illustration of generalized computation of standardization (called the g-formula) and inverse probability weighting, as well as the specification of marginal structural models, particularly for time-varying exposures. We use a novel hypothetical example, which allows us access to typically hidden potential outcomes. This illustration provides steppingstones (or "tips") to understand more concretely the estimation of the effects of complex time-varying exposures.

Entities: Chemical Disease Gene Species

Keywords: causal inference; g-formula; inverse probability weighting; marginal structural model; time-varying exposure

Mesh：

Year: 2020 PMID： 32684529 PMCID： PMC7429147 DOI： 10.2188/jea.JE20200226

Source DB: PubMed Journal: J Epidemiol ISSN： 0917-5040 Impact factor: 3.211

BACKGROUND ON THE TOPIC

When we try to say something meaningful about a specific exposure–outcome causal relationship, counterfactual models are among the most popular and widely accepted approaches in the epidemiologic community.[1]–[6] A counterfactual approach not only formalizes the language of cause and effect,[7]–[13] but has also triggered the explosive development of novel analytic methods, including propensity scores (ie, the probability of exposure conditional on measured confounders)[14]–[19] and regression model-based estimation methods (ie, multivariable-adjusted outcome modeling, possibly followed by averaging predicted risks under distinct exposure statuses),[20],[21] which have been evolved into doubly robust estimation.[22]–[28] More importantly, a counterfactual approach has spurred extensive discussion on the assumptions for inferring causality from data and the conditions for specific statistical methods to work using, for example, causal diagrams.[2]–[4],[6],[29]–[35] Yet, the most striking illustration brought about by the counterfactual approach may be that it can offer an elegant solution to the controversy surrounding the definition and estimability of the effects of exposures that vary over time. For example, initiated antiretroviral therapy (exposure) for acquired immunodeficiency syndrome may be intermitted after looking at the symptoms of pneumonia, which is a predictor of clinical outcomes (eg, death) but affected by the prior exposure, and thus considered as a part of the exposure’s effects. While no existent theory (at the time) in the statistics literature had offered clear guidance for adjusting or not adjusting for such intermediate variables to estimate the effect of time-varying exposures, new causal methodologies emerged in the 1980s. These include Robins’ unified approach, which is comprised of the generalized computational algorithm formula (abbreviated as g-formula) and estimation methods (ie, inverse probability weighting and g-estimation) of two classes of counterfactual, or structural, models.[36]–[42] In 2000, marginal structural models were introduced as a tool to make the effects of such time-varying exposures easily estimable.[43]–[45] Specifically, a marginal structural model is an equation to demonstrate prespecified assumptions on the causal effects to be estimated (ie, causal estimands). Thanks to the series of Robins and Hernán’s seminal works,[46]–[51] as well as others’ tutorials on the topic with intuitive theory and easy-to-implement software,[52]–[59] marginal structural models have been widely applied to longitudinal data. Herein, we illustrate the use of marginal structural models, parameters of which can be estimated in a comparative way using inverse probability weighting and the g-formula in certain situations, featuring hypothetical data with a time-varying exposure to point out common pitfalls as well as serve as a stepping stone to better understand the use of these methods.

CONCEPTUAL PITFALLS

If readers feel confused with the following statements, they could be trapped by the pitfalls around the methodology considered in this paper: Marginal structural models should be distinguished from inverse probability weighting. A marginal structural model is an equation to show prespecified assumptions on causal estimands, while an exposure probability model for inverse probability weighting is an imposed restriction on observed distribution for estimation. As a marginal structural model and exposure probability model (for inverse probability weighting) are used for different purposes, misspecification of these models would lead to biases in different ways. Principles for variable selection for marginal structural models are distinct from that for exposure probability models, and thus model specification of them raises different challenges. Inverse probability weighting shares identifiability assumptions with the g-formula and can be used to fit marginal structural models when the assumptions are met, although g-formula can be used to fit them only when the models are saturated. Although some of these pitfalls have been appreciated previously,[57] we aim to discuss them from a different perspective. Before entering these subtleties, it would be helpful to seize the rationale of the specialized causal methods elaborated for time-varying exposures with simple worked examples without relying on computerized packages. Unlike point-exposure settings,[1],[6],[60],[61] however, we rarely encounter such pedagogic examples of time-varying exposures, including counterfactual data that explicate causal estimands and underlying conditions. Although there are at least four excellent numerical examples appropriate for exercise, they rely on either the external causal knowledge (ie, causal diagrams without explicit estimands[51],[59] or “g-null” theorem implied by a causal diagram and observed data[6]) or “true” parameters for simulated data.[54] In this paper, we provide a step-by-step illustration, or tips, using a novel, hypothetical numerical example dataset that includes potential outcomes, which directly incorporates minimal information to explicitly define causal estimands and conditions for their identification. One may consider a causal diagram would be helpful to understand the structure of the dataset. As noted later, however, causal diagrams typically include more causal assumptions than sufficient conditions to identify causal effects. That is why we do not start by drawing causal diagrams and use them only complimentarily in our illustration, despite the fact that they are indeed useful tools for explicating our assumptions in real data analysis.[35] The following “tips” emanate from two introductory subsections regarding the effects of point exposures and time-varying exposures. Then, we step into the main contents to understand the unique role of and distinction between inverse probability weighting, marginal structural models, and regression/exposure probability models.

TIPS TO UNDERSTAND WHAT, WHY, AND HOW OF MARGINAL STRUCTURAL MODELS

Prerequisite: identification of point-exposure effects

As many epidemiologists become familiarized with a potential-outcome framework for a single time point, or a point-exposure setting, we just briefly review it here; readers unfamiliar with the basic concepts and notation may refer to Part 1 of Causal Inference: What If[6] or concise introduction papers.[61],[62] Suppose that exposure A (eg, antihypertensive drug), outcome Y (eg, the occurrence of cardiovascular disease), and set of covariates L (eg, current/prior health conditions, unhealthy behaviors, and social support) are observed for individual i = 1,…, n. Let Y denote the possibly unobserved, potential outcome that would be observed if, possibly counterfactually, exposure A were set to level a = 0 (unexposed) or 1 (exposed) (hereafter, we may omit subscript i if no confusion will occur). Then, the average causal effect of exposure A on outcome Y may be defined as E[Y1] − E[Y0], which compares counterfactual expectations (or risks for a binary outcome) of Y1 and Y0 in the same population along the difference-scale. Suppose a hypothetical cohort (Table 1) of 1,240 members whose E[Y0] = 660/1,240 = 0.532 and E[Y1] = 830/1,240 = 0.669, indicating moderate risk increase (causal risk difference of 13.7%) by exposure A. Note that in a counterfactual framework, because either Y0 or Y1 can be observed as Y according to actual exposure status A, we can observe neither E[Y0] nor E[Y1] directly in the data. Thus, we need the set of assumptions to identify the causal effect[1],[6],[14],[61]: consistency (ie, if A = a then Y = Y for all a), positivity (ie, 0 < P(A = a|L) almost everywhere, for all a), and the following conditional exchangeability given covariates, say, L.

Table 1.

Hypothetical cohort data with potential outcomes under point-exposure

Stratum		N	Potential outcome^a				Observed outcome

L	A		Y⁰ = 1	Risk	Y¹ = 1	Risk	Y = 1	E[Y\|A, L]
1	1	280	168	0.6	210	0.75	210	0.75
1	0	720	432	0.6	540	0.75	432	0.6
0	1	180	45	0.25	60	0.333	60	0.333
0	0	60	15	0.25	20	0.333	15	0.25

Total		1,240	660	0.532	830	0.669

aUnobservable counterfactual distributions. Bold numbers are observed as Y = 1 (by consistency) in each stratum.

aUnobservable counterfactual distributions. Bold numbers are observed as Y = 1 (by consistency) in each stratum. Table 1 also presents the observed distribution of (L, A, Y) in accordance with potential outcome Y (a = 0, 1) under consistency. In Table 1, potential risk under a = 0 in the exposed E[Y0|A = 1] = 213/460 = 0.463 is not equal to that in the unexposed E[Y0|A = 0] = 447/780 = 0.573, and the same is true for potential risk under a = 1, E[Y1|A]. When A is associated with Y as previously, marginal (unconditional) exchangeability is violated and the A–Y association (observed risk difference) is said to be confounded: E[Y|A = 1] − E[Y|A = 0] = 270/460 − 447/780 = 1.4%, indicating almost null association. Fortunately, within every strata of L, we can verify from Table 1 (“Risk” columns) that E[Y|A = 0, L = l] = E[Y|A = 1, L = l] (a = 0, 1 and l = 0, 1) and thus equal to E[Y|L = l]. This condition is called “conditional exchangeability given L” and the sets of covariates that satisfy the condition are said to be cofounders.[6],[36] Under the condition, a weighted mean, or standardized risk,[4],[52] [Y|A = a, L = l]P(L = l) is equal to E[Y][6]; that is, causal effects are identifiable. In our data, standardized risks for A = 0 and A = 1 arerespectively. The next subsection extends the definitions for and the conditions sufficient to identify causal effects for time-varying settings. To focus on the complexity of conditional exchangeability in time-varying settings, we suppose throughout this paper that consistency and positivity assumptions, as well as the time-varying versions of them,[51],[63],[64] are met in our data.

Definition and identification of effects of time-varying exposures

Targeted effects of time-varying exposure

If an exposure varies over time, the aforementioned definition of effects should be redefined. Consider a simple case with 2 time points. At time 1, baseline confounders L1 are measured and then exposure A1 is commenced; at time 2, confounder set L2 is measured and exposure is changed to A2; finally, outcome Y is measured. Thus, the observed data are (L1, A1, L2, A2, Y), for i = 1,…, n. Note that A1 and A2 may represent the same exposure (eg, start/stop antihypertensive drugs) or different exposures introduced sequentially (eg, first-line and second-line chemotherapy for cancer patients). Likewise, L1 and L2 may consist of the same set of variables or (partly or entirely) distinct sets of variables. For time-varying exposure, potential outcome can be defined by the combination of intervention on a joint exposure (A1, A2): let denote the potential outcome that would be observed if exposure A1 and A2 were set to level a1 and a2, respectively. We assume that exposure at each time takes on 0 (unexposed) or 1 (exposed), leading to 4 different potential outcomes—Y0,0, Y0,1, Y1,0, and Y1,1 for each individual i. The average causal effect of exposure on outcome may be defined as any contrast between counterfactual expectations E[]; eg, E[Y1,1] − E[Y0,0]. We can also consider E[Y1,0] − E[Y0,0], which is referred to as the “controlled direct effect of A1 while A2 set at 0.”[65]–[67] Note that joint exposure (A1, A2) can affect not only outcome Y, but also L2 (by A1), which is measured after exposure initiation. Under the implausible assumption of no effect of (the part of) exposure on (the part of) the following confounders, the effect of (A1, A2) can solely be seen as a multivalued exposure at a single time-point; as shown earlier, [Y|A1 = a1, A2 = a2, L1 = l1, L2 = l2]P(L1 = l1, L2 = l2) is equal to E[] if the corresponding exchangeability assumptions for point-exposure hold. In the following hypothetical data, however, there is no single set of confounders for joint effects of (A1, A2). Rather, L1 is a sufficient set of confounders for A1, and (L1, A1, L2) is a sufficient set of confounders for A2. This condition would enable us to identify E[] but the usual standardization formula, [Y|A1 = a1, A2 = a2, L1 = l1, L2 = l2]P(L1 = l1, L2 = l2) leads to biased estimates unless the aforementioned implausible assumption of no-effect of past exposures on time-varying confounders holds.[6],[36],[43],[51]

A hypothetical cohort

For simplicity, consider a hypothetical cohort with empty L1. The situation would arise if A1 is randomized at baseline, but non-adherence occurs or another exposure is introduced during the follow-up, or if the cohort is restricted based on measured variables L1. In either case, the following illustration is unaffected by including the diverse values of L1, so let us ignore the adjustment for baseline confounders in our illustration.[6],[51],[54] Table 2 provides the data distribution of (A1, L2, A2, Y) augmented by unobserved potential outcome (a1, a2 = 0, 1) in the hypothetical cohort. As in Table 1, observed outcome Y coincides with such that (A1, A2) = (a1, a2) by consistency. We want to identify from observational data four expectations E[] (“Total” row of “Risk” columns).

Table 2.

Hypothetical cohort data with potential outcomes under time-varying exposure

Stratum			N	Potential outcome^a								Observed outcome

A₁	L₂	A₂		Y^0,0 = 1	Risk	Y^0,1 = 1	Risk	Y^1,0 = 1	Risk	Y^1,1 = 1	Risk	Y = 1	E[Y\|A₁, L₂, A₂]
1	1	1	720	648	0.9	648	0.9	432	0.6	576	0.8	576	0.8
1	1	0	180	162	0.9	162	0.9	108	0.6	144	0.8	108	0.6
1	0	1	1,800	1,080	0.6	990	0.55	900	0.5	720	0.4	720	0.4
1	0	0	1,800	1,080	0.6	990	0.55	900	0.5	720	0.4	900	0.5
0	1	1	5,670	5,103	0.9	4,536	0.8	2,835	0.5	3,402	0.6	4,536	0.8
0	1	0	630	567	0.9	504	0.8	315	0.5	378	0.6	567	0.9
0	0	1	840	252	0.3	294	0.35	462	0.55	252	0.3	294	0.35
0	0	0	3,360	1,008	0.3	1,176	0.35	1,848	0.55	1,008	0.3	1,008	0.3

Total			15,000	9,900	0.66	9,300	0.62	7,800	0.52	7,200	0.48

aUnobservable counterfactual distributions. Bold numbers are observed as Y = 1 (by consistency) in each stratum.

aUnobservable counterfactual distributions. Bold numbers are observed as Y = 1 (by consistency) in each stratum. We note that neither unconditional nor conditional (given L2) exchangeability holds for joint exposure (A1, A2) in our data. For example, in the subgroups of (A1, A2) = (1, 1) and (0, 0), E[Y0,0|A1 = 1, A2 = 1] = 1,728/2,520 = 0.686 differs from E[Y0,0|A1 = 0, A2 = 0] = 1,575/3,990 = 0.395 (unconditional exchangeability fails). Likewise, E[Y0,0|A1 = 1, L2 = 0, A2 = 1] = 0.6 ≠ E[Y0,0|A1 = 0, L2 = 0, A2 = 0] = 0.3 (conditional exchangeability fails). Readers can see other potential outcomes also differ on average between distinct subgroups of (A1, A2). Next, let us see the bias in estimators ignoring or solely stratifying on L2 as a “baseline” confounder.

Naïve standardization vs the g-formula

Table 3 shows the observable part of Table 2 in a different layout, adding some candidate estimates from observed data. “L2-collapsed” estimates are risks in subgroups of joint exposure, E[Y|A1 = a1, A2 = a2] without considering L2. These are away from E[] in Table 2 because of the lack of unconditional exchangeability. On the other hand, “naïve standardization” uses standardization formula in point-exposure settings: [Y|A1 = a1, A2 = a2, L2 = l2]P(L2 = l2), where P(L2 = 1) = 0.48 and P(L2 = 0) = 0.52. For example, standardized risk in (A1, A2) = (0, 0) can be obtained asHowever, this estimate is (and other estimates are) again biased from E[Y0,0] = 0.66 (and other E[] in Table 2) owing to the violation of conditional exchangeability given L2.

Table 3.

Estimates of effects of time-varying exposure from hypothetical cohort data

	A₁ = 0							A₁ = 1							p(L₂)

	A₂ = 0			A₂ = 1			p(L₂\|A₁)	A₂ = 0			A₂ = 1			p(L₂\|A₁)

L₂	N	Y = 1	Risk	N	Y = 1	Risk		N	Y = 1	Risk	N	Y = 1	Risk
1	630	567	0.9	5,670	4,536	0.8	0.6	180	108	0.6	720	576	0.8	0.2	0.48
0	3,360	1,008	0.3	840	294	0.35	0.4	1,800	900	0.5	1,800	720	0.4	0.8	0.52

Estimates of E[Ya1,a2]^a
L₂-collapsed^b	3,990	1,575	0.39	6,510	4,830	0.74		1,980	1,008	0.51	2,520	1,296	0.51
Naïve standardization^c			0.59			0.57				0.55			0.59
G-formula^d			0.66			0.62				0.52			0.48

a(a1, a2) corresponds to the value of (A1, A2).

bCalculate E[Y|A1, A2] using N and Y = 1 data in the subgroup defined by (A1, A2).

cCalculate [Y|A1, A2, L2 = l2]p(l2), where data in “Risk” and “p(L2)” columns in each L2 = l2 (0 or 1) row are used for E[Y|A1, A2, L2 = l2] and p(l2), respectively.

dCalculate [Y|A1, A2, L2 = l2]p(l2|A1) as above, except for using probabilities in “p(L2|A2)” instead of “p(L2)” for the corresponding L2 and A2 values.

a(a1, a2) corresponds to the value of (A1, A2). bCalculate E[Y|A1, A2] using N and Y = 1 data in the subgroup defined by (A1, A2). cCalculate [Y|A1, A2, L2 = l2]p(l2), where data in “Risk” and “p(L2)” columns in each L2 = l2 (0 or 1) row are used for E[Y|A1, A2, L2 = l2] and p(l2), respectively. dCalculate [Y|A1, A2, L2 = l2]p(l2|A1) as above, except for using probabilities in “p(L2|A2)” instead of “p(L2)” for the corresponding L2 and A2 values. Instead of using P(L2 = l2) in the standardization formula, the “g-formula” in Table 3 averages the stratified risks E[Y|A1, L2 = l2, A2] using the weights P(L2 = l2|A1): Unlike the previous two naïve estimates, we can see that these values are equal to E[] in Table 2. As elaborated in the next subsection, the g-formula is one expression of E[] in terms of observed distribution under the condition that is different from unconditional/conditional exchangeability.

Conditions for identification of the effects

Instead of conditional exchangeability E[|A1, L2, A2] = E[|L2] for joint exposure, we can easily check the following conditions,for all a1 and a2, from upper four rows vs lower four rows (for (C1)) and every 2 rows within the same stratum of (A1, L2) (for (C2)) in Table 2. These conditions are collectively called the sequential exchangeability for (A1, A2),[6],[36],[51] which are typically easier to hold than joint conditional exchangeability but are neither necessary nor sufficient condition for joint conditional exchangeability (see Appendix A for more technical notes on the conditions). The covariates that satisfy (C2) through their stratification (ie, L2 here) are called time-varying confounders. In fact, slightly strong condition (C2′) E[|A1, L2, A2 = 1] = E[|A1, L2, A2 = 0] (which requires conditional independence in all A1 supports instead of only in A1 = a1 compatible with intervention on ) also holds in our example, while this is not required for the g-formula to be equal to E[]. The g-formula equals E[] if sequential exchangeability (C1) and (C2) holds. It is helpful to depict the conditions in causal diagrams, namely, causal directed acyclic graphs (DAGs)[2],[29] and single-world intervention graphs (SWIGs)[31],[32]; we would like readers unfamiliar with these graphical terminology and rules (eg, opening/blocking paths, d-separation, the backdoor criterion) to refer to introductory articles[30],[32],[35] or book chapters[6],[34] on the topic. Informally, variables are d-separated if they are not connected with each other or connected only through paths on which at least one unadjusted “colliders” or adjusted “non-colliders” exist. If a supposed exposure is d-separated from a supposed outcome by adjusting for non-descendant variables of the exposure (in an original graph) after deletion of arrows emanating from the exposure, then we would say the backdoor criterion is satisfied. Figure 1, which is adopted from Part 3 of Causal Inference: What If,[6] represents the causal diagrams that imply (C1) and (C2). Note that the typical strategy for causal inference in practice starts by drawing a causal DAG (eg, Figure 1(a)) or a SWIG (eg, Figure 1(c)) assumed for the data-generating process. Then, (conditional) independences between potential and observed variables, such as (C1) and (C2), are deduced from the graph. Here, we go backward; we start with counterfactual data (Table 2) in which (C1) and (C2) hold and proceed to causal DAGs/SWIGs that are compatible with those conditions.

Figure 1.

Causal DAGs and SWIGs compatible with example data, where U and W are unobserved variables: (a) causal DAG without W, in which A1–L2, A1–Y, and A2–Y are (conditionally) unconfounded given observed data; (b) causal DAG with W, in which A1–Y and A2–Y are (conditionally) unconfounded but A1–L2 is confounded given observed data; (c) a “template” under intervention (a1, a2) of SWIG that corresponds to causal DAG (a); (d) a “template” under intervention (a1, a2) of SWIG that corresponds to causal DAG (b).

In Figure 1(a), there is no non-descendant variable set that blocks all backdoor paths from collective nodes (A1, A2) to Y (ie, satisfies the backdoor criterion). On the contrary, the backdoor paths to Y from A1 and A2 are separately blocked by distinct sets of variables: empty set for A1 and (A1, L2) for A2. The arguments can be more directly depicted using potential variables in Figure 1(c), which is a “template” of the SWIG representing each intervention (a1, a2) on (A1, A2).[32] For example, A1 is d-separated from any variables, and is d-separated from given . After additionally conditioning on A1 = a1 (which is automatically done in the “template”), (by consistency) is still d-separated from given (by consistency) and A1 = a1; thus, (C1) and (C2) are satisfied in this SWIG. The same arguments can be applied to Figure 1(b) and (d), where A1–L2 is confounded (ie, connected by a backdoor path) by unobserved W. In other words, there are settings where joint effects of (A1, A2) on Y can be identified (via sequential exchangeability) even if the effects of A1 on L2 are not identifiable (by the unobservable). More implication obtained from Figure 1 is detailed in Appendix B. The remainder of the paper does not require the reference to causal diagrams.

Different view of the g-formula: inverse probability weighting

We have seen that under the sequential exchangeability (C1) and (C2), the g-formula is equivalent to the averages of potential outcome. If baseline confounders L1 exist, the g-formula iswhich is equivalent to E[] if (C1) and (C2) hold by additionally conditioning on L1. The left-hand side of equation (1) is a representation of the iterative conditional expectation of the g-formula. The alternative expression of E[] under (C1) and (C2) is inverse probability weighting[6],[42],[51],[64]:where I(A1 = a1, A2 = a2) is an indicator function that takes 1 if individual i has joint exposure level (a1, a2) and 0 otherwise, p(a1|l1) = P(A1 = a1|L1 = l1) is a conditional probability function of first exposure having level a1 and p(a2|l1, a1, l2) = P(A2 = a2|L1 = l1, A1 = a1, L2 = l2) is a conditional probability function of second exposure having level a2 given past exposure and covariates. Accordingly, p(A1|L1) and p(A2|L1, A1, L2) in formula (2) are functions of individual data. These two expressions are equivalent forms of E[] under sequential exchangeability (C1) and (C2), as well as the time-varying versions of consistency and positivity. Despite the equivalence of these identification formulas, the estimator that plugs each estimate into (1) is called a g-formula estimator and that based on (2) is an inverse probability weighted estimator. The arguments can be extended to “dynamic regimes” with stronger conditions (Appendix B).[51],[64] Now, let us obtain inverse probability weighted estimates from Table 2. First, we garner the probability of actually received exposure given past exposure and covariates separately for A1 and A2. As L1 is empty to achieve sequential exchangeability, p(A1) and P(A2|A1, L2) for each combination of (A1, L2, A2) are provided in Table 4. Next, calculate the “inverse probability weights” 1/{p(A1)p(A2|A1, L2)} and multiply the numbers of combinations (A1, L2, A2) by the weights. Note that the sum of the weights I(A1 = a1, A2 = a2)/{p(A1)p(A2|A1, L2)} for each (a1, a2) equals total sample size (ie, n = 15,000 in our data). Hence, formula (2) indicates that we only have to estimate the probability of Y = 1 for every combination of (a1, a2) in these multiplied numbers, or the inverse probability weighted population:

Table 4.

Hypothetical cohort data weighted by inverse probability of exposures

A₁	L₂	A₂	Unweighted number		p(A₁)	p(A₂\|A₁, L₂)	IPW	Number multiplied by IPW

			N	Y = 1				N	Y = 1
1	1	1	720	576	0.3	0.8	4.17	3,000	2,400
1	1	0	180	108	0.3	0.2	16.67	3,000	1,800
1	0	1	1,800	720	0.3	0.5	6.67	12,000	4,800
1	0	0	1,800	900	0.3	0.5	6.67	12,000	6,000
0	1	1	5,670	4,536	0.7	0.9	1.59	9,000	7,200
0	1	0	630	567	0.7	0.1	14.29	9,000	8,100
0	0	1	840	294	0.7	0.2	7.14	6,000	2,100
0	0	0	3,360	1,008	0.7	0.8	1.79	6,000	1,800

IPW, inverse probability weight.

Marginal structural models

We have estimated four distinct E[] separately via g-formula (1) or inverse probability weighting (2). No approximation, or model, has been used. Now, carefully look at the true values E[] in the last row of Table 2. We can see that E[Y1,0] − E[Y0,0] = 0.52 − 0.66 = 0.48 − 0.62 = E[Y1,1] − E[Y0,1]; the difference between a1 = 1 vs a1 = 0 is −14%, irrespective of the value of a2. Likewise, review E[Y0,1] − E[Y0,0] = 0.62 − 0.66 = 0.48 − 0.52 = E[Y1,1] − E[Y1,0] and the causal risk difference of a2 = 1 vs a2 = 0 is −4%. We can collectively write the counterfactual expectations as follows: E[] = 0.66 − 0.14a1 − 0.04a2. More generally, we may describe the relation between E[] and (a1, a2) asThis is the correctly specified marginal structural model; if we have the data in Table 2, the parameters of marginal structural model (3) can be unbiasedly estimated by, for example, the least-squares or maximum-likelihood methods. The marginal structural models are the simplified expressions of E[] by restricting the possible values of E[].[42],[43],[51] In equation (3), the left-hand side can take any four values, but the right-hand side expresses them by only three parameters. Model (3) is marginal because the expectations are taken with the marginal distributions of unconditional on other observed variables (though the condition is relaxed later) and other potential outcomes other than (a1, a2) (thus, we need not consider any cross-world joint distributions under different interventions).[42],[46] Model (3) is also structural because it imposes restrictions on potential outcomes rather than observed distributions. There are other possibilities for specification of marginal structural models. For example, we can fit the simpler additive modelwhich has only two parameters assuming that A1 and A2 have the same effect (risk difference) on Y, or a multiplicative marginal structural modelwhere exp(β1) and exp(β2) represent the (common) risk ratios E[]/E[] (a2 = 0, 1) and E[]/E[] (a1 = 0, 1), respectively. However, these are incorrectly specified or misspecified marginal structural models because any parameter values (β0, β1) or (β0, β1, β2) in the right-hand sides of (4) and (5) cannot exactly express the left-hand sides. A marginal structural model is correctly specified in multiplicative scale by making it saturated by, for example, including an interaction term of a1 and a2: We estimate these marginal structural models through inverse probability weighting from observed data in Table 3, where sequential exchangeability (C1) and (C2) holds. Of course, models (4) and (5) are misspecified and necessarily result in biased estimates of E[]. Nevertheless, the estimates of misspecified marginal structural models may well approximate the true E[] unless the model forms differ significantly from the true relationship between E[] and (a1, a2). A typical estimation process is as follows: 1) calculate the inverse probability weight, 1/{p(A1)p(A2|A1, L2)}, for each variable pattern (A1, L2, A2) as in Table 4; 2) fit the regression model for E[Y|A1 = a1, A2 = a2] with the same functional form of the marginal structural models; and 3) obtain confidence intervals by the sandwich estimator or bootstrap. The SAS and Stata codes to create a dataset and replicate the results are provided in Appendix C and Supplementary Material, respectively. Table 5 shows the parameter estimates of these models. Expectations E[] are also estimated by linear combination of these estimates in the corresponding models; eg, E[Y0,0] = β0 (models 3 and 4) or exp(β0) (models 5 and 6), and E[Y1,1] = β0 + β1 + β2 (model 3), β0 + 2β1 (model 4), exp(β0 + β1 + β2) (model 5), or exp(β0 + β1 + β2 + β3) (model 6).

Table 5.

Inverse probability weighted estimates of marginal structural models from observed hypothetical cohort data (Table 3)

	MSM (3): Correct		MSM (4): Incorrect		MSM (5): Incorrect		MSM (6): Correct

	Estimate^a	95% CI^b	Estimate^a	95% CI^b	Estimate^a	95% CI^b	Estimate^a	95% CI^b
Risk difference or ratio
A₁ (a₁ = 1 vs 0)	−0.140	−0.160, −0.120	−0.090^c	−0.104, −0.076	0.781	0.753, 0.810	0.788^d	0.746, 0.832
A₂ (a₂ = 1 vs 0)	−0.040	−0.060, −0.020	−0.090^c	−0.104, −0.076	0.932	0.900, 0.965	0.939^e	0.903, 0.978
A₁A₂	—		—		—		0.983^f	0.914, 1.057

Potential outcome mean
E[Y^0,0]	0.660	0.643, 0.677	0.660	0.643, 0.677	0.663	0.645, 0.681	0.660	0.641, 0.680
E[Y^0,1]	0.620	0.605, 0.635	0.570	0.560, 0.580	0.618	0.602, 0.634	0.620	0.604, 0.637
E[Y^1,0]	0.520	0.501, 0.539	0.570	0.560, 0.580	0.518	0.499, 0.537	0.520	0.497, 0.544
E[Y^1,1]	0.480	0.463, 0.497	0.480	0.463, 0.497	0.483	0.466, 0.499	0.480	0.461, 0.500

CI, confidence interval; MSM, marginal structural model.

aRisk differences β (MSMs (3) and (4)) or risk ratios exp(β) (MSMs (5) and (6)) in the upper part.

bUsing sandwich estimator.

cCommon risk difference for A1 and A2.

dRisk ratio for A1 when controlling A2 at 0: E[Y1,0]/E[Y0,0].

eRisk ratio for A2 when controlling A1 at 0: E[Y0,1]/E[Y0,0].

fInteraction between A1 and A2 in risk ratio scale: (E[Y1,1]E[Y0,0])/(E[Y1,0]E[Y0,1]).

CI, confidence interval; MSM, marginal structural model. aRisk differences β (MSMs (3) and (4)) or risk ratios exp(β) (MSMs (5) and (6)) in the upper part. bUsing sandwich estimator. cCommon risk difference for A1 and A2. dRisk ratio for A1 when controlling A2 at 0: E[Y1,0]/E[Y0,0]. eRisk ratio for A2 when controlling A1 at 0: E[Y0,1]/E[Y0,0]. fInteraction between A1 and A2 in risk ratio scale: (E[Y1,1]E[Y0,0])/(E[Y1,0]E[Y0,1]). Why do we need to model E[] by taking the risk to cause bias? Consider exposures can change at an additional one time point. Without models, we need to estimate 23 = 8 (double of our case) distinct E[]. If we have six time points, the task requires 64 estimates from the limited amount of data. Furthermore, if we have continuous exposure, we have to rely on the dose-response curves irrespective of the number of exposure time points. Given we always have a limited amount of data, our estimation tasks must rely on the dimension reduction of parameter space by imposing restriction on the possible values of counterfactual outcome means. In Table 5, despite both models (3) and (6) being correctly specified and unbiasedly estimated, the estimates of E[] from model (3) (3 parameters) have slightly narrower confidence intervals than those from model (6) (four parameters). The efficiency gain owing to dimension reduction will be modest as the number of time points increases. Note that models (3)–(6) do not require covariate information, though can incorporate baseline confounders L1 for examining effect modifications by certain variables in specific scales (eg, risk difference or ratio).[6],[68] The convenient choice that is commonly seen in practice may be the simplest model assuming a common exposure effect across time and baseline confounder strata:which imposes more restriction than the marginal structural model (4), which is agnostic about (ie, does not assume) no-effect modification by L1. To assess effect modification by baseline confounders, the model can be modified asthough this is still generally stricter than model (4) because the effect of exposure is restricted to be linearly modified by L1.

Dealing with high-dimensional covariates

In our example, we have no baseline confounder and only one time-varying binary confounder variable L2, as well as two binary exposures A1 and A2. As a result, we can estimate all conditional expectations and conditional probabilities in g-formula (1) and inverse probability weighting (2) from the direct calculation of the mean/proportion in each stratum; in other words, we used saturated regression and exposure probability models. In practice, however, we have many variables in L1 or L2, or both, some of which may follow continuous or multinomial distributions. In such cases, we must rely on models for observed distribution of (L1, A1, L2, A2, Y).[4],[20],[69],[70] For example, g-formula (1) can be estimated by fitting the following outcome and covariate regression models:where L2 is a kth variable in arbitrarily ordered L2 = (L21,…, L2)T with a constant L20. Note that in general, we must conduct numerical approximation of conditional distribution of L2 by simulating the Monte–Carlo samples from the model fit, which has the conditional means following the above regression models (the parametric g-formula estimator).[38],[47] Alternatively, we could iteratively model the left-hand side of g-formula (1) from inside to outside of expectations by fitting the outcome regression models for the predictions from previous model fit (equivalent to the Q-learning estimator).[24],[71] Inverse probability weighting formula (2) can also be estimated by, for example, logistic models for exposure probabilities:We then calculate the weighted mean using 1/{p(A1|L1)p(A2|L1, A1, L2)} from predicted values from these models. Note that both of these approaches do not impose any restriction on the values of E[]; we could use regression or exposure probability models without specifying marginal structural models and vice versa (recall the calculation of Table 5). Marginal structural models are causal assumptions about the relationship between E[] and hypothetical intervention (a1, a2); on the contrary, regression and exposure probability models are approximations of certain aspects of the observed distribution of (L1, A1, L2, A2, Y). In practice, however, we should rely on both marginal structural models and exposure probability models when using inverse probability weighting for estimating the effects of exposure with a moderate number of time points.[44]–[50] Table 6 shows the estimates of marginal structural models (3)–(6) using the fit of a misspecified exposure probability model: logit P(A2 = 1|A1 = a1, L2 = l2) = α0 + α1a1 + α2l2. As expected, all estimates of E[] are biased from Table 2 owing to the exposure probability model misspecification. Moreover, even for correctly specified marginal structural models (3) and (6), these estimates diverge from each other when using an exposure probability model to estimate inverse probability weights. Similar to the dimension reduction via marginal structural models, we would expect a greater efficiency gain (ie, variance reduction) in inverse probability weighting estimators when high-dimensional confounders must be conditioned on to achieve sequential exchangeability.

Table 6.

Inverse probability weighted estimates of marginal structural models using a misspecified exposure probability model

	MSM (3): Correct		MSM (4): Incorrect		MSM (5): Incorrect		MSM (6): Correct

	Estimate^a	95% CI^b	Estimate^a	95% CI^b	Estimate^a	95% CI^b	Estimate^a	95% CI^b
Risk difference or ratio
A₁ (a₁ = 1 vs 0)	−0.119	−0.145, −0.092	−0.081^c	−0.095, −0.068	0.813	0.774, 0.855	0.886^d	0.819, 0.958
A₂ (a₂ = 1 vs 0)	−0.045	−0.073, −0.017	−0.081^c	−0.095, −0.068	0.924	0.880, 0.969	1.022^e	0.983, 1.063
A₁A₂	—		—		—		0.822^f	0.749, 0.902

Potential outcome mean
E[Y^0,0]	0.655	0.635, 0.674	0.649	0.628, 0.671	0.658	0.638, 0.679	0.625	0.606, 0.644
E[Y^0,1]	0.610	0.593, 0.628	0.568	0.552, 0.584	0.608	0.590, 0.626	0.639	0.624, 0.654
E[Y^1,0]	0.536	0.503, 0.570	0.568	0.552, 0.584	0.535	0.502, 0.569	0.554	0.515, 0.595
E[Y^1,1]	0.491	0.472, 0.511	0.487	0.466, 0.507	0.494	0.475, 0.513	0.465	0.445, 0.486

CI, confidence interval; MSM, marginal structural model.

aRisk differences β (MSMs (3) and (4)) or risk ratios exp(β) (MSMs (5) and (6)) in the upper part.

bUsing sandwich estimator.

cCommon risk difference for A1 and A2.

dRisk ratio for A1 when controlling A2 at 0: E[Y1,0]/E[Y0,0].

eRisk ratio for A2 when controlling A1 at 0: E[Y0,1]/E[Y0,0].

fInteraction between A1 and A2 in risk ratio scale: (E[Y1,1]E[Y0,0])/(E[Y1,0]E[Y0,1]).

Summary of pitfalls and tips

Our hypothetical dataset explicitly shows estimands (ie, E[]) and minimally possesses the counterfactual conditions (ie, sequential exchangeability) to estimate counterfactual means under joint intervention on time-varying exposure (A1, A2). We hitherto illustrate the tips (Box) for formal understanding of marginal structural modeling and its estimation through inverse probability weighting (pitfall 1), as well as the required causal assumptions on unobservable data. Models are used to account for the “curse of dimensionality.” On one hand, marginal structural models reduce the dimension of counterfactual outcome means under a huge number of the combinations of time-varying exposures. On the other hand, exposure probability models must be adopted in practice to account for the large numbers of baseline and time-varying confounders, which usually do not have implications on marginal structural modeling (pitfalls 2 and 4). We also show the biases based on the misspecification of exposure probability models and misspecification of marginal structural models separately (pitfall 3). Note that while inverse probability weighting and the g-formula are applicable to estimate marginal counterfactual means (ie, saturated marginal structural models), only the former can estimate general, unsaturated marginal structural models (pitfall 5). Although running into these pitfalls may not necessarily lead to large biases in practical analysis, failure to recognize these subtleties would advocate unprincipled and suboptimal strategies for causal inference. We would conclude this section with additional emphases of two pitfalls. First, variable selection and model specification are generally different tasks in modeling for causal inference. By inverse probability weighting, exposure probability models should select confounders, stratification of which is sufficient to achieve sequential exchangeability. In our example, all analyses with or without an exposure probability model include all confounder(s), L2. Even if the models include all confounders, however, they may be misspecified as in the analysis in Table 6. The same is true for regression models for the g-formula. On the contrary, it is unnecessary for marginal structural models to include confounders; only covariates (need not to be confounders but should be conditioned in propensity score[19]) that may modify the exposure effect of interest may be included in marginal structural models.[6],[68] Second, doubly robust estimators can alleviate the bias from misspecification of regression and exposure probability models,[22]–[28] but not the bias owing to the misspecification of marginal structural models nor other causal models (that are not introduced in this paper). For example, Table 6 provides the biased estimates using a misspecified exposure probability model for correct/incorrect marginal structural models. Among them, bias in the estimates of correct marginal structural models (3) and (6) would be mitigated by doubly robust methods, by including outcome regression models via the iterative model-fitting algorithm of Bang and Robins,[24] while the fitting of incorrect marginal structural models (4) and (5) must result in biased estimates. Hence, even with doubly robust methods, the careful consideration of marginal structural models is needed, especially for long-term follow-up study with many time points at which exposure can change. Marginal structural models for dynamic regimes may also have to depend on strong modeling assumptions,[51],[64],[72]–[74] even when exposure is binary and change at several time points.

FUTURE DIRECTIONS

There is a relevant method other than the g-formula and inverse probability weighting that requires essentially the same assumptions to estimate causal effects of time-varying exposures: g-estimation.[15],[18],[38]–[41],[51],[67] Like the relation of marginal structural modeling and inverse probability weighting, g-estimation is a method to estimate the parameters of structural nested models. Structural nested models and g-estimation indeed have attractive statistical properties (eg, robustness, efficiency, and flexible parameterization), which successfully work within Robins’ causal “interventionism” framework with minimal conditions.[31],[41],[46],[63],[75] Despite its theoretical superiority, g-estimation has been underused in epidemiologic literature probably because of the complexity of background theory and interpretability of the parameters.[75] However, structural nested models are especially useful for dynamic regimes of time-varying exposures by modeling the effect modification by time-varying covariates,[38],[41],[51] which cannot generally be included in marginal structural models.[46],[68] Besides the conceptual pitfalls considered in this paper, there are important pitfalls regarding specification and estimation of marginal structural models, which will often lead to mistakes in practice: One should always use the independence working correlations in marginal structural models of repeated-measures outcomes.[47],[76],[77] If “stabilized” weights include covariates in the numerator weights,[43] they should be conditioned in the marginal structural models.[50] “Stabilization” of the weights is not always acceptable (eg, dynamic-regime marginal structural models[72]–[74]). It is always important to check the fits of exposure probability models (eg, checking calibration or model-diagnostic measures[78] and weight distributions[50]) and marginal structural models (eg, comparing the estimating equation-based quasi-likelihood information criterion with that for less restricted models[79] or testing equivalence between asymptotic values of parameter estimates obtained through different weighting options[80]). There are other practical concerns in real data analysis. For example, many follow-up studies compare time-to-event outcomes, which complicate the modeling and estimation process for the effects of time-varying exposure. In these settings, time-dependent Cox models or the risk-set switching Kaplan–Meier estimators would need unrealistic assumptions to yield causally interpretable estimates.[43],[81] In addition, censoring of the events must be taken with care by, for example, constructing the inverse probability weights to prevent attrition bias.[44],[45],[51] Note that the idea of inverse probability of censoring weights appears in diverse causal inference fields; eg, adjustment for treatment discontinuation in clinical trials,[82],[83] estimation of the effects of dynamic regimes,[72] and the effects of the treatment duration on survival.[84]

Box.

Key messages for clear understanding of marginal structural modeling

•	Marginal structural models (MSMs) should be distinguished from inverse probability weighting
•	MSM shows prespecified assumptions on causal estimands, while an exposure probability model is an imposed restriction on observed distribution
•	As MSM and exposure probability model are used for different purposes, misspecification of these models would lead to biases in different ways
•	Model specifications of MSMs and exposure probability models raise different challenges in real data analysis
•	G-formula, which shares identifiability assumptions with inverse probability weighting, can be used to fit MSMs only when the models are saturated

61 in total

1. Subtle issues in model specification and estimation of marginal structural models.

Authors: Wei Yang; Marshall M Joffe
Journal: Pharmacoepidemiol Drug Saf Date: 2012-01-16 Impact factor: 2.890

2. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: main content.

Authors: Liliana Orellana; Andrea Rotnitzky; James M Robins
Journal: Int J Biostat Date: 2010 Impact factor: 0.968

Review 3. Comparison of dynamic treatment regimes via inverse probability weighting.

Authors: Miguel A Hernán; Emilie Lanoy; Dominique Costagliola; James M Robins
Journal: Basic Clin Pharmacol Toxicol Date: 2006-03 Impact factor: 4.080

4. Doubly robust estimation in missing data and causal inference models.

Authors: Heejung Bang; James M Robins
Journal: Biometrics Date: 2005-12 Impact factor: 2.571

5. Causal diagrams for epidemiologic research.

Authors: S Greenland; J Pearl; J M Robins
Journal: Epidemiology Date: 1999-01 Impact factor: 4.822

6. Biases in Randomized Trials: A Conversation Between Trialists and Epidemiologists.

Authors: Mohammad Ali Mansournia; Julian P T Higgins; Jonathan A C Sterne; Miguel A Hernán
Journal: Epidemiology Date: 2017-01 Impact factor: 4.822

7. The role of model selection in causal inference from nonexperimental data.

Authors: J M Robins; S Greenland
Journal: Am J Epidemiol Date: 1986-03 Impact factor: 4.897

8. A Cautionary Note on Extended Kaplan-Meier Curves for Time-varying Covariates.

Authors: Arvid Sjölander
Journal: Epidemiology Date: 2020-07 Impact factor: 4.822

9. Sensitivity analysis for subsequent treatments in confirmatory oncology clinical trials: A two-stage stochastic dynamic treatment regime approach.

Authors: Yasuhiro Hagiwara; Tomohiro Shinozaki; Hirofumi Mukai; Yutaka Matsuyama
Journal: Biometrics Date: 2020-06-01 Impact factor: 2.571

10. Determining the effect of highly active antiretroviral therapy on changes in human immunodeficiency virus type 1 RNA viral load using a marginal structural left-censored mean model.

Authors: Stephen R Cole; Miguel A Hernán; Kathryn Anastos; Beth D Jamieson; James M Robins
Journal: Am J Epidemiol Date: 2007-05-02 Impact factor: 4.897

5 in total

1. Do P2Y12 receptor inhibitors prescribed poststroke modify the risk of cognitive disorder or dementia? Protocol for a target trial using multiple national Swedish registries.

Authors: Georg Hans Kuhn; Frederick R Walker; Michael Nilsson; Madeleine Hinwood; Jenny Nyberg; Lucy Leigh; Sara Gustavsson; John Attia; Christopher Oldmeadow; Marina Ilicic; Thomas Linden; N David Åberg; Chris Levi; Neil Spratt; Leeanne M Carey; Michael Pollack; Sarah J Johnson
Journal: BMJ Open Date: 2022-05-09 Impact factor: 3.006

2. Mortality Among Patients Undergoing Blood Transfusion in Relation to Donor Sex and Parity: A Natural Experiment.

Authors: Jingcheng Zhao; Arvid Sjölander; Gustaf Edgren
Journal: JAMA Intern Med Date: 2022-07-01 Impact factor: 44.409

3. Estimating the causal effect of transient anemia status on renal and cardiovascular outcomes in community-dwelling patients in Japan at the beginning of impaired renal function using marginal structural modeling.

Authors: Satoshi Onozawa; Tomomi Kimura; Yuichiro Ito; Tadao Akizawa
Journal: Clin Exp Nephrol Date: 2021-10-01 Impact factor: 2.801

Review 4. Breastfeeding, pregnancy, medicines, neurodevelopment, and population databases: the information desert.

Authors: Sue Jordan; Rebecca Bromley; Christine Damase-Michel; Joanne Given; Sophia Komninou; Maria Loane; Naomi Marfell; Helen Dolk
Journal: Int Breastfeed J Date: 2022-08-02 Impact factor: 3.790

5. Using Propensity Scores for Causal Inference: Pitfalls and Tips.

Authors: Koichiro Shiba; Takuya Kawahara
Journal: J Epidemiol Date: 2021-06-12 Impact factor: 3.211

5 in total