Literature DB >> 34999885

Educational note: causal decomposition of population health differences using Monte Carlo integration and the g-formula.

Nikkil Sudharsanan1, Maarten J Bijlsma2,3.   

Abstract

One key objective of the population health sciences is to understand why one social group has different levels of health and well-being compared with another. Whereas several methods have been developed in economics, sociology, demography, and epidemiology to answer these types of questions, a recent method introduced by Jackson and VanderWeele (2018) provided an update to decompositions by anchoring them within causal inference theory. In this paper, we demonstrate how to implement the causal decomposition using Monte Carlo integration and the parametric g-formula. Causal decomposition can help to identify the sources of differences across populations and provide researchers with a way to move beyond estimating inequalities to explaining them and determining what can be done to reduce health disparities. Our implementation approach can easily and flexibly be applied for different types of outcome and explanatory variables without having to derive decomposition equations. We describe the concepts of the approach and the practical steps and considerations needed to implement it. We then walk through a worked example in which we investigate the contribution of smoking to sex differences in mortality in South Korea. For this example, we provide both pseudocode and R code using our package, cfdecomp. Ultimately, we outline how to implement a very general decomposition algorithm that is grounded in counterfactual theory but still easy to apply to a wide range of situations.
© The Author(s) 2021. Published by Oxford University Press on behalf of the International Epidemiological Association.

Entities:  

Keywords:  Decomposition; Monte Carlo; causal inference; health disparities; parametric g-formula; population models

Mesh:

Year:  2021        PMID: 34999885      PMCID: PMC8743135          DOI: 10.1093/ije/dyab090

Source DB:  PubMed          Journal:  Int J Epidemiol        ISSN: 0300-5771            Impact factor:   7.196


Causal or counterfactual-based decomposition methods are of growing importance in epidemiology and the population health sciences. We develop and demonstrate a highly flexible implementation of the causal decomposition that is grounded in counterfactual theory but still easy to apply to a wide range of questions without having to derive specialized decomposition equations. We demonstrate how to use our decomposition algorithm to estimate the contribution of smoking to sex differences in the age-adjusted 1-year risk of mortality in South Korea, finding that smoking explains 27% of the male mortality disadvantage at ages ≥50 years.

Introduction

A central aim of the population health sciences is to understand why one social group has different levels of health and well-being compared with another. Recent examples of this question include understanding why Hispanics have worse congenital heart disease outcomes compared with non-Hispanics, why adult mortality is higher in urban compared with rural Indonesia, and why poorer individuals in Finland have higher mortality compared with more affluent individuals. By identifying the sources of differences across populations, these studies provide an important first step in determining what can be done to reduce health disparities. Decomposition analyses are one of the key tools for understanding the sources of differences in an outcome between groups and can help to move researchers from estimating to explaining health inequalities. At their core, decomposition analyses seek to determine how much of an observed difference in an outcome between two groups is due to the differing distribution of specific causes of that outcome between the groups. For example, in the example above on Finland, researchers may ask: ‘How much of the mortality difference between rich and poor individuals is due to the higher prevalence of smoking among poorer compared with richer individuals?’ Although such questions may sound like mediation analysis, there is a key difference between mediation and decomposition. In a causal mediation analysis, we would first estimate the causal effect of poverty on mortality and then identify how much of this effect is driven through the causal effect of poverty on smoking. In a decomposition analysis, we are interested in how much smoking contributes to observed differences in mortality between poor and non-poor, and are agnostic to how much of the difference in smoking between poor and non-poor is due to the causal effect of smoking and how much is due to confounding causes. This crucial difference (depicted graphically using directed acyclic graphs in Figure 1) has consequences for the analytical approach to be taken and requires fewer confounding variables to be accounted for. Importantly, in a decomposition analysis, since we are not attempting to estimate the causal effect of the group variable (the exposure in a mediation in analysis), we do not have to contend with the open issue of whether causal effects can be estimated for non-manipulable characteristics such as race.
Figure 1

Directed acyclic graphs showing conceptual differences between mediation (A) and decomposition (B). Solid lines represent causal effects, whereas two-way dotted lines represent associations.

Directed acyclic graphs showing conceptual differences between mediation (A) and decomposition (B). Solid lines represent causal effects, whereas two-way dotted lines represent associations. Various methods have been developed across disciplines for conducting decomposition analyses. Regression decompositions, such as the Oaxaca-Blinder decomposition, and its non-linear extensions,, use individual-level data and are employed frequently in economics and sociology, whereas approaches using aggregate-level data are common in demography. Recent advances in epidemiology provide a new perspective to decompositions, situating them in causal inference and counterfactual theory.,,, Among these, Jackson and VanderWeele’s (2018) provide an important advance by framing decomposition analyses around interventions to reduce disparities, where the importance of specific characteristics to differences between populations is evaluated through hypothetical intervention scenarios to equalize these characteristics between groups. In this paper, we demonstrate a simple way to implement the counterfactual decomposition using parametric models and Monte Carlo integration. We focus on a worked example that asks ‘How much of the observed sex difference in mortality in South Korea is due to the higher prevalence of smoking among men compared with women?’ and demonstrates how to decompose sex differences in the age-adjusted 1-year mortality risk ratio between men and women. Our approach is based on a straightforward algorithm for estimating counterfactual decompositions for different outcome distributions without having to derive decomposition equations and can be easily applied within common statistical packages or implemented with our R package, cfdecomp.

A counterfactual approach to decomposition

Concepts

We motivate and develop our approach through the question: ‘What is the contribution of smoking to sex differences in mortality in South Korea?’ We adopt a counterfactual perspective and define ‘contribution’ by asking ‘How large would the difference in mortality be if men and women counterfactually had an equal smoking prevalence?’ Our first main step is to specify exactly what level of smoking prevalence we are equalizing men and women to. When the relationship between an outcome (such as mortality) and a mediator (such as smoking) is non-linear, this choice can affect the contribution estimate. Therefore, the choice of the counterfactual mediator distribution should be informed by substantive concerns (e.g. what makes sense from a policy perspective?) and inferential concerns (e.g. certain values may be outside the range observed in the data and should therefore be avoided). We choose to set men to have the smoking prevalence of women, since this maps to a clear intervention that public health policymakers may seek to achieve. The second main step is to specify a summary population measure. This is the measure that we will use to compare the mortality of men and women in South Korea. For our example, we consider the age-adjusted 1-year risk of death. In theory, our approach can be extended to decompose more complicated summary measures, such as disability-adjusted life years lost or period life expectancy. However, decomposing such measures requires additional, often stronger, assumptions. For this reason, we do not cover the application of our method to those summary measures here and choose rather to focus on common summary measures with clear assumptions. Third, we need to specify contrasts of these summary measures between men and women (i.e. how are we going to compare the summary measure?). We consider the risk ratio for men relative to women (adjusted for age). Our method also allows us to decompose other contrasts, such as the risk difference—a point that we will return to when describing the decomposition algorithm below. Based on Steps 1–3, we can construct our estimate of the ‘contribution’ of smoking by seeing how much the difference in the summary measure between men and women reduces when we set men to have the same smoking prevalence as women. For example, we would compare the mortality risk ratio between men and women in the observed data to the mortality risk ratio between men and women in a counterfactual world in which we set men to have the same smoking levels as women. We could then estimate the contribution of smoking as the percentage reduction in the male–female mortality disparity. Note that this contribution is not bounded between 0 and 1, and could result in negative contributions or contributions of >100%. This is not an issue, however; this situation occurs in both mediation and decomposition analyses when the indirect effect (the association via the mediators) and the direct effect (the association not via the mediators) are of opposite signs and hence partially cancel each other out in the total effect. Indeed, many recent papers using mediation and decomposition analyses have found contribution estimates of >100 or <0.,, Contribution estimates of <0 or >1 could also occur due to imprecision in the underlying estimates. For this reason, it is important to present and interpret such estimates with their accompanying standard error. In Supplementary Appendix 3, available as Supplementary data at IJE online, we provide a more general formal exposition of the causal decomposition.

Parametric modelling and Monte Carlo-based estimation

The core estimand in our decomposition is the counterfactual summary measure of mortality for men if they were set to have the same smoking distribution as women. Estimating this counterfactual requires (i) a way to match the smoking distribution between men and women, and (ii) a way to re-estimate mortality as a function of the new smoking distribution. Importantly, since we are interested in the effect of changing the level of smoking on mortality, our approach to re-estimating mortality needs to adjust for the confounders of the smoking–mortality relationship. Our solution to these two issues is to use the parametric g-formula and Monte Carlo integration., This entire approach can be estimated by following a straightforward algorithm. Decide on a summary measure. Decide on a contrast. Decide on the counterfactual mediator distribution. Fit regression model(s) for the mediator(s) of interest with confounders of the mediator–outcome relationship as covariates. Fit regression model(s) for the outcome with the mediator(s) of interest and the same confounders as the mediator model. Use the coefficients from the mediator model(s) with the observed confounder values to simulate mediator values for each individual in the data. Use the coefficients from the outcome model(s) together with the observed confounder values and the new simulated mediator values to simulate the outcome for each individual in the data. This is the natural-course pseudo-population. Within this natural-course pseudo-population, estimate the summary measure for both groups and then form the contrast of interest across groups. Use the coefficients from the mediator model(s) with the observed confounder values to simulate mediator values that follow the counterfactual mediator distribution. Use the coefficients from the outcome model(s) together with the observed confounder values and simulated mediator values to simulate the outcome for each individual in the data. This is the counterfactual pseudo-population. Within this counterfactual pseudo-population, estimate the summary measure for both groups and then form the contrast of interest across groups.

Decomposition algorithm

Step 4: Compare the contrast of interest in the natural-course and counterfactual pseudo-populations

To estimate standard errors and to produce stable estimates of the contribution, we have to address two types of variability. First, since we are drawing values of the mediators and outcomes from probability distributions, the exact values assigned to individuals can change across multiple draws. This results in the estimate of the contribution also changing across draws (known as Monte Carlo error). To reduce this error, we conduct Steps 2 and 3 multiple times, each time drawing a new set of mediator and outcome values. We then construct the contrasts for each draw and then average across all these draws to produce stable natural-course and counterfactual estimates, before calculating the contribution in Step 4. Second, because our results are based on a sample, we need to account for sampling variability. This is especially important for the construction of confidence intervals around the estimates. We use a bootstrap procedure to capture this uncertainty, drawing with replacement a fresh sample of size equal to the original data before Step 1, conducting the entire analysis k times, and then estimating the standard error of our decomposition estimates as the standard deviation of the estimates from the k bootstrap samples. Our algorithm above treats the variables involved as time-fixed, which may not always be appropriate., The algorithm can be easily expanded, however, to allow for time-varying variables; we present a time-varying version of the decomposition algorithm above in Supplementary Appendix 2, available as Supplementary data at IJE online, based on Westreich et al. (2012). A second important note is that the natural course is often used in g-formula analyses to validate the estimation models rather than as part of the estimand. In our algorithm, however, the natural course also forms part of the contribution estimate. We chose to use the natural-course estimate instead of the observed data in our estimand so that both the counterfactual and ‘as-is’ scenarios are based on the same underlying model. However, if the natural-course estimates do not approximate the data well, then that is evidence of model misspecification, which needs to be investigated further. Both the size of and contribution of specific mediators to a health disparity are dependent on the scale that the disparity is measured on. For example, a difference in mortality between two populations and the contribution of smoking to this difference may vary based on whether the disparity is measured as a mortality risk ratio, a survival risk ratio or an absolute difference in mortality rates. A major strength of our decomposition algorithm is that the researcher is not limited to one scale and can estimate and explain the disparity using multiple measures. This is because the decomposition algorithm works by first generating pseudo-populations based around model-predicted values rather than by comparisons of model coefficients.

Empirical example: the contribution of smoking to sex differences in mortality in South Korea

We now demonstrate the application of the approach that we outlined in the previous section to real data from the Korean Longitudinal Study of Aging. In the interest of providing a simple pedagogical example, we conduct a stylized analysis and thus the results should be interpreted cautiously. A more rigorous analysis that fully explores and accounts for the different sources of confounding and measurement error is outside the scope of this example. The simplified example also raises conceptual issues that we omit discussion of, such as whether some of the confounders may instead mediate the relationship between ever smoking and mortality. However, to lend some credence to the analysis, we note that the results of our example are in line with other literature on the contribution of smoking to sex differences in mortality.

Data: Korean Longitudinal Study of Aging

We use data from the 2006–2012 waves of the Korean Longitudinal Study of Aging—a nationally representative survey of South Korean individuals aged ≥45 years. We use data on adults aged ≥50 years from the baseline 2006 waves, using the subsequent waves for mortality follow-up. Our total sample consists of 7615 individuals with 42 405 person-years of follow-up. We convert our data from a person to person-age format, with one observation for every age lived in the survey, along with a dichotomous indicator for whether an individual survived through or died at that age. Individuals leave the survey through death, censoring from loss to follow-up before 2012, or from censoring at the end of the survey period in 2012.

Main variables: outcome, mediator, and confounders

Our outcome of interest is a dichotomous indicator for whether an individual died or survived to the next age and our primary mediator is a dichotomous indicator for whether an individual reported ever regularly smoking cigarettes. We adjust for the following potential confounders of the smoking–mortality relationship: age, how frequently an individual reported drinking alcohol, schooling, urbanicity, and marital status.

Step 0: Specify a summary measure, contrast, and counterfactual distribution

Our main summary measure is the age-adjusted 1-year risk of death (surviving to the next age). For this summary measure, our contrast of interest is the risk ratio of mortality for men relative to women. We construct this contrast using the following Poisson regression on person-year observations (adjusting for age using indicator variables for 5-year age groups): where α1 is our estimate of interest. We use a Poisson regression here to just estimate the summary contrast (the exponent of α1) but could have alternatively directly estimated an age-standardized risk ratio from the data. Importantly, because we are interested in the observed difference between men and women (adjusting for just age), we do not add any confounders to this model. For this analysis, we set the smoking levels among men to be equal to those among women as our counterfactual scenario.

Step 1: Estimate relationships in the data (using regression models)

Mediator model

We model the probability of ever regularly smoking for men and women using the following logistic-regression model: Here, is a binary variable for whether an individual self-reported ever regularly smoking, is the indicator variable for female, is a continuous measurement of age and are the confounders described previously. We use this model to estimate the group → causes association pathway in Figure 1B. We include the confounders in this model, not to adjust for confounding, but rather to allow us to predict and match the sex-specific smoking prevalence within confounder strata.

Outcome model

We model mortality as a function of smoking, sex and the confounders by fitting the following logistic-regression model: We use this model to estimate the causes → outcome effect pathway in Figure 1B.

Steps 2 and 3: simulation to form the natural-course and counterfactual pseudo-populations

Based on the results of the two models, we simulate the natural-course and counterfactual pseudo-populations for both men and women. In Figure 2, we provide a step-by-step example of how to use the regression estimates to form the simulated values for a single male individual in the data. The pseudocode in Figure 3 and R code in the Supplementary Material, available as Supplementary data at IJE online, demonstrate how to do this for all individuals in the data using common statistical software.
Figure 2

Flowchart for simulating the natural-course and counterfactual smoking and mortality values for a single male in the data. The regression estimates are based on the models described in the ‘Methods’ section.

Figure 3

Example code for estimating the contribution of smoking to sex differences in mortality in South Korea. For this example, we have a binomial mediator ‘smoke’ (ever-smoker), binomial outcome ‘died’ (death in a person-year), our summary measures and contrast is the age-adjusted mortality risk ratio and, for the counterfactual scenario, we assign men the smoking distribution of women. In the models, C represents covariates needed for exchangeability.

Flowchart for simulating the natural-course and counterfactual smoking and mortality values for a single male in the data. The regression estimates are based on the models described in the ‘Methods’ section. Example code for estimating the contribution of smoking to sex differences in mortality in South Korea. For this example, we have a binomial mediator ‘smoke’ (ever-smoker), binomial outcome ‘died’ (death in a person-year), our summary measures and contrast is the age-adjusted mortality risk ratio and, for the counterfactual scenario, we assign men the smoking distribution of women. In the models, C represents covariates needed for exchangeability.

Step 4: Calculate and compare the contrasts of interest and determine the percent contribution of smoking

Once pseudo-populations have been created, the final step is to calculate the contrast of interest. We then estimate the contribution of smoking to sex differences in mortality by measuring how much the contrast changes between the natural-course and counterfactual worlds. All steps needed to estimate the decomposition are also shown as pseudocode in Figure 3. We also provide code for how to estimate the example in R using our function cfdecomp in the Supplementary Material, available as Supplementary data at IJE online.

Results

Descriptive characteristics.

The mean age was 66.2 years for men and 67.4 years for women (Table 1). A greater share of men were currently married compared with women (93% compared with 64%) due to a much higher proportion of widowhood among women (33% compared with 5%). There were important health and socio-economic differences between men and women. Men were far more likely to smoke (61% compared with 4%) and drink regularly (proportion who reported drinking at least once a week: 41% compared with 4%). Men were also substantially more likely to have completed more than middle school (46% compared with 17%).
Table 1

Descriptive characteristics of the sample at baseline, in adults aged ≥50 years, Korean Longitudinal Study of Aging, 2006

Men
Women
MeanSDMeanSD
Age (years)66.29.067.49.9
% n % n

Marital status
 Never married0.011050.00100
 Married/partnered0.9317 1470.6415 350
 Separated/divorced0.023490.02499
 Widowed0.058930.337962
Completed schooling
 None0.0917060.317299
 Elementary or middle0.4582490.5312 574
 More than middle0.4685390.174038
Rural0.2749870.276534
Ever-smoker0.6111 2760.041015
Alcohol consumption
 None/less than once a month0.4378680.8720 808
 One to several times a month0.1630400.082000
 One to several times a week0.2851190.04906
 Most days of the week0.059350.00113
 Every day of the week0.0815320.0084
Descriptive characteristics of the sample at baseline, in adults aged ≥50 years, Korean Longitudinal Study of Aging, 2006

Decomposition of the age-adjusted 1-year risk of mortality.

Men were 1.89 times [95% confidence interval (CI): 1.65, 2.14] more likely to die within 1 year of an interview compared with women (after adjusting for age) (Table 2). After setting men to have the same smoking distribution of women, this risk ratio reduced to 1.65 (95% CI: 1.38, 1.92). The resulting change corresponds to a (1 – 0.65/0.89) = 28% (95% CI: 0.08, 0.47) contribution of smoking to sex differences in the age-adjusted 1-year risk of mortality.
Table 2

Estimates of the contribution of smoking to the age-adjusted 1-year mortality risk ratio using the counterfactual decomposition method, Korean Longitudinal Study of Aging, 2006–2012

Natural-course RR (95% CI)Counterfactual RR (95% CI)Percent contribution (95% CI)
Mortality risk ratio for men relative to women1.89 (1.65, 2.14)1.65 (1.38, 1.92)28% (8%, 47%)
Estimates of the contribution of smoking to the age-adjusted 1-year mortality risk ratio using the counterfactual decomposition method, Korean Longitudinal Study of Aging, 2006–2012

Discussion

We introduce a general yet easily applied procedure for implementing counterfactual decompositions using the parametric g-formula and Monte Carlo integration. We demonstrate this approach by estimating the contribution of smoking to sex differences in mortality in South Korea by decomposing the age-adjusted mortality risk ratio for men relative to women. We find that the large smoking difference between men and women in South Korea explains 27% of the age-adjusted mortality risk ratio among adults aged ≥50 years. The age-adjusted mortality risk could also be decomposed using closed-form decomposition equations.,, The algorithm we outline does not replace closed-form decomposition approaches, but rather provides an alternative using simulations, which provides two main advantages. First, we can decompose summary measures based on any outcome distribution in the generalized linear model family without having to derive or use separate decomposition equations depending on whether an outcome is binomially, Poisson, or normally distributed. Moving between outcome distributions simply requires changing the regression type used to model the outcome in the decomposition algorithm. The second advantage of the simulation algorithm is that we can easily switch between different contrasts, since we effectively regenerated entire micropopulations for the observed and counterfactual worlds. For example, once natural-course and counterfactual pseudo-populations have been generated, we decomposed the risk ratio by estimating Poisson regressions of mortality on sex within both pseudo-populations and measuring how the risk ratio changes between the natural-course and counterfactual worlds. If we were instead interested in decomposing the odds ratio, we would simply switch from Poisson to logistic regressions and compare the odds ratios. Despite these advantages, our algorithm comes with important trade-offs compared with existing decomposition implementations. Compared with the closed-form equations, our approach requires substantial computational power and time. This is not a trivial consideration and decompositions with large data sets may take hours to even days to complete even when considerable computational power is available. Furthermore, as with any method seeking to provide causal explanations, the causal validity of the decomposition results hinges on assumptions of exchangeability (also known as no unmeasured confounding), common support (positivity), and consistency. We discuss these three issues in more detail in Supplementary Appendix 1, available as Supplementary data at IJE online, for interested readers.

Conclusions

Decomposing the sources of differences in health and other outcomes is a key research endeavour in epidemiology and other population health sciences. We describe an implementation of the counterfactual decomposition that builds on and generalizes the rich existing body of work on decomposition methods in the health and social sciences. The approach provides a highly flexible and easily implemented way of estimating decompositions that are grounded in potential outcomes and counterfactual theory, and applicable to a wide range of population health questions.

Supplementary data

Supplementary data are available at IJE online.

Ethics approval

This study uses publicly available and de-identified secondary data, and was exempt from institutional review-board approval.

Funding

N.S. receives funding from the Alexander von Humboldt Foundation.

Data availability

Data are freely available (after registration) at g2aging.org.

Conflicts of interest

None declared. Click here for additional data file.
  18 in total

1.  A general approach to causal mediation analysis.

Authors:  Kosuke Imai; Luke Keele; Dustin Tingley
Journal:  Psychol Methods       Date:  2010-12

2.  Interventional Effects for Mediation Analysis with Multiple Mediators.

Authors:  Stijn Vansteelandt; Rhian M Daniel
Journal:  Epidemiology       Date:  2017-03       Impact factor: 4.822

3.  Decomposition Analysis to Identify Intervention Targets for Reducing Disparities.

Authors:  John W Jackson; Tyler J VanderWeele
Journal:  Epidemiology       Date:  2018-11       Impact factor: 4.822

4.  The parametric g-formula for time-to-event data: intuition and a worked example.

Authors:  Alexander P Keil; Jessie K Edwards; David B Richardson; Ashley I Naimi; Stephen R Cole
Journal:  Epidemiology       Date:  2014-11       Impact factor: 4.822

5.  Parametric Mediational g-Formula Approach to Mediation Analysis with Time-varying Exposures, Mediators, and Confounders.

Authors:  Sheng-Hsuan Lin; Jessica Young; Roger Logan; Eric J Tchetgen Tchetgen; Tyler J VanderWeele
Journal:  Epidemiology       Date:  2017-03       Impact factor: 4.822

6.  Measuring and explaining the change in life expectancies.

Authors:  E E Arriaga
Journal:  Demography       Date:  1984-02

7.  The parametric g-formula to estimate the effect of highly active antiretroviral therapy on incident AIDS or death.

Authors:  Daniel Westreich; Stephen R Cole; Jessica G Young; Frank Palella; Phyllis C Tien; Lawrence Kingsley; Stephen J Gange; Miguel A Hernán
Journal:  Stat Med       Date:  2012-04-11       Impact factor: 2.373

8.  A decomposition method based on a model of continuous change.

Authors:  Shiro Horiuchi; John R Wilmoth; Scott D Pletcher
Journal:  Demography       Date:  2008-11

9.  Socioeconomic Mediators of Racial and Ethnic Disparities in Congenital Heart Disease Outcomes: A Population-Based Study in California.

Authors:  Shabnam Peyvandi; Rebecca J Baer; Anita J Moon-Grady; Scott P Oltman; Christina D Chambers; Mary E Norton; Satish Rajagopal; Kelli K Ryckman; Laura L Jelliffe-Pawlowski; Martina A Steurer
Journal:  J Am Heart Assoc       Date:  2018-10-16       Impact factor: 5.501

10.  Rural-Urban Differences in Adult Life Expectancy in Indonesia: A Parametric g-formula-based Decomposition Approach.

Authors:  Nikkil Sudharsanan; Jessica Y Ho
Journal:  Epidemiology       Date:  2020-05       Impact factor: 4.860

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.