Literature DB >> 27549016

Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis.

Abstract

Propensity score methods are used to reduce the effects of observed confounding when using observational data to estimate the effects of treatments or exposures. A popular method of using the propensity score is inverse probability of treatment weighting (IPTW). When using this method, a weight is calculated for each subject that is equal to the inverse of the probability of receiving the treatment that was actually received. These weights are then incorporated into the analyses to minimize the effects of observed confounding. Previous research has found that these methods result in unbiased estimation when estimating the effect of treatment on survival outcomes. However, conventional methods of variance estimation were shown to result in biased estimates of standard error. In this study, we conducted an extensive set of Monte Carlo simulations to examine different methods of variance estimation when using a weighted Cox proportional hazards model to estimate the effect of treatment. We considered three variance estimation methods: (i) a naïve model-based variance estimator; (ii) a robust sandwich-type variance estimator; and (iii) a bootstrap variance estimator. We considered estimation of both the average treatment effect and the average treatment effect in the treated. We found that the use of a bootstrap estimator resulted in approximately correct estimates of standard errors and confidence intervals with the correct coverage rates. The other estimators resulted in biased estimates of standard errors and confidence intervals with incorrect coverage rates. Our simulations were informed by a case study examining the effect of statin prescribing on mortality.

Entities: Chemical Disease Gene Species

Keywords: Monte Carlo simulations; inverse probability of treatment weighting (IPTW); observational study; propensity score; survival analysis; variance estimation

Mesh：

Year: 2016 PMID： 27549016 PMCID： PMC5157758 DOI： 10.1002/sim.7084

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

Introduction

Observational studies are increasingly being used to estimate the effects of treatments, interventions and exposures on outcomes. Confounding is an important issue in such studies as treated subjects often differ systematically from control subjects in terms of baseline covariates that are prognostically important. Statistical methods must be used to remove or minimize the effect of confounding due to measured covariates so that valid inferences on treatment effects can be drawn from observational studies. An important class of bias‐reduction methods are those that are based on the propensity score 1. The propensity score is the probability of treatment assignment conditional on measured baseline covariates. These methods are increasingly being used to reduce or minimize the confounding that occurs in observational studies. There are four ways of using the propensity score to reduce confounding: matching on the propensity score, stratification on the propensity score, inverse probability of treatment weighting (IPTW) using the propensity score and covariate adjustment using the propensity score 1, 2, 3. These methods are frequently used in the biomedical literature 4, 5. Survival or time‐to‐event outcomes occur frequently in the medical and epidemiological literature 6. Recent studies have examined the performance of different propensity score methods for estimating the effect of treatment on survival outcomes 7, 8. In one study, it was shown that both matching on the propensity score and IPTW using the propensity score resulted in unbiased estimation of marginal hazard ratios 8. However, IPTW using the propensity score with the conventional variance estimator resulted in biased estimation of standard errors and confidence intervals whose empirical coverage rates differed from the advertised rates. Variance estimation when matching on the propensity score 7, 9, 10, 11, 12 or when using IPTW with linear treatment effects 13 has received a moderate degree of attention. However, little attention has been focused on variance estimation when using IPTW with survival outcomes. The objective of the current study was to examine variance estimation when using IPTW using the propensity score to estimate the effect of treatment on survival or time‐to‐event outcomes when using a Cox proportional hazards model. The paper is structured as follows: In Section 2, we briefly describe different methods for variance estimation when using IPTW using the propensity score with survival outcomes. In Section 3, we introduce a brief case study illustrating the application of these methods. In Section 4, we describe an extensive series of Monte Carlo simulations that were performed to compare the performance of different variance estimators. In Section 5, we report the results of these simulations. Finally, in Section 6, we summarize our findings and place them in the context of the existing literature.

Using propensity score weighting to estimate the effect of treatment on survival outcomes

We use the following notation throughout this section. Let Z denote an indicator variable denoting treatment status (Z = 1 for active treatment of interest vs. Z = 0 for the control treatment), while e denotes the propensity score. The propensity score is typically estimated using a logistic regression model in which the binary indicator variable denoting treatment status is regressed on observed baseline covariates. Alternatively, methods from the machine learning literature, such as random forests or generalized boosting methods can be used 14, 15, 16. Variable selection for the propensity score model has been considered elsewhere 17. The inverse probability of treatment weights (IPTWs) are defined as 13. Thus, each subject is weighted by the reciprocal of the probability of receiving the treatment that the subject actually received. Weighting the sample using these weights results in a synthetic sample in which observed baseline covariates are not confounded with treatment assignment. The use of these weights allows one to estimate the average treatment effect (ATE). We refer to these weights as the conventional IPTW‐ATE weights. An alternative to the conventional IPTW‐ATE weights are the stabilized IPTW‐ATE weights: 18, 19. In defining the stabilized weights, the reciprocal of the probability of receiving a given treatment is multiplied by the marginal probability of receiving the given treatment. Stabilized weights are intended to reduce variability due to instability in estimation that can be induced by subjects with very large weights. Using weights equal to allows one to estimate the average treatment effect in the treated (ATT) 20. We refer to these weights as IPTW‐ATT weights. With these weights, treated subjects receive a weight of one, while the control subjects receive a weight of the odds of receiving the active treatment. Thus, the population of treated subjects serves as the reference population to which each of the treated and control populations are standardized. To estimate the effect of treatment on the hazard of the occurrence of the outcome, one can use a weighted Cox regression model to regress survival on an indicator variable denoting treatment status 8, 21. Three possible methods can be used to estimate the variance of the estimated treatment effect. First, one could use the naïve model‐based variance estimator from the maximum partial likelihood estimator for the Cox proportional hazards model. Second, one could use a robust sandwich‐type variance estimator of the type proposed by Lin 22. The use of this estimator with IPTW regression models was proposed by Joffe et al. 21. A rationale for the use of a robust variance estimator is provided by Hernan et al., who note that the use of weights induces a within‐subject correlation in outcomes as observations can have weights that are unequal to one another 23, 24. Xu et al. showed that the use of conventional IPTW‐ATE weights tended to result in a doubling of the number of subjects in the analytic sample 25. The use of a robust variance estimator accounts for the lack of independence in replications of subjects induced by weighting. The third approach to variance estimation is to use a bootstrap‐based method to estimate the variability of the estimated treatment effect. Of these three variance estimation methods, the use of a robust variance estimator appears to be the most frequent. However, the relative performance of the three different methods is not known. Statistical software code in R and SAS for fitting the weighted Cox regression model with robust standard errors is provided in the appendix.

Case study

We provide an empirical example to illustrate the application of the methods described in the previous section.

Data and analyses

We used data from a previously published tutorial on the use of propensity score methods with survival data 26, 27. The sample consisted of 9107 subjects patients discharged from hospital with acute myocardial infarction (AMI) in Ontario, Canada. Data on patient characteristics were obtained by retrospective chart review by trained cardiovascular research nurses. These data were collected as part of the Enhanced Feedback for Effective Cardiac Treatment Study 28. The exposure of interest was whether the patient received a prescription for a statin lipid lowering agent at hospital discharge. Three thousand and forty‐nine subjects (33.5%) received a statin prescription at hospital discharge. Patients were followed for up to eight years post‐discharge. Patients who survived to eight years post‐discharge had their survival times treated as censored observations. For the purposes of this case study, we considered 10 baseline characteristics: age, systolic blood pressure at admission, respiratory rate at admission, creatinine, history of previous cardiovascular revascularization procedure (percutaneous coronary intervention or coronary artery bypass graft surgery), diabetes, current smoker, history of hyperlipidemia, history of dementia and history of a previous AMI. The first four are continuous covariates while the last six are binary. Statin exposure was regressed on the set of 10 baseline covariates using a logistic regression model. All 10 covariates were statistically significantly associated with statin prescribing at hospital discharge (P < 0.032). The hazard of post‐discharge mortality was regressed on the 10 covariates using a Cox proportional hazards regression model. All 10 covariates were significantly associated with the hazard of death (P < 0.002). The corresponding odds ratios (for exposure) and hazard ratios (for death) are reported in Table 1.

Table 1

Odds ratios and hazard ratios for treatment selection and mortality.

Variable	Odds ratio for statin prescribing (95% CI)	Hazard ratio for death (95% CI)
Age (per year increase)	0.980 (0.976, 0.984)	1.075 (1.071, 1.078)
Systolic blood pressure (per mmHg increase)	1.002 (1.001, 1.004)	0.996 (0.995, 0.997)
Respiratory rate (per breath per minute)	0.985 (0.975, 0.995)	1.036 (1.031, 1.041)
Creatinine (per µmol/L)	0.999 (0.998, 1.000)	1.003 (1.003, 1.003)
Previous revascularization	1.323 (1.117, 1.568)	1.201 (1.074, 1.343)
Diabetes	0.849 (0.758, 0.951)	1.766 (1.647, 1.893)
Current smoker	0.875 (0.783, 0.977)	1.247 (1.149, 1.355)
Hyperlipidemia	5.307 (4.804, 5.863)	0.850 (0.785, 0.92)
Dementia	0.488 (0.331, 0.72)	1.727 (1.508, 1.978)
Previous AMI	1.196 (1.056, 1.354)	1.424 (1.321, 1.534)

Odds ratios and hazard ratios for treatment selection and mortality. The propensity score was estimated for each subject using the logistic regression model estimated above. The three sets of weights described in Section 2 were calculated (conventional IPTW‐ATE weights, stabilized IPTW‐ATE weights, IPTW‐ATT weights). The effect of statin prescribing on the hazard of post‐discharge mortality was estimated using a weighted Cox proportional hazards model. The three different variance estimators described above were used to construct 95% confidence intervals for the effect of treatment on the hazard of mortality. When using bootstrapping, 200 bootstrap samples were drawn from the study sample 29 (Efron and Tibshirani suggest that 200 bootstrap samples are generally sufficient for estimating a standard error (page 52)). In each bootstrap sample, the propensity score was estimated by regressing treatment status on the 10 baseline covariates using a logistic regression model. The hazard of death was regressed on an indicator variable denoting treatment status using a weighted Cox model fit to the bootstrap sample. This was done for the three different sets of weights. The standard deviation of the estimated log‐hazard ratios (i.e., the estimated regression coefficient for the treatment status indicator) across the 200 bootstrap samples was used as the bootstrap estimate of the standard error of the estimated regression coefficient obtained in the original analytic sample. Ninety‐five percent confidence intervals were constructed by as , where denotes the estimated effect in the original analytic sample, and denotes the estimated standard error of the estimated treatment effect (obtained using either the naïve variance estimator, the robust sandwich‐type variance estimator, or bootstrapping).

Results

The nine estimated hazard ratios with associated 95% confidence intervals (three different sets of weights × three methods for variance estimation) are reported in Table 2. Several observations warrant comment. First, the estimates of the ATE are further from the null than are the estimates of the ATT (hazard ratios of 0.76 vs. 0.90). An interpretation of this result is that the effect of statin prescribing on mortality is stronger in the entire AMI population than it is in those AMI patients who ultimately received a statin at hospital discharge. Second, when estimating the ATE, variance estimates were substantially smaller when a naïve variance estimator was used compared to when a robust variance estimator or a bootstrap estimator was used. Third, differences between the robust variance estimator and the bootstrap estimator were negligible. Fourth, when estimating the ATE, the choice of weight (IPTW‐ATE weight vs. the stabilized ATE weight) had essentially no impact upon the estimate of the point estimate of statin efficacy.

Table 2

Estimated treatment effects, variance estimates and 95% confidence intervals from case study.

Estimation method	Estimated log‐hazard ratio (standard error)	Estimated hazard ratio (95% CI)
Estimates of the ATE
IPTW weights – naïve variance estimate	−0.2729 (0.0248)	0.76 (0.73, 0.80)
IPTW weights – robust variance estimate	−0.2729 (0.0436)	0.76 (0.70, 0.83)
IPTW weights – bootstrap variance estimate	−0.2729 (0.0425)	0.76 (0.70, 0.83)
Stabilized weights – naïve variance estimate	−0.2722 (0.0378)	0.76 (0.71, 0.82)
Stabilized weights – robust variance estimate	−0.2722 (0.0435)	0.76 (0.70, 0.83)
Stabilized weights – bootstrap variance estimate	−0.2722 (0.0459)	0.76 (0.70, 0.83)
Estimates of the ATT
Naïve variance estimation	−0.1082 (0.0463)	0.90 (0.82, 0.98)
Robust variance estimation	−0.1082 (0.0461)	0.90 (0.82, 0.98)
Bootstrap variance estimation	−0.1082 (0.0488)	0.90 (0.82, 0.99)

Estimated treatment effects, variance estimates and 95% confidence intervals from case study.

Monte Carlo simulations—methods

We used an extensive series of Monte Carlo simulations to examine the performance of different variance estimators when using IPTW with survival outcomes. We based our simulations on the empirical analyses conducted in the previous section so that our simulations would reflect the empirical data on which the case study was based. We simulated data for a setting in which there were 10 baseline covariates (X 1 − X 10). The first four were continuous while the last six were binary (to reflect the setting of the case study). Using the data from the case study, we standardized the four continuous covariates so that they had mean zero and unit variance. We estimated the Pearson variance–covariance matrix of the 10 baseline covariates. We simulated 10 baseline covariates for each subject from a multivariate normal distribution with mean vector equal to zero and with variance–covariance matrix equal to that estimated above. Each of the last six simulated covariates was used to create a binary variable with a prevalence equal to that of one of the binary covariates in the case study. This was done by determining whether the simulated normally distributed variable lay above or below a specified quantile of the given normal distribution. As in the case study, we assumed that each of the 10 simulated baseline covariates influenced treatment assignment. For each subject, the probability of treatment selection was determined from the following logistic model: logit(p ) = α 0,treat + α 1 x 1 + α 2 x 2 + α 3 x 3 + α 4 x 4 + α 5 x 5 + α 6 x 6 + α 7 x 7 + α 8 x 8 + α 9 x 9 + α 10 x 10. A bisection approach, as described in previous publications, was used to determine the intercept of the treatment‐selection model (α0,treat) so that the proportion of subjects in the simulated sample that were treated was fixed at the desired proportion 8. The regression coefficients (α 1, …, α 10) were set to the values estimated using the logistic regression model in the case study (with the modification that the continuous variables had been standardized to mean zero and unit variance). For each subject, treatment status was generated from a Bernoulli distribution with subject‐specific parameter: Z ~ Be(p ). We then generated a time‐to‐event outcome for each subject using a data‐generating process for time‐to‐event outcomes described by Bender et al. 30. For each subject, the linear predictor was defined as As in the case study, the 10 baseline covariates all affected the hazard of the outcome. The regression coefficients (β 1, …, β 10) were set to the values estimated in the case study (with the modification that the continuous variables had been standardized to mean zero and unit variance). The value of the log‐hazard ratio for the treatment indicator (β treat) was allowed to vary in the Monte Carlo simulations. For each subject, we generated a random number from a standard uniform distribution: u ~ U(0,1). An event time was generated for each subject as follows: . We set λ and η to be equal to 0.00002 and 2, respectively, as was done in previous studies 8, 31. The use of this data‐generating process results in a conditional treatment effect, with a conditional hazard ratio of exp(βtreat). However, IPTW using the propensity score estimates the marginal or population‐average effect. To determine the true marginal hazard ratio (this was necessary to permit estimation of empirical coverage rates of estimated confidence intervals), we generated a very large dataset consisting of 1 000 000 subjects using the methods described in the previous paragraphs with one important modification. For each subject, we simulated two outcomes: the two potential outcomes under the treatment and control conditions. Thus, we simulated an outcome for each subject under the assumption that they were treated. We then simulated an outcome for each subject under the assumption that they were under the control condition. Then in the dataset consisting of both outcomes for each subject, we regressed the hazard of the outcome on the indicator variable denoting treatment status. The estimated log‐hazard ratio will serve as the true marginal ATE. We then repeated the above regression analysis, restricting the analysis sample to those subjects who were ultimately assigned to the treatment in the large simulated population. The estimated log‐hazard ratio will serve as the true marginal ATT. We allowed the following factors to vary in our Monte Carlo simulations: the percentage of subjects that were treated (5%, 10%, 20%, 30%, 40%, and 50%) and the true conditional hazard ratio (1, 2, 3, 4, and 5). We thus examined 30 scenarios (6 treatment prevalences × 5 conditional hazard ratios). In each scenario, we simulated 1000 datasets, each consisting of 1000 subjects. Within each simulated dataset we did the following: We estimated the propensity score using a logistic regression model to regress treatment status on the 10 baseline covariates. In each of the 1000 simulated datasets for each scenario, we estimated the log‐hazard ratio and its standard error using the methods described in Section 2. Thus, we used three different sets of weights: the conventional IPTW‐ATE weights, the stabilized IPTW‐ATE, and the IPTW‐ATT weights. We used three different variance estimators: the naïve model‐based estimates, the robust sandwich‐type variance estimators, and a bootstrap estimate based on 200 bootstrap samples. Let and denote the estimated log‐hazard ratio and its estimated standard error, respectively, obtained from the ith simulated dataset using a given method. The estimated 95% confidence interval for the estimated log‐hazard ratio was computed as . We examined the accuracy of variance estimation three different ways. First, we examined the accuracy with which the estimated standard error of the estimated log‐hazard ratio estimated the sampling variability of the estimated log‐hazard ratio. To do so, we compared two quantities. We determined the mean standard error of the log‐hazard ratio across the 1000 iterations: . We also determined the standard deviation of the estimated log‐hazard ratios across the 1000 simulated datasets: sd . The first quantity estimates the mean standard error, while the second quantity is the empirical estimate of the sampling variability of the log‐hazard ratio. We then determined the ratio of these two quantities: . If the ratio equals one, then the estimated standard error of the log‐hazard ratio is correctly estimating the sampling variability of the estimated log‐hazard ratio. Second, we determined the proportion of estimated 95% confidence intervals that covered the true log‐hazard ratio across the 1000 iterations of each scenario. Third, we determined the mean length of the estimated 95% confidence intervals across the 1000 iterations of each scenario.

Monte Carlo simulations—results

The results of the Monte Carlo simulations are reported in Figures 1 through 6. The first three figures summarize results for estimation of the ATE, while the last three figures summarize results for the estimation of the ATT. Each figure consists of six panels, with one panel for each of the true conditional hazard ratios used in the data‐generating process.

Figure 1

Ratio of mean estimated standard error to empirical standard deviation of sampling distribution: ATE.

Estimation of the ATE

The ratio of the mean estimated standard error to the empirical standard deviation of the sampling distribution of the log‐hazard ratio is reported in Figure 1. In general, the bootstrap variance estimator resulted in a ratio that was approximately equal to one, indicating that the bootstrap estimator correctly approximated the standard deviation of the empirical sampling distribution of the log‐hazard ratio. This was true regardless of whether the conventional IPTW‐ATE or the stabilized IPTW‐ATE weights were used. The naïve variance estimator with the conventional IPTW‐ATE weights tended to substantially under‐estimate the standard deviation of the sampling distribution of the log‐hazard ratio. The remaining estimators tended to over‐estimate the sampling variability, particularly when the prevalence of treatment was above 25%. It is important to note that, in general, the robust variance estimator over‐estimated the sampling variability of the log‐hazard ratio, regardless of which set of weights was used. The empirical coverage rates of the estimated 95% confidence intervals are reported in Figure 2. Due to our use of 1000 iterations within each scenario, any empirical coverage rate that exceeded 0.9635 or that was less than 0.9365 would be statistically significantly different from the nominal rate of 0.95 based on a standard normal‐theory test. Horizontal lines denoting these thresholds, along with the nominal rate of 0.95 have been superimposed on each panel. The use of the naïve variance estimator with the conventional IPTW‐ATE weights resulted in estimated 95% confidence intervals whose empirical coverage rates were substantially lower than the nominal level. The two bootstrap estimators (using conventional IPTW‐ATE weights or the stabilized IPTW‐ATE weights) tended to result in estimates with approximately correct coverage rates. The remaining three methods (robust estimator with conventional IPTW‐ATE weights, naïve estimator with stabilized IPTW‐ATE weights, robust estimator with stabilized IPTW‐ATE weights) frequently resulted in confidence intervals whose empirical coverage rates exceeded the nominal level.

Figure 2

Empirical coverage rates of estimated 95% confidence intervals: ATE.

Empirical coverage rates of estimated 95% confidence intervals: ATE. Because of the equivalence between confidence intervals covering a given value and hypothesis testing, the scenario in which the true hazard ratio was equal to one permits us to examine the type I error rate. The empirical type I error rate is equal to the proportion of estimated confidence that exclude the true value. In examining the upper left panel of Figure 2, one can see that the two bootstrap‐based estimators tended to have approximately correct type I error rates, while the other four methods tended to have incorrect type I error rates. In particular, the use of conventional IPTW‐ATE weights with a naïve variance estimator tended to result in substantially inflated type I error rates. The other three methods tended to result in type I error rates that were lower than the expected rate of 5%. The mean length of estimated 95% confidence intervals are reported in Figure 3. As was to be expected, based on the results reported above, the use of naïve variance estimator with the conventional IPTW‐ATE weights resulted in the narrowest confidence intervals. Among the remaining five estimators, the two bootstrap estimators (with the conventional IPTW‐ATE weights or with the stabilized IPTW‐ATE weights) tended to result in the narrowest 95% confidence intervals.

Figure 3

Mean length of estimated 95% confidence intervals: ATE.

Estimation of the ATT

The degree to which the estimated standard error correctly approximated the standard deviation of the empirical sampling distribution of the log‐hazard ratio is reported in Figure 4. The use of the bootstrap estimator tended to result in estimates of standard error that closely approximated the standard deviation of the empirical sampling distribution (Figure 4). In contrast to this, the naïve variance estimator and the robust variance estimator tended to result in inaccurate estimation of the standard error. Importantly, the robust variance estimator systematically over‐estimated the sampling variability of the estimated log‐hazard ratio. The use of the bootstrap estimator tended to result in estimated 95% confidence intervals with approximately correct coverage rates (Figure 5). The use of the other two variance estimators tended to result in 95% confidence intervals whose coverage rates differed from the nominal level. In examining the upper left panel of Figure 5, one observes that the bootstrap estimator tended to have an approximately correct type I error rate, while the other two methods tended to have incorrect type I error rates. Finally, the use of the bootstrap estimator tended to result in the narrowest 95% confidence intervals (Figure 6).

Figure 4

Ratio of mean estimated standard error to empirical standard of sampling distribution: ATT.

Figure 5

Empirical coverage rates of estimated 95% confidence intervals: ATT.

Figure 6

Mean length of estimated 95% confidence intervals: ATT.

Ratio of mean estimated standard error to empirical standard of sampling distribution: ATT. Empirical coverage rates of estimated 95% confidence intervals: ATT. Mean length of estimated 95% confidence intervals: ATT.

Discussion

We conducted an extensive series of Monte Carlo simulations to examine the performance of different variance estimators when using IPTW with the propensity score to estimate the effect of treatment on survival outcomes. We briefly summarize our findings and place them in the context of the existing literature. We found that, when estimating either the ATE or the ATT, a bootstrap‐based estimator accurately estimated the sampling variability of the log‐hazard ratio and tended to result in estimated confidence intervals that had approximately correct coverage rates. In contrast to this, the use of a robust variance estimator tended to result in estimated standard errors that were biased upwards, regardless of which set of weights was used. Furthermore, the use of naïve variance estimator with the conventional IPTW‐ATE weights resulted in estimates of standard error that were biased downwards. Finally, the bootstrap estimators tended to have an approximately correct type I error rates, while the other methods tended to have incorrect type I error rates. Based on the observations from our Monte Carlo simulations, we would encourage analysts always to use the bootstrap estimate of the standard error rather than using the naïve or robust variance estimator provided by statistical software packages. The naïve variance estimator with the conventional IPTW weights resulted in substantial bias in estimating the standard error across all simulation settings. Similarly, the robust variance estimator (with either the conventional IPTW weights or the stabilized weights) tended to result in greater bias compared to the use of the bootstrap. In some settings, the use of a naïve variance estimator with the stabilized weights resulted in estimates of standard error that were within 10% of the true standard deviation of the sampling distribution. This tended to occur when the hazard ratio was large (hazard ratio of 4 or 5) and the prevalence of treatment was moderate (0.2 to 0.5). However, even in these scenarios, it was preferable to use the bootstrap estimator. For these reasons, we recommend that the bootstrap estimator be used in all settings. Our findings regarding the use of the naïve variance estimator when using the IPTW‐ATE weights were not surprising. As noted above, Hernan et al. note that weighting induces a within‐subject correlation in outcomes in the weighted sample 23. Thus, the assumption of independence required by the maximum partial likelihood estimator is violated. In our simulation settings, ignoring the weighting induced an artificial inflation in the sample size which results in an artificially low estimate of the standard error of the estimated log‐hazard ratio. Thus, across all scenarios, we observed that the mean estimated standard error was substantially smaller than the empirical standard deviation of the sampling distribution of the estimated log‐hazard ratios. Methods for variance estimation when using propensity score methods have received only minor attention in the methodological literature. In the context of propensity score matching, studies using Monte Carlo simulations have found that variance estimation needs to account for the matched nature of the sample and that naïve estimates of variance are inappropriate 7, 8, 10, 11. When using propensity‐score matching with replacement, Abadie and Imbens found that the use of bootstrapping to estimate standard errors was inappropriate 9. However, when matching on the propensity score without replacement, Austin and Small found that bootstrapping performed well 12. When using IPTW with the propensity score to estimate linear treatment effects (i.e. differences in means or differences in proportions), Lunceford and Davidian derived formal variance estimators 13. When fitting a regression model in a sample weighted using the IPTW weights, Joffe et al. recommended using a robust variance estimator 21. The current study is, to the best of our knowledge, the first to comprehensively examine variance estimation when using IPTW to estimate the effect of treatment on survival outcomes. In a literature review reported in a recent article describing diagnostics for use with IPTW, it was observed that the frequency with which IPTW is being used is increasing rapidly over time 32. Given the increasing popularity of IPTW‐based methods and the frequency with which survival outcomes occur in the medical literature 6, the results described in the current study will likely be of interest to a wide body of researchers and applied analysts. A paper by Xu et al. is particularly relevant to the current study 25. They used Monte Carlo simulations to compare three estimation methods (conventional IPTW‐ATE weights with a naïve variance estimator vs. stabilized IPTW‐ATE weights with a naïve variance estimator vs. conventional IPTW‐ATE weights with a robust variance estimator) when using a logistic regression model to estimate the effect of treatment on a binary outcome. Their primary focus was on two issues: the size of the weighted sample and type I error rates. They found that the use of conventional IPTW‐ATE weights tended to result in a weighted sample that was approximately double the size of the unweighted sample, while the use of stabilized IPTW‐ATE weights tended to result in a weighted sample that was approximately the same size as the unweighted sample. Consequently, the use of conventional IPTW‐ATE weights with a naïve variance estimator resulted in inflated type I error rates. However, the use of stabilized IPTW‐ATE weights with a naïve variance estimator tended to result in approximately correct type I error rates. Finally, they found that the use of stabilized IPTW‐ATE weights with a robust variance estimator resulted in type I error below lower than the advertised rate of 5%. Their finding that the use of conventional IPTW‐ATE weights with a naïve variance estimator resulted in inflated type I error rates is similar to our finding that this estimator resulted in variance estimates that were too small (Figure 1) and that type I error rates were substantially inflated (top left pane of Figure 2). However, we also found that the use of stabilized IPTW‐ATE weights with a naïve variance estimator tended to result in biased estimation of variance, although to a lesser degree than was observed for the conventional IPTW‐ATE weights. These differences may be attributable to differences in the nature of the outcome (binary vs. time‐to‐event) or the simulation design (their simulation design used either a single binary covariate or a binary covariate and a continuous covariate while our design used 10 baseline covariates that were a mixture of continuous and dichotomous). A potential explanation for the sub‐optimal performance of the robust variance estimator is provided by a study by Williamson et al. that examined variance reduction in randomized trials by using IPTW using the propensity score 33. When considering continuous or binary outcomes, they derived a variance estimator for the appropriate measure of effect (difference of means for continuous outcomes; risk difference, relative risk and odds ratio for binary outcomes) that accounted for the fact that the weights had been estimated, rather than known with certainty. As mentioned above, Hernan et al. noted that weighting induced a within‐subject correlation in outcomes and that the use of robust variance estimator was encouraged to address lack of independence in the weighted sample 23. However, while the robust variance estimator accounts for the lack of independence, it does not account for the fact that the propensity score was estimated rather than known with certainty. In contrast to this, the use of bootstrapping implicitly accounts for the estimation of the propensity score as the propensity score is estimated within each bootstrap sample. Thus, the bootstrap estimate of the standard error incorporates sampling variability in the estimated propensity score. Unfortunately, the variance estimators proposed by Williamson et al. have not been extended to the case of survival outcomes. In subsequent research it would be important to extend the variance estimator of Williamson et al. for use with a weighted Cox regression model and to compare its performance to that of the methods considered in the current study. We suspect that its performance would be similar to that of bootstrapping, as both methods account for the estimation of the propensity score. However, such a derivation is beyond the scope of the current study. We began our analyses by conducting empirical analyses using patients discharged from hospital with a diagnosis of AMI. In this case study, we observed that the estimated standard error obtained using the naïve variance estimator with the conventional IPTW‐ATE weights was substantially smaller than those obtained using either the robust variance estimator or the bootstrap estimator. This observation reflects one of the findings of the subsequent Monte Carlo simulations in which we found that the naïve variance estimator with the conventional IPTW‐ATE weights resulted in substantial under‐estimation of the true sampling variability. While the case study was intended to illustrate the application of the different methods, it also served a secondary purpose. The analysis of the case study data provided parameter estimates that informed the design of the subsequent Monte Carlo simulations. Thus, the simulated data reflected the data used in the case study, leading the simulations to reflect a realistic clinical scenario. As such, our simulations were similar to the plasmode simulations described by Franklin et al. 34. In summary, the current study was motivated by prior research that observed that the use of a robust sandwich‐type variance estimator when using IPTW to estimate the effect of treatment on survival outcomes resulted in biased estimates of standard errors and confidence intervals with incorrect coverage rates 8. Based on the results of the current study, we recommend that the bootstrap be used to estimate standard errors and confidence intervals when fitting a Cox proportional hazards model in a sample weighted using the IPTW.

28 in total

Review 1. Principles for modeling propensity scores in medical research: a systematic literature review.

Authors: Sherry Weitzen; Kate L Lapane; Alicia Y Toledano; Anne L Hume; Vincent Mor
Journal: Pharmacoepidemiol Drug Saf Date: 2004-12 Impact factor: 2.890

2. Propensity score applied to survival data analysis through proportional hazards models: a Monte Carlo study.

Authors: Etienne Gayat; Matthieu Resche-Rigon; Jean-Yves Mary; Raphaël Porcher
Journal: Pharm Stat Date: 2012-03-12 Impact factor: 1.894

3. Type I error rates, coverage of confidence intervals, and variance estimation in propensity-score matched analyses.

Authors: Peter C Austin
Journal: Int J Biostat Date: 2009-04-14 Impact factor: 0.968

4. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study.

Authors: Peter C Austin; Paul Grootendorst; Sharon-Lise T Normand; Geoffrey M Anderson
Journal: Stat Med Date: 2007-02-20 Impact factor: 2.373

5. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study.

Authors: Peter C Austin; Paul Grootendorst; Geoffrey M Anderson
Journal: Stat Med Date: 2007-02-20 Impact factor: 2.373

Review 6. A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003.

Authors: Peter C Austin
Journal: Stat Med Date: 2008-05-30 Impact factor: 2.373

7. Evaluating uses of data mining techniques in propensity score estimation: a simulation study.

Authors: Soko Setoguchi; Sebastian Schneeweiss; M Alan Brookhart; Robert J Glynn; E Francis Cook
Journal: Pharmacoepidemiol Drug Saf Date: 2008-06 Impact factor: 2.890

8. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases.

Authors: Jessica M Franklin; Sebastian Schneeweiss; Jennifer M Polinski; Jeremy A Rassen
Journal: Comput Stat Data Anal Date: 2014-04 Impact factor: 1.681

9. Constructing inverse probability weights for marginal structural models.

Authors: Stephen R Cole; Miguel A Hernán
Journal: Am J Epidemiol Date: 2008-08-05 Impact factor: 4.897

10. The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments.

Authors: Peter C Austin
Journal: Stat Med Date: 2013-09-30 Impact factor: 2.373

107 in total

1. A Comparative Study of Carvedilol Versus Metoprolol Initiation and 1-Year Mortality Among Individuals Receiving Maintenance Hemodialysis.

Authors: Magdalene M Assimon; M Alan Brookhart; Jason P Fine; Gerardo Heiss; J Bradley Layton; Jennifer E Flythe
Journal: Am J Kidney Dis Date: 2018-04-10 Impact factor: 8.860

2. 2-Year Outcomes After Complete or Staged Procedure for Tetralogy of Fallot in Neonates.

Authors: Jill J Savla; Jennifer A Faerber; Yuan-Shung V Huang; Theoklis Zaoutis; Elizabeth Goldmuntz; Steven M Kawut; Laura Mercer-Rosa
Journal: J Am Coll Cardiol Date: 2019-09-24 Impact factor: 24.094

3. Adjusted restricted mean survival times in observational studies.

Authors: Sarah C Conner; Lisa M Sullivan; Emelia J Benjamin; Michael P LaValley; Sandro Galea; Ludovic Trinquart
Journal: Stat Med Date: 2019-05-22 Impact factor: 2.373

4. Chemotherapy, Radiation, or Combination Therapy for Stage III Uterine Cancer.

Authors: Sbaa Syeda; Ling Chen; June Y Hou; Ana I Tergas; Fady Khoury-Collado; Alexander Melamed; Caryn M St Clair; Cande V Ananth; Alfred I Neugut; Dawn L Hershman; Jason D Wright
Journal: Obstet Gynecol Date: 2019-07 Impact factor: 7.661

5. Nursing Home Culture Change Practices and Survey Deficiencies: A National Longitudinal Panel Study.

Authors: Michael J Lepore; Julie C Lima; Susan C Miller
Journal: Gerontologist Date: 2020-11-23

6. Comparative risk of harm associated with trazodone or atypical antipsychotic use in older adults with dementia: a retrospective cohort study.

Authors: Jennifer A Watt; Tara Gomes; Susan E Bronskill; Anjie Huang; Peter C Austin; Joanne M Ho; Sharon E Straus
Journal: CMAJ Date: 2018-11-26 Impact factor: 8.262

7. Differences in coronary artery disease complexity and associations with mortality and hospital admissions among First Nations and non-First Nations patients undergoing angiography: a comparative retrospective matched cohort study.

Authors: Annette Schultz; Lindsey Dahl; Elizabeth McGibbon; Jarvis Brownlie; Catherine Cook; Basem Elbarouni; Alan Katz; Thang Nguyen; Jo-Ann V Sawatzky; Heather J Prior; Moneca Sinclaire; Karen Throndson; Randy Fransoo
Journal: CMAJ Open Date: 2020-11-02

8. Racial/Ethnic Disparities in Access and Outcomes of Simultaneous Liver-Kidney Transplant Among Liver Transplant Candidates With Renal Dysfunction in the United States.

Authors: Su-Hsin Chang; Mei Wang; Xiaoyan Liu; Tarek Alhamad; Krista L Lentine; Mark A Schnitzler; Graham A Colditz; Yikyung Park; William C Chapman
Journal: Transplantation Date: 2019-08 Impact factor: 4.939

9. Association Between Preoperative Metformin Exposure and Postoperative Outcomes in Adults With Type 2 Diabetes.

Authors: Katherine M Reitz; Oscar C Marroquin; Mazen S Zenati; Jason Kennedy; Mary Korytkowski; Edith Tzeng; Stephen Koscum; David Newhouse; Ricardo Martinez Garcia; Jennifer Vates; Timothy R Billiar; Brian S Zuckerbraun; Richard L Simmons; Stephen Shapiro; Christopher W Seymour; Derek C Angus; Matthew R Rosengart; Matthew D Neal
Journal: JAMA Surg Date: 2020-06-17 Impact factor: 14.766

10. Long-term effectiveness of catheter ablation in patients with atrial fibrillation and heart failure.

Authors: Michelle Samuel; Michal Abrahamowicz; Jacqueline Joza; Marie-Eve Beauchamp; Vidal Essebag; Louise Pilote
Journal: Europace Date: 2020-05-01 Impact factor: 5.214