Literature DB >> 30207407

Combining multiple imputation and bootstrap in the analysis of cost-effectiveness trial data.

Jaap Brand¹, Stef van Buuren², Saskia le Cessie^3,4, Wilbert van den Hout¹.

Abstract

In healthcare cost-effectiveness analysis, probability distributions are typically skewed and missing data are frequent. Bootstrap and multiple imputation are well-established resampling methods for handling skewed and missing data. However, it is not clear how these techniques should be combined. This paper addresses combining multiple imputation and bootstrap to obtain confidence intervals of the mean difference in outcome for two independent treatment groups. We assessed statistical validity and efficiency of 10 candidate methods and applied these methods to a clinical data set. Single imputation nested in the bootstrap percentile method (with added noise to reflect the uncertainty of the imputation) emerged as the method with the best statistical properties. However, this method can require extensive computation times and the lack of standard software makes this method not accessible for a larger group of researchers. Using a standard unpaired t-test with standard multiple imputation without bootstrap appears to be a robust alternative with acceptable statistical performance for which standard multiple imputation software is available.

Entities: Chemical Disease Gene Species

Keywords: bootstrap; confidence interval; cost-effectiveness analysis; mean difference; multiple imputation

Mesh：

Year: 2018 PMID： 30207407 PMCID： PMC6585698 DOI： 10.1002/sim.7956

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

INTRODUCTION

The central goal in healthcare cost‐effectiveness analysis is to assess whether the additional positive health effect of a new treatment justifies the additional costs of this new treatment. In health economic trial data, probability distributions are typically skewed and missing data are frequent. Especially, costs distributions can be skewed because costs are nonnegative with small numbers of patients incurring much of the costs. Bootstrapping has long been advocated for the evaluation of skewed health economic data1, 2, 3 because it does not require specific distributional assumptions. Moreover, contrary to other approaches to skewness, which use transformed outcomes, bootstrap allows to explicitly analyze the means, which is important for population‐level decision making. The concept of bootstrap is to approximate the unknown distribution of test‐statistics under the sampling mechanism by means of the empirical distribution of these test‐statistics under resampling from the sample. P‐values or confidence intervals are then derived from this empirical distribution. More recently, multiple imputation has been advocated to account for missing data.4, 5, 6, 7 In cost‐effectiveness trials, patients typically report longitudinally on their health and healthcare use. Over a 1 or 2‐year follow‐up period, apart from incidental missing questionnaires, trial participation may selectively reduce by 30% or 50%. Multiple imputation is a flexible method that can properly account for the uncertainty and bias due to such missing data. In multiple imputation, new data sets are constructed in which the missing values are imputed. These imputed values vary over the data sets, reflecting uncertainty due to prediction errors of the imputed values, uncertainty about the imputation model parameters, and sampling variability of the imputed values. The resulting completed data sets are then each analyzed by means of the complete data method of interest and the intermediate results are pooled into one final result according to the so‐called Rubin rules. Bootstrap and multiple imputation are well‐established resampling methods for handling skewed and missing data. Some papers have discussed relationships between bootstrapping and (multiple) imputation.8, 9, 10, 11, 12 Some papers have also compared the statistical performance of specific combined approaches, in settings somewhat similar to ours.13, 14 However, there have been no papers that compared the statistical performance of a systematic range of combined approaches, including different orders of nesting the computation loops of bootstrap and of multiple imputation. As a result, in cost‐effectiveness trials, the mean difference between treatment groups has been estimated using a variety of approaches, including bootstrap nested within multiple imputation15 and (single or multiple) imputation nested within bootstrap.16, 17 In this paper, we compare 10 candidate methods that account for missing observations and skewness of outcomes, using data simulation to assess the coverage of 95% confidence intervals, the bias of the point‐estimates, and the confidence interval width. We distinguish between methods where the bootstrap is nested within multiple imputation and methods where (single or multiple) imputation is nested within the bootstrap. In addition, we consider simpler alternatives like list‐wise deletion, single imputation, standard multiple imputation without bootstrap, and standard multiple imputation with a modified t‐test to remove the effect of skewness. In order to study their behavior in practice, we also applied all candidate methods to real‐life data from a clinical trial on Sciatica.

METHODS

Candidate methods for combining multiple imputation and bootstrap

We are interested in the mean difference in outcome between two treatment groups, denoted by . Table 1 lists the 10 candidate methods to estimate and its 95% confidence interval. Some methods use double loops (methods that actually combine multiple imputation and bootstrap), others use a single loop (methods that use either bootstrapping or multiple imputation), or use no loop at all.

Table 1

Overview of the 10 candidate methods

Description	Code name
Benchmark methods
• List‐wise deletion	BENCH_LWD
• Single imputation using the predicted mean value	BENCH_prd
Multiple imputation without bootstrapping
• Standard multiple imputation using predictive mean matching and	MW_S
Rubin's rules for the computation of the confidence interval based
on the normality assumption, without bootstrap
• Multiple imputation using predictive mean matching with reduction	MW_EDW
of the effect of skewness by means of Edgeworth Expansion,
without bootstrap
Bootstrapping nested in multiple imputation
• Multiple imputation using predictive mean matching in the outer	MB_p
and the bootstrap percentile method in the inner loop
• Multiple imputation using predictive mean matching in the outer	MB_t
loop and the bootstrap‐t method in the inner loop
Multiple imputation nested in bootstrapping
• The bootstrap percentile method in the outer loop and multiple	BM_p
imputation by means of predictive mean matching in the inner loop
• The bootstrap‐t method in the outer loop and multiple imputation by	BM_t
means of predictive mean matching in the inner loop
Single imputation nested in bootstrapping
• The bootstrap percentile method in the outer loop, encompassing	BS_p
imputation by means of predictive mean matching
• The bootstrap‐t method in the outer loop, encompassing single	BS_t
imputation by means of predictive mean matching

Overview of the 10 candidate methods Benchmark methods. The first two methods are list‐wise deletion (BENCH_LWD) and single imputation (BENCH_prd), which are two popular methods that are known to have potentially poor performance. These will serve as a “bench mark” for the other eight candidate methods. They use neither bootstrapping nor multiple imputation. In the BENCH_LWD method, all patients with any missing values are removed from the data. In BENCH_prd, each missing value is imputed once with the predicted mean value using linear regression, ie, without taking the uncertainty of the imputation into account. Multiple imputation without bootstrapping. In the methods based on multiple imputations, the uncertainty of the imputations is incorporated by drawing from the predictive distribution of the missing values. Both uncertainty due to prediction errors of the imputed values and uncertainty about the imputation model parameters are reflected using chained equations (MICE),18 with predictive mean matching for robustness against nonnormality.19 The candidate method of multiple imputation without bootstrapping is standard multiple imputation, constructing m new data sets with completed data point‐estimates (. The point‐estimates are pooled by computing the average , and squared standard errors are pooled as , where is the average completed data variance of the point‐estimate and is the between imputation variance of the completed data point‐estimates. Pooled p‐values and confidence intervals are derived from the pooled point‐estimates and pooled standard errors under the assumption of normality. In the standard MW_S method, the 95% confidence interval of the mean difference is given by where is the 97.5% percentile of the student‐t distribution with the degrees of freedom computed by the method proposed by Barnard and Rubin.20 The second single loop method uses multiple imputation applied to a modified t‐test, based on Edgeworth expansion,21, 22 which removes the effect of skewness (MW_EDW). We included this approach because the use of bootstrapping in cost‐effectiveness analyses is particularly advocated because of the skewness of the cost data and Edgeworth expansion could obtain the same goal without increasing the computational complexity. The 95% confidence interval for the mean difference from the MW_EDW method is given by where is the sum of the sample sizes in both groups, and the 2.5% and 97.5% percentile of the standard normal distribution, and is the inverse transformation specified by Zhou22 with the complete data estimate of parameter replaced by average of this parameter over the completed data estimates for . Parameter is given by where and are the population variances and and are the population skewness of the first and second sample. This parameter can be interpreted as the impact of skewness on the deviation of the ordinary t‐test statistic from the t‐distribution this statistic has under normality. Bootstrapping nested in multiple imputation. In the approaches with multiple imputation in the outer loop (denoted by MB in Table 1), multiple imputation is used to generate m completed data sets, bootstrapping is applied to each of the completed data sets, and the intermediate results per completed data set are then pooled. For the bootstrap method, a distinction is made between the bootstrap percentile method (“_p” in Table 1) and the bootstrap‐t method (“_t” in Table 1).1 In the percentile method MB_p, the point‐estimate is the pooled mean difference and the 95% confidence interval is of the shape , where and are estimates of the 2.5% and 97.5% percentiles and of the bootstrap distribution of the estimated mean differences. When bootstrap is nested within multiple imputation, the percentiles are estimated by their corresponding average values from the completed data estimates and ( of these percentiles resulting from bootstrap. In general, for the bootstrap‐t method the 95% confidence interval is based on the t‐test and is of the shape where and are a point‐estimate of the unknown mean difference and its associated standard error both obtained without bootstrap, and and are estimates of the 2.5% and 97.5% percentiles and of the t‐test statistic obtained by means of bootstrap. When bootstrap is nested within multiple imputation (MB_t), the point‐estimate and its associated standard error are given by pooled mean difference and pooled standard error , and the percentiles and are estimated by the averages and from the corresponding completed data estimates and ( of these percentiles. Multiple imputation nested in bootstrapping. In the approach with multiple imputation in the inner loop (denoted by BM in Table 1), bootstrapping is used first to generate incomplete data sets and then, for each incomplete data set, m completed data sets are generated. Computationally, this requires far more calls to the MICE procedure than when multiple imputation is in the outer loop. For the bootstrap methods, a distinction is again made between the bootstrap percentile method and the bootstrap‐t method. For the bootstrap percentile method BM_p, the percentiles and are estimated by the 2.5% and 97.5% percentiles from the bootstrapped pooled mean differences over the m completed data sets. For the bootstrap‐t method BM_t, the point‐estimate and its associated standard error are given by the pooled mean difference and pooled standard error , and the percentiles and are estimated by the 2.5% and 97.5% percentiles of the bootstrap pooled t‐test statistics. Single imputation nested in bootstrapping. The methods in the previous section can be simplified by using only a single imputation (. With single imputation nested in bootstrapping (denoted by BS in Table 1), no pooling over the imputations is needed. The single imputation not only imputes the expected value of the missing data but adds “noise” to reflect the uncertainty of the imputation (using a single call to the MICE procedure per bootstrap resample). For the bootstrap percentile method BS_p, the 2.5% and 97.5% percentiles are estimated from the bootstrapped completed mean differences. For the bootstrap‐t method BS_t, the point‐estimate and its associated standard error are given by the completed mean difference and its associated standard error from the completed data set, and the percentiles and are estimated by the 2.5% and 97.5% percentiles of the bootstrap completed data t‐test statistic.

Simulation study

In the data simulation study, the 10 candidate methods were compared with respect to statistical validity and efficiency. These were assessed on repeatedly simulated data sets, simulated according to 30 different quite extreme data simulation models, varying both the complete data generating mechanism and the missing data mechanism (see Table 2). The 30 data simulation models are defined in comparison to a reference case model, varying six aspects of the model one at a time. All models represent cost‐effectiveness trial data, with independent patients in two equally sized treatment groups (reference case ). Correlated bivariate cost‐effectiveness data were generated for each patient, similarly in both treatment groups. Throughout this simulation study, the effectiveness variable was generated using a beta(5,2) distribution. The cost variable was modeled as a mixture of either zero costs (reference case for 30% of the patients) or a gamma distribution with a mean fixed at 1000 euro and skewness (reference case ). Such semicontinuous mixtures of zero and positive values often occur in cost data.23 Prespecified variable rank correlation (reference case ) between effectiveness and costs was generated using the NORTA (NORmal To Anything) algorithm.24 To prevent ties in the NORTA algorithm, the zero costs were modeled as a small uniform distribution between 0 and 1 euro. For both missing completely at random (MCAR) and missing at random (MAR) data mechanisms, missing data were generated in the cost variable only (reference case 40% missing). For the MAR missing data mechanism, the cost data were three times more likely to be missing in patients with effectiveness above or equal to the median than in patients with effectiveness below the median (reference case 60% versus 20% missing).

Table 2

Varied Assumptions in Data Mechanism (for groups 1 and 2)	Plot Symbol used in Figure 1
% missing in the costs data (reference case 40% and 40%)
10% and 10%	Blue triangle point‐down
10% and 50%	Blue circle
50% and 50%	Blue triangle point‐up
Sample size (reference case n = 2 × 200)
n = 2 × 50	Green triangle point‐down
n = 2 × 200	Green circle
n = 2 × 500	Green triangle point‐up
% zeroes in cost data (reference case 30% and 30%)
5% and 5%	Orange triangle point‐down
5% and 40%	Orange circle
40% and 40%	Orange triangle point‐up
Skewness parameter γ in cost data (reference case 2 and 2)
0.5 and 0.5	Red triangle point‐down
0.5 and 3	Red circle
3 and 3	Red triangle point‐up
Rank correlation (reference case ‐0.8 and ‐0.8)
‐0.3 and ‐0.3	Purple triangle point‐down
‐0.3 and ‐0.9	Purple circle
‐0.9 and ‐0.9	Purple triangle point‐up

Assumptions in the data simulation models used to compare the candidate methods. Unspecified parameters follow the reference case assumptions (see text). The specified 15 assumptions were combined with both a missing completely at random mechanism (MCAR, open plot symbols) and a missing at random mechanism (MAR, filled plot symbols) For each of the 30 data simulation models, we simulated 1000 incomplete data sets to assess the performance of the 10 candidate methods in estimating the mean cost difference between the treatment groups. The data sets included treatment, effectiveness, and costs, where costs were missing for part of the participants. Per treatment group, the effectiveness variable was used as predictor variable for costs in MICE. For all candidate methods involving multiple imputation, the number of imputations was and the number of bootstrap resamples was equal to . Per candidate method, the number of data simulation models for which the method was statistically valid, ie, both unbiased and without significant under coverage,25 was counted and displayed at the top of Panel A in Figure 1. A method was considered unbiased for a particular simulation model if the bias‐validity criterion holds,26 where and are the bias and standard error of the corresponding point‐estimate estimated from the simulation study. A method was considered to have significant under coverage for a particular simulation model if the actual coverage over the 1000 simulated data sets was significantly less than 95%, ie, if in 935 or less simulated data sets the 95% confidence interval contained the true value (see Panel A in Figure 1). An actual coverage lower than 90% of these confidence intervals has been considered unacceptable.18 For a given simulation model, the efficiency of a method was defined as the average confidence interval width over all simulated 1000 incomplete data sets. The simulations were all performed in R, version 3.02.

Figure 1

Results of the simulation study in which the performance of the 10 candidate methods for 30 different data simulation models was assessed for the actual confidence interval coverage (Panel A), bias (Panel B), and average confidence interval (Panel C). The top row of Panel A indicates the number of data simulation models (out of 30) for which each method is considered valid (ie, unbiased and with coverage at least 93.6%). For legend of the symbols, see Table 2 [Colour figure can be viewed at wileyonlinelibrary.com]

Application – Sciatica trial

The 10 candidate methods were also applied to real‐life data from the Sciatica trial.27 The Sciatica trial was a randomized controlled clinical trial in which the cost effectiveness of a policy of early surgery (n = 142) was compared to a policy of prolonged conservative care (n = 141). In the early surgery policy, disc surgery was scheduled within two weeks of randomization and canceled only if spontaneous recovery occurred before the date of surgery. In the prolonged conservative care policy, disc surgery was offered if sciatica persisted after six months. Increasing leg pain, not responsive to drug, and progressive neurological deficit were reasons for performing surgery earlier than six months. The trial concluded that early surgery was cost effective from a societal perspective because the additional healthcare costs were compensated by improved patient outcome and a reduction in absenteeism from work. Outcome measures. Apart from the realistic nature, the primary difference between the data simulation models and the Sciatica data is in the complexity of the data structures. Typical for cost‐effectiveness trial data, the Sciatica trial had a longitudinal structure, with extensive quarterly patient questionnaires during a one‐year follow‐up. Moreover, the overall costs and effectiveness were constructed from a large number of underlying health and healthcare items. Outcome measures to which the 10 candidate methods were applied are four different health effects measured by means of quality‐adjusted life years (QALYs) and five different costs categories. The QALYs are computed over the one‐year period as the area under a utility function, which quantifies the value of the patient's health (anchored at 1 = perfect health and 0 = as poor as dead). The different QALYs in this example are QALYs based on four different utility functions, ie, the UK and US tariffs for the EuroQol (EQ‐5D),28, 29 the SF‐6D,30 and a visual analogue scale.27 The five costs categories were disc surgery costs, total healthcare costs, informal care costs, productivity costs in terms of absenteeism from work, and the total societal costs, all measured over one year of follow‐up. The total healthcare costs included costs from disc surgery, physical therapy, other admissions to hospital, neurologists, neurosurgeons, other specialists, general practitioners and other paramedical professionals, alternative care, home care, analgesics and other drugs, and aids. Generation of imputations. For the UK EQ‐5D, the US EQ‐5D, the SF‐6D, and the visual analogue scale, the percentages of missing data were 23%, 23%, 23%, and 21% in the prolonged conservative care treatment group and 28%, 28%, 24%, and 35% in the early surgery treatment group. For all five costs categories, the percentage of missing data was 18% in the prolonged conservative care treatment group and 26% in the early surgery treatment group. Missing effectiveness and healthcare data were imputed at the item level. Imputations were generated using MICE, with a large linear prediction model. Effectiveness and cost items were predicted by gender, age, treatment group, and all (other) effectiveness items. Dependencies within patients over time were taken into account by performing separate regression analyses for each separate time point, including the effectiveness measurements at other time points as predictors. From each completed data set, the QALYs and aggregate costs categories were calculated. Like for the data simulation models, the number of imputations was chosen equal to 5 and the number of bootstrap resamples was chosen equal to 1000.

RESULTS

The results for the 10 methods and 30 data simulation models are graphically summarized in Figure 1, measured by coverage (panel A), bias (panel B), and efficiency (panel C). A method is considered valid for a particular model if it is both unbiased (below the red line in panel B) and without significant under coverage (ie, with simulated coverage at least 93.6%, above the lowest dotted line in panel A). The number of data simulation models for which each method was valid is indicated in the top row of panel A. Concerning bias, all methods were unbiased for the 15 data simulation models with MCAR missing data mechanism. For the MAR missing data mechanism, cost data were three times more likely to be missing in patients with effectiveness above or equal to the median. The BENCH_LWD method was biased for 13 of these 15 MAR data simulation models. The BENCH_prd method was biased for two of the 15 MAR data simulation models, as imputing the predicted mean value is less robust to departures from linearity and normality than the nonbenchmark methods. The other eight candidate methods were unbiased for all data simulation models with MAR missing data mechanism. Therefore, the statistical validity of the nonbenchmark methods is determined by coverage. Concerning coverage, list‐wise deletion (BENCH_LWD) yielded significant under coverage for 12 data simulation models (all MAR). Single imputation (BENCH_prd) yielded under coverage for 13 MCAR and seven MAR models. The under coverage is due both to the bias and to the fact that imputing the predicted mean value does not properly reflect the uncertainty. Standard multiple imputation without bootstrap (MW_S) performed quite well with statistical validity for 19 data simulation models out of 30. For the other 11 simulation models, in 10 models, the coverage was between 93.6% and 90%; in one model, the coverage was slightly less than 90%. This latter model was the one involving MCAR and different percentages of missing of 10% and 50% for both samples (open blue circle). Moreover, for the small sample size of 50 (green triangles pointing down), the coverage was larger than 90%. Therefore, MW_S appears to be robust against skewness, even for small sample sizes. The method MW_EDW that corrects for skewness did not outperform the standard MW_S. Both methods with bootstrap nested within multiple imputation yielded contradictory results. The percentile‐based MB_p method yielded poor performance with statistical validity for only three out of the 30 data simulation models and coverage below 90% for about half of the models. The t‐test–based MB_t method performed better than the MB_p method, with statistical validity for 12 out of 30 data simulation models and coverage larger than 90% for all 30 models. Both methods in which multiple imputation is nested within bootstrap showed a coverage of that least 90% for all 30 models. The BM_p method was statistically valid for a considerable 23 out of 30 data simulation models, whereas the BM_t method was statistically valid for 12 out of 30 models. Finally, concerning coverage, the two methods with single imputation nested within bootstrap yielded contradictory results. The percentile‐based BS_p method yielded the best statistical validity over all 10 candidate methods, with statistical validity for 29 out of 30 data simulation models. On the other hand, the BS_t method performed poorly with statistical validity for only two out of 30 models and coverage below 90% for about half of the models. Concerning efficiency, of the methods with relative poor statistical validity, some had relatively long confidence intervals (BENC_LWD and BS_p) and some had relatively short confidence intervals (BENCH_prd and MB_p). Among the remaining methods, the confidence intervals were similar in length.

Sciatica trial

Figure 2 displays the estimated differences for the four QALY outcomes and the five cost outcomes between the randomization groups of the Sciatica trial. The top panels display the point estimates according to the different methods, with the estimated confidence intervals. The bottom panels show the lengths of those confidence intervals.

Figure 2

Estimated four quality‐adjusted life year (QALY) outcomes and five cost outcomes for the Sciatica trial. Top panels display the point estimates with upper and lower bound of the confidence intervals. Bottom panels show the lengths of those confidence intervals [Colour figure can be viewed at wileyonlinelibrary.com] Except for list‐wise deletion, there was little difference between the candidate methods. Each point estimate is well within the confidence intervals of the other methods. Like in the original trial, all candidate methods showed (marginally) significant QALY differences in favor of early surgery. Surgery costs, total health care costs, and informal care costs were significantly higher after early surgery, without significant difference on productivity and total societal costs. Productivity costs, and consequently total societal costs, showed the largest differences between the methods due to the larger variability and because patients without paid labor reduced the effective sample size. Table 3 gives information about the computation times for the different methods. The methods embedding multiple imputation in bootstrap yield the largest computation time of more than 29 hours due to the large number of MICE calls and the large imputation model. In contrast, the methods without bootstrapping in the outer loop require less than two minutes.

Table 3

Computation time to analyze data from the Sciatica study (total and for MICE calls). Time indicated by “xh ym zs” denotes x hours and y minutes and z seconds

Method	Total time	Number of	Total	Time per	Percentage
		MICE calls	MICE time	MICE call	MICE time
BENCH_LWD	0.2 s	0			0%
BENCH_prd	23 s	1	23 s	23 s	100%
MW_S	1 m 52 s	5	1 m 52 s	22 s	100%
MW_EDW	1 m 52 s	5	1 m 52 s	22 s	100%
MB_p and MB_t	1 m 54 s	5	1 m 53 s	23 s	99.1%
BM_p and BM_t	29 h 25 m 21 s	5000	28 h 57 m 43 s	21 s	98.4%
BS_p and BS_t	5 h 53 m 34 s	1000	5 h 48 m 03 s	21 s	98.4%

Computation time to analyze data from the Sciatica study (total and for MICE calls). Time indicated by “xh ym zs” denotes x hours and y minutes and z seconds

DISCUSSION

This paper evaluated 10 different candidate methods for estimating confidence intervals of the mean difference between two independent treatment groups from incomplete skewed data. The combined use of multiple imputation with bootstrap does not automatically yield statistically valid results, and thus should be applied with care. The bootstrap percentile method embedded in multiple imputation (MB_p) yielded a low coverage because the pooled confidence interval was obtained as average of the completed data confidence intervals . In these completed data confidence intervals, the extra uncertainty due to missing data is not taken into account. This way, the variance between imputation sets (ie, the sampling variability of the missing values) is not fully taken into account. In contrast, the seemingly similar bootstrap‐t method embedded in multiple imputation (MB_t) performs considerably better because the resulting confidence interval does account for the extra uncertainty due to missing data through the total variance . Yet, when single imputation is embedded in bootstrapping, it is the bootstrap percentile method (BS_p) that outperforms the bootstrap‐t method (BS_t). Moreover, except for list‐wise deletion, we found no patterns as to which aspects of the data models would be particularly problematic or would favor particular methods. In our study, the method BS_p embedding a single imputation within the bootstrap percentile method emerged as the method with the best statistical properties. At first sight, this may be a striking result, as usually multiple (and not single) imputations are needed to properly reflect uncertainty. However, it has been described before that bootstrapping the incomplete data provides a mechanism that can properly account for both sampling and missing data uncertainty.8, 26 See chapter 5 in the book by Little and Rubin for a comparison of resampling methods and multiple imputation.26 Keep in mind that it is important for the validity of the BS_p method that the single imputation not only imputes the expected value of the missing data but also adds “noise” to reflect the uncertainty of the imputation to prevent under coverage. In contrast, the BS_t method embedding single imputation within the bootstrap‐t method yielded confidence intervals that were too narrow and resulted in considerable under coverage. Standard multiple imputation without bootstrap (MW_S) appears to be robust against skewness with acceptable performance across data simulation models, even when the sample size was small. This standard method also takes both missing data and sampling variation into account and was only outperformed by the computationally more intensive methods with imputation nested in percentile bootstrapping (BM_p and BS_p). Correction for skewness using a modified t‐test did not improve the performance.21 The robustness of MW_S against skewness was shown in earlier studies31 for sample sizes of 50 and it has also been shown that the sampling distribution of the sample mean from very skew populations is close to normality for a sample size of 65.32, 33 In our study, we have, for computational reasons, chosen for relatively low numbers of imputations ( and bootstrap resamples (. In practice, we may want to use higher numbers, in line with various recommendations.34 In addition, we may adopt more sophisticated prediction models to impute missing data or more sophisticated forms of bootstrapping, like bias‐corrected and accelerated bootstrap. While such changes may improve statistical performance, we do not expect that the main conclusions emanating from our study would change. Under specific assumptions, other techniques to address missing data are equivalent to, or sometimes superior to, multiple imputation.35 Alternatively, multiple imputation can be the better option if additional information is available that can be used to inform the imputations, or when the missing data occur also in other parts of the data, eg, in the covariates. What is optimal in a particular application depends very much on the missing data pattern and on the plausibility of the assumptions associated with the approach to deal with the missing data. We restricted our analysis to the case where the missing data occur only in the outcome variables, which is the relevant case for cost‐effectiveness trial data. In our simulation study, the true parameters were known, which allowed for the assessment of statistical validity under quite extreme conditions. We also applied the candidate methods to real data from a clinical trial. For this application, the differences between the methods were small. This suggests that, under less extreme conditions, the differences between the methods may be limited.

CONCLUSION

The combination of multiple imputation and bootstrap should be used with care to prevent statistically invalid results. In particular, the popular practice of averaging bootstrapped intervals over multiple imputations provides under coverage, and thus is too optimistic. We found that single imputation embedded in the bootstrap percentile method (with added noise to reflect the uncertainty of the imputation) had the best statistical properties, as resampling the incomplete data properly reflects both sampling and missing data variation. However, this method can require extensive computation times and the lack of standard software limits the accessibility for a larger group of researchers. Using a standard unpaired t‐test with standard multiple imputation without bootstrap appears to be a robust alternative with acceptable statistical performance.

20 in total

1. The estimation of a preference-based measure of health from the SF-36.

Authors: John Brazier; Jennifer Roberts; Mark Deverill
Journal: J Health Econ Date: 2002-03 Impact factor: 3.883

Review 2. The use of the bootstrap statistical method for the pharmacoeconomic cost analysis of skewed data.

Authors: A Desgagné; A M Castilloux; J F Angers; J LeLorier
Journal: Pharmacoeconomics Date: 1998-05 Impact factor: 4.981

3. Missing data: our view of the state of the art.

Authors: Joseph L Schafer; John W Graham
Journal: Psychol Methods Date: 2002-06

4. US valuation of the EQ-5D health states: development and testing of the D1 valuation model.

Authors: James W Shaw; Jeffrey A Johnson; Stephen Joel Coons
Journal: Med Care Date: 2005-03 Impact factor: 2.983

5. Estimating the cost-effectiveness of fluticasone propionate for treating chronic obstructive pulmonary disease in the presence of missing data.

Authors: Andrew H Briggs; Greta Lozano-Ortega; Sally Spencer; Geraldine Bale; Michael D Spencer; P Sherwood Burge
Journal: Value Health Date: 2006 Jul-Aug Impact factor: 5.725

6. Multiple imputation using chained equations: Issues and guidance for practice.

Authors: Ian R White; Patrick Royston; Angela M Wood
Journal: Stat Med Date: 2010-11-30 Impact factor: 2.373

7. Missing... presumed at random: cost-analysis of incomplete data.

Authors: Andrew Briggs; Taane Clark; Jane Wolstenholme; Philip Clarke
Journal: Health Econ Date: 2003-05 Impact factor: 3.046

8. Confounding and missing data in cost-effectiveness analysis: comparing different methods.

Authors: Tommi Härkänen; Timo Maljanen; Olavi Lindfors; Esa Virtala; Paul Knekt
Journal: Health Econ Rev Date: 2013-03-28

9. Variable selection under multiple imputation using the bootstrap in a prognostic study.

Authors: Martijn W Heymans; Stef van Buuren; Dirk L Knol; Willem van Mechelen; Henrica C W de Vet
Journal: BMC Med Res Methodol Date: 2007-07-13 Impact factor: 4.615

10. Combining multiple imputation and bootstrap in the analysis of cost-effectiveness trial data.

Authors: Jaap Brand; Stef van Buuren; Saskia le Cessie; Wilbert van den Hout
Journal: Stat Med Date: 2018-09-12 Impact factor: 2.373

19 in total

1. Cost-effectiveness of extended-release injectable naltrexone among incarcerated persons with opioid use disorder before release from prison versus after release.

Authors: Ali Jalali; Philip J Jeng; Daniel Polsky; Sabrina Poole; Yi-Chien Ku; George E Woody; Sean M Murphy
Journal: J Subst Abuse Treat Date: 2022-07-02

2. The handling of missing data in trial-based economic evaluations: should data be multiply imputed prior to longitudinal linear mixed-model analyses?

Authors: Ângela Jornada Ben; Johanna M van Dongen; Mohamed El Alili; Martijn W Heymans; Jos W R Twisk; Janet L MacNeil-Vroomen; Maartje de Wit; Susan E M van Dijk; Teddy Oosterhuis; Judith E Bosmans
Journal: Eur J Health Econ Date: 2022-09-26

3. Economic evaluation of the Target-D platform to match depression management to severity prognosis in primary care: A within-trial cost-utility analysis.

Authors: Yong Yi Lee; Cathrine Mihalopoulos; Mary Lou Chatterton; Susan L Fletcher; Patty Chondros; Konstancja Densley; Elizabeth Murray; Christopher Dowrick; Amy Coe; Kelsey L Hegarty; Sandra K Davidson; Caroline Wachtler; Victoria J Palmer; Jane M Gunn
Journal: PLoS One Date: 2022-05-25 Impact factor: 3.752

4. Lifetime Cardiovascular Disease Risk by Coronary Artery Calcium Score in Individuals With and Without Diabetes: An Analysis From the Multi-Ethnic Study of Atherosclerosis.

Authors: Bart S Ferket; M G Myriam Hunink; Umesh Masharani; Wendy Max; Joseph Yeboah; Gregory L Burke; Kirsten E Fleischmann
Journal: Diabetes Care Date: 2022-04-01 Impact factor: 17.152

5. Challenging Assumptions of Outcomes and Costs Comparing Peritoneal and Hemodialysis.

Authors: Eugene Lin; Khristina I Lung; Glenn M Chertow; Jay Bhattacharya; Darius Lakdawalla
Journal: Value Health Date: 2021-07-30 Impact factor: 5.101

6. Risk-based, 6-monthly and 24-monthly dental check-ups for adults: the INTERVAL three-arm RCT.

Authors: Jan E Clarkson; Nigel B Pitts; Beatriz Goulao; Dwayne Boyers; Craig R Ramsay; Ruth Floate; Hazel J Braid; Patrick A Fee; Fiona S Ord; Helen V Worthington; Marjon van der Pol; Linda Young; Ruth Freeman; Jill Gouick; Gerald M Humphris; Fiona E Mitchell; Alison M McDonald; John Dt Norrie; Kirsty Sim; Gail Douglas; David Ricketts
Journal: Health Technol Assess Date: 2020-11 Impact factor: 4.014

7. Early-Life Circumstances, Health Behavior Profiles, and Later-Life Health in Great Britain.

Authors: Thijs van den Broek
Journal: J Aging Health Date: 2020-12-19

8. Maintaining musculoskeletal health using a behavioural therapy approach: a population-based randomised controlled trial (the MAmMOTH Study).

Authors: Gary J Macfarlane; Marcus Beasley; Neil Scott; Huey Chong; Paul McNamee; John McBeth; Neil Basu; Philip C Hannaford; Gareth T Jones; Phil Keeley; Gordon J Prescott; Karina Lovell
Journal: Ann Rheum Dis Date: 2021-02-01 Impact factor: 19.103

9. Cost-Effectiveness and Return-on-Investment of a Participatory Ergonomics Intervention Among Childcare Workers: An Economic Evaluation in a Randomized Controlled Trial.

Authors: Nidhi Gupta; Johanna M van Dongen; Andreas Holtermann; Allard J van der Beek; Matthew Leigh Stevens; Charlotte Diana Nørregaard Rasmussen
Journal: J Occup Environ Med Date: 2022-02-09 Impact factor: 2.306

10. Combining multiple imputation and bootstrap in the analysis of cost-effectiveness trial data.

Authors: Jaap Brand; Stef van Buuren; Saskia le Cessie; Wilbert van den Hout
Journal: Stat Med Date: 2018-09-12 Impact factor: 2.373