Literature DB >> 34388954

Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial.

Anna Heath^1,2,3, Mark Strong⁴, David Glynn⁵, Natalia Kunst⁶, Nicky J Welton⁷, Jeremy D Goldhaber-Fiebert⁸.

Abstract

The expected value of sample information (EVSI) can be used to prioritize avenues for future research and design studies that support medical decision making and offer value for money spent. EVSI is calculated based on 3 key elements. Two of these, a probabilistic model-based economic evaluation and updating model uncertainty based on simulated data, have been frequently discussed in the literature. By contrast, the third element, simulating data from the proposed studies, has received little attention. This tutorial contributes to bridging this gap by providing a step-by-step guide to simulating study data for EVSI calculations. We discuss a general-purpose algorithm for simulating data and demonstrate its use to simulate 3 different outcome types. We then discuss how to induce correlations in the generated data, how to adjust for common issues in study implementation such as missingness and censoring, and how individual patient data from previous studies can be leveraged to undertake EVSI calculations. For all examples, we provide comprehensive code written in the R language and, where possible, Excel spreadsheets in the supplementary materials. This tutorial facilitates practical EVSI calculations and allows EVSI to be used to prioritize research and design studies.

Entities: Chemical

Keywords: R tutorial; expected value of sample information; research design methods; simulation methods; value of information

Mesh：

Year: 2021 PMID： 34388954 PMCID： PMC8793320 DOI： 10.1177/0272989X211026292

Source DB: PubMed Journal: Med Decis Making ISSN： 0272-989X Impact factor: 2.749

Introduction

What Is EVSI and Why Is It Not Used More Frequently?

The expected value of sample information (EVSI) measures the value of reducing decision uncertainty by undertaking a proposed study with a given design. Specifically, EVSI is the expected economic benefit of a study that collects additional information that aims to reduce uncertainty before making a decision. In medical decision making, EVSI can be applied to a wide range of study designs, including clinical trials, to inform the relative effectiveness of treatments or observational studies to estimate baseline event rates. The expected net benefit of sampling (ENBS) is defined as the costs of a study subtracted from its (population-level) EVSI. Studies with high ENBS efficiently trade off information value and data collection cost. ENBS can then be used to optimize study design and prioritize research investments that offer value for money.[3,4] EVSI and ENBS can also support reimbursement decision makers as small values for EVSI and ENBS indicate that treatment recommendations should be made using existing evidence, rather than recommending the collection of further evidence before making a treatment recommendation. Despite these benefits of EVSI and ENBS, their practical application has been restricted by the difficulty of the computations required and by the small number of analysts who are familiar with its use.

How Is EVSI Computed?

In model-based health economic evaluations, EVSI is usually calculated using a simulation-based approach based on 3 main elements, each of which can increase the barrier to its implementation. First, the model-based economic evaluation must be fully probabilistic (i.e., all relevant quantities must be parameterized and their uncertainty accurately characterized and encoded in probability distributions). In this setting, the optimum decision option is the one that maximizes expected net benefit, where expectation is taken over the parameter uncertainty. Second, we must simulate plausible values for the data that would be collected in the proposed future study. Third, we must update our parameter uncertainty using the simulated plausible study data from the previous step, potentially changing the optimum decision option. This final step has traditionally been highly computationally demanding because it requires a large number of simulations. The first and third elements of the process have been widely discussed. First, methods for developing probabilistic decision-analytic models are well established, since probabilistic analyses (PAs), also known as probabilistic sensitivity analyses, are required as part of health technology assessment (HTA) processes in many health systems.[8-12] Good practice guidelines and textbooks also guide the development of probabilistic decision-analytic models using evidence from the literature.[1,13-15] The third element has been facilitated by recently developed efficient approximation methods that have overcome the computational challenge of calculating EVSI using the simulated study data.[16-19] These approximation methods have recently been compared and evaluated.[20,21]

What Does This Tutorial Discuss?

This tutorial addresses the crucial second element, simulating plausible study data, which has not received sufficient attention in the literature to allow analysts to easily compute EVSI. Fortunately, simulating study data is a common task outside of HTA.[22,23] This tutorial highlights how these approaches[23-29] can be used to compute EVSI. We will present methods to simulate data using correlated and uncorrelated parametric distributions that incorporate real-world study challenges, such as loss to follow-up, and using a nonparametric approach with individual patient data (IPD) from previous studies. We aim to support the generation of realistic study data to improve the accuracy of EVSI calculations. Coupled with the recent advancements in EVSI computation, this tutorial will facilitate the use of EVSI in practice to guide research prioritization and study design.

Background and Notation

This section provides a brief introduction to EVSI and the notation used throughout this tutorial. A more complete introduction to EVSI is included in other sources.[1,7,21]

Model-Based Decision Analysis

We are aiming to decide between a set of interventions. We have a decision-analytic model that estimates the net benefit for each option , given a vector of input parameters . We consider that the model is a function that maps inputs to strategy-specific net benefits , denoted . The inputs represent real-world quantities (e.g., costs, relative treatment effects, disease progression on standard care, utilities, and disease prevalence), which are not known with certainty. Through a PA, we represent knowledge about these quantities via the joint probability distribution , which can be considered as describing the joint prior distribution for . The expected net benefit of the optimum decision given current knowledge is . This expectation is usually estimated using Monte Carlo simulation (i.e., values of are sampled from and used to compute the average net benefit for each ) because it is usually not available analytically.

The Expected Value of Sample Information

Data to update information in have value if they might change the optimum treatment. If we were to collect new data and update our knowledge about and the net benefits, the optimal decision would be the option that maximizes the expected net benefit, , conditional on the new data. However, before conducting a study, the data have not been collected, and so we compute the expected value of collecting additional data, where the expectation is taken with respect to the distribution of all plausible realizations of the data that the proposed study may generate. Thus, the data from the proposed study are a random variable, denoted , and are not yet observed. The expected value of the net benefit for the optimal decision given new information, averaged over the distribution of all possible datasets, , is , and EVSI is the difference between this quantity and the expected net benefit under current information, The first and second terms in this equation are usually not available in closed form and must be estimated using simulation methods. is the complete set of quantities that would be collected during the study. In reality, this dataset may include mismeasured quantities, missing values, and measurements taken at times that deviate from the study design, which should be reflected in our distribution for . Furthermore, a model parameter could be informed by different study designs (e.g., relative effectiveness can be estimated through a randomized controlled trial or through an observational study using suitable methods, which would result in different ).

Efficient Methods for Computing EVSI

The “standard” approach to EVSI estimation uses a nested Monte Carlo scheme that requires a large number of samples from the posterior distribution of the model parameters given sampled data, , (an “inner loop”) nested within an “outer loop” that samples a large number of simulated datasets . If the numbers of inner-loop and outer-loop samples are and , respectively, the decision-analytic model must be evaluated , requiring days or even months to complete the required computation. However, recent methods for computing EVSI decrease this time to seconds via approximations that either reduce , the number of simulated datasets required, or avoid the inner loop altogether.[16-20]

Approaches to Simulating Study Datasets

We now discuss how to simulate plausible study datasets. For some EVSI computation methods, only a summary statistic (e.g., mean, sum), denoted , is required. As simulating directly can decrease the computational burden of the study data simulation, in some simple settings, we discuss methods for generating directly. However, for many studies (e.g., those collecting censored survival data), it will not be possible to simulate directly, and we will only discuss the individual-level simulation method.

Simulating Study Outcomes Using Parametric Distributions

Plausible study data can be generated by specifying a parametric data-generating process . The exact parametric data-generating process will change depending on the proposed study design as it must reflect which model parameters the study will inform and what data should be collected to update these parameters. For example, a randomized controlled trial can be proposed to inform the log odds ratio of a given health event between the current standard and novel treatment while a cohort study would inform the baseline event rate, and a study analyzing administrative claims data would inform costs. Studies can also be proposed to updated multiple model parameters, and the parametric data-generating process can be specified in an arbitrarily complex manner to design increasingly realistic studies. Irrespective of the complexity of , plausible datasets can be generated from by first simulating from the marginal distribution of the parameters and then simulating from the sampling distribution of the data based on the sampled parameter values . This generates samples from the joint distribution of and as . By generating samples from the joint distribution of and “ignoring” the samples of , we generate datasets from the distribution of the data, , that include both first-order (i.e., individual-level) uncertainty and second-order (i.e., parametric) uncertainty. In practice, samples of from are required in PA and are thus available as part of standard cost-effectiveness analyses that compute the net benefit for each decision option . To present the data-generating algorithm, the first 2 columns of Table 1 represent this standard PA, where the parameter samples and net benefits are indexed with a bracketed superscript.

Table 1

Representation of a Probabilistic Analysis (PA) Sample with Samples for a Set of Parameters and Decision Options

Probabilistic Analysis Sample						Simulated Datasets
Parameters			Net Benefits			Simulated Datasets
θ1(1)	…	θP(1)	NB1(1)	…	NBD(1)	x1(1)	…	xO×M(1)
θ1(2)	…	θP(2)	NB1(2)	…	NBD(2)	x1(2)	…	xO×M(2)
⋮	⋱	⋮	⋮	⋱	⋮	⋮	⋱	⋮
θ1(S)	…	θP(S)	NB1(S)	…	NBD(S)	x1(S)	…	xO×M(S)

The bracketed superscript indexes the parameter samples, corresponding net benefits, and simulated datasets.

Representation of a Probabilistic Analysis (PA) Sample with Samples for a Set of Parameters and Decision Options The bracketed superscript indexes the parameter samples, corresponding net benefits, and simulated datasets. We assume that our study aims to record quantities (study outcomes) on participants, resulting in measurements in the study. For example, a study could recruit 100 people ( = 100) to measure their blood pressure and quality of life ( = 2). Thus, a single study dataset is denoted as the vector . The third column of Table 1 demonstrates that each PA parameter sample is used to sample from the conditional distribution of the data, , to generate the samples that follow the marginal distribution of the data . We can also consider studies (e.g., cohort or registry studies) that propose collecting the individual-level quantities at different time points. Again, these studies can be generated using the same algorithm, but each simulated dataset will contain measurements.

Univariate Data Simulation for Complete Datasets

Initially, we consider studies that collect a single outcome at a single time point for each participant (i.e., ).

Generating binary outcome data

Assume that our decision-analytic model has a parameter, , that is the proportion of individuals in a population who experience an event (e.g., a stroke) under the current standard treatment. Our current knowledge about this proportion is represented by a prior distribution , informed from a previous study or a literature search. In our PA, we have samples drawn from . Information about could be updated by extracting individuals from a patient registry and determining whether each individual has experienced the event, resulting in a binary outcome (event v. no event) that can be simulated from a Bernoulli distribution with parameter equal to the probability of an adverse event. To generate datasets from , we take each value of for , and sample binary outcomes with parameter . Assuming and , we can generate this dataset in R as follows: Alternatively, the number of events in each simulated study (i.e., a summary of the study data) can be sampled from a binomial distribution with parameter and the number of “trials” (size) equal to . This highlights the distinction between simulating individual-level data, , and simulating a summary statistic of the individual-level data, . This summary statistic is generated in R as follows: In this example, simulating the data summary is relatively simple and therefore recommended. However, if multiple outcomes will be simulated for each individual (see the multivariate data simulation section), then the individual-level binary outcomes will likely be required.

Generating normally distributed continuous data

Assume that the decision-analytic model has a parameter, , that represents the mean systolic blood pressure in the population. The current prior uncertainty about , obtained through a previous study on , is modeled in . Additional information could be gathered in a cross-sectional study that measures the blood pressure in individuals. We assume that the individual-level systolic blood pressure follows a normal distribution from which we can simulate a dataset for study participants. To generate datasets from the marginal distribution of the data, we take each value of for and sample from a normal distribution with mean . The variance for the normal distribution represents the individual-level variance in blood pressure and can either be assumed known or assigned a probability distribution that represents our uncertainty in the individual-level variance of the systolic blood pressure. Crucially, this individual-level variance, which can be extracted from the literature or estimated from available individual-level data, is unlikely to be equal to the variance of , which represents the uncertainty in our knowledge about the parameter. Note that an estimate of the individual-level variance is required for standard sample size calculations, used to ensure that a hypothesis test undertaken with the trial data has sufficient power. Assuming , , and an individual-level variance ( ) of 80, these data are simulated in R as follows: Alternatively, if the study is aiming to estimate the mean systolic blood pressure, then the summary statistic (i.e., the study mean systolic blood pressure) can be simulated directly from the sampling distribution of the mean. In this case, the study-level mean blood pressure would be simulated from a normal distribution with mean and standard deviation equal to the square root of the individual-level variance divided by the sample size (i.e., the standard error of the mean). R code for this simulation is given as follows: Many summary statistics are approximately normal (e.g., the log odds ratio or log hazard ratio), allowing us to potentially adapt this simulation method for other summary statistics. However, the standard error for these alternative summary statistics must be specified correctly, which can be challenging especially when considering variable sample sizes for the study. Thus, it may be more appropriate to generate individual-level data and then calculate the summary statistic from the simulated dataset by analyzing the simulated data as if it were collected during a study (see the data on relative effectiveness section below).

Generating time-to-event data

Assume that our decision-analytic model has a parameter, , that represents the probability that a patient’s cancer progresses within a 1-month period on the current standard treatment. The prior distribution of this transition probability, potentially estimated from the control arm in a clinical trial or from administrative data, is represented by and will be updated by measuring the time to cancer progression in individuals from a cancer registry. Assuming that the rate of progression is constant over time, we can simulate time-to-progression data from an exponential distribution with rate, . Thus, generating datasets takes each value of for and samples time-to-progression data from an exponential distribution with parameter . Assuming and , the following R code generates the following data: Alternative time-to-event distributions are also available (e.g., Weibull, Gamma) but have different parameterizations of the data-generating process. These distributions are more complex because they also have more than 1 parameter. Assume that our decision-analytic model is a partitioned survival model with a Weibull distribution estimating progression-free survival times for the current standard treatment and parameterized in terms of and . Uncertainty in is represented by the joint distribution and will be updated by a study that collects time-to-progression data for individuals. To generate datasets, we take each pair of values for and sample time-to-progression data from a Weibull distribution with correlated parameters . Assuming and , R code for this is as follows: Note that choosing the appropriate individual-level distribution for this data simulation can be challenging, and methods are currently being developed to adapt the EVSI calculation method itself when the survival distribution is unknown. However, these methods still need to simulate from a range of survival distributions and will thus require the methods presented here.

Generating utility data

Next, assume that our health economic model has a parameter, , that represents the mean utility for a specific health state (e.g., the preprogression state). Information about could arise from a previous utility elicitation exercise and is encoded in a beta prior distribution . Additional information on the utility could be gathered through a utility elicitation study among individuals in the given health state (e.g., through the use of a standard gamble method). We can assume that this utility score follows a beta distribution with a mean of and an individual-level variance obtained from a previous study. To simulate these data, the mean and variance must be translated into the parameters of the beta distribution, which we achieve using the function calculate_beta_parameters below. The following code generates datasets for a study collecting utility scores from individuals: There are a large range of study types (e.g., those that collect data on costs or resource use) that we are not able to address directly in this tutorial. However, the general-purpose algorithm can be adapted to simulate from the relevant distributions (e.g., log-normal distribution for costs).

Multivariate Data Simulation for Complete Datasets

If the proposed study collects more than 1 outcome for each study participant, , and/or outcomes at more than 1 time point, alternative methods will be required. In this framework, any study where the individuals receive different interventions (e.g., randomized controlled trials) are defined as multivariate data collection exercises. This is because we specify the treatment that the individual receives as one of the quantities of interest. Thus, as we record the treatment and at least 1 outcome, demonstrated in the data on relative effectiveness section below.

Independent multivariate data simulation

If the quantities generated for each participant are assumed to be independent, conditional on , a separate univariate data-generating process can be specified for each of the quantities of interest and then combined into a single dataset. Assuming the data are independent conditional on the parameters does not mean that the data are uncorrelated as any correlations in the model parameters, embodied in , would generate correlated patient-level study data. A combined study that investigates participants and records whether they experience an adverse event and their times to progression can be generated in R as follows: This code does not store the data using the spreadsheet structure demonstrated in Table 1, but it uses a 3-dimensional array with rows for each study participant, columns for each recorded quantity, and matrix slices (the third dimension) for each simulation. This structure makes it easier to analyze data separately for each simulation if this is required to estimate the summary statistics.

Dependent multivariate data simulation

Multivariate data simulation is more complex when the simulated quantities are correlated for each participant (e.g., if participants with shorter survival times are more likely to experience adverse events). This correlation must be specified when we generate multivariate data and can either be assumed fixed or assigned a probability distribution that represents our uncertainty about the correlation. If we ignore the correlation, we are implicitly assuming that it is zero, with certainty. Thus, even if evidence about the correlation structure is lacking, it is important to assess whether this assumption of zero correlation is valid. In general, the correlation can be informed 1) by the literature, although reporting on correlation is often lacking, and you may need to request this information from the authors; 2) by calculating the correlation in available data; or 3) through expert elicitation. One method to generate correlated data initially generates uncorrelated data and then reorders the simulated dataset to achieve the required correlation.[35,36] These reordering methods are implemented in the R function postSimOpt, which generates correlated data with a given correlation matrix. If we are generating correlated data similar to the previous example recording adverse events and time-to-progression data from participants, then we can reorder the data from the previous example to have a correlation of −0.2 using R as follows: Correlated data can also be generated using regression to specify the dependencies between the quantities of interest. The regression method decomposes the joint distribution of these quantities into conditional and marginal distributions, where the conditional distributions are defined using regression models. This method can generate data for correlated quantities of interest, , by initially generating a value of from its marginal distribution, before proceeding to generate conditional on , with the relationship specified using regression. Following this, can be generated based on and and so on. If is small, then the required regression models may have been published, but as the number of outcomes increases, IPD will be required to fit these models. The data generation should consider uncertainty in the parameters of the regression model, specified either by fitting the regression models using Bayesian methods or sampling the regression coefficients from their sampling distribution. This sampling distribution is approximately multivariate normal with the variance-covariance matrix estimated when the regression models are fit in standard software. Thus, if published regression models are used, the variance of the regression parameters must also be extracted. Using the previous example and assuming that its first simulated dataset is actually IPD recording adverse events and time-to-progression data that are saved in a data frame called dat, the following code generates correlated data using the regression method: These methods can be combined with the uncorrelated data generation processes to generate both dependent and independent data for the proposed study.

Data on relative effectiveness

Data from a proposed randomized control trial, which updates uncertainty in the log odds ratio of an event on a novel intervention compared to the current standard treatment ( ), also require correlated multivariate data generation. The first quantity of interest is an indicator , highlighting which treatment each participant receives. In an equally randomized 2-arm trial, this is generated from a Bernoulli distribution with probability 0.5, with a 1 representing that the participant has been randomized to receive the novel intervention. To calculate the patient-level probability of experiencing the outcome event of interest from this indicator, we must combine the th simulated values of with the simulated values of the baseline probability of experiencing the event under the standard treatment, denoted . (Note that information on the baseline probability of the event can, and often should, come from a different source than the information to inform , i.e., the baseline event rate comes from administrative data, while a previous clinical trial would inform the relative effectiveness.) The individual-level log odds of experiencing the event can then be computed by adding to . The individual-level probability of the event is then calculated from , and the individual-level response can be generated from a Bernoulli distribution with these probabilities. The summary statistic (e.g., the observed log odds ratio) can then be estimated by fitting a generalized linear model to the th dataset as though the simulated data were observed. The following R code implements this method for a study collecting data on participants: This example uses binary outcomes and log odds ratios as a measure of relative effect. If an alternative outcome type and/or measure of relative effect is used, then this method must be adapted to translate the parameters to the additive scale and back to generate the data. We provide code to implement this method for survival outcomes and log hazard ratios in the supplementary material. Finally, there are many methods for generating correlated data that are not discussed in this tutorial. Copulas are a class of statistical models that combine univariate marginal distributions and a multivariate correlation structure and can generate correlated data. Elsewhere, methods can ensure that simulated data preserve their rank (i.e., in situations where 1 outcome must be larger than another). Microsimulation models or discrete-event simulations can also generate interrelated individual event data in a highly flexible but more computationally intensive manner.[40,41]

Realistic Study Designs

Realistic studies can encounter issues with missing values, loss to follow-up, and censoring, which should be included in our data simulation procedure.

Missingness

Data that are not recorded during a study (i.e., missing data) are commonly accounted for in study design and analysis. Thus, simulating missing values based on knowledge about the potential rate of missingness will often be required. A “missingness indicator” equals 1 if the participant’s data are missing and 0 otherwise. This can be used to simulate missingness using a Bernoulli distribution with the probability equal to the expected level of missingness, obtained from the literature or expert opinion. Once the missingness indicator has been generated, participants with a missingness indicator of 1 are then “deleted” from the simulated dataset. If the study collects multivariate outcomes, then missingness can be considered separately for each outcome. The simplest type of missingness (i.e., missing completely at random) generates the missingness indicator independent of the quantities of interest with an example assuming 10% missing data given as follows: A correlation between the data and the missingness indicator (i.e., where participant outcomes or traits lead to higher levels of missingness) can also be assumed and would induce bias in estimates from the data and EVSI if it is not accounted for properly. If this type of missingness is used, then the method for updating the distribution of the model parameters, based on the data, would also need to be adjusted using common methods for addressing missing data.

Censoring in time-to-event data

Censoring is commonly encountered when working with time-to-event data; for example, right-censored data include the information that a participant did not experience an event during the study but do not record when (or if) the event is experienced after the study’s observation period ended. Censoring is modeled by adding a “censoring indicator” to the dataset, which equals 0 if the data point is censored and 1 if it is not. To generate censored survival data, we first generate the event time for each participant from a suitable uncensored model (cf. generating time-to-event data). We then generate a potential “censoring time” for each participant; this can either be a fixed number (i.e., all patients are censored at the end of the study follow-up) or simulated from a different time-to-event distribution with parameters estimated to reflect patterns of dropout or loss to follow-up seen in similar studies. If the censoring event occurs before the event, we change the event time to the censoring time and the censoring indicator to 0. An example where time-to-progression data are censored at 6 months is given as follows: This code implements right-censoring, commonly seen in randomized control trials, but a similar method could simulate left-censored data, where the event time is not observed if it occurs before the censoring time. Finally, interval censoring, where only the time interval in which the event occurs is known, requires a more complex specification.

Simulating Study Outcomes Using Nonparametric Resampling

If the decision-analytic model is based on IPD, we could investigate whether there is value in collecting additional data with the same (or a similar) study design. Given IPD are available, we could generate data in this setting by resampling the IPD and avoid specifying parametric distributions for the data. Resampling from IPD, denoted , can characterize parameter uncertainty using bootstrap methods, but these methods must be extended to generate the range of plausible datasets from . Assume that a parameter for a decision-analytic model, , can be estimated as a function of the IPD, . The uncertainty in can be estimated by resampling times from with replacement to create multiple pseudo-datasets , before estimating the model parameter (Table 2).

Table 2

Representation of the Bootstrap Estimation Method for the Parameter Based on an Initial Sample of Size

Simulation	y1	y2	y3	…	yN	θ8
1	y1(1)	y2(1)	y3(1)	…	yN(1)	θ8(1)
2	y1(2)	y2(2)	y3(2)	…	yN(2)	θ8(2)
⋮	⋮	⋮	⋮	⋱	⋮	⋮
S	y1(S)	y2(S)	y3(S)	…	yN(S)	θ8(S)

Representation of the Bootstrap Estimation Method for the Parameter Based on an Initial Sample of Size To simulate a dataset from with participants for each row of the PA dataset, we should resample values with replacement from each dataset , (i.e., resample from each row of Table 2). This is equivalent to generating the data from . The following displays the R code for this resampling algorithm:

S <- 1000 # Number of simulated datasetsM <- 100 # Number of individuals extracted from the registryx <- matrix(NA, nrow = S, ncol = M) # Set up empty matrixtheta_1 <- runif(S, 0.1, 0.2) # Distribution for theta_1for (s in 1:S) # Simulate s = 1,…,S studies p <- theta_1[s] # Set the Bernoulli parameter to the s-th # value of theta_1 x[s, ] <- rbinom(n = M, size = 1, prob = p) # Sample M binary # event outcomes}

This resampling method can also generate datasets that are similar to the IPD. For example, if the proposed study targets younger participants than the previous study, we could perform a weighted resampling to sample the younger patients more frequently. We could also sample a subset of the quantities from the previous study to evaluate the value of a more targeted study or plan a study with a shorter follow-up. Once we have generated our resampled datasets, the efficient EVSI estimation procedures require different adaptions to estimate EVSI. Methods that require Bayesian updating (e.g., the standard Monte Carlo method and the moment matching method) must use an adapted bootstrap algorithm, which we are currently developing, to approximate the Bayesian updating without specifying and analytically. Methods that require a summary statistic (e.g., the regression-based method) can be used by calculating the parameter using the function for each simulated dataset. Note that one of the EVSI calculation methods is based on evaluating the likelihood function of the data and so cannot be used with this resampling method.

Discussion

EVSI can be used to optimize study designs to generate data to support decision making in HTA processes, which are often based on decision-analytic models. EVSI can formalize the decision to collect additional information before making policy decisions in health, thereby ensuring that effective and efficient treatments are available to patients.[48-50] This tutorial supports the increased use of EVSI by researchers, decision makers, and industry partners by presenting a range of methods to generate simulated datasets for EVSI calculation. Recent research has allowed practical EVSI calculations through the development of efficient estimation methods, which generally require simulated datasets from a proposed future study. The methods presented in this tutorial can be used to simulate datasets from randomized trials and observational studies with a range of outcome types, including uni- and multivariate datasets. Furthermore, they support the modeling of imperfect study conduct and incomplete data collection. Finally, they are applicable with and without individual patient-level data. We demonstrate these methods using R code and, where appropriate, with Excel spreadsheets included in the supplementary material. Once we have simulated the datasets from the proposed study, the final computation of EVSI depends on the selected algorithm, as detailed in Kunst et al. Accurate EVSI estimation requires realistic data simulation. These datasets should reflect our judgments about the data, encoded in our chosen parameter distributions and data-generating process. Thus, they do not need to reflect a dataset that has previously been collected, making it challenging to determine if the simulated datasets are “correct.” However, when developing the simulation method, biological plausibility can and should be checked (e.g., determine that all simulated survival times are within the life span of a human). It may also be worthwhile to check whether the simulated data reflect the specified inputs (e.g., calculate the individual-level variance for each simulation and check if it is approximately equal to the specified variance). As the number of simulated datasets is large, these checks may only be possible for a small number of the datasets and can be used for validation. As studies can be designed with almost infinite complexities, many study designs that are relevant to health economic decision making could not be included in this tutorial. For example, simulating data on utilities is potentially more complex than the method presented in this tutorial as health states are often ranked, and the data simulation should take this into account, potentially through previously developed methods. Recent research has also proposed methods for EVSI calculation when the survival distribution is unknown and may change based on the future data. Furthermore, studies based on long-term longitudinal cohorts will require complex multivariate data generation and missing data patterns. Finally, the estimation of study costs to compute ENBS and optimize study design has received limited discussion in the literature despite its importance to ensure accurate research prioritization.

Conclusion

This tutorial presents a general-purpose algorithm for generating simulated datasets from a probabilistic analysis and explored common correlated and uncorrelated data types. This method is demonstrated in several examples but can be extended to more complex study designs, as required. Hence, this tutorial facilitates practical EVSI calculations and allows research design and prioritization based on ENBS. Click here for additional data file. Supplemental material, sj-txt-5-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making Click here for additional data file. Supplemental material, sj-txt-6-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making Click here for additional data file. Supplemental material, sj-xlsx-1-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making Click here for additional data file. Supplemental material, sj-xlsx-2-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making Click here for additional data file. Supplemental material, sj-xlsx-3-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making Click here for additional data file. Supplemental material, sj-xlsx-4-mdm-10.1177_0272989X211026292 for Simulating Study Data to Support Expected Value of Sample Information Calculations: A Tutorial by Anna Heath, Mark Strong, David Glynn, Natalia Kunst, Nicky J. Welton and Jeremy D. Goldhaber-Fiebert in Medical Decision Making

M <- 100Wx <- numeric(length = S) # Set up empty vectorfor (s in 1:S) { # Simulate s = 1,…,S studies p <- theta_1[s] # Set the Binomial parameter to the s-th # value of theta_1 Wx[s] <- rbinom(n = 1, size = M, prob = p) # Sample count of # the event outcomes}

S <- 1000M <- 100;x <- matrix(nrow = S, ncol = M) # Set up empty matrixtheta_2 <- runif(S, 120, 130) # Hypothetical distribution # for theta_2v <- 80for (s in 1:S) { # Simulate s = 1,…,S studies mu <- theta_2[s] # Set the Normal mean parameter to the # s-th value of theta_2 x[s, ] <- rnorm(n = M, mean = mu, sd = sqrt(v)) # Sample M # blood pressure measures }

M <- 100v <- 80Wx <- numeric(length = S) # Set up empty vectorfor (s in 1:S) { # Simulate s = 1,…,S studies mu <- theta_2[s] # Set the Normal mean parameter to the s-th # value of theta_2 Wx[s] <- rnorm(n = 1, mean = mu, sd = sqrt(v / M)) # Sample # study mean BP}

S <- 1000; theta_3 <- runif(S, 0.2, 0.3) # Hypothetical # distribution for theta_3M <- 100x <- matrix(nrow = S, ncol = M) # Set up empty matrixfor (s in 1:S) { # Simulate s = 1,…,S studies r <- -log(1 - theta_3[s]) # Derive rate from s-th value of # the transition probability x[s, ] <- rexp(n = M, rate = r) # Sample M times-to- # progression}

S <- 1000# Correlated joint distribution for theta_4 and theta_5# (Column 1: theta_4, Column 2: theta_5)theta_4_5 <- MASS::mvrnorm(S, c(5,6), matrix(c(0.3, 0.1, 0.1, 0.5), nrow = 2))M <- 100x <- matrix(nrow = S, ncol = M) # Set up empty matrixfor (s in 1:S) { # Simulate s = 1,…,S studies shape <- theta_4_5[s, 1] # Weibull shape parameter from # s-th value of theta_4 scale <- theta_4_5[s, 2] # Weibull scale parameter from # s-th value of theta_5 x[s, ] <- rweibull(n = M, shape = shape, scale = scale)# Sample M times-to-progression}

S <- 1000;theta_6 <- rbeta(S, 70, 15) # Hypothetical # distribution for theta_6M <- 100v <- 0.04 x <- matrix(nrow = S, ncol = M) # Set up empty matrixcalculate_beta_parameters <- function(mean, sd){ # Function to estimate beta parameters from mean and # standard deviation shape1 <- ((1 - mean) / sd ^ 2 - 1 / mean) * mean ^ 2 shape2 <- shape1 * (1 / mean - 1) # Return the calculated parameters. return(list(shape1 = shape1, shape2 = shape2))}for (s in 1:S) { # Simulate s = 1,…,S studies # Derive beta parameters with iteration specific mean params <- calculate_beta_parameters(theta_6[s], sqrt(v)) x[s, ] <- rbeta(n = M, shape1 = params$shape1, shape2 = params$shape2) # Sample M times-to-progression}

S <- 1000O <- 2M <- 100x <- array(dim = c(M, O, S)) # Set up empty arrayfor (s in 1:S) { # Simulate s = 1,…,S studies p <- theta_1[s] # Set the Bernoulli parameter to the # s-th value of theta_1 r <- -log(1 - theta_3[s]) # Derive rate from s-th value of # the transition probability x[ , 1, s] <- rbinom(n = M, size = 1, prob = p) # Sample M # binary adverse outcomes x[ , 2, s] <- rexp(n = M, rate = r) # Sample M times-to- # progression}

library(SimJoint) # Package containing function to reorder # dataS <- 1000O <- 2M <- 100correlation <- matrix(c(1, -0.2, -0.2, 1), nrow = 2) # Specify the correlation matrixx <- array(dim = c(M, O, S)) # Set up empty arrayfor (s in 1:S) { # Simulate s = 1,…,S studies p <- theta_1[s] # Set the Bernoulli parameter to the s-th # value of theta_1 r <- -log(1 - theta_3[s]) # Derive rate from s-th value of # the transition probability x[ , 1, s] <- rbinom(n = M, size = 1, prob = p) # Sample M # binary adverse outcomes x[ , 2, s] <- rexp(n = M, rate = r) # Sample M times-to- # progression # Reorder the columns so they are correlated x[ , , s] <- postSimOpt(x[, , s], correlation) $X}

library(MASS) # Package to simulate from multivariate normal # distributionS <- 1000M <- 100; O <- 2dat <- as.data.frame(x[ , , 1])# Generalised Linear Model to predict adverse event # probability from times-to-progressionmod <- glm(AE ~ Time_Prog, data = dat, family = “binomial”)theta_reg <- mvrnorm(S, coef(mod), vcov(mod)) # Sampling # distribution of coefficientsx <- array(dim = c(M, O, S)) # Set up empty arrayfor (s in 1:S) { # Simulate s = 1,…,S studies r <- -log(1 - theta_3[s]) # Derive rate from s-th value of # the transition probability x[ , 2, s] <- rexp(n = M, rate = r) # Sample M times-to- # progression mod$coefficients <- theta_reg[s, ] # Set the coefficients # to their s-th value # Predict probability of an adverse event from the simulated # times-to-progression p.ind <- predict(mod, data.frame(Time_Prog = x[, 2, s]), type = “response”) x[ , 1, s] <- rbinom(n = M, size = 1, prob = p.ind) # Sample M # binary adverse outcomes}

library(boot) # Package for logit and inv.logitS <- 1000M <- 100; O <- 2theta_7 <- rnorm(S, 1.2, 0.1) # Hypothetical distribution # for log odds ratiotheta_8 <- runif(S, 0.2, 0.3) # Hypothetical distribution # for baseline riskx <- array(dim = c(M, O, S)) # Set up empty arrayWx <- numeric(length = S) # Set up empty vector for simulated # summary statisticfor (s in 1:S) { # Simulate s = 1,…,S studies # Sample M treatment indicators x[ , 1, s] <- rbinom(n = M, size = 1, p = 0.5)

# Calculate s-th baseline log odds baseline.logodds <- logit(theta_8[s]) # Calculate odds for treated group from baseline log odds # and the s-th log odds ratio individual.logodds <- baseline.logodds + theta_7[s] * x[ , 1, s] # Calculate probability from log odds individual.prob <- inv.logit(individual.logodds) # Sample M binary outcomes x[ , 2, s] <- rbinom(n = M, size = 1, prob = individual.prob) # Create a dataframe with the data data.complete <- data.frame(x[, , s]) names(data.complete) <- c(“Treatment,”“Outcome”) # Generalised linear model to compute odds ratio for the s-th dataset Wx[s] <- glm(Outcome ~ Treatment, data = data.complete, family = “binomial”)$coef[2]}

S <-1000; theta_2 <-runif(S, 120, 130) # Hypothetical # distribution for theta_2M <-100; v <-80x <-matrix(nrow = S, ncol = M) # Set up empty matricesfor (s in 1:S) { # Simulate s = 1,…,S studies mu <-theta_2[s] # Set the Normal mean parameter to the s-th # value of theta_2 x[s, ] <-rnorm(n = M, mean = mu, sd = sqrt(v)) # Sample M # blood pressure measures missing <-rbinom(n = M, size = 1, prob = 0.1) # Sample # missingness indicator x[s, which(missing == 1)] <-NA # Knock out the missing # observations}

S <-1000; theta_3 <-runif(S, 0.2, 0.3) # Hypothetical # distribution for theta_3M <-100x <-matrix(nrow = S, ncol = M) # Set up empty matrixcensoring_time <-6for (s in 1:S) { # Simulate s = 1,…,S studies r <- -log(1 - theta_3[s]) # Derive rate from s-th value of # the transition probability x[s, ] <- rexp(n = M, rate = r) # Sample M times-to-# progression}censoring_indicator <- (x > censoring_time) # Set indicator # for times > 6 monthsx[censoring_indicator] <- censoring_time # Set censored # times to 6 months

S <- 1000N <- 150; M <- 100y <- runif(N, 10, 30) # Hypothetical IPDx <- matrix(nrow = S, ncol = M) # Set up empty matrixfor (s in 1:S) { # Simulate s = 1,…,S studies y_s <- sample(y, N, replace = TRUE) # Bootstrap sample from y x[s, ] <- sample(y_s, M, replace = TRUE) # Sample M IPD values # from y_s}

29 in total

Review 1. Empirically evaluating decision-analytic models.

Authors: Jeremy D Goldhaber-Fiebert; Natasha K Stout; Sue J Goldie
Journal: Value Health Date: 2010-03-10 Impact factor: 5.725

2. Simulating Multivariate Nonnormal Data Using an Iterative Algorithm.

Authors: John Ruscio; Walter Kaczetow
Journal: Multivariate Behav Res Date: 2008 Jul-Sep Impact factor: 5.923

3. Dimensions of design space: a decision-theoretic approach to optimal research design.

Authors: Stefano Conti; Karl Claxton
Journal: Med Decis Making Date: 2009-07-15 Impact factor: 2.583

4. Estimating the Expected Value of Sample Information across Different Sample Sizes Using Moment Matching and Nonlinear Regression.

Authors: Anna Heath; Ioanna Manolopoulou; Gianluca Baio
Journal: Med Decis Making Date: 2019-06-04 Impact factor: 2.583

5. Model parameter estimation and uncertainty: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force--6.

Authors: Andrew H Briggs; Milton C Weinstein; Elisabeth A L Fenwick; Jonathan Karnon; Mark J Sculpher; A David Paltiel
Journal: Value Health Date: 2012 Sep-Oct Impact factor: 5.725

6. An Efficient Estimator for the Expected Value of Sample Information.

Authors: Nicolas A Menzies
Journal: Med Decis Making Date: 2015-04-24 Impact factor: 2.583

7. Value of Information Analytical Methods: Report 2 of the ISPOR Value of Information Analysis Emerging Good Practices Task Force.

Authors: Claire Rothery; Mark Strong; Hendrik Erik Koffijberg; Anirban Basu; Salah Ghabri; Saskia Knies; James F Murray; Gillian D Sanders Schmidler; Lotte Steuten; Elisabeth Fenwick
Journal: Value Health Date: 2020-03 Impact factor: 5.725

8. Microsimulation Modeling for Health Decision Sciences Using R: A Tutorial.

Authors: Eline M Krijkamp; Fernando Alarid-Escudero; Eva A Enns; Hawre J Jalal; M G Myriam Hunink; Petros Pechlivanoglou
Journal: Med Decis Making Date: 2018-04 Impact factor: 2.583

9. Unifying Research and Reimbursement Decisions: Case Studies Demonstrating the Sequence of Assessment and Judgments Required.

Authors: Claire McKenna; Marta Soares; Karl Claxton; Laura Bojke; Susan Griffin; Stephen Palmer; Eldon Spackman
Journal: Value Health Date: 2015-08-13 Impact factor: 5.725

10. Expected value of sample information calculations in medical decision modeling.

Authors: A E Ades; G Lu; K Claxton
Journal: Med Decis Making Date: 2004 Mar-Apr Impact factor: 2.583

1 in total

1. Calculating Expected Value of Sample Information Adjusting for Imperfect Implementation.

Authors: Anna Heath
Journal: Med Decis Making Date: 2022-01-16 Impact factor: 2.749

1 in total