Literature DB >> 34895139

Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression.

Katrin Madjar¹, Manuela Zucknick², Katja Ickstadt³, Jörg Rahnenführer³.

Abstract

BACKGROUND: Important objectives in cancer research are the prediction of a patient's risk based on molecular measurements such as gene expression data and the identification of new prognostic biomarkers (e.g. genes). In clinical practice, this is often challenging because patient cohorts are typically small and can be heterogeneous. In classical subgroup analysis, a separate prediction model is fitted using only the data of one specific cohort. However, this can lead to a loss of power when the sample size is small. Simple pooling of all cohorts, on the other hand, can lead to biased results, especially when the cohorts are heterogeneous.
RESULTS: We propose a new Bayesian approach suitable for continuous molecular measurements and survival outcome that identifies the important predictors and provides a separate risk prediction model for each cohort. It allows sharing information between cohorts to increase power by assuming a graph linking predictors within and across different cohorts. The graph helps to identify pathways of functionally related genes and genes that are simultaneously prognostic in different cohorts.
CONCLUSIONS: Results demonstrate that our proposed approach is superior to the standard approaches in terms of prediction performance and increased power in variable selection when the sample size is small.

Entities: Chemical

Keywords: Bayesian variable selection; Cox proportional hazards model; Gaussian graphical model; Heterogeneous cohorts; Markov random field prior; Subgroup analysis

Mesh：

Year: 2021 PMID： 34895139 PMCID： PMC8665528 DOI： 10.1186/s12859-021-04483-z

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

In clinical research, molecular measurements such as gene expression data play an important role in the diagnosis and prediction of a disease outcome, such as time-to-event endpoint. In general, the number of molecular predictors is larger than the sample size (“ problem”) and typically only a small number of genes is associated with the outcome while the rest is noise. Thus, important objectives in statistical modeling are good prediction performance and variable selection to obtain a subset of prognostic predictors. In the Bayesian framework, different types of variable selection priors have been proposed also with application to the Bayesian Cox model. One common choice is the use of shrinkage priors such as the Bayesian lasso as an analog to the frequentist penalized likelihood approach [20, 26, 41]. A popular alternative are “spike-and-slab” priors that use latent indicators for variable selection and a mixture distribution for the regression coefficients [14, 35]. In general, the regression coefficients are modeled independently. However, with applications to molecular data, it can be reasonable to consider structural information between covariates, since the effect on a clinical outcome is typically not caused by single genes acting in isolation, but rather by changes in a regulatory or functional pathway of interacting genes. Several authors have dealt with this problem by using a Markov random field (MRF) prior to incorporate structural information on the relationships among the covariates into variable selection [21, 28, 33, 34]. Alternatively, Chakraborty and Lozano [5] propose a Graph Laplacian prior for modeling the dependence structure between the regression coefficients through their precision matrix. When the data are heterogeneous and consists of known subpopulations with possibly different dependence structures, estimating one joint graphical model would hide the underlying heterogeneity while estimating separate models for each subpopulation would neglect common structure. For this situation, Danaher et al. [8] use an extension of the frequentist graphical lasso with either a group or fused lasso type penalty for joint structure learning. Saegusa and Shojaie [30] propose a weighted Laplacian shrinkage penalty where the weights represent the degree of similarity between subpopulations. Bayesian approaches for sharing common structure in the joint inference of multiple graphical models have also been developed [24, 27, 40]. Peterson et al. [27] use an MRF prior for the graph structures with pairwise similarities between different graphs. However, all these methods have in common that they focus on structure learning only and do not take into account the relationship between (structured) covariates and a clinical outcome as in the context of regression modeling. We consider the situation that molecular measurements and a survival outcome are available for various, possibly heterogeneous patient subgroups or cohorts such as in a multicenter study. In the following, we use the term “subgroup” for different pre-known patient cohorts or data sets. In classical subgroup analysis, only the data of the subgroup of interest is used to build a risk prediction model for this specific subgroup. This may lead to a loss of power or unstable results with high variance especially in small subgroups. Thus, it is tempting to simply pool all data to increase the sample size. This approach, however, can result in biased estimates when the subgroups are heterogeneous regarding their effects and subgroup-specific effects may get lost. We aim at sharing information between subgroups to increase power when this is supported by the data. Our approach provides a separate risk prediction model for each subgroup that allows the identification of common as well as subgroup-specific effects and has improved prediction accuracy and variable selection power compared to the two standard approaches. Some frequentist approaches tackle this problem by suggesting a penalized Cox regression model with a weighted version of the partial likelihood that includes patients of all subgroups but assigns them (individual) weights. Weyer and Binder [37] propose the use of fixed weights. This idea is extended by Richter et al. [29] using model-based optimization for tuning of the weights to obtain the best combination of fixed weights regarding prediction accuracy. [23] estimate individual weights from the data such that they represent the probability of belonging to a specific subgroup. In this paper, we use a Bayesian approach and borrow information across subgroups through graph-structured selection priors instead of weights in the likelihood. We propose an extension of the Bayesian Cox model with “spike-and-slab” prior for variable selection by Treppmann et al. [35] in the sense that we incorporate graph information between covariates into variable selection via an MRF prior instead of modeling the regression coefficients independently. The graph is not known a priori and inferred simultaneously with the important predictors. Its structure can be partitioned into subgraphs linking covariates within or across different subgroups. Thus, representing conditional dependencies between genes (i.e. pathways) and similarities between subgroups by genes being simultaneously prognostic in different subgroups.

Methods

First, the general methods are described that are required for our proposed Bayesian model introduced later in this section.

The Bayesian Cox proportional hazards model

Assume the observed data of patient m consist of the tuple and the covariate vector , . is the matrix of (genomic) covariates. denotes the observed time of patient m, with the event time and the censoring time. indicates whether a patient experienced an event () or was right-censored (). The Cox proportional hazards model [7] models the hazard rate of an individual m at time t. It consists of two terms, the non-parametric baseline hazard rate and a parametric form of the covariate effect:where is the unknown parameter vector that represents the strength of influence of the covariates on the hazard rate. Under the Cox model, the joint survival probability of n patients given iswhere is the vector of the individual observed times for all patients and the vector of corresponding random variables. One of the most popular choices for the cumulative baseline hazard function is a gamma process priorwhere is an increasing function with . can be considered as an initial guess of and describes the weight that is given to [20]. Lee et al. [20] propose a Weibull distribution with fixed hyperparameters and . Following Zucknick et al. [41], we obtain estimates of and from the training data by fitting a parametric Weibull model without covariates to the survival data. We choose in accordance with the authors. In practice the presence of ties is very common, leading to the grouped data likelihood described in Ibrahim et al. [17, chapter 3.2.2]. A finite partition of the time axis is constructed with and for all . The observed time of patient m falls in one of the J disjoint intervals , . Assume the observed data are grouped within , where and are the risk and failure sets corresponding to interval g. Let be the increment in the cumulative baseline hazard in interval , . From the gamma process prior of follows that the ’s have independent gamma distributionsThe conditional probability that the observed time of patient m falls in interval is given bywith . The resulting grouped data likelihood is defined as [17, chapter 3.2.2].

Stochastic search variable selection

The stochastic search variable selection (SSVS) procedure by George and McCulloch [14] uses latent indicators for variable selection and models the regression coefficients as a mixture of two normal distributions with different variancesThis prior allows the ’s to shrink towards zero. Due to the shape of the two-component mixture distribution, it is called spike-and-slab prior. The latent variable indicates the inclusion () or exclusion () of the i-th variable and specifies the variance of the normal distribution. is set small so that is likely to be close to zero if . is chosen sufficiently large to inflate the coefficients of selected variables and to make their posterior mean values likely to be non-zero. In general, the variances of the regression coefficients are assumed to be constant: and for all . The standard prior for is a product of independent Bernoulli distributionswith prior inclusion probability . Typically, these prior inclusion probabilities are chosen to be the same for all variables and often with set to a fixed value.

Graphical models

A graphical model is a statistical model that is associated with a graph summarizing the dependence structure in the data. The nodes of a graph represent the random variables of interest and the edges of a graph describe conditional dependencies among the variables. Structure learning implies the estimation of an unknown graph. Recent applications are mainly driven by biological problems that involve the reconstruction of gene regulatory networks and the identification of pathways of functionally related genes from their expression levels. A graph is called undirected, when its edges are unordered pairs of nodes instead of ordered pairs with edges pointing from one node to the other (directed graph). When the variables are continuous measurements and assumed to be multivariate normal a common choice are Gaussian models [11]. We assume that the vector of random variables for patient m, follows a multivariate normal distribution with mean vector and covariance matrix . The inverse of the covariance matrix is referred to as precision matrix , with symmetric and positive definite. Let be the data matrix consisting of n independent patients and the sample covariance matrix. In graphical models, a graph is used to represent conditional dependence relationships among random variables . Let be an undirected graph, where is a set of nodes (e.g. genes) and is a set of edges (e.g. relations between genes) with edge . can be indexed by a set of binary variables with or 0 when edge (i, j) belongs to E or not. The symmetric matrix is termed adjacency matrix representation of the graph. The graph structure implies constraints on the precision matrix such that , meaning that variables i and j are conditionally independent given all remaining variables [11, 36]. We use the approach for structure learning by Wang [36] that is based on continuous spike-and-slab priors for the elements of the precision matrix and latent indicators for the graph structure. The approach induces sparsity and is efficient due to a block Gibbs sampler and no approximation of the normalizing constant. The corresponding hierarchical model is defined aswhere and are the normalizing constants, and is the set of all parameters with small, large, and . restricts the prior to the space of symmetric-positive definite matrices. A small value for () means that is small enough to bet set to zero. A large value for () allows to be substantially different from zero. The binary latent variables serve as edge inclusion indicators. Wang [36] proposes the following fixed hyperparameters , , and as resulting in good convergence and mixing.

The proposed Bayesian subgroup model

We assume the entire data consists of S predefined subgroups of patients (different cohorts or data sets), where for each patient the subgroup affiliation is known and unique. This information, which specific subgroup a patient belongs to, is available in the data.

Likelihood

Let be the gene expression (covariate) matrix for subgroup s, , consisting of independent and identically distributed observations. For patient m in subgroup s the vector of random variables is assumed to follow a multivariate normal distribution with mean vector and unknown precision matrix , . We consider the outcome with as well as the predictors , to be random variables. Thus, the likelihood for subgroup s is the joint distribution . The conditional distribution corresponds to the grouped data likelihood of the Bayesian Cox proportional hazards model at the beginning of this section [20] for subgroup swhere are the observed data in subgroup s, with the risk and the failure sets corresponding to interval , . The increment in the cumulative baseline hazard for subgroup s in interval is termed . is the p-dimensional vector of regression coefficients for subgroup s. The marginal distribution of is multivariate normal with The joint likelihood across all subgroups is the product of the subgroup likelihoods

Prior specifications

Prior on the parameters and of the Cox model

The prior for the increment in the cumulative baseline hazard in subgroup s follows independent gamma distributionswith a Weibull distribution , , [20]. We choose the hyperparameters , and to be fixed and in accordance with Lee et al. [20] and Zucknick et al. [41]. We set and estimate the hyperparameters and from the (training) data by fitting a parametric Weibull model without covariates to the survival data of subgroup s. We perform variable selection using the SSVS approach by George and McCulloch [14] described earlier in this section. The prior of the regression coefficients in subgroup s conditional on the latent indicator is defined as a mixture of two normal distributions with small () and large () varianceThe latent indicator variable indicates the inclusion () or exclusion () of variable i in the model for subgroup s. We assume equal variances for all regression coefficients. We set the hyperparameters to the fixed values and following Treppmann et al. [35]. This choice corresponds to a standard deviation of and a 95% probability interval of for .

Prior on linking variable and graph selection

The standard prior for the binary variable selection indicators is a product of independent Bernoulli distributions as utilized by Treppmann et al. [35]. However, this does not consider information from other subgroups and relationships between covariates. For this situation, we propose a Markov random field (MRF) prior for the latent variable selection indicators that incorporates information on the relationships among the covariates as described by an undirected graph. This prior assumes that neighboring covariates in the graph are more likely to have a common effect and encourages their joint inclusion. The MRF prior for given is defined aswhere is a pS-dimensional vector of variable inclusion indicators, is a symmetric adjacency matrix representation of the graph, and a, b are scalar hyperparameters. The hyperparameter a influences the overall variable inclusion probability and controls the sparsity of the model, with smaller values resulting in sparser models. Without loss of generality . The hyperparameter determines the prior belief in the strength of relatedness between pairs of neighboring variables in the graph and controls the probability of their joint inclusion. Higher values of b encourage the selection of variables with neighbors already selected into the model. The idea becomes more evident by looking at the conditional probabilityAn MRF prior for variable selection has also been used by other authors [21, 28, 33, 34]. However, unlike us, they do not address the problem of borrowing information across subgroups by linking covariates in a graph. We propose a joint graph with possible edges between all pairs of covariates within each subgroup and edges between the same covariates in different subgroups. The elements in the adjacency matrix of the graph represent the presence () or absence () of an edge between nodes (genes) i and j in subgroups r and s. They can be viewed as latent binary indicator variables for edge inclusion. The adjacency matrix in the present model is defined as is the matrix of latent edge inclusion indicators within subgroup sand is the matrix of latent edge inclusion indicators between subgroups r and swith , , , . Thus, within each subgroup s we assume a standard undirected graph with possible edges between all pairs of genes representing conditional dependencies as in a functional or regulatory pathway. Between different subgroups we only allow for relations between the same gene in different subgroups (different genes in different subgroups are assumed to be unconnected). This allows sharing information between subgroups and prognostic genes shared by different subgroups have a higher inclusion probability. To visualize this idea, Fig. 1 shows an example network consisting of two subgroups, each with five predictors.

Fig. 1

Example graph. Illustration of the proposed graph for subgroups, each with genomic predictors (nodes). Possible edges between two nodes are marked by dashed lines

Graph selection prior on and

We infer the unknown graph and precision matrix using the structure learning approach for Gaussian graphical models by Wang [36]. The precision matrix of subgroup s corresponding to subgraph is denoted by . The corresponding prior is defined bywith fixed hyperparameters small, large and . We assume the binary edge inclusion indicators within subgroup s () as well as between subgroups r and s () to be independent Bernoulli a prioriwith fixed prior probability of edge inclusion .

Posterior inference

The joint posterior distribution for the set of all parameters is proportional to the product of the joint likelihood and the prior distributions of the parameters in all subgroups

Markov Chain Monte Carlo sampling

Markov Chain Monte Carlo (MCMC) simulations are required to obtain a posterior sample of the parameters. The different parameters are updated iteratively according to their conditional posterior distributions using a Gibbs sampler. A brief outline of the MCMC sampling scheme is given in the following. More details are provided in the Appendix. Starting with an arbitrary set of initial values for the parameters, the MCMC algorithm runs with a reasonably large number of iterations to obtain a representative sample from the posterior distribution. All subsequent results are based on single MCMC chains, each with 20 000 iterations in total and a burn-in period of 10 000 iterations. As starting values we choose an empty model with:We assessed convergence of each MCMC chain by looking at autocorrelations, trace plots and running mean plots of the regression coefficients. In addition, we ran several independent MCMC chains with different starting values to ensure that the chains and burn-in period were long enough to reach (approximate) convergence. For subgroup update with the block Gibbs sampler proposed by Wang [36]. Update all elements in iteratively with Gibbs sampler from the conditional distributions as well as , where () denotes all elements in except for (). Update all elements in iteratively with Gibbs sampler from the conditional distributions , where denotes all elements in except for . Update from the conditional distribution , , , using a random walk Metropolis-Hastings algorithm with adaptive jumping rule as proposed by Lee et al. [20]. includes all elements in except for . The conditional distribution for the update of can be well approximated by the gamma distribution where is the number of events in interval g for subgroup s and denotes the vector without the g-th element, , [17, chapter 3.2.2].

Posterior estimation and variable selection

We report the results of the Cox models in terms of marginal and conditional posterior means and standard deviations of the estimated regression coefficients, as well as posterior selection probabilities. After removal of the burn-in samples, the remaining MCMC samples serve as draws from the posterior distribution to calculate the empirical estimates. These estimates are then averaged across all training sets for each variable separately. The strategy for variable selection follows Treppmann et al. [35]. First, the mean model size is computed as the average number of included variables across all MCMC iterations after the burn-in. Then the variables with the highest posterior selection probability are considered as the most important variables and selected in the final model. We visually assess the inferred graph by the marginal posterior probabilities of the pairwise edge inclusion indicators. High probabilities suggest that an edge exists between two covariates (nodes). We consider the presence of an edge as a continuous parameter rather than choosing a cutoff for binary decision.

Prediction

We use training data for model fitting and posterior estimation and test data to assess model performance. We evaluate the prediction performance of the Cox models by the integrated Brier score. The expected Brier score can be interpreted as a mean square error of prediction. It measures the inaccuracy by comparing the estimated survival probability of a patient m, , with the observed survival status and the squared residuals are weighted using inverse probability of censoring weightsto adjust for the bias caused by the presence of censoring in the data. is the Kaplan-Meier estimator of the censoring times [3, 31]. The predictive performance of competing survival models can be compared by plotting the Brier score over time (prediction error curves). Alternatively, prediction error curves can be summarized in one value with the integrated Brier score as a measure of inaccuracy over a time interval rather than at single time points [15]

Median probability model and Bayesian model averaging

For the calculation of the prediction error, we account for the uncertainty in model selection by two different approaches: the Median Probability Model (MPM) [1] and an approximation to Bayesian Model Averaging (BMA) [16]. After removal of the burn-in samples, we compute the Brier score over the “best” selected models. According to the BMA approach we choose the top 100 models with the largest log-likelihood values to obtain the marginal posterior means of the regression coefficients, which in turn are required for the risk score. Our choice of the number of top models for BMA approximation is based on visual assessment of the MCMC frequencies of the different top-selected models. However, the number of models could be optimized. For the MPM approach we select all covariates with a mean posterior selection probability larger than 0.5. For these variables we calculate the marginal posterior means of the regression coefficients and the corresponding risk score.

Simulation study

In the following, we compare the performance of our proposed model, referred to as CoxBVS-SL (for Cox model with Bayesian Variable Selection and Structure Learning, as an extension of the model by Treppmann et al. [35]), to a standard subgroup model and a combined model. The combined model pools data from all subgroups and treats them as one homogeneous cohort, whereas the subgroup model only uses information in the subgroup of interest and ignores the other subgroups. Both standard approaches follow the Bayesian Cox model proposed by Treppmann et al. [35] with stochastic search variable selection and independent Bernoulli priors for the variable inclusion indicators . The priors for variable selection and structure learning are specified as follows. We set the hyperparameter of the Bernoulli distribution to , matching the prior probability of variable inclusion in the MRF prior of the CoxBVS-SL model. Based on a sensitivity analysis, we choose the hyperparameters of the MRF prior as and . When the graph contains no edges or then the prior variable inclusion probability is . This probability increases when is combined with a nonempty graph. For the sensitivity analysis of a and b we considered in total 36 combinations of the following hyperparameter values: and and simulated the data according to scenario I with . Visual assessment of the results showed that they were relatively robust without major differences between the parameter combinations (Additional file 1: Fig. S1). Therefore, we selected the combination of values for a and b based on a compromise between variable selection accuracy (trade-off between large probability of true positive and small probability of false positive selections) and prediction performance. Mean posterior probabilities of variable selection in simulation I. Mean posterior selection probabilities of the first nine genes in subgroup 1 (averaged across the ten training sets). The colors represent the different models and the plot symbol indicates whether a gene is selected on average or not Posterior effect estimates in simulation I. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of the first nine genes in subgroup 1 (averaged across the ten training sets) The remaining hyperparameters for and are chosen as , and , in accordance with Wang [36] and Peterson et al. [28]. Wang [36] extensively studied the impact of different parameter combinations on the structure learning results, reporting that the results were relatively insensitive to the choice of . He recommended a range for the parameters and as providing good convergence and mixing. Based on his recommendation, we performed a sensitivity analysis in previous simulations to confirm that the parameter range is also appropriate for our experiments. All tested parameter combinations provided reasonable variable selection results with only small differences, which led us to choose one of the best performing combinations in terms of variable selection accuracy. In the following simulations, we examine varying numbers of genomic covariates p and sample sizes n, with a focus on small sample sizes relative to the number of variables which is characteristic for gene expression data. We standardize the genomic covariates before model fitting and evaluation to have zero mean and unit variance. Parameters of the training data (mean and standard deviation of each variable) are used to scale the training and test data. For the standard subgroup model and the proposed model we standardize each subgroup separately, whereas for the combined model we pool training data of all subgroups. For Bayesian inference, typically one training data set is used for posterior estimation and an independent test data set for model evaluation. However, results have shown some variation due to the data draw. Therefore, in the following, simulation of training and test data is repeated ten times for each simulation scenario. In a second simulation set up we use two different hyperparameters b for the subgraphs , and in the MRF prior of the CoxBVS-SL model and compare the prediction performance with the Sub-struct model. In the latter is an empty graph and only information of is included in the MRF prior. We use the same training and test data as before but only consider simulation scenarios with .

Data simulation

Training and test data, each consisting of n samples and p genomic covariates, are simulated from the same distribution as described in the following. We consider two subgroups that differ only in their relationship between genomic covariates and survival endpoint (, ), and in the parameters for the simulation of survival data. We generate gene expression data from the same multivariate normal distribution with mean vector and precision matrix . The precision matrix is defined such that the variance of each gene is 1 and partial correlations exist only between the first nine prognostic genes. Within the three blocks of prognostic genes determined by the same effect (gene 1 to 3, gene 4 to 6, and gene 7 to 9) we assume pairwise partial correlations of 0.5. All remaining genes are assumed to be uncorrelated. We simulate survival data from a Weibull distribution according to Bender et al. [2], with scale and shape parameters estimated from two gene expression cancer cohorts [4, 10] to obtain realistic survival times. Therefore, we compute survival probabilities at 3 and 5 years using the Kaplan-Meier estimator, separately for both cohorts. The corresponding probabilities are 57% and 75% for 3-years survival, and 42% and 62% for 5-years survival, respectively. Event times for subgroup s are simulated fromwith true effects , . We randomly draw non-informative censoring times from a Weibull distribution with the same Weibull parameters as for the event times, resulting in approximately 50% censoring rates in both subgroups. The individual observed event indicators and times until an event or censoring are defined as and , . Effects in simulation I True effects in both subgroups for the simulation of survival outcome We choose the true effects of the genomic covariates on survival outcome as stated in Table 1. Genes 1, 2, 3 and 7, 8, 9 are subgroup-specific, while genes 4, 5 and 6 have the same effect in both subgroups. All remaining genes represent noise and have no effect in both subgroups.

Table 1

Effects in simulation I

	Gene
	1	2	3	4	5	6	7	8	9	10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots$$\end{document}…	p
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\beta}}_1$$\end{document}β1	1	1	1	−1	−1	−1	0	0	0	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots$$\end{document}…	0
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\beta}}_2$$\end{document}β2	0	0	0	−1	−1	−1	1	1	1	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots$$\end{document}…	0

True effects in both subgroups for the simulation of survival outcome

Effects in simulation II Simulated effects in both subgroups. Groups of proteins with the same effect are defined by different phosphorylation sites (or isoforms) of the same protein so that they can learn from each other

Simulation results I

We consider three low-dimensional settings with genes and samples in each subgroup, as well as five high-dimensional settings with and sample sizes . We also tested and , but as expected, the results always lay between the results for and . For this reason, they are not shown here. We compare our proposed model (CoxBVS-SL) to the standard subgroup model (Subgroup) and the standard combined or pooled model (Pooled) regarding variable selection accuracy and prediction performance. Posterior selection probabilities for each gene are computed based on all iterations after the burn-in and averaged across all training data sets. The resulting mean posterior selection probabilities of the first nine genes in subgroup 1 are depicted in Fig. 2 (and in Additional file 1: Fig. S2, for subgroup 2). Across all simulation scenarios, the CoxBVS-SL model has more power for the selection of prognostic genes compared to the two standard approaches, and at the same time, does not erroneously select noise genes (false positives) as the Pooled model. As expected, with larger n, power and accuracy in variable selection increase for both, the CoxBVS-SL and the Subgroup model. The Pooled model only correctly identifies the joint effects of genes 4, 5 and 6 but fails to detect subgroup-specific effects.

Fig. 2

Mean posterior probabilities of variable selection in simulation I. Mean posterior selection probabilities of the first nine genes in subgroup 1 (averaged across the ten training sets). The colors represent the different models and the plot symbol indicates whether a gene is selected on average or not

Prediction performance in simulation I. Integrated Brier Scores (IBS) across all ten test sets for subroup 1 (IBS based on the Median Probability Model). The black triangle within each boxplot represents the mean value Mean posterior probabilities of variable selection in simulation II. Mean posterior selection probabilities (averaged across the ten training sets) of the first nine genes in subgroup 1 Posterior effect estimates in simulation II. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of the first nine genes in subgroup 1 (averaged across the ten training sets) Prediction performance in simulation II. Integrated Brier Scores (IBS) across all ten test sets for subroup 1 (left) and 2 (right) (based on the Median Probability Model). The black triangle within each boxplot represents the mean value Posterior estimates of the regression coefficients of the first nine genes in subgroup 1 are shown in Fig. 3 for conditional posterior means (conditional on ) and in Additional file 1: Fig. S3 for marginal posterior means (independent of ), both along with standard deviations. The corresponding results for subgroup 2 are depicted in Additional file 1: Figs. S4 and S5. For the conditional posterior means of the prognostic genes are less shrunk than the marginal posterior means. Results of the CoxBVS-SL model and the Subgroup model are very similar, whereas the Pooled model averages effects across subgroups leading to biased subgroup-specific effects and more false positives. Surprisingly, the joint effects of genes 4, 5 and 6 are also more precisely estimated (less shrunk) by CoxBVS-SL and Subgroup compared to Pooled.

Fig. 3

Posterior effect estimates in simulation I. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of the first nine genes in subgroup 1 (averaged across the ten training sets)

We assess prediction performance by the integrated Brier score (IBS), computed based on the Median Probability Model (MPM, Fig. 4 for subgroup 1 and Additional file 1: Fig. S7, for subgroup 2) and the Bayesian Model Averaging (BMA, Additional file 1: Fig. S6, for subgroup 1 and Additional file 1: Fig. S8, for subgroup 2). The Pooled model has the worst prediction accuracy. In the case of MPM, CoxBVS-SL performs clearly better than Subgroup, for BMA both models are competitive.

Fig. 4

Inference of the graph showed relatively high accuracy for learning the conditional dependence structure among genes within subgroups and for detecting joint effects across different subgroups. The block correlation structure between the prognostic genes within each subgroup is correctly estimated by the precision matrix and the subgraph , in the CoxBVS-SL model (see Additional file 1: Fig. S9). Inference of the subgraph linking both subgroups improves with increasing sample size. The corresponding marginal posterior edge inclusion probabilities of the prognostic genes with joint effects (genes 4, 5 and 6) are larger than for the remaining genes, which becomes more evident for increasing n (see Additional file 1: Fig. S10). Findings support the assumption that incorporating network information into variable selection may increase power to detect associations with the survival outcome and improve prediction accuracy.

Simulation results II

Next, we study the effect of two different hyperparameters b in the MRF prior of the CoxBVS-SL model with respect to variable selection and prediction performance. The new hyperparameter corresponds to the subgraphs , within each subgroup and to the subgraph linking both subgroups. By choosing a larger value for , we give more weight in the MRF prior and thus, increase the prior variable inclusion probability for genes being simultaneously selected in both subgroups and having a link in . We compare the results of CoxBVS-SL with varying to the results of the Sub-struct model where and only information of , is included in the MRF prior. In this comparison we investigate how much information is added by over . For the other hyperparameters we use the same values as in the previous simulations. We apply all models to the same training and test data sets as before but only consider simulation scenarios with and . Mean posterior probabilities of variable selection in the case study. Mean posterior selection probabilities of all 20 proteins in both subgroups (averaged across all training sets). The different colors represent the models or parameter values of and in CoxBVS-SL (abbreviated by “C.”). The plot symbol indicates whether a protein is selected (triangle) or not (circular point) Posterior effect estimates in the case study. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of all 20 proteins in both subgroups (averaged across all training sets). The different colors represent the models or parameter values of and in CoxBVS-SL (abbreviated by “C.”). The plot symbol indicates whether a protein is selected (triangle) or not (circular point) Prediction performance in the case study. Integrated Brier Scores (IBS) across all ten test sets for both subroups (based on the Median Probability Model). CoxBVS-SL is abbreviated by “C.”. The black triangle within each boxplot represents the mean value Figure 5 shows the mean posterior selection probabilities of the first nine genes in subgroup 1 (subgroup 2 is presented in Additional file 1: Fig. S11). The results of Sub-struct are similar to CoxBVS-SL with . Increasing values of lead to larger posterior variable inclusion probabilities, however, not only for the prognostic genes (see genes 7, 8 and 9 in subgroup 1). This means more power for the correct identification of prognostic genes when , but on the other hand, a tendency towards more false positives.

Fig. 5

Mean posterior probabilities of variable selection in simulation II. Mean posterior selection probabilities (averaged across the ten training sets) of the first nine genes in subgroup 1

Posterior estimates of the regression coefficients are very similar for all models. Figure 6 shows the conditional posterior means (conditional on ) and Additional file 1: Fig. S12 the marginal posterior means (independent of ) along with standard deviations of the first nine genes in subgroup 1. The corresponding results of subgroup 2 are depicted in Additional file 1: Figs. S13 and S14.

Fig. 6

Posterior effect estimates in simulation II. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of the first nine genes in subgroup 1 (averaged across the ten training sets)

We assess prediction performance in terms of the integrated Brier score (IBS), computed based on the Median Probability Model (Fig. 7) and the Bayesian Model Averaging (Additional file 1: Fig. S15). Larger values of tend to lead to a slightly better prediction performance of CoxBVS-SL compared to Sub-struct when . When the sample size is large, the prediction accuracy of all models is similarly good.

Fig. 7

Prediction performance in simulation II. Integrated Brier Scores (IBS) across all ten test sets for subroup 1 (left) and 2 (right) (based on the Median Probability Model). The black triangle within each boxplot represents the mean value

Additional file 1: Fig. S16 compares the results of the subgraph for varying in CoxBVS-SL. For larger values of the marginal posterior edge inclusion probabilities of the prognostic genes with joint effects (genes 4, 5 and 6) increase, as expected, since they are given a higher weight in the prior. However, when we also notice a minor increase of the marginal posterior edge inclusion probabilities of the other six prognostic genes with subgroup-specific effects.

Case study based on Glioblastoma protein expression data

In this section we compare CoxBVS-SL with varying to both standard models, Pooled and Subgroup. We use the Glioblastoma protein expression data from Peterson et al. [28], comprising 212 samples with survival data (159 events) and proteins. For reasons of computation time, we use only proteins and standardize the protein expression data as described in the previous section. In contrast to the previous simulations, we do not draw the expression data from a multivariate normal distribution, but instead use real protein expression data with realistic correlation structure between all covariates, following the concept of plasmode simulations as described by Franklin et al. [12]. We still simulate the relationship between proteins and survival outcome by choosing artificial effects and simulating the survival data from a Weibull distribution. We randomly divide the complete data set into two equally large subsets to obtain two subgroups. For the survival endpoint we simulate the event times and censoring times , respectively, in subgroup s from a Weibull distribution with scale and shape parameters estimated by the Kaplan-Meier estimator of the true event and censoring times, respectively, in the specific subgroup. The individual observed event indicators and survival times until an event or censoring are defined as and , resulting in approximately 42% censoring rates in both subgroups. The effects in subgroup and that we assume for the simulation of survival data are depicted in Table 2.

Table 2

Effects in simulation II

Protein	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta_{1}$$\end{document}β1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\beta}}_2$$\end{document}β2
Akt	2	0
Akt_pS473	2	0
Akt_pT308	2	0
EGFR	0	2
EGFR_pY1068	0	2
EGFR_pY1173	0	2
AMPK_alpha	−1.5	1.5
Annexin.1	1.5	−1.5
GSK3.alpha.beta	−2	−2
GSK3.alpha.beta_pS21_S9	−2	−2
GSK3_pS9	−2	−2
X14.3.3_beta	0	0
X14.3.3_epsilon	0	0
X14.3.3_zeta	0	0
X4E.BP1	0	0
X4E.BP1_pS65	0	0
X4E.BP1_pT37T46	0	0
X4E.BP1_pT70	0	0
X53BP1	0	0
A.Raf_pS299	0	0

Simulated effects in both subgroups. Groups of proteins with the same effect are defined by different phosphorylation sites (or isoforms) of the same protein so that they can learn from each other

We repeatedly randomly split the complete data into training (with proportion 0.8) and test sets, stratified by subgroup and event indicator. In total, we generate ten training data sets for model fitting and ten test data sets for evaluation of the prediction performance. We choose the hyperparameters in accordance with the case study in Peterson et al. [28] as follows. For the two standard models a prior probability of variable inclusion of 0.2 is assumed. In the CoxBVS-SL model we set the hyperparameters of the precision matrix and graph to and . The hyperparameters of the MRF prior are and as in the previous section, we tried out two different values for b: and , or and .

Results of the case study

When either or increases the mean posterior selection probabilities of all proteins increase too (Fig. 8). The Subgroup and CoxBVS-SL model with perform similarly. They correctly identify the subgroup-specific effects of the first six proteins and do not falsely select any noise proteins. Interestingly, the effects of proteins AMPK and Annexin (ID 7 and 8), going in opposite directions for both subgroups, as well as the joint effects of proteins GSK3 are not all identified. There are a few false negatives. The Pooled model, in contrast, shows a clear bias for the subgroup-specific and opposite effects. The effects are averaged across both subgroups, which also becomes evident when looking at the posterior estimates of the regression coefficients, for the conditional posterior means in Fig. 9 and for the marginal posterior means in Additional file 1: Fig. S17. The results of the Subgroup and CoxBVS-SL model are similar. In particular, the posterior means of the noise proteins with null effect are close to 0, also for large values of or .

Fig. 8

Mean posterior probabilities of variable selection in the case study. Mean posterior selection probabilities of all 20 proteins in both subgroups (averaged across all training sets). The different colors represent the models or parameter values of and in CoxBVS-SL (abbreviated by “C.”). The plot symbol indicates whether a protein is selected (triangle) or not (circular point)

Fig. 9

Posterior effect estimates in the case study. Conditional posterior means (conditional on ) and standard deviations (SD) of the regression coefficients of all 20 proteins in both subgroups (averaged across all training sets). The different colors represent the models or parameter values of and in CoxBVS-SL (abbreviated by “C.”). The plot symbol indicates whether a protein is selected (triangle) or not (circular point)

When we compare all models with regard to prediction accuracy in Fig. 10 and Additional file 1: Fig. S18, we again see competitive performance for the Subgroup and CoxBVS-SL model whereas Pooled is clearly worse. We can observe a tendency towards slightly improved prediction accuracy for increasing values of .

Fig. 10

Prediction performance in the case study. Integrated Brier Scores (IBS) across all ten test sets for both subroups (based on the Median Probability Model). CoxBVS-SL is abbreviated by “C.”. The black triangle within each boxplot represents the mean value

Finally, we assess the impact of increasing values of on the subgraph linking both subgroups. The corresponding marginal posterior edge selection probabilities are depicted in Additional file 1: Fig. S19. When becomes larger first, the posterior edge selection probabilities of proteins 8, 10 and 11 with opposite or joint effects in both subgroups increase, followed by the first six proteins with subgroup-specific effects and protein 9 with joint effect. The posterior edge selection probabilities of the noise proteins in both subgroups remain at the prior mean and only start to increase slightly when . Proteins 7 and 9 have much smaller posterior edge selection probabilities than the other proteins with opposite or joint effects, which fits to previous findings. When becomes larger, the marginal posterior edge selection probabilities in the subgraphs and show no visible changes. In they increase for some proteins however, to a much lesser extent than for larger .

Discussion

We consider the situation of different, possibly heterogeneous patients subgroups (pre-known cohorts or data sets) with survival endpoint and continuous molecular measurements such as gene expression data. When building a separate risk prediction model for each subgroup, it is important to consider heterogeneity but at the same time it can be reasonable to allow sharing information across subgroups to increase power, in particular when the sample sizes are small. For this situation we propose a hierarchical Cox model with stochastic search variable selection prior. To achieve higher power in variable selection and better prediction performance, we use an MRF prior instead of the standard Bernoulli prior for the latent variable selection indicators . The MRF prior leads to higher selection probabilities for genes that are related in an undirected graph. We use this graph to link genes across different subgroups and thereby borrow information between subgroups. Genes that are simultaneously prognostic in different subgroups have a higher probability of being selected into the respective subgroup Cox models. As a side aspect, the graph in the MRF prior also allows us to estimate a network between genes within each subgroup providing indications of functionally related genes and pathways. Here, genes that are conditionally dependent have a higher selection probability. In the simulations and the case study we compared our proposed CoxBVS-SL model to the standard approach with independent Bernoulli prior for represented by the Subgroup and Pooled model. Simulations showed that the Pooled model performed worst in terms of variable selection and prediction accuracy. It averaged the effects across both subgroups and thus, led to biased estimates. CoxBVS-SL had more power in variable selection and slightly better prediction performance than Subgroup when the sample size was small. For both models were competitive. However, in the case study the CoxBVS-SL and Subgroup model performed similarly well (Pooled was again clearly worse). The reason for this may be that the sample sizes in both subgroups were relatively large, in particular . In the MRF prior of our proposed CoxBVS-SL model we specify one unique hyperparameter b for both, the connection levels of covariates within and between subgroups. Since this assumption may be inadequate, we considered further simulations where we studied the effect of increasing values of , representing the weight that is given to the subgraph in the MRF prior of CoxBVS-SL, and compared the results to the Sub-struct model where . When was small, CoxBVS-SL and Sub-struct performed very similarly. Thus, the subgraph linking both subgroups had only a small influence on the results compared to the conditional dependencies among covariates within each subgroup (subgraphs and ). For larger values of prediction performance slightly improved and power in variable selection increased but on the other hand, there was a tendency towards false positive variables. By using different hyperparameters and , we can vary the strength of connection between pairs of covariates within and between subgroups. However, we still assume that all pairs of subgroups are equally-likely linked a priori. In our situation this assumption is justified since we have no prior knowledge of the amount of shared, similar effects between subgroups. If prior information on the heterogeneity structure between subgroups (similar effects) is available, it can be incorporated into the MRF prior or the graph prior. The problem of different connection levels of covariates within and between subgroups can in a similar way be approached by the hyperparameter in the graph prior, instead of the hyperparameter b in the MRF prior. In previous simulations (data not shown) we increased the weight for by choosing a larger value for the prior probability of edge inclusion for the corresponding edge inclusion indicators , . This led to larger posterior edge selection probabilities, however, for all genes and not only the ones with joint effects. The variable selection results did not change remarkably. We could observe a small increase in power for all genes which again implied a tendency towards false positives. Due to computation time, we have included only up to 200 variables so far and the analysis of many thousands of genes is not (yet) feasible. An advantage of the CoxBVS-SL model is that it does not require prior knowledge of the graph among the covariates within and between subgroups. It accounts for uncertainty over both variable and graph selection. In situations where pathway information is available and the graph structure is known, it is possible to incorporate this structural information in the MRF prior via a fixed graph. We assume that subgroups are pre-specified with the subgroup affiliation of each patient being unique and fixed. However, in situations with unknown subgroups the latent subgroup structure would first need to be determined using methods such as clustering. A wide variety of approaches have been proposed for the clustering of molecular data such as gene expression profiles [9, 13, 18, 25, 39] with extensions to sparse clustering [32, 38] and integrative clustering of multiple omics data types [6, 19].

Conclusions

To our knowledge, we propose the first completely Bayesian approach to combine different, possibly heterogeneous subgroups/cohorts in Cox regression with variable selection. We offer a solution for sharing information across the subgroups to increase power in variable selection and improve prediction performance. We were able to demonstrate the superiority of our proposed CoxBVS-SL model over the two standard approaches. The standard Pooled model always performed worst, whereas the CoxBVS-SL model outperformed the standard Subgroup model when and otherwise was competitive. This suggests that incorporating network information into variable selection can increase power to identify the prognostic covariates and improve prediction performance. We showed that a proper choice of the hyperparameter b (and a) in the MRF prior is crucial for the results of the graph and the Cox model. Our proposed model does not require prior knowledge of the dependency structure between covariates within subgroups and the heterogeneity structure between subgroups (i.e., of the amount of shared, similar effects). In the absence of any prior structural information, we assume that all pairs of covariates within and between subgroups are equally-likely linked a priori, and we allow inference of the corresponding unknown graphical structures. In situations where prior structural information is available, for example pathway information or degree of heterogeneity between subgroups, this information can be incorporated into our model. We presented a way to assign different connection levels to covariates within and between subgroups by using different hyperparameters b in the MRF prior. Alternatively, one could use fixed edges in the graph or varying prior edge selection probabilities. The discovery of graphical structure is an additional benefit of our proposed model. However, our focus is on prediction performance, unbiased effect estimation and variable selection in the Cox model. Our proposed CoxBVS-SL model showed improved results in the situation of small sample sizes which is an important problem, not only in clinical applications.

18 in total

1. Generating survival times to simulate Cox proportional hazards models.

Authors: Ralf Bender; Thomas Augustin; Maria Blettner
Journal: Stat Med Date: 2005-06-15 Impact factor: 2.373

2. Joint Bayesian variable and graph selection for regression models with network-structured predictors.

Authors: Christine B Peterson; Francesco C Stingo; Marina Vannucci
Journal: Stat Med Date: 2015-10-29 Impact factor: 2.373

3. Bayesian Inference of Multiple Gaussian Graphical Models.

Authors: Christine B Peterson; Francesco C Stingo; Marina Vannucci
Journal: J Am Stat Assoc Date: 2015-03-01 Impact factor: 5.033

4. An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models.

Authors: Harald Binder; Christine Porzelius; Martin Schumacher
Journal: Biom J Date: 2011-02-17 Impact factor: 2.207

5. Biomarker discovery in non-small cell lung cancer: integrating gene expression profiling, meta-analysis, and tissue microarray validation.

Authors: Johan Botling; Karolina Edlund; Miriam Lohr; Birte Hellwig; Lars Holmberg; Mats Lambe; Anders Berglund; Simon Ekman; Michael Bergqvist; Fredrik Pontén; André König; Oswaldo Fernandes; Mats Karlsson; Gisela Helenius; Christina Karlsson; Jörg Rahnenführer; Jan G Hengstler; Patrick Micke
Journal: Clin Cancer Res Date: 2012-10-02 Impact factor: 12.531

6. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases.

Authors: Jessica M Franklin; Sebastian Schneeweiss; Jennifer M Polinski; Jeremy A Rassen
Journal: Comput Stat Data Anal Date: 2014-04 Impact factor: 1.681

7. Integrative clustering methods for high-dimensional molecular data.

Authors: Prabhakar Chalise; Devin C Koestler; Milan Bimali; Qing Yu; Brooke L Fridley
Journal: Transl Cancer Res Date: 2014-06-01 Impact factor: 1.241

8. Validation of a histology-independent prognostic gene signature for early-stage, non-small-cell lung cancer including stage IA patients.

Authors: Sandy D Der; Jenna Sykes; Melania Pintilie; Chang-Qi Zhu; Dan Strumpf; Ni Liu; Igor Jurisica; Frances A Shepherd; Ming-Sound Tsao
Journal: J Thorac Oncol Date: 2014-01 Impact factor: 15.609