Literature DB >> 34876106

Weighted Cox regression for the prediction of heterogeneous patient subgroups.

Abstract

BACKGROUND: An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected.
METHODS: For this situation, we propose a penalized Cox regression model with a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. The weights are estimated from the data such that patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model.
RESULTS: Our proposed approach is evaluated through simulations and application to real lung cancer cohorts, and compared to existing approaches. Simulation results demonstrate that our proposed model is superior to standard approaches in terms of prediction performance and variable selection accuracy when the sample size is small.
CONCLUSIONS: The results suggest that sharing information between subgroups by incorporating appropriate weights into the likelihood can increase power to identify the prognostic covariates and improve risk prediction.

Entities: Chemical

Keywords: Cox proportional hazards model; Heterogeneous cohorts; High-dimensional data; Subgroup analysis; Weighted regression

Mesh：

Year: 2021 PMID： 34876106 PMCID： PMC8650299 DOI： 10.1186/s12911-021-01698-1

Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN： 1472-6947 Impact factor: 2.796

Background

Survival analysis is an important field of biomedical research, particularly cancer research. The main objectives are the prediction of a patient’s risk and the identification of new prognostic biomarkers to improve patients’ prognosis. In recent years, molecular data such as gene expression data have increasingly gained importance in diagnosis and prediction of disease outcome. Technologies for the measurement of gene expression have made rapid progress and the use of high-throughput technologies allows simultaneous measurements of genome-wide data for patients, resulting in a vast amount of data. A typical characteristic of this kind of high-dimensional data is that the number of genomic predictors greatly exceeds the number of patients (). In this situation, the number of genes associated with a clinical outcome, here time-to-event endpoint, is typically small. Important objectives in modeling high-dimensional data are good prediction performance and finding a subset of predictors that are truly relevant to the outcome. A sparse model solution may reduce noise in estimation and increase interpretability of the results. Another problem with high-dimensional data is that standard approaches for parameter estimation in regression models cannot handle such a large number of predictors; conventional regression techniques may not provide a unique solution to maximum likelihood problems or may result in an overfitted model. During the last years, different approaches have been proposed for handling the situation, often implying automatic variable selection, such as regularization [33, 38, 45] or boosting algorithms [4, 20, 21, 35]. In clinical practice, patient cohorts are typically small. However, when data of several patient cohorts or subgroups with the same or similar disease are available it can be reasonable to use this information and appropriately combine the data. In multicenter studies, patients of all subgroups are often simply pooled. When subgroups are heterogeneous with regard to their relationship between predictors and outcome, this combined analysis may suffer from biased results and averaging of subgroup-specific effects. Standard subgroup analysis, on the other hand, includes only patients of the subgroup of interest and may lead to a loss of power when the sample size is small. We aim at providing a separate prediction model for each subgroup that allows for identifying common as well as subgroup-specific effects and has improved prediction accuracy over both standard approaches. Therefore, we propose a Cox proportional hazards model that allows sharing information between subgroups to increase power when this is supported by data. We use a lasso penalty for variable selection and a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. Patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model. We estimate individual weights for each patient from the training data following the idea of Bickel et al. [3]. We assume subgroups are pre-known and determined by multiple cancer studies or cohorts. However, our approach can be applied to any other type of subgroups, for example, defined by clinical covariates. Our proposed model is evaluated through simulations and application to real lung cancer cohorts, and compared to the standard subgroup model and the standard combined model.

Related work

Different approaches have been published recently suggesting the use of weights in regression models to consider subgroups. Weyer and Binder [39] aim at improving stability and prediction quality of a Cox model for a specific subgroup by including one additional weighted subgroup. The authors use a weighted and stratified Cox regression model based on componentwise boosting for automatic variable selection. They study the effects of a set of different fixed weights for the additional subgroup, while all observations in the subgroup of interest obtain a weight of 1 in the stratum-/subgroup-specific likelihood. In this paper, we compare a set of different fixed weights as suggested by Weyer and Binder [39] to our more flexible approach with individual weights for each patient from each subgroup estimated from the (training) data. However, we assume the same baseline hazard rate across all subgroups in contrast to the stratified Cox model by Weyer and Binder [39]. Alternatively, subgroup weights can be considered as a tuning parameter in model-based optimization (MBO) to improve prediction performance in the Cox model. This approach by Richter et al. [29] is more flexible than the previously mentioned one by Weyer and Binder [39] since it allows different fixed weights for different subgroups in each subgroup model. However, it also makes the restriction that all weights for patients from the same subgroup must be the same, which is quite different in terms of spirit from our proposed approach with individual weights for each patient from each subgroup. The major different idea of the MBO method is to quickly find a good set of fixed weights for the other subgroups in terms of prediction performance. Despite its difference in spirit, this alternative procedure could be an interesting outlook for further comparison studies. Bayesian approaches for the estimation of subgroup weights were proposed by Bogojeska and Lengauer [7] and Simon [31]. However, they are not designed for our high-dimensional situation since they do not perform variable selection. Weighted regression models are also used in local regression, however without predefined groups. For each individual, a local regression model is fitted based on its neighboring observations. The latter are weighted by their distances from the observation of interest. Penalized localized regression approaches for dealing with high-dimensional data exist [5, 34]. Instead of using distance in covariate space, our proposed weights correspond to the relationship between covariates and subgroup membership. A drawback of localized regression is that it does not provide global regression parameters, making interpretation difficult. Furthermore, only a small number of observations is used for each local fit in contrast to our approach, where the weighted likelihood is based on all training data. We define subgroups by multiple cancer studies or cohorts and aim at appropriately combining them to increase power and simultaneously, considering heterogeneity among the subgroups. This idea of combining data from different data sources is similar to integrative analysis. In high-dimensional settings with genomic predictors, different publications suggest the use of specific penalties in regularized regression for parameter estimation and variable selection across multiple data types. For example, Liu et al. [24] and Liu et al. [25] propose composite penalties with two-level gene selection. In the first selection level represented by an outer penalty, the association of a specific gene in at least one study is determined. In the second level, inner penalties of ridge or lasso type are used to allow the selection of either the same set of genes or different sets of genes in all studies. Instead of aggregating multiple studies with the same type of (omics) data, Boulesteix et al. [8] perform an integrative analysis of multiple omics data types available for the same patient cohort. The authors use a lasso penalty with different penalty parameters for the different data types. Bergersen et al. [2] integrate external information provided by another genomic data type by using a weighted lasso that penalizes each covariate individually with weights inversely proportional to the external information. Gade et al. [15] use a bipartite graph to integrate miRNA and gene expression data from the same patient cohort into one prediction model to find a combined signature that improves the prediction. This graph is built by combining correlations between both data types and external information on target predictions.

Methods

Cox proportional hazards model

Assume the observed data of patient i consists of the tuple , the covariate vector , and the subgroup membership with S the number of subgroups in the complete data set, and . denotes the observed time of patient i, with the event time and the censoring time. indicates whether a patient experienced an event () or was (right-)censored (). The most popular regression model in survival analysis is the Cox proportional hazards model [12]. It models the hazard rate of an individual at time t aswhere is the baseline hazard rate, and is the unknown parameter vector. The regression coefficients are estimated by maximizing a partial likelihood without having to specify the baseline hazard rate.

Penalized Cox regression model

We consider high-dimensional settings where the number of covariates p exceeds the sample size n. In this situation, the solution maximizing the Cox partial likelihood is not unique. One possibility to deal with this problem is to introduce a penalty term into the partial log-likelihood , referred to as regularization. This approach is also reasonable in settings since it considers collinearity among the predictors and helps to prevent overfitting. We use a lasso penalty [32, 33] that performs variable selection and yields a sparse model solution. The resulting maximization problem of the penalized partial log-likelihood is given byThe parameter controls the strength of penalization and is optimized by tenfold cross-validation. For parameter estimation, we use the implementation in the R package glmnet [14].

Weighted Cox partial likelihood

In the standard unweighted partial likelihood, all patients contribute to the same extent to the estimation of the regression coefficients. This might not be desirable when the cohort is heterogeneous due to known subgroups that are associated with different prognosis. In this situation, it is reasonable to fit a separate Cox model for each subgroup. This can be done by using only the data from the subgroup of interest or by including information from the other subgroups. We include patients from all subgroups in the likelihood for one specific subgroup but assign them individual weights , to account for the heterogeneity in the data. The size of each weight determines to which extent the corresponding patient contributes to the estimation. In accordance with Weyer and Binder [39], the weighted version of the partial log-likelihood is defined asWeyer and Binder [39] propose the use of fixed weights. The idea is to focus on a specific subgroup s of patients and assign each of these patients a weight of 1, while all other patients are down-weighted with a fixed weight :Standard subgroup analysis is based only on the patients in the subgroup of interest s, which corresponds to for all patients not belonging to s. A combined model that pools patients from all subgroups corresponds to for all patients. Alternatively to the idea of Weyer and Binder [39], we propose to estimate individual weights for each patient from the training data. This approach is described in the following section.

Estimation of weights

Individual weights for each patient in each subgroup-specific likelihood can be estimated from the training data following the idea of Bickel et al. [3]. The weights match the joint distribution of all subgroups to the target distribution of a specific subgroup s, such that a patient who is likely to belong to the subgroup of interest receives a higher weight in the subgroup-specific model. Assume the entire training data from all subgroups are summarized in the covariates and a response . In time-to-event settings, the response corresponds to the tuple , with the observed time until an event or censoring and the event indicator. Let be an arbitrary loss function and the predicted response based on the observed covariates in subgroup s. should correctly predict the true response and thus minimize the expected loss with respect to the unknown joint distribution for each subgroup s, given by . The following equation shows that this expected loss for each subgroup equals the expected weighted loss with respect to the joint distribution of the pooled data from all subgroups The subgroup-specific weights for each patient are defined asThe last equation shows that the weights can be expressed in terms of p(s) and . p(s) can be estimated by the relative frequency of subgroup s in the overall training cohort, and can be considered as a multi-class classification problem [3]. We estimate by multinomial logistic regression or by random forest, using the implementation in the R packages glmnet [14] and ranger [43], respectively. Unlike Bickel et al. [3], we use tenfold cross-validation to estimate from the training data to prevent overfitting. As a result, for each subgroup, we obtain an n-dimensional vector of estimated individual weights. Unlike the fixed weights by Weyer and Binder [39], our proposed estimated weights are not constrained to (0, 1) as the ratio can take values larger than 1. The R package glmnet, which we use to fit the weighted penalized Cox model, internally rescales the weights so that they add up to the sample size (see the vignette “An Introduction to glmnet”). However, normalizing the weights to range from 0 to 1 is not necessary as all individuals contribute to the likelihood with a certain weight and rescaling all weights in the likelihood would not change the estimated Cox model.

Prediction performance

Prediction performance of all Cox models is evaluated by Harrell’s C-(concordance) index [17], implemented in the R package Hmisc [18]. The C-index is a measure of predictive discrimination and defined as the proportion of all usable pairs of patients with concordant predicted and observed survival times. For a concordant pair of patients, the survival time of the patient with larger risk score is known to be shorter than the survival time of the patient with lower risk score, such that the risk measure and the survival time lead to the same ordering of patients. Let , be the observed survival times of patients i and , and , the corresponding risk scores (with corresponding to the test data and estimated from the training data). A pair is considered concordant if . The C-index is defined aswhere is the number of comparable pairs that standardizes CI to [0, 1]. A patient pair is considered unusable, if both patients die at the same time, or both patients are censored, or if one is censored before the other one dies. stands for a very good prediction and values around 0.5 suggest a random prediction. While Harrell’s C-index is an easy to interpret and compute approach for quantifying the accuracy of prognostic survival models, it depends on the censoring distribution. To overcome this shortcoming, Uno et al. [37] introduce inverse probability censoring weights to the C-index to adjust for right censoring. Instead of evaluating the “overall” prediction accuracy, it can be of interest to quantify the discriminative ability at each time point under consideration. In this situation, time-dependent ROC analysis can be used to distinguish at each time point between patients having an event at or up to t and those having an event after t. The corresponding area under the time-dependent ROC curve provides an estimator of incidence/dynamic or cumulative/dynamic AUC for right-censored time-to-event data [19, 36].

Model fitting and evaluation

We compare our weighted approach with the standard (unweighted) models, i.e. the combined model and the subgroup model, as well as a weighted Cox model with fixed weights as proposed by Weyer and Binder [39]. In the latter, patients belonging to a certain subgroup are assigned a weight of 1 in the subgroup-specific likelihood, while all other observations are down-weighted with a constant weight . For our proposed approach we compare three different classification methods for weights estimation with respect to prediction performance: Multinomial logistic regression with lasso (lasso) or ridge (ridge) penalty, and random forest (rf). All Cox models include a lasso penalty for variable selection. We compare the following Cox models concerning prediction performance. The italic expressions in parentheses denote the abbreviations of the models in the following analyses: Weighted model with estimated weights (lasso, ridge, rf) Weighted model with fixed weights () Standard subgroup model (sub), using only patients of a specific subgroup Standard combined model (all), using patients of all subgroups. The subgroup indicator is included as additional covariate. Simulation set-up. Analysis pipeline for the simulation study; Brighter regions in the training and test set indicate the observations of the subgroup Figure 1 provides a schematic representation of the analysis pipeline. First, we randomly generate training data sets for model fitting and test data sets for model evaluation and repeat this procedure 100 times. In the application example, we repeatedly randomly split the complete data into training (with proportion 0.632) and test sets. We perform subsampling stratified by subgroup and event indicator, to take different subgroup sizes and censoring proportions into account. In the simulation study, we repeatedly randomly generate independent training and test sets of the same size and with the same distribution parameters. Second, we estimate individual subgroup weights from the training data using different classification methods and 10-fold cross-validation (CV). Next, we fit the combined and weighted Cox models based on the training data of all subgroups, while the standard subgroup model is based on the training data of the respective subgroup only. Finally, we evaluate the prediction performance of the estimated Cox models with respect to a certain subgroup using only the test data of this particular subgroup. The R package batchtools [23] is used for parallelization and the R package mlr [6] is used as a framework for weights estimation, Cox model fitting and evaluation by the C-index.

Fig. 1

Simulation set-up. Analysis pipeline for the simulation study; Brighter regions in the training and test set indicate the observations of the subgroup

Results of the simulation study

Simulated data

We simulate four subgroups (1A, 1B, 2A, 2B) of equal size n from two differently distributed groups denoted by the index : group 1 including subgroups 1A and 1B, and group 2 including subgroups 2A and 2B. Within each group we use the same parameters for the simulation of the data. We simulate the survival data from a Weibull distribution according to Bender et al. [1], with scale parameter and shape parameter estimated from two independent lung cancer cohorts (GSE37745 and GSE50081). For this purpose, we compute survival probabilities at 3 and 5 years using the Kaplan-Meier estimator for both lung cohorts separately. The corresponding probabilities are 57% and 75% for 3-years survival, and 42% and 62% for 5-years survival, respectively. Individual event times in group are simulated aswith true effects , . We randomly draw noninformative censoring times from a Weibull distribution with the same parameters as for the event times, resulting in approximately 50% censoring rates in both groups. The individual observed event indicators and times until an event or censoring are defined as and . For each subgroup we simulate p uncorrelated (genetic) covariates from a multivariate normal distribution with mean vector and covariance matrix . In previous simulation studies we compared the results of different covariance structures, including realistic dependence structures estimated from real gene expression data, but found no remarkable differences [26]. Elements of are defined by a linear function with parameter that reflects the degree of similarity between the two groups. We assign to genes with a strong effect on the outcome (), corresponds to genes with a moderate effect (), and to genes with a weak or no effect (). This choice relies on the assumption that prognostic genes have a higher expression level than noise genes. The magnitude of is chosen following real gene expression data, where the expression values typically range from 4 to 12 after transformation to scale. Effects in the simulation study Effects of the first 12 genes for the simulation of survival outcome In all simulated scenarios, we assume the first 12 genes to be prognostic in at least one of the two groups, with corresponding effects given in Table 1. We include subgroup-specific effects (genes 1 to 4), opposite effects (genes 5 and 6), effects in the same direction but of different size (genes 7 and 8), and joint effects of varying sizes (genes 9 to 12). We choose these effects with alternate signs so that they sum up to zero, resulting in reasonable simulated survival times. In settings with , we assume all remaining genes to represent noise and being unrelated to the survival times in both groups ().

Table 1

Effects in the simulation study

Gene	1	2	3	4	5	6	7	8	9	10	11	12
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\beta }_1$$\end{document}β1	1	1	0	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.5	0.5	0.75	0.25	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.75	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.25
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\beta }_2$$\end{document}β2	0	0	1	1	0.5	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.5	0.25	0.75	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.75	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}- 0.25

Effects of the first 12 genes for the simulation of survival outcome

In our simulation study we focus on high-dimensional settings where the sample size n is small compared to the number of covariates (genes) p, a typical characteristic of gene expression data. Table 2 shows all parameters tested in the simulation study with their respective values, resulting in 252 different combinations in total.

Table 2

Parameter combinations in the simulation study

Parameter	Values (per subgroup)
n	20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000
p	12, 100, 200
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}ϵ	0, 0.1, 0.2, 0.3, 0.4, 0.5, 1

All parameters tested in the simulation study with their respective values, resulting in 252 different combinations in total

Parameter combinations in the simulation study All parameters tested in the simulation study with their respective values, resulting in 252 different combinations in total

Weights estimation

Our proposed subgroup model uses patients from all subgroups for training but assigns them individual weights in the Cox partial likelihood based on their subgroup membership. Weights for a specific subgroup are estimated by the individual predicted probabilities of belonging to this subgroup, obtained by classification, divided by the subgroup proportion. Thus, a patient who is likely to belong to the subgroup of interest receives a higher weight in the subgroup-specific likelihood. We compare three different classification methods that are appropriate for multi-class problems and high-dimensional covariates with respect to their predictive quality and their ability to discriminate between differing subgroups. Estimated weights in the simulation. Estimated weights for subgroup 1A obtained by random forest based on simulated training data with and Figure 2 displays boxplots of the estimated weights for subgroup 1A across all training sets in two selected simulation scenarios with and . The x-axis represents the true subgroup membership of each observation, and the y-axis the individual weights estimated by random forest (rf) for subgroup 1A. Results of all three classification methods (lasso, ridge, rf) are relatively similar, altough rf tends to perform best for small and n, whereas for large sample size the discriminative ability of lasso and ridge is slightly better. The largest difference in results is obtained for different values of . When all subgroups are very similar (), multi-class classification fails to distinguish the two differing groups. All observations are assigned a weight of approximately 1 in all subgroup models, similar to the standard combined Cox model. The corresponding area under the ROC curve (AUC) for the distinction between group 1 and 2 (computed based on test data and cross-validated training data) is approximately 0.5, indicating that prediction performance is not much better than random (see Additional file 1: Figure S1). Increasing values of , meaning larger differences between the two groups, lead to improved prediction performance (see Additional file 1: Figure S1), and for classification succeeds in providing an almost perfect separation between both groups with . Larger sample size n and smaller number of covariates p also result in better prediction performance.

Fig. 2

Estimated weights in the simulation. Estimated weights for subgroup 1A obtained by random forest based on simulated training data with and

Parameter estimation and prediction performance

Weighted Cox models, including fixed or estimated weights (with different classification methods for weights estimation), are compared to the standard combined and subgroup model, first by estimated regression coefficients and second by prediction performance. Estimated effects in the simulation. Mean estimated regression coefficients of the first 12 prognostic genes in all Cox models for group 1 (averaged across all simulated training data sets, regardless of whether a gene is selected in individual splits or not, and subgroups 1A and 1B) for and . The symbol ‘x’ represents the true simulated effects of both groups: in black the group of interest (here group 1) and in grey the other group Figure 3 shows scatterplots of the mean estimated regression coefficients of the first 12 prognostic genes in group 1 (mean across all training sets and subgroups 1A and 1B) for simulated data with , and . For , the combined and weighted model with estimated weights provide very similar results, as expected. They identify joint effects better than the subgroup model when the sample size is small () and otherwise equally well. However, the subgroup model estimates subgroup-specific effects better, especially for increasing sample size, whereas the other two model approaches tend to average effects across all subgroups. For larger values of the estimated weights model detects subgroup-specific effects increasingly better than the combined model, and similarly well or even better than the standard subgroup model when sample size is small. Results for fixed weights lie between the subgroup model and the combined model.

Fig. 3

Estimated effects in the simulation. Mean estimated regression coefficients of the first 12 prognostic genes in all Cox models for group 1 (averaged across all simulated training data sets, regardless of whether a gene is selected in individual splits or not, and subgroups 1A and 1B) for and . The symbol ‘x’ represents the true simulated effects of both groups: in black the group of interest (here group 1) and in grey the other group

These findings agree with the corresponding mean inclusion frequencies (MIFs), defined as the proportion of training data sets in which a specific covariate j is included in the model (). For small sample size, the MIFs of the standard combined model and the estimated weights approach are larger than the MIFs of the standard subgroup model. This has a positive impact on the detection of joint effects, but subgroup-specific effects that are present in only one group may be more often erroneously selected in the other group. For increasing sample size the MIFs of all models also increase. For larger values of , the MIFs of the estimated weights model move closer to the MIFs of the subgroup model regarding subgroup-specific effects and are still similar to the combined model for joint effects. Prediction performance in the simulation. Mean C-index, averaged across all test data sets and subgroups, for Finally, we assess the prediction performance of all Cox models in terms of the C-index. High values of the C-index (close to 1) indicate a good predictive performance, whereas 0.5 corresponds to random prediction. Figure 4 displays the mean C-index (averaged across all test sets and subgroups). For the combined model and the weighted model with estimated weights exhibit a very similar predictive ability, that is better compared to the subgroup model when sample size is small. However, when the sample size increases the subgroup model outperforms the other methods. For larger values of , the estimated weights approach performs best when the sample size is small and otherwise equally well as the subgroup model. Estimated weights by lasso and ridge improve in comparison to rf (random forest) for larger n. Unsurprisingly, the prediction performance of fixed weights lies between the standard combined model and the subgroup model. Mean C-index values for all 252 simulation scenarios and all 14 Cox model types can be found in Additional file 1: Table S2.

Fig. 4

Prediction performance in the simulation. Mean C-index, averaged across all test data sets and subgroups, for

Results of the application to NSCLC cohorts

We apply all methods presented in the previous section to the following four non-small cell lung cancer (NSCLC) cohorts comprising in total patients with available overall survival endpoint and Affymetrix microarray gene expression data: GSE29013 (, 18 events), GSE31210 (, 35 events), GSE37745 (, 143 events), and GSE50081 (, 65 events). For the analysis, we use the total number of genetic covariates measured in each cohort, as well as two preselected reduced gene sets. One gene filter is defined by the features with the highest variance in gene expression values across all four cohorts, referred to as top-1000-variance genes. The second gene filter is a literature-based selection of prognostic genes. More details on the data description and preprocessing can be found in Additional file 1. In the following, we consider four lung cancer cohorts as subgroups. We compare the estimated weights using three classification methods (lasso, ridge, rf) and three different pre-specified sets of genes (gene filters): all available genes (), top-1000-variance genes (), and a literature-based selection of prognostic genes (). Since all results are very similar, we only show them exemplary for the top-1000-variance genes and rf in Fig. 5. Boxplots of the estimated weights suggest that subgroups are very different from each other. Patients belonging to the subgroup of interest receive a relatively large weight in the respective subgroup-specific model, while the contribution of all other subgroups is close to zero. This resembles the standard subgroup model.

Fig. 5

Estimated weights in the application. Estimated weights for all lung cancer cohorts using random forest and the top-1000-variance genes as gene filter

Estimated weights in the application. Estimated weights for all lung cancer cohorts using random forest and the top-1000-variance genes as gene filter The estimated weights for patients from GSE29013 are the highest in the corresponding subgroup model for GSE29013, and much higher compared to the other cohorts. The reason is that GSE29013 is by far the smallest subgroup and when the estimated probabilities of belonging to are divided by the very small relative frequency , the resulting probability ratio corresponding to the weights gets very large. All analyses are based on probe set level of gene expression data, but for the illustration of the parameter estimates in the Cox models, probe set IDs are translated into gene symbols using the R/Bioconductor annotation packages hgu133plus2.db [10] and AnnotationDbi [28]. In case of missing gene symbols, original probe set IDs are retained. Corresponding gene annotation is retrieved from the Ensembl website [44] to obtain gene-specific information on encoded proteins, related pathways, Gene Ontology (GO) annotations, associated diseases, and related articles in PubMed. This information is retrieved from the NCBI Gene [9] and GeneCards [16] databases. Figure 6 shows, separately for each subgroup, the mean estimated regression coefficients of the most frequently selected top-1000-variance genes (genes with a mean inclusion frequency (MIF) larger or equal than 0.5 in any model type). Eight genes are in the overlap of all subgroups, among them immune-related genes (DEFB1, AOC1, JCHAIN) as well as genes (215780_s_at/SET, SPP1) that were reported in the literature to be associated with different types of cancer. Often they are most frequently selected by the combined model and the weighted model with large fixed weights. Subgroup-specific genes with strong effects on overall survival and high MIFs in the proposed weighted model involve the following cancer-related genes: ADH1C, BMP5, LCN2 and PLOD2 in GSE31210, CST1 in GSE37745, as well as AREG and COL4A3 in GSE29013.

Fig. 6

Estimated effects in the application. Different types of Cox models including the top-1000-variance genes as covariates. Mean estimated regression coefficients of selected genes (averaged across all training sets regardless of whether a gene is selected in individual splits or not). For each subgroup genes with a mean inclusion frequency larger or equal than 0.5 in any model type are selected Prediction performance in the application. Different types of Cox models including the top-1000-variance genes as covariates. Boxplots of C-index based on all test sets for the prediction of each subgroup For the other two gene filters (prognostic genes and all genes), parameter estimates of the most stable genes in all Cox models are displayed in Additional file 1: Figures S2 and S3. Cox models including all genes identify fewer genes compared to the other gene filters which is likely caused by the large number of noise genes. There are two cancer-related genes most frequently selected across all subgroups by the combined model and the weighted model with large fixed weights: CPNE8 and SPP1 MIFs and estimated regression coefficients of the subgroup model and the proposed weighted model are mainly close to zero, except for PTGER3 in GSE31210. PTGER3 induces tumor progression in different cancer types including adenocarcinoma of the lung. This may explain the specific association with GSE31210 being the only subgroup comprising exclusively adenocarcinoma. Interestingly, almost all selected genes are either in the overlap of all subgroups or specific for only one subgroup. There are hardly any genes selected by two or three subgroups, which may be due to the fact that these lung cancer studies are heterogeneous (see Additional file 1: Figure S4). There is one gene (SPP1) that is in the overlap of all four subgroups and all three gene sets. SPP1—also known as Osteopontin (OPN)—is involved in inflammatory response, osteoblast differentiation for bone formation and attachment of osteoclasts to the mineralized bone matrix for bone resorption. Further, SPP1 is associated with several malignant diseases and prognosis in NSCLC. Finally, all Cox models are compared with regard to prediction performance. In Fig. 7 results of the C-index across all test sets are shown for the top-1000-variance genes. The combined model and fixed weights of increasing size tend to have the highest predictive accuracy, while the estimated weights approach and the standard subgroup model perform similarly. Particularly in the subgroup model for GSE29013 the performance of the estimated weights differs from the fixed weights because the estimated weights for GSE29013 are much higher compared to those for all other subgroups, which is similar to the standard subgroup model. The corresponding boxplots of the C-index for the prognostic gene filter and all genes are shown in Additional file 1: Figures S5 and S6. Random forest tends to be the best classification method in combination with prognostic genes and all genes, whereas ridge tends to perform slightly better than the other classification methods along with top-1000-variance genes. However, overall prediction performance is mostly moderate and not much better than random.

Fig. 7

Prediction performance in the application. Different types of Cox models including the top-1000-variance genes as covariates. Boxplots of C-index based on all test sets for the prediction of each subgroup

Discussion

We have focused on three major objectives: prediction of a patient’s survival, selection of important covariates, and consideration of heterogeneity in data due to pre-known subgroups of patients. Specifically, we have aimed at estimating a separate risk prediction model for each subgroup using patient-level training data from all available subgroups and individually weighting patients according to their similarity to the subgroup of interest. Our approach should correctly identify common as well as subgroup-specific effects and have improved prediction accuracy over standard approaches. As standard approaches, we consider standard subgroup analysis, including only patients from the subgroup of interest, and standard combined analysis that simply pools patients from all subgroups. We have proposed a Cox model with lasso penalty for variable selection and a weighted version of the partial likelihood that includes patients from all subgroups but with individual weights. This allows sharing information between subgroups to increase power when this is supported by the data, meaning that subgroups are similar in their covariates and survival outcome. Weights for a specific subgroup are estimated by classification and cross-validation on the training data from all subgroups, such that they represent the probability of belonging to that subgroup given the observed covariates and survival outcome. These predicted conditional probabilities are divided by the a priori probability of the respective subgroup to obtain the subgroup-specific weights for each patient. Patients who fit well into the subgroup of interest receive higher weights in the subgroup-specific model. The estimated subgroup-specific model can then be applied to the test data from the corresponding subgroup to obtain predictions for that subgroup. Alternatively to our individual weights, one could restrict the model to the case where all weights for the patients from a subgroup must be the same [29, 39]. We have considered three different classification methods for weights estimation (multinomial logistic regression with lasso or ridge penalty and random forest), and, based on simulated data and on real data, we have compared our proposed weighted Cox model to both standard Cox models (combined and subgroup), as well as a weighted Cox model with different fixed weights as proposed by Weyer and Binder [39]. Observations belonging to a certain subgroup were assigned a weight of 1 in the subgroup-specific likelihood, while all other observations were down-weighted with a constant weight . Simulation results have shown that when subgroups were very similar and hardly distinguishable from each other in terms of their covariate values and only had a few different subgroup-specific effects, classification methods failed to discriminate between distinct subgroups and all observations were assigned a weight around one corresponding to the standard combined model. In this situation, results of the combined model and the proposed weighted model were very similar as intended. Both models had better prediction performance and larger power to correctly identify joint effects than the standard subgroup model when the sample size was small (). The potential bias introduced in the estimation of subgroup-specific effects (tendency to average subgroup-specific effects across subgroups) is, however, not very likely in the situation of very similar subgroups. For increasing sample size, the standard subgroup model outperformed the other models regarding prediction and selection accuracy, in particular in terms of unbiased estimation of subgroup-specific effects. When differences between subgroups became larger, classification succeeded in discriminating between different subgroups, and our proposed weighted model improved over the combined model in correctly identifying subgroup-specific effects and resulted in higher prediction accuracy. It clearly outperformed the standard subgroup model when the sample size was low, and otherwise performed similarly well. Results with fixed weights, as expected, always lay between the standard subgroup model and the combined model. However, they cannot flexibly adapt to different degrees of heterogeneity between subgroups as our proposed estimated weights do. In the application example, we considered four lung cancer studies as subgroups comprising overall survival outcome, and gene expression data as covariates. Three different gene filters were used: all available genes, top-1000-variance genes, and a literature-based selection of prognostic genes. The real data application demonstrated the case of strongly differing subgroups where adding data from other subgroups is not appropriate as reflected by the small estimated weights. Our proposed weighted approach resembled the standard subgroup model, where only the subgroup of interest is assigned a high weight and all other subgroups have weights close to zero. The results of all three classification methods were similar. Prediction performance of Cox models indicated that logistic regression with ridge penalty and top-1000-variance genes outperformed the other two classification methods, while random forest tended to perform best in combination with all genes and with prognostic genes. However, the prediction performance of all Cox models was mainly moderate and not much better than random prediction. The combined model and the weighted model with fixed weights of increasing size tended to have slightly higher predictive accuracy, while the estimated weights approach and the standard subgroup model performed similarly. Genes identified most frequently by the former models were often present in all subgroups and some of them were reported in the literature to be associated with prognosis in various cancers. However, the corresponding estimated regression coefficients were often relatively small suggesting weak effects on survival outcome. Few candidate genes with reported cancer relation and relatively strong subgroup-specific effects were selected most frequently by either the subgroup model or the proposed weighted model. A major reason for the overall moderate prediction accuracy in the application example may be that the present lung cancer studies are too heterogeneous. On the one hand, they comprise different histological subtypes that are known to be associated with a different prognosis. One could think of using only patients belonging to the same histological subtype such as adenocarcinoma. However, this would make the sizes of the patient subgroups even smaller. On the other hand, tissue processing and RNA extraction for generating gene expression data as well as patient inclusion criteria vary between studies. In GSE29013 genome-wide expression profiling was based on formalin-fixed paraffin-embedded (FFPE) tissues rather than fresh frozen tissues like in GSE37745 and GSE50081, which might influence expression levels. GSE31210 and GSE50081 include only patients with stage I and II, and GSE31210 is additionally restricted to lung adenocarcinomas. In Madjar [26] we studied the influence of further parameters for weights estimation on prediction performance: the inclusion of interactions between genomic covariates and survival time in the classification model, as well as replacement of the survival time by the Nelson–Aalen estimator of the cumulative hazard rate in the set of covariates in the classification model. The latter was proposed by White and Royston [40] in the context of multiple imputation. We also considered a simulation with uneven sample sizes across subgroups and compared standard classification without sampling techniques with two oversampling techniques (random oversampling and synthetic minority oversampling technique). Oversampling increases the sample size of the small subgroup so that it is balanced with respect to the other subgroups. However, we found no considerable influence of the further parameters for weights estimation on prediction performance and also oversampling seemed to have no effect. Simulations with uneven sample sizes showed that the predicted probabilities of belonging to a specific subgroup were smaller for the subgroup with smaller sample size compared to the other subgroups having the same large sample size. However, this effect was compensated for when was divided by the relative frequency of each subgroup to obtain the weights ratio. This resulted in similar prediction accuracies for all subgroups, whereas the standard subgroup model clearly showed a worse prediction performance for the small subgroup. We make the important assumption that subgroups are pre-known with the subgroup affiliation of each patient being unique and fixed, which is generally the case when patients from different clinical centers are considered. However, in situations with unknown subgroups the latent subgroup structure would first need to be determined using methods such as clustering. A wide variety of approaches have been proposed for the clustering of molecular data [13, 27, 42] with extensions to sparse clustering [30, 41] and integrative clustering of multiple omics data types [11, 22].

Conclusions

Predicting cancer survival risk based on high-dimensional molecular measurements for patients combined from heterogenous subgroups/cohorts is an important problem. The central motivation and idea of our proposed approach is to improve the prediction for a specific selected subgroup when also data from other subgroups are available, however, when it is not a priori clear which other subgroups can help to improve the prediction for the subgroup of interest. By adding data from other subgroups in a penalized weighted Cox model we aim at increasing the power through larger sample size compared to the classical subgroup analysis that ignores the information from all other individuals. Weights are based on the probability of belonging to the subgroup of interest and are estimated from the (training) data instead of having to determine them a priori. In the situation of small sample sizes, simulation results clearly demonstrated the benefit of our proposed approach, suggesting that incorporating information from other subgroups in the estimation of a subgroup-specific risk model can improve the prediction performance and variable selection accuracy over standard approaches. Additional file 1: Additional supporting information referenced in the Results Sections: Description of the NSCLC data preprocessing, Supplementary Figures 1-6, Supplementary Tables 1-2).

31 in total

Weighted Cox regression for the prediction of heterogeneous patient subgroups.

Background

Related work

Methods

Cox proportional hazards model

Penalized Cox regression model

Weighted Cox partial likelihood

Estimation of weights

Prediction performance

Model fitting and evaluation

Results of the simulation study

Simulated data

Weights estimation

Parameter estimation and prediction performance

Results of the application to NSCLC cohorts

Discussion

Conclusions

1. Generating survival times to simulate Cox proportional hazards models.

2. Survival model predictive accuracy and ROC curves.

3. Generalized additive modeling with implicit variable selection by likelihood-based boosting.

4. Model-based boosting in high dimensions.

5. Integrative analysis of prognosis data on multiple cancer subtypes.

6. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.

7. A framework for feature selection in clustering.

8. Integrative clustering methods for high-dimensional molecular data.

9. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models.

Review 10. Clustering Algorithms: Their Application to Gene Expression Data.