Literature DB >> 27990267

Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer.

Abstract

Metastatic castrate resistant prostate cancer (mCRPC) is the major cause of death in prostate cancer patients. Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear. Thus finding key patient clinical variables related with mCRPC is an important issue for understanding the disease progression mechanism of mCRPC and clinical decision making for these patients. The Prostate Cancer DREAM Challenge is a crowd-based competition to tackle this essential challenge using new large clinical datasets. This paper proposes an effective procedure for predicting global risks and survival times of these patients, aimed at sub-challenge 1a and 1b of the Prostate Cancer DREAM challenge. The procedure implements a two-step feature selection procedure, which first implements sparse feature selection for numerical clinical variables and statistical hypothesis testing of differences between survival curves caused by categorical clinical variables, and then implements a forward feature selection to narrow the list of informative features. Using Cox's proportional hazards model with these selected features, this method predicted global risk and survival time of patients using a linear model whose input is a median time computed from the hazard model. The challenge results demonstrated that the proposed procedure outperforms the state of the art model by correctly selecting more informative features on both the global risk prediction and the survival time prediction.

Entities: Chemical Disease Gene Species

Keywords: Cox-proportional hazards model; Survival analysis; feature selection

Year: 2016 PMID： 27990267 PMCID： PMC5130083 DOI： 10.12688/f1000research.8201.1

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Prostate cancer is the most common malignant tumor among men and ranks third in terms of mortality after lung cancer and colorectal cancer. The major clinical treatment against prostate cancer is an anti-androgen therapy to inhibit male hormones providing to prostate cancer cells. However, the therapy cannot inhibit the cancer cell growth for long because these cells can develop the resistance against the androgen absence condition. This developed prostate cancer is called metastatic castrate resistant prostate cancer (mCRPC), which is the major cause of death in prostate cancer patients [1, 2]. Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear [3]. Finding key clinical variables related with mCRPC is an important first step for understanding the disease progression mechanism and clinical decision making for these patients. Halabi et al. [4] identified key factors of mCRPC from a lot of clinical variables by feature selection based on a Cox’s proportional hazards model with a L 1 penalty, i.e. a variant of Lasso for survival analysis [6, 7] and built a mCRPC prognostic model. This data-driven approach is important to correctly predict patient health status for treatment choices. To validate and improve such prediction models of mCRPC patients, larger scale clinical datasets collected from several clinical institutes are useful. The Prostate Cancer DREAM challenge in DREAM 9.5 ( https://www.synapse.org/ProstateCancerChallenge) provided such datasets and an opportunity to tackle this essential challenge using the wisdom of the crowd, in which participating teams were required to submit prediction models based on clinical variables from the comparator arms of four phase III clinical trials with over 2,000 mCRPC patients treated with first-line docetaxel. My method for this challenge consists of a two-step feature selection procedure, which first performs both sparse feature selection [7] and statistical hypothesis testing [8], and then performs a forward feature selection [9] to screen out non-informative features. Selected clinical variables were used to build a prognostic model to predict global risks of patients. For a survival time prediction, my method further used a linear model fitting with median survival time [5] computed by the established progression model. The final result of this DREAM challenge demonstrated that, in the sub-challenge 1a, the proposed procedure outperforms Halabi’s model [4] by correctly selecting more informative features on global risk prediction. In sub-challenge 1b, my method using these selected features predicted the survival time more correctly and outperforms most of the other team’s methods.

Methods

Dataset and pre-process

Data across comparator arms of four phase III clinical trials have been compiled, annotated, cleaned and were made available through the Challenge and remain available on the web site [7]. These datasets include over 150 clinical variables and over 2,000 mCRPC patients treated with first-line docetaxel. The output value to be predicted for unknown new patients is the survival time. The survival times of patients are not always observed because some patients are still alive when they are lost to follow-up or when the study ends. Thus the observed survival times are right censoring. For the training dataset, three of the clinical trial cohorts were provided, which includes data for 476, 598, and 526 patients from clinical trial ASCENT-2 (Novacea, provided by Memorial Sloan Kettering Cancer Center) [10], VENICE (Sanofi) [11], and MAINSAIL (Celgene) [12], respectively. For the test dataset, 470 patients’ data were provided from clinical trial ENTHUSE-33 (AstraZeneca) [13]. The goal of this challenge was to correctly predict global risk of death and survival time of patients in the test dataset. In these datasets, clinical variables for some patients were missing. These missing values were imputed by the median of each variable for numerical values and by the most frequent value for each categorical variable.

Hazard model

A Cox proportional hazards model is assumed for the relationship between clinical variables (input variables) of a patient and the survival time (a output variable) [5]. Let be clinical variables of a patient. The hazard function of the patient at time t is given by where h 0( t) is a baseline hazard function and is a weight vector to be optimized from training data. When the weight value of the d-th clinical variable β is large, the clinical variable is informative to predict the survival time. On the other hand, when β =0, the d-th clinical variable is independent with the survival time. Thus the correctly estimating is the most important task in survival analysis. A common estimation is performed by maximizing a partial log likelihood function of N patients given by where is a vector of clinical variables of the n-th patient, δ is a binary variable. δ = 1 for died patients and δ = 0 for right-censored patients at time t when is the survival time of the n-th patient. R is the risk set at time t . This estimation is of course affected by non-informative clinical variables (noise variables) because the size of the training data is limited, where the number of clinical variable is large but the number of patients is small. Before estimating weight vector in the hazard function, my method implemented a two-step feature selection to screen out non-informative clinical variables.

Feature selection

The goal of feature selection is to divide the set of all clinical variables into a set of informative variables and non-informative variables by optimizing the final scoring metric. However, this optimization is NP-hard, i.e. intractable in general. Thus my procedure implemented this task in a heuristic manner; 1) screening numerical features by a L 1 sparse penalized regression and categorical features by a statistical hypothesis testing, and then 2) a forward sequential feature selection to narrow the list of informative selected features by optimizing the final scoring metric. For the first procedure, my procedure used a variant of LASSO for a Cox’s proportional hazards model [7] provided by R package glmpath [11]. This approach should choose the weight of the L 1 penalty term. My method automatically chose it by minimizing an information criterion (AIC), which is a criterion to estimate the generalized error. Because the computational cost of this implementation with a lot of clinical variables is expensive, my procedure used this sparse feature selection for only numerical variables to reduce the computational cost. Categorical variables were evaluated using rank statistical hypothesis testing [5, 8]. This method tests if there is a significant difference between two or more survival curves with different values of a categorical variable. If the difference of curves is statistically significant, the categorical variable might be related with survival times of patients. Therefore, such variables should be selected for a survival time prediction model. Among selected features described above, my procedure further implemented a forward feature selection [9] to narrow the list of clinical variables. In my procedure, the most useful feature that maximally increases an integrated time-dependent AUC (iAUC) [14], which is the final scoring metric in sub-challenge 1a, is sequentially added one by one until all variables are selected. After that, the optimal set of clinical variables is selected by maximizing iAUC. iAUCs were estimated by cross-validation (CV), which was performed by randomly splitting all training data into 90% training data and 10% test data. iAUC was estimated as the median among ten calculated iAUC values.

Prediction of global risk of death and survival time

After selecting informative features, parameter in the Cox proportional hazard function was optimized using only the selected clinical variables. Next, the hazard function was used to predict the global risk of death for each patient [5]. The survival time of each patient can be predicted based on the median time when an estimated survival probability is equal to 0.5, computed from the hazard function [5]. However the root mean squared error of this prediction method was still large and an estimation bias was included because of the right censoring setting, which will be experimentally demonstrated later. Against this problem, my method used a linear model fitting from computed median times to observed survival times in the training dataset. Survival time was predicted by the liner regression model whose input is the estimated median time of each patient.

Results

Selected clinical variables

My method removed clinical variables having a lot of missing values and then it used only 14 numerical clinical variables and 56 categorical clinical variables with less number of missing values. Feature selection for numerical clinical variables was first implemented using the L 1 penalized approach [7] by function coxpath in R package glmpath ( https://cran.r-project.org/web/packages/glmpath/glmpath.pdf). This function can compute the entire regularization path for the L 1 penalized model by increasing the weight of the penalty and check only steps of the path when a weight parameter of a clinical variable becomes greater than zero. Table 1 shows the first 20 steps and the sequence of added clinical variables. Figure 1 shows computed AIC scores of these steps. The best feature set (step) was selected by minimizing an AIC score. This procedure chose the 14th step and then selected nine clinical covariates (ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH) as informative clinical variables.

Table 1.

Selected clinical variables at each step of the regularization path.

Step	Clinical variable
1	ENTRT_PC
2	ALP
3	HB
6	AST
8	ECOG_C
11	NEU
12	PLT
13	PSA
14	LDH
15	CA
16	CREAT
18	ALT
19	WBC
20	TBILI

Figure 1.

AICs of steps in the L 1 regularization path.

On the other hand, differences of survival curves by categorical clinical variables were statistically tested using function survdiff in R package survival ( https://cran.r-project.org/web/packages/survival/survival.pdf). Table 2 shows the ranking result of clinical variables with p-values. The threshold of a significance level was set to 0.05 and then the procedure selected categorical features ANALGESICS, MHGEN, MI, TURP, MHCARD, ACE_INHIBITORS, MHPSYCH and PROSTATECTOMY.

Table 2.

p-value of statistical hypothesis testing for categorical clinical variables.

Rank	Clinical variable	p-value
1	ANALGESICS	9.8e-08
2	MHGEN	8.5e-03
3	MI	1.0e-02
4	TURP	1.2e-02
5	MHCARD	1.3e-02
6	ACE_INHIBITORS	2.6e-02
7	MHPSYCH	3.9e-02
8	PROSTATECTOMY	4.3e-02

For these 17 selected clinical variables by two feature selections, we further implemented the forward feature selection described in the previous section. Figure 2 shows iAUC at each step of the forward feature selection. This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients.

Figure 2.

iAUC at each step of the forward feature selection.

Prediction performance

The parameter vector of a Cox-proportional hazards model with six selected clinical variables was optimized by maximizing the partial log-likelihood function. Then the global risks of death of patients in the test dataset were predicted from the optimized model. Prediction performance iAUC by the proposed method is 0.7671 although iAUC by Halabi’s model is 0.7429, which can be found in the ranking result of sub-challenge 1a in the web site of Prostate Cancer DREAM Challenge ( https://www.synapse.org/ProstateCancerChallenge). This result demonstrated that the proposed prediction outperforms Halabi’s method by correctly selecting informative features. Furthermore, survival times of patients were predicted using median times computed from the optimized hazard model. Figure 3(a) shows predicted values and observed values in the training dataset. This result demonstrates that the estimation of variance is large and the center of plotted data is located to the upper-left from the diagonal line, meaning that predicted values are biased. To improve these prediction errors, the median survival times were transformed by a linear model. Figure 3(b) shows the prediction result after this transformation. These figures demonstrate that the proposed prediction reduces both the estimation bias and variance. As a result, the root mean square error (RMSE) between true values and predictions is drastically improved, from 281.3 by median survival times to 198.7 by the proposed method. This prediction result in sub-challenge 1-b in the Prostate Cancer DREAM Challenge was ranked in the group of top-performers even though the global risk prediction result in sub-challenge 1a was worse than the best 10 performers.

Figure 3.

Predicted survival times in the training dataset.

Conclusions

This paper outlines a prediction method of global risks of mCRPC patients for sub-challenge 1a and that of survival time for sub-challenge 1b in the Prostate Cancer DREAM Challenge. The challenge result in sub-challenge 1b demonstrated that this procedure, which is based on the two-step feature selection and the correction of naïve survival time predictions from the optimized hazard model, outperformed the other teams’ methods. Especially, for survival time prediction, this correction method based on centering and reducing estimation variance works well to improve RMSE, the scoring metric of sub-challenge 1b. This analysis demonstrates that a naïve prediction from a basic model (Cox’s proportional hazards model) is not always optimal for an evaluation metric. Thus a suitable transformation is necessary to optimize the metric. This paper also provides a two-step feature selection procedure because using only a single feature selection method leaves a lot of non-informative features. By carefully selecting features by this two-step procedure, the global risk prediction outperformed Halabi’s model [4] in sub-challenge 1a. This result demonstrated that multiple feature selection procedures are necessary to screen out non-informative features. Future work includes the validation of informative clinical variables selected by not only of the method proposed here, but also other top-performing methods. Table 3 shows the comparison of our selected clinical variables with Halabi’s selected variables [4]. Both models selected ALP, ECOG_C and HB but neither our model nor Halabi’s model selected the other eight clinical variables. Although selection results depend on the datasets used, we should further investigate the importance of these clinical variables using knowledge in clinical and biological research areas.

Table 3.

Selected clinical variables by the proposed model and Halabi’s model [4].

Clinical Variables	Proposed Model	Halabi’s Model
ALB	×	○
ALP	○	○
ANALGESICS	×	○
AST	○	×
ECOG_C	○	○
HB	○	○
LDH	×	○
LIVER	×	○
MI	○	×
PLT	○	×
PSA	×	○

Data availability

The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2016 Shiga M Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). The Challenge datasets can be accessed at: https://www.projectdatasphere.org/projectdatasphere/html/pcdc Challenge documentation, including the detailed description of the Challenge design, overall results, scoring scripts, and the clinical trials data dictionary can be found at: https://www.synapse.org/ProstateCancerChallenge The code and documentation underlying the method presented in this paper can be found at: http://dx.doi.org/10.7303/syn4229266 [15] This paper is generally well-written, with a clear and concise description of the problem and challenge. The author adopted a two-step feature selection procedure: a penalized L 1 regression for Cox PH model (R package "glmpath") in the first step, and forward selection in the second step. Features are selected to optimize the iAUC (integrated time-dependent AUC) in 10-fold cross validation. Major comments: I am confused about how the two-step feature selection procedure works. The author mentioned the following Therefore, I assume the second step starts with the features selected from the first step. However, the features shown in Table 2 don't appear to be a subset of the features shown in Table 1. Also, the feature "MI" doesn't appear to be in Table 1. " Among selected features described above, my procedure further implemented a forward feature selection to narrow the list of clinical variables." " This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients." Minor comments: The difference between sub-challenge 1a and sub-challenge 1b is not documented in the Introduction. Please explain that in sub-challenge 1a, the submissions consist of the risk scores, while in sub-challenge 1b, the submissions consist of the predicted survival time. Under Results and "Selected clinical variables", the author mentioned that " My method removed clinical variables having a lot of missing values and then it used only 14 numerical clinical variables and 56 categorical clinical variables with less number of missing values.". What are the exact criteria for filtering clinical variables given that there are 150+ clinical variables to start with? Please explain what the clinical variables mean (e.g. ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH in Table 1). Please expand the captions for Table 1 and Table 2 to put these tables in the context of the 2-step feature selection procedure. In Table 3, I assume the circle means "yes" and the cross means "no". Please add a legend to the caption. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This paper offers methods to calculate patient risk scores and predict survival times from proportional hazard models in the context of the Prostate Cancer DREAM Challenge. The author used a two-step feature selection procedure by first using a combination of the LASSO and significance testing and then using a forward selection method. The challenge consisted of two parts. In one part the contestants were to assign global risk scores to patients and in the other they were to predict survival times. The author states that the results of the methods in question for the former outcome did not make it into the top-10 of the challenge. However, the paper seems to conclude that the two-step feature selection is superior to one-step feature selection. This is possibly based on a comparison with the DREAM benchmark model only. In this case, the paper would benefit from a more specific statement. For the feature selection it seems unclear if the LASSO variable selection was done conditionally on the categorical predictors (without penalizing their coefficients) or marginally on only the continuous predictors. Cross-validation seems to have been carried out incorrectly in the sense that only the second step (the forward selection) and not the first step was cross-validated. Whether this has consequences for the quality of the selection is unclear, but the estimated iAUC-values reported in Figure 2 are suspiciously large – and they definitely overestimate the validation iAUC. For predicting survival times, the author first used a fitted proportional hazards model to estimate median survival times. Then observed survival times were regressed linearly on the predicted medians. This estimated a linear transformation, which could be used to transform predicted medians to means. The paper would benefit from a brief discussion of the motivation behind this approach. It is stated that the linear transformation “reduces both the estimation bias and variance”, which is unclear as it is not stated what we’re aiming to estimate. Arguably, estimating the means from the medians should improve the performance as the RMSE is used to score the predictions. Minor comments: p. 5: 1-b→1b Table 3: Please add information to the caption about what the symbols mean. It is clear from reading the paper that “open circle” means “selected”, but that is not self-evident. We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

8 in total

1. Updated prognostic model for predicting overall survival in first-line chemotherapy for patients with metastatic castration-resistant prostate cancer.

Authors: Susan Halabi; Chen-Yen Lin; W Kevin Kelly; Karim S Fizazi; Judd W Moul; Ellen B Kaplan; Michael J Morris; Eric J Small
Journal: J Clin Oncol Date: 2014-01-21 Impact factor: 44.544

2. Randomized, open-label phase III trial of docetaxel plus high-dose calcitriol versus docetaxel plus prednisone for patients with castration-resistant prostate cancer.

Authors: Howard I Scher; Xiaoyu Jia; Kim Chi; Ronald de Wit; William R Berry; Peter Albers; Brian Henick; David Waterhouse; Dean J Ruether; Peter J Rosen; Anthony A Meluch; Luke T Nordquist; Peter M Venner; Axel Heidenreich; Luis Chu; Glenn Heller
Journal: J Clin Oncol Date: 2011-04-11 Impact factor: 44.544

3. Docetaxel and prednisone with or without lenalidomide in chemotherapy-naive patients with metastatic castration-resistant prostate cancer (MAINSAIL): a randomised, double-blind, placebo-controlled phase 3 trial.

Authors: Daniel P Petrylak; Nicholas J Vogelzang; Nikolay Budnik; Pawel Jan Wiechno; Cora N Sternberg; Kevin Doner; Joaquim Bellmunt; John M Burke; Maria Ochoa de Olza; Ananya Choudhury; Juergen E Gschwend; Evgeny Kopyltsov; Aude Flechon; Nicolas Van As; Nadine Houede; Debora Barton; Abderrahim Fandi; Ulf Jungnelius; Shaoyi Li; Ronald de Wit; Karim Fizazi
Journal: Lancet Oncol Date: 2015-03-03 Impact factor: 41.316

4. No improvement noted in overall or cause-specific survival for men presenting with metastatic prostate cancer over a 20-year period.

Authors: Jennifer N Wu; Kari M Fish; Christopher P Evans; Ralph W Devere White; Marc A Dall'Era
Journal: Cancer Date: 2013-11-20 Impact factor: 6.860

5. Aflibercept versus placebo in combination with docetaxel and prednisone for treatment of men with metastatic castration-resistant prostate cancer (VENICE): a phase 3, double-blind randomised trial.

Authors: Ian F Tannock; Karim Fizazi; Sergey Ivanov; Camilla Thellenberg Karlsson; Aude Fléchon; Iwona Skoneczna; Francisco Orlandi; Gwenaelle Gravis; Vsevolod Matveev; Sevil Bavbek; Thierry Gil; Luciano Viana; Osvaldo Arén; Oleg Karyakin; Tony Elliott; Alison Birtle; Emmanuelle Magherini; Laurence Hatteville; Daniel Petrylak; Bertrand Tombal; Mark Rosenthal
Journal: Lancet Oncol Date: 2013-06-04 Impact factor: 41.316

6. Abiraterone in metastatic prostate cancer without previous chemotherapy.

Authors: Charles J Ryan; Matthew R Smith; Johann S de Bono; Arturo Molina; Christopher J Logothetis; Paul de Souza; Karim Fizazi; Paul Mainwaring; Josep M Piulats; Siobhan Ng; Joan Carles; Peter F A Mulders; Ethan Basch; Eric J Small; Fred Saad; Dirk Schrijvers; Hendrik Van Poppel; Som D Mukherjee; Henrik Suttmann; Winald R Gerritsen; Thomas W Flaig; Daniel J George; Evan Y Yu; Eleni Efstathiou; Allan Pantuck; Eric Winquist; Celestia S Higano; Mary-Ellen Taplin; Youn Park; Thian Kheoh; Thomas Griffin; Howard I Scher; Dana E Rathkopf
Journal: N Engl J Med Date: 2012-12-10 Impact factor: 91.245

7. Cancer statistics, 2009.

Authors: Ahmedin Jemal; Rebecca Siegel; Elizabeth Ward; Yongping Hao; Jiaquan Xu; Michael J Thun
Journal: CA Cancer J Clin Date: 2009-05-27 Impact factor: 508.702

8. Phase III, randomized, placebo-controlled study of docetaxel in combination with zibotentan in patients with metastatic castration-resistant prostate cancer.

Authors: Karim Fizazi; Karim S Fizazi; Celestia S Higano; Joel B Nelson; Martin Gleave; Kurt Miller; Thomas Morris; Faith E Nathan; Stuart McIntosh; Kristine Pemberton; Judd W Moul
Journal: J Clin Oncol Date: 2013-04-08 Impact factor: 44.544

8 in total

1 in total

1. DWCox: A density-weighted Cox model for outlier-robust prediction of prostate cancer survival.

Authors: Jinfeng Xiao; Sheng Wang; Jingbo Shang; Henry Lin; Doris Xin; Xiang Ren; Jiawei Han; Jian Peng
Journal: F1000Res Date: 2016-12-01

1 in total