Literature DB >> 35758796

BITES: balanced individual treatment effect for survival data.

S Schrod¹, A Schäfer², S Solbrig², R Lohmayer³, W Gronwald⁴, P J Oefner⁴, T Beißbarth¹, R Spang⁵, H U Zacharias^6,7, M Altenbuchinger¹.

Abstract

MOTIVATION: Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine. Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes). Several methods were suggested for this scenario based on observational data, i.e. data where the intervention was not applied randomly, for both continuous and binary outcome variables. However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period. Albeit their enormous importance, time-to-event data are rarely used for treatment optimization. We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network; i.e. we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM).
RESULTS: We show in simulation studies that this approach outperforms the state of the art. Furthermore, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters. We successfully validated this finding in an independent cohort.
AVAILABILITY AND IMPLEMENTATION: We provide BITES as an easy-to-use python implementation including scheduled hyper-parameter optimization (https://github.com/sschrod/BITES). The data underlying this article are available in the CRAN repository at https://rdrr.io/cran/survival/man/gbsg.html and https://rdrr.io/cran/survival/man/rotterdam.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35758796 PMCID： PMC9235492 DOI： 10.1093/bioinformatics/btac221

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Inferring the effect of interventions on outcomes is relevant in diverse domains, comprising precision medicine and epidemiology (Frieden, 2017) or marketing (Bottou ; Kohavi ). A fundamental issue of causal reasoning is that potential outcomes are observed only for the applied intervention but not for its alternatives (the counterfactuals). This is particularly true in medicine, where patient’s outcome is only known for the applied (the factual) treatment and not for its alternatives, i.e. the counterfactual outcomes remain hidden. Therapeutic interventions, such as drug treatments or surgeries, are typically made by physicians on the basis of expert consensus guidelines. This process has to take into account both the expected treatment benefit, but also the potential side effects. Estimates of the former can be difficult. For instance, the success of drug treatments in cancer strongly depends on multiple characteristics of the tumor and the patient, and, consequently, the Average Treatment Effect (ATE) estimated from controlled trials does not necessarily hold on the level of individual patients. Thus, an estimate of the Individual Treatment Effect (ITE) is necessary, which has to be inferred from data (Holland, 1986). Solving the latter ‘missing data problem’ was attempted repeatedly in the literature using machine learning methods in combination with counterfactual reasoning. There are two naive approaches to this issue: the treatment can be included as a covariate or it can be used to stratify the model development, i.e. individual treatment-specific models are learned (also called T-learner). Potential outcomes can then be estimated by changing the respective treatment covariate or model. These naive approaches are occasionally discussed in performance comparisons, e.g. in Chapfuwa and Curth . An alternative approach is to match similar patients between treated and non-treated populations using, e.g. propensity scores (Rosenbaum and Rubin, 1983). This directly provides estimates of counterfactual outcomes. However, a central issue in this context is to define appropriate similarity measures, which should ideally also be valid in a high-dimensional variable space (King and Nielsen, 2019). Further alternatives are Causal Forests (Athey ; Athey and Wager, 2019; Wager and Athey, 2017) or deep architectures such as the Treatment-Agnostic Representation Network (TARNet) (Johansson ; Shalit ). Both methods do not account for treatment selection biases and thus will be biased toward treatment-specific distributions. This issue was recently approached by several groups which balanced the treated and non-treated distributions using model regularization via representations of Integral Probability Metrics (IPM) (Müller, 1991). Suggested methods are, e.g. balanced propensity score matching (Diamond and Sekhon, 2013; Li and Fu, 2017), deep implementations such as the Counterfactual regression Network (CFRNet) (Johansson ; Shalit ) or the auto-encoder based Deep-Treat (Atan ). Recently, balancing was incorporated in a Generative Adversarial Net for inference of Individualized Treatment Effects (GANITE) (Yoon ). Note, learning balanced representations involves a trade-off between predictive power and bias since biased information can be also highly predictive. All aforementioned approaches deal with continuous or binary response variables. In medicine, however, patient outcome is often recorded as time-to-event data, i.e. the time until an event occurs. The patient is (right-)censored at the last known follow-up if the event was not observed within the observation period. A plethora of statistical approaches deal with the analysis of time-to-event data (Martinussen and Scheike, 2006), of which one of the most popular methods is Cox’s Proportional Hazards (PH) model (Cox, 1972). The Cox PH model is a semi-parametric approach for time-to-event data, which models the influence of variables on the baseline hazard. Here, the PH assumption implies an equal baseline hazard for all observations. In fact, the influence of variables can be estimated without any consideration of the baseline hazard function (Breslow, 1972; Cox, 1972). The Cox PH model is also highly relevant in the context of machine learning. It was adapted to the high-dimensional setting using l1 and l2 regularization terms (Tibshirani, 1997), with applications ranging from the prediction of adverse events in patients with chronic kidney disease (Zacharias ) to the risk prediction in cancer entities (Jachimowicz ; Staiger ). The Cox PH model can be also adapted to deep learning architectures, as proposed by (Katzman ). Alternative machine-learning approaches to model time-to-event data include discrete-time Cox models built on multi-outcome feedforward architectures (Gensheimer and Narasimhan, 2019; Kvamme and Borgan, 2019; Lee ) and random survival forests (RSF) (Athey and Wager, 2019; Ishwaran ). The prediction of ITEs from time-to-event data has received little attention in the machine learning community, which is surprising considering the enormous practical relevance of the topic. Seminal works are (Chapfuwa ) and (Curth ). Most recently, Curth suggested to learn discrete-time treatment-specific conditional hazard functions, which were estimated using a deep learning approach. Treatment and control distributions were balanced analogously to Shalit using the p-Wasserstein distance (Kantorovitch, 1958; Ramdas ). This approach, named SurvITE, was shown to outperform the current state of the art in simulation studies. We propose to combine the loss of the Cox PH model with an IPM regularized deep neural network architecture to balance generating distributions of treated and non-treated patients. We named this approach ‘Balanced Individual Treatment Effect for Survival data’ (BITES). We show that this approach—albeit its apparent simplicity—outcompetes SurvITE as well as alternative state-of-the-art methods. First, we demonstrate the superior performance of BITES in simulation studies where we focus on biased treatment assignments and small sample sizes. Second, we used training data from the Rotterdam Tumour Bank (Foekens ) to show that BITES can optimize hormone treatment in patients with breast cancer. We validated the latter model using data from a controlled randomized trial of the German Breast Cancer Study Group (GBSG) (Schumacher ) and analyzed feature importance using SHAP (SHapley Additive exPlanations) values (Lundberg and Lee, 2017). We further provide an easy-to-use python implementation of BITES including scheduled hyper-parameter optimization (https://github.com/sschrod/BITES).

2 Materials and methods

Patient outcome can be recorded as (right-)censored time-to-event data. First, we will introduce models for such data, i.e. the Cox PH model and recent non-linear adaptations. Second, we will discuss the potential outcome model and how it can be used to model survival. Third, we introduce regularization techniques to account for unbalanced distributions and, finally, we will combine these methods in a deep neural network approach termed BITES to learn treatment recommender systems based on patient survival.

2.1 Survival data

Let be the space of covariates and the space of available treatments. Furthermore, let be the observed survival times and the corresponding event indicator. Denote sample data of patient i by the triplet . If the patient experiences the event within the observation period, is the time until the event of interest occurs, otherwise is the censoring time. Let the survival times y be distributed according to f(y) with the corresponding cumulated event distribution . The survival probability at time y is then given by . The hazard function is and corresponds to the risk of dying at time y (Cox, 1972), i.e. a greater hazard corresponds to greater risk of failure. Here, the model parameters are given by and the baseline hazard function is . Note that . According to Cox’s PH assumption, all patients share the same baseline hazard function and, importantly, the baseline hazard cancels in maximum likelihood estimates of . Thus, time dependence can be eliminated from the individual hazard prediction and rather than learning the exact time to event, Cox regression learns an ordering of hazard rates. At every event time , the set of patients at risk is given by . The partial log-likelihood of the Cox model (Breslow, 1972; Cox, 1972) is given by: Faraggi and Simon (1995) suggested to replace the ordinary linear predictor function, , by a feedforward neural network with a single outcome node and network parameters θ. Following this idea, Katzman introduced DeepSurv, which showed improved performance compared to the linear case, particularly if non-linear covariate dependencies are present.

2.2 The counterfactual problem

The outcome space for multiple treatment options k is given by . For simplicity, we will restrict the discussion to the binary case, k = 2, with a treated group, T = 1, and a control group, T = 0. We consider the problem where only a single factual outcome is observed per patient, i.e. the outcomes for all other interventions, also known as the counterfactuals, are missing. Hence, the individual treatment effect (ITE), defined as can only be inferred based on potential outcome estimates (Rubin, 1974). We will build a recommendation model that assigns treatments to patients with predictions . Following recent work (Alaa and van der Schaar, 2017; Athey and Wager, 2019; Johansson , 2020; Shalit ; Wager and Athey, 2017; Yao ; Yoon ), we make the standard strong ignorability assumption, which has been shown to be a sufficient condition to make the ITE identifiable (Pearl, 2017; Shalit ), i.e. it guarantees proper causal dependencies on the interventions. The strong ignorability assumption contains the unconfoundedness and overlap assumptions: Theorem 1 (Unconfoundedness). Covariates X do not simultaneously influence the treatment T and potential outcomes , i.e. This assumption ensures that the causal effect is not influenced by non-observable causal substructures such as confounding (Pearl, 2009). Correcting for confounding bias requires structural causal models, which are ambiguous in general and need to be justified based on domain knowledge (Pearl, 2008). Theorem 2 (Overlap). There is a non-zero probability for each patient i to receive each of the treatments :

2.3 Balancing distributions

Strong ignorability only removes confounding artifacts. Imbalances of the generating distributions due to biased treatment administration might still be present. Balancing the generating distributions of treated and control group has been shown to be effective both on the covariate space (Imai and Ratkovic, 2014) and on latent representations (D’Amour ; Huang ; Johansson , 2020; Li and Fu, 2017; Lu ; Shalit ; Yao ). This is either achieved by multi-task models or IPMs. The latter quantify the difference of probability measures and defined on a measurable space S by finding a function that maximizes (Müller, 1991) Most commonly used are the Maximum Mean Discrepancy (MMD), restricting the function space to reproducing kernel-Hilbert spaces (Gretton ), or the p-Wasserstein distance (Ramdas ). Both have appealing properties and can be empirically estimated (Sriperumbudur ). MMD has low sample complexity with a fast rate of convergence, which comes with low computational costs. A potential issue is that gradients vanish for overlapping means (Feydy ). The p-Wasserstein distance, on the other hand, offers more stable gradients even for overlapping means, which comes with higher computational costs, i.e. by solving a linear program. The computational burden can be reduced by entropically smoothing the latter and by using the Sinkhorn divergence, where is the smoothed Optimal Transport (OT) loss defined in the following (Feydy ; Ramdas ). Definition 1 (Smoothed Optimal Transport loss). For and Borel probability measures on the entropically smoothed OT loss is with the set of all joint probability measures whose marginals are on , i.e. for all subsets , we have and . Here, ϵ mediates the strength of the Kullback-Leibler divergence. The Sinkhorn divergence can be efficiently calculated for (Cuturi, 2013). For p = 2 and ϵ = 0 we can retrieve the quadratic Wasserstein distance and in the limit it becomes the MMD (Genevay ). BITES tunes ϵ to take advantage of the more stable OT gradients to improve the overlap while remaining computationally efficient. In the following, we denote it by to highlight the possibility to use any representation of the IPM. A thorough discussion of the Sinkhorn divergence, its theoretical properties, as well as one- and two-dimensional examples can be found in Feydy .

2.4 BITES

BITES model architecture: BITES combines survival modeling with counterfactual reasoning, i.e. it facilitates the development of treatment recommender systems using time-to-event data. BITES uses the network architecture shown in Figure 1 with loss function where q is the fraction of patients in the control cohort (patients with T = 0) and is given by the negative Cox partial log-likelihood of Equation 2, where we parametrize the hazard function according to the network architecture illustrated in Figure 1. The latent representation is regularized by an IPM term to reduce differences between treatment and control distributions of non-confounding variables. Throughout the article, we used the Sinkhorn divergence of the smoothed OT loss with p = 2 as IPM term. Hence, the parameter ϵ in Equation 9 calibrates between the quadratic-Wasserstein distance (ϵ = 0) and MMD (). The total strength of the IPM regularization is adjusted by the hyper-parameter α. Models with α = 0 do not balance treatment effects and therefore we denote this method as ‘Individual Treatment Effects for Survival’ (ITES). Models with will be denoted as ‘Balanced Individual Treatment Effects for Survival’ (BITES). Large α values enforce balanced distributions between treatment and control population. Note, there is a trade-off between balancing distributions and model performance since outcome relevant information can be predictive for the treatment. (B)ITES uses the time-dependent ITE for treatment decisions. For the studies shown in this article, we assigned treatments based on the ITE evaluated for a survival probability of 50%, i.e. .

Fig. 1.

The BITES network architecture. BITES uses shared deeply connected layers for both treatment options, which are mapped on a latent representation . This is regularized by a Sinkhorn divergence to account for imbalances between treatment and control distributions. The factual and counterfactual proportional hazard rates are modeled by two different outcome heads (h1 and h0), respectively. These are used to predict the ITE together with the corresponding baseline hazard function. The latter is individually estimated for treatment and control patients Implementation: BITES uses a deep architecture of dense-connected layers which are each followed by a dropout (Srivastava ) and a batch normalization layer (Ioffe and Szegedy, 2015). It uses ReLU activation functions (Nair and Hinton, 2010) and is trained using the Adam optimizer (Kingma and Ba, 2014). Further, early stopping based on non-decreasing validation loss and weight decay regularizations (Krogh and Hertz, 1992) are used to improve generalization. Our implementation is based on the PyTorch machine learning library (Paszke ) and the pycox package (Kvamme and Borgan, 2019). The Sinkhorn divergence is implemented using the GeomLoss package (Feydy ). We provide an easy-to-use python implementation which includes a hyperparameter optimization using the ray[tune] package (Liaw ) to efficiently distribute model training.

2.5 Treatment recommender systems

For comparison, we evaluated several strategies to build treatment recommender systems. Cox regression model: We implemented the Cox regression as T-learner with treatment-specific survival models using lifelines (Davidson-Pilon ). Note, an ordinary Cox regression model which uses both the covariates and the treatment variable as predictor variables generally recommends the treatment with the better ATE; a treatment-specific term adds to and thus the treatment which reduces the hazard most will be recommended. Therefore, we did not include the latter approach and focus on the Cox T-learner in our analysis. DeepSurv: Katzman suggested to provide individual recommendations based on single model predictions using and as covariates based on . Hence, it uses a treatment independent baseline hazard which could compromise the performance (Bellera ; Xue ). Treatment-specific DeepSurv models: To account for treatment-specific differences of baseline hazard functions, we also estimated DeepSurv as a T-learner (T-DeepSurv), i.e. we learned models stratified for treatments. We then evaluated the time-dependent ITE based on the survival function . Treatment-specific Random Survival Forests: Analogously to the previous approach, we learned treatment-specific RSF (Athey ; Ishwaran ) using the implementation of scikit-survival (Pölsterl, 2020) to estimate the time-dependent ITE. SurvITE: Curth suggested to learn discrete-time treatment-specific conditional hazard functions, which were estimated using an individual outcome head for each time interval. (We employed their python implementation available under https://github.com/chl8856/survITE.) We evaluated the time-dependent ITE to assign treatments, as for the latter two methods.

2.6 Performance measures

We used different measures to assess the performance of treatment recommendation systems. This comprises both measures for the quantification of prediction performance and of treatment assignment. Discriminative performance was assessed using a time-dependent extension of Harrell’s C-index (Harrell, 1982) to account for differing baseline hazards, which evaluates for all samples i and j at all event times (Antolini ). This reduces to Harrell’s C-index for strictly ordered survival curves. To quantify the performance of treatment recommendations, we used the Precision in Estimation of Heterogeneous Effect (PEHE) score (Hill, 2011), which is defined as the difference in residuals between factual and counterfactual outcome: Note, the PEHE score can only be calculated if both the factual and counterfactual outcomes are known, which is usually only the case in simulation studies. Therefore, we restricted its application to the latter. There, we further quantified the proportion of correctly assigned ‘best treatments’.

3 Results

3.1 Simulation studies

We performed three exemplary simulation studies. First, we simulated a scenario where covariates affect survival only linearly. Second, we simulated data with additional non-linear dependencies, and, finally, we performed a simulation where the treatment assignments were biased by the covariates. Linear simulation study: In analogy to (Alaa and van der Schaar, 2017) and (Lee ), we simulated a 20-dimensional covariate vector consisting of two 10-dimensional vectors and , with corresponding survival times given by We set the parameters and . The first term in the exponent is treatment dependent while the second term affects survival under both treatments identically. This simulation gives an overall positive ATE in of the patients. Survival times exceeding 10 years were censored to resemble common censoring at the end of a study. Of the remaining samples, 50% were censored at a randomly drawn fraction of the true unobserved survival time. Samples were assigned randomly to the treated, T = 1, and control group, T = 0, without treatment administration bias. Finally, we added an error to all covariates. Detailed information about hyper-parameter selection is given in the Supplementary Section S1. Figure 2a shows the distributions of Harrell’s C-index evaluated on 1000 test samples for 50 consecutive simulation runs. We observed that across all investigated sample sizes (x-axis) the T-learner Cox regression showed superior performance, closely followed by ITES and BITES. These three methods performed equally well for the larger sample sizes n = 1800 and n = 2400. We further investigated the proportion of correctly assigned treatments, Figure 2a, and PEHE scores, Supplementary Figure S1, where we obtained qualitatively similar trends. RSF, DeepSurv and T-DeepSurv showed inferior performance with respect to C-Indices, correctly assigned treatments, and PEHE scores. Results for lower sample sizes can be found in Supplementary Section S2. Consistent with previous findings, Cox regression outperformed all competitors closely followed by (B)ITES.

Fig. 2.

Harrell’s C-index and the fraction of correctly predicted treatments for the linear (a, d), non-linear (b, e), and treatment biased non-linear (c, f) simulations. The boxplots give the distribution for 50 consecutive simulation runs, i.e. for different model initializations, based on the best set of hyper-parameter determined by the validation C-index. Results are shown for different training sample sizes with 1000 fixed test samples for each of the simulations. The dashed horizontal line represents the fraction of patients that benefits for 100% treatment administration Non-linear simulation study: Next, we simulated non-linear treatment-outcome dependencies using the model where we set the parameters and . Note, the first term imposes sizable non-linear effects which differ between both treatments. We further scaled the polynomials by c = 0.01 to yield realistic survival times up to 10 years. This setting gives an overall positive ATE in of the patients. Figure 2b gives the performance of the evaluated methods in terms of Harrell’s C-index. We observed that the ordinary Cox regression with linear predictor variables performs worst across all sample sizes, followed by RSF, and SurvITE. Approximately equal performance was observed for the DeepSurv approaches, ITES, and BITES. Among these methods, the treatment-specific DeepSurv models (T-DeepSurv) showed a higher variance across the simulation runs, in particular for the low sample sizes. Next, we studied the corresponding PEHE scores (Supplementary Fig. S1) and the proportion of correctly assigned treatments (Fig. 2e). We observed, although DeepSurv performed well in terms of C-Indices, that the performance was highly compromised in the latter two measures. In fact, it was not able to outperform the recommendation based on the ATE, i.e. always assigning T = 1, which corresponds to the dashed horizontal line. We further observed that SurvITE performed worst in this scenario with both substantially lower proportions of correctly assigned treatments and higher PEHE scores compared to the other methods. Here, T-DeepSurv, ITES, and BITES performed best, however, the results of the former are inferior compared to ITES and BITES for sample sizes of n = 600 and n = 1200. Additional simulations for smaller sample sizes can be found in Supplementary Section S2, where we observed that none of the methods is able to outperform the ATE-based recommendation. Non-linear simulation study with treatment bias: Finally, we repeated the non-linear simulation study but now took into account a treatment assignment bias, i.e. the value of one or more covariates is indicative of the applied treatment. To simulate this effect, we assigned the treatment with a 90% probability if the fifths entry of or was larger than zero. To ensure that the unconfoundedness assumption holds, we set the corresponding entries and to zero. This simulation study yields a positive treatment effect in of the patients (dashed horizontal line in Fig. 2f). Figure 2c and f and Supplementary Figure S1 show the results in terms of C-index, correctly assigned treatments, and PEHE scores, respectively. Similar to the previous studies, the best performing methods with respect to C-Indices were the two DeepSurv models, ITES and BITES. With respect to correctly assigned treatments and PEHE scores, however, BITES consistently outperformed the other methods for reasonable sample sizes starting from n = 1200. For n = 600, none of the methods was able to outperform a model where the treatment is always recommended (dashed line in Fig. 2f). This was further confirmed for lower sample sizes (Supplementary Section S2).

3.2 Bites optimizes hormone treatment in patients with breast cancer

We retrieved data of node-positive breast cancer patients from the Rotterdam Tumour Bank (Foekens ) as provided by Katzman . The latter data were preprocessed according to (Royston and Parmar, 2013). We used recurrence-free survival (RFS) time, defined as the time from primary surgery to the earlier of disease recurrence or death from any cause, as outcome for the further analysis. The available patient characteristics are age, menopausal status (pre/post), number of cancerous lymph nodes, tumor grade and progesterone and estrogen receptor status. Of these patients, 339 were treated by a combination of chemotherapy and hormone therapy. The remaining patients were treated by chemotherapy only. Note, in this study, the application of hormone treatment was not randomized. In total of the patients were censored. We used these data to learn treatment recommender systems in order to predict the ITE of adding hormone therapy to chemotherapy. We performed hyper-parameter tuning as outlined in Supplementary Section S3, and selected the models with the lowest validation loss, respectively. Next, we evaluated the performance using test data from the GBSG Trial 2 (Schmoor ). Excluding cases with missing covariates, it contains 686 individual patients, with randomized hormone treatment assignments. The obtained C-indices are summarized in Table 1. Note, since only the factual outcomes are observable, we could not evaluate the performance with respect to correctly assigned ‘best treatments’ or PEHE scores. However, to substantiate our findings, we stratified our patients into two groups; the group ‘recommended treatment’ contains samples where the recommended treatment coincides with the applied treatment, while the group ‘anti-recommended treatment’ contains the samples where the recommended treatment does not coincide with the applied treatment (following Katzman ). The corresponding Kaplan–Meier (KM) curves of BITES are shown in Figure 3 with recommended treatment in green and anti-recommended treatment in red. Corresponding results for the other methods are shown in Supplementary Figure S3. For comparison, KM curves for the treated and control group are shown in blue and orange in Figure 3. Interestingly, BITES recommends hormone treatment only in 83.4% which resulted in the largest difference in survival based on the recommendations made by BITES (P = 0.000016). On the other hand, DeepSurv and Cox regression suggest to treat all patients with hormone therapy, closely followed by SurvITE (treatment recommended for 98.1% of patients). The results for all models are summarized in Table 1. Note, the group with BITES recommendation showed a superior survival compared to the treated group and the group with BITES anti-recommendation showed an inferior performance compared to the control group. Both comparisons, however, were not significant in a log-rank test.

Table 1.

Predictive outcomes on the controlled randomized test set of the RGBSG data obtained by each of the discussed models with minimum validation loss found in a hyper-parameter grid search

Method	C-index	P-value	Fraction T = 1
Cox reg.	0.471	0.0034	100%
DeepSurv	0.671	0.0034	100%
T-DeepSurv	0.652	0.2023	92.9%
RSF	0.675	0.0013	82.5%
SurvITE	0.631	0.0039	98.1%
ITES	0.676	0.000198	75.8%
BITES	0.666	0.000016	83.4%

Values in boldface indicate the best performing model with respect to C-index and P-value, respectively.

Fig. 3.

Recurrence-free survival probability for patients grouped according to the respective treatment recommendations of BITES, based on the test data from the GBSG Trial 2. For comparison, we show the KM curves for all hormone treated and untreated (control) patients in blue and orange, respectively (shown without error bars for better visibility) Predictive outcomes on the controlled randomized test set of the RGBSG data obtained by each of the discussed models with minimum validation loss found in a hyper-parameter grid search Values in boldface indicate the best performing model with respect to C-index and P-value, respectively. Finally, we explored feature importance of the BITES model using SHAP values (Lundberg and Lee, 2017) with results shown in Figure 4 which correspond to treatment option T = 0 (no hormone treatment) and T = 1 (hormone treatment), respectively. Here, points correspond to patients and positive (negative) SHAP values on the x-axis indicate an increased (decreased) risk of failure. Further, the feature value is illustrated in colors ranging from red to blue, where high values are shown in red and low values in blue. We observed that the number of positive lymph nodes has the strongest impact on survival with SHAP values ranging from to in the group with and without hormone treatment, where more positive lymph nodes (shown in red) indicate a worse survival. Considering menopausal status, we observed an increased risk of death and recurrence in postmenopausal breast cancer patients that had not received adjuvant hormone treatment (T = 0). This effect was substantially mitigated in the hormone-treated group, which is in line with the observation that postmenopausal, more than premenopausal breast cancer patients draw a disease-free survival benefit from extended adjuvant endocrine treatment (Li ). However, this does not preclude a survival benefit from hormone treatment for certain premenopausal breast cancer patients as revealed by a comparison of Figure 4a and b. It is also noteworthy, that high tumor grade (grade 3, shown in red) yielded increased SHAP values of up to 0.5 in the tamoxifen-treated group. This effect was substantially mitigated in the group without hormone treatment. This finding is in line with a recent study by Dar , which found a significant tamoxifen treatment benefit only among patients suffering from lower grade tumors, while no benefit was observed for grade 3 tumors. In summary, we observed strong hints that hormone treatment alleviates the negative effect of menopause, and increases the negative effect of high tumor grade on patient survival.

Fig. 4.

SHAP (SHapley Additive exPlanations) values for the best selected BITES model on the controlled randomized test samples of the RGBSG data. Red points correspond to high and blue points to low feature values. A positive SHAP value indicates an increased hazard and hence decreased survival chances and vice versa (A color version of this figure appears in the online version of this article)

4 Conclusion

We presented BITES, which is a machine learning framework to optimize individual treatment decisions based on time-to-event data. It combines Deep Neural Network counterfactual reasoning with Cox’s PH model. It further enables balancing of treated and non-treated patients using IPM on a latent layer data representation. We demonstrated in simulation studies that BITES outcompetes state-of-the-art methods with respect to prediction performance (Harrell’s C-index), correctly assigned treatments, and PEHE scores. We observed that BITES can effectively capture both linear and non-linear covariate outcome dependencies on both small and large scale observational studies. Moreover, we showed that BITES can be used to optimize hormone treatment in breast cancer patients. Using independent data from the GBSG Trial 2, we observed that BITES treatment recommendations might improve patients’ RFS. In this context, SHAP values were demonstrated to enhance the interpretability and transparency of treatment recommendations. Like most recently developed counterfactual tools, BITES depends on the strong ignorability assumption. Hence, caution is necessary when analyzing heavily confounded observational data. Future work needs to address more specialized time-to-event models, such as competing event models, and the generalization to multiple treatments and combinations thereof. Both could substantially broaden the scope of applications for BITES. In summary, BITES facilitates treatment optimization from time-to-event data. In combination with SHAP values, BITES models can be easily interpreted on the level of individual patients, making them a versatile backbone for treatment recommender systems.

Funding

This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the e:Med research and funding concept (grants 01ZX1912A, 01ZX1912C). Conflict of Interest: none declared. Click here for additional data file.

19 in total

1. A time-dependent discrimination index for survival data.

Authors: Laura Antolini; Patrizia Boracchi; Elia Biganzoli
Journal: Stat Med Date: 2005-12-30 Impact factor: 2.373

Review 2. Evidence for Health Decision Making - Beyond Randomized, Controlled Trials.

Authors: Thomas R Frieden
Journal: N Engl J Med Date: 2017-08-03 Impact factor: 91.245

3. A neural network model for survival data.

Authors: D Faraggi; R Simon
Journal: Stat Med Date: 1995-01-15 Impact factor: 2.373

4. Evaluating the yield of medical tests.

Authors: F E Harrell; R M Califf; D B Pryor; K L Lee; R A Rosati
Journal: JAMA Date: 1982-05-14 Impact factor: 56.272

5. The urokinase system of plasminogen activation and prognosis in 2780 breast cancer patients.

Authors: J A Foekens; H A Peters; M P Look; H Portengen; M Schmitt; M D Kramer; N Brünner; F Jänicke; M E Meijer-van Gelder; S C Henzen-Logmans; W L van Putten; J G Klijn
Journal: Cancer Res Date: 2000-02-01 Impact factor: 12.701

6. Variables with time-varying effects and the Cox model: some statistical concepts illustrated with a prognostic factor study in breast cancer.

Authors: Carine A Bellera; Gaëtan MacGrogan; Marc Debled; Christine Tunon de Lara; Véronique Brouste; Simone Mathoulin-Pélissier
Journal: BMC Med Res Methodol Date: 2010-03-16 Impact factor: 4.615

7. Testing the proportional hazards assumption in case-cohort analysis.

Authors: Xiaonan Xue; Xianhong Xie; Marc Gunter; Thomas E Rohan; Sylvia Wassertheil-Smoller; Gloria Y F Ho; Dominic Cirillo; Herbert Yu; Howard D Strickler
Journal: BMC Med Res Methodol Date: 2013-07-09 Impact factor: 4.615

Review 8. Continuous and discrete-time survival prediction with neural networks.

Authors: Håvard Kvamme; Ørnulf Borgan
Journal: Lifetime Data Anal Date: 2021-10-07 Impact factor: 1.588

9. Assessment of 25-Year Survival of Women With Estrogen Receptor-Positive/ERBB2-Negative Breast Cancer Treated With and Without Tamoxifen Therapy: A Secondary Analysis of Data From the Stockholm Tamoxifen Randomized Clinical Trial.

Authors: Huma Dar; Annelie Johansson; Anna Nordenskjöld; Adina Iftimi; Christina Yau; Gizeh Perez-Tenorio; Christopher Benz; Bo Nordenskjöld; Olle Stål; Laura J Esserman; Tommy Fornander; Linda S Lindström
Journal: JAMA Netw Open Date: 2021-06-01

10. Gene expression-based outcome prediction in advanced stage classical Hodgkin lymphoma treated with BEACOPP.

Authors: Ron D Jachimowicz; Wolfram Klapper; Gunther Glehr; Horst Müller; Heinz Haverkamp; Christoph Thorns; Martin Leo Hansmann; Peter Möller; Harald Stein; Thorsten Rehberg; Bastian von Tresckow; H C Reinhardt; Peter Borchmann; Fong Chun Chan; Rainer Spang; David W Scott; Andreas Engert; Christian Steidl; Michael Altenbuchinger; Andreas Rosenwald
Journal: Leukemia Date: 2021-06-10 Impact factor: 11.528