Han Fu1, Deedra Nicolet2,3, Krzysztof Mrózek2, Richard M Stone4, Ann-Kathrin Eisfeld2, John C Byrd5, Kellie J Archer1. 1. Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, Ohio, USA. 2. Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio, USA. 3. Alliance Statistics and Data Management Center, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio, USA. 4. Dana-Farber/Partners Cancer, Harvard University, Boston, Massachusetts, USA. 5. Department of Internal Medicine, University of Cincinnati, Cincinnati, Ohio, USA.
Abstract
Medical breakthroughs in recent years have led to cures for many diseases. The mixture cure model (MCM) is a type of survival model that is often used when a cured fraction exists. Many have sought to identify genomic features associated with a time-to-event outcome which requires variable selection strategies for high-dimensional spaces. Unfortunately, currently few variable selection methods exist for MCMs especially when there are more predictors than samples. This study develops high-dimensional penalized Weibull MCMs, which allow for identification of prognostic factors associated with both cure status and/or survival. We demonstrated how such models may be estimated using two different iterative algorithms. The model-X knockoffs method was combined with these algorithms to control the false discovery rate (FDR) in variable selection. Through extensive simulation studies, our penalized MCMs have been shown to outperform alternative methods on multiple metrics and achieve high statistical power with FDR being controlled. In an acute myeloid leukemia (AML) application with gene expression data, our proposed approach identified 14 genes associated with potential cure and 12 genes with time-to-relapse, which may help inform treatment decisions for AML patients.
Medical breakthroughs in recent years have led to cures for many diseases. The mixture cure model (MCM) is a type of survival model that is often used when a cured fraction exists. Many have sought to identify genomic features associated with a time-to-event outcome which requires variable selection strategies for high-dimensional spaces. Unfortunately, currently few variable selection methods exist for MCMs especially when there are more predictors than samples. This study develops high-dimensional penalized Weibull MCMs, which allow for identification of prognostic factors associated with both cure status and/or survival. We demonstrated how such models may be estimated using two different iterative algorithms. The model-X knockoffs method was combined with these algorithms to control the false discovery rate (FDR) in variable selection. Through extensive simulation studies, our penalized MCMs have been shown to outperform alternative methods on multiple metrics and achieve high statistical power with FDR being controlled. In an acute myeloid leukemia (AML) application with gene expression data, our proposed approach identified 14 genes associated with potential cure and 12 genes with time-to-relapse, which may help inform treatment decisions for AML patients.
Medical breakthroughs in recent years have led to cures for various diseases including cancer. For example, improvement in outcomes has occurred in younger adults with acute myeloid leukemia (AML) during the past decades. Approximately 50% to 75% of adult AML patients treated with chemotherapy achieves complete remission (CR) and approximately 20% to 30% of the patients enjoy long‐term disease‐free survival.
Yanada et al
suggested that AML patients with 3‐year relapse‐free survival from the CR date can be considered “potentially cured.” In practice, a long plateau in the Kaplan‐Meier (K‐M) estimated survival distribution with sufficient follow‐up may indicate the presence of a cure fraction in the sample studied.When a cure fraction is present, one assumption in regular survival models such as the Cox proportional hazards (PH) model, that all subjects will experience the event of interest, is violated. In this case, a Cox PH model tends to underestimate the hazard and overestimate the survival for the susceptible subjects.
A special type of survival model, namely, the mixture cure model (MCM), is frequently used in the presence of a cure fraction.
,
,
,
MCMs postulate that a fraction of the patients are cured with the survival probability of one and the other patients are susceptible to the event of interest with the failure time following a proper survival distribution, referred to as the latency distribution. The latency distribution can be depicted using parametric models
and semi‐parametric PH models.
,
,
Parametric latency models have straightforward distributional assumptions and are easy to fit using various optimization methods for maximum likelihood estimation.In order to better understand disease mechanisms, high‐throughput genomic applications have been conducted to identify important biomarkers that are associated with some time‐to‐event outcomes. When the number of covariates exceeds the sample size, many traditional model‐fitting approaches are not applicable. Effective variable selection procedures for high‐dimensional data are needed to select a set of variables that are truly associated with the event of interest. Various penalization methods have been proposed and widely used during the past decades for variable selection in survival analysis, such as the least‐absolute shrinkage and selection operator (LASSO)
and the adaptive LASSO.
However, only a few papers focused on the penalized MCMs when a cure fraction exists.Liu et al
studied variable selection for the semi‐parametric PH MCM, where both the LASSO and the smoothly clipped absolute deviation (SCAD) penalties were considered. The expectation‐maximization (E‐M) algorithm
was used to maximize the penalized likelihood. A recent paper
followed the work of Liu et al and extended the method to mixture and promotion time cure models
based on LASSO and adaptive LASSO.
Beretta and Heuchenne
generalized the PH MCM to accommodate time‐varying covariates, using the SCAD penalty. Another approach by Scolas et al
adopted a parametric MCM with an accelerated failure time (AFT) regression model for latency and adaptive LASSO for variable selection on interval‐censored data. However, a common limitation of these papers
,
,
,
is that only low‐dimensional data, where the number of covariates is much smaller than the sample size, were considered.Fan et al
proposed a penalization method for estimating the MCM where they explicitly considered the structural effects of covariates. That is, they postulated that the covariate effects on cure probability and those on survival function of susceptibles are potentially linearly related. Although the method is able to analyze high‐dimensional data, their strong assumption of proportionality structure may constrain its applicability. They later proposed a less restrictive strategy which imposes a sign‐based penalty to promote similarity in signs of the two‐parts (referred to as SCinCRM in the remainder of this article).
Bussy et al
introduced a more general mixture model (C‐Mix) which considered subgroups of patients with different prognosis and risks. The C‐Mix model includes the MCM as a special case and applies to high‐dimensional data, but it only allows subgroup membership, not the latency portion, to be driven by covariates. Additionally, the variable sets selected by the aforementioned methods are likely to contain redundant or noise variables, also known as false positives. Effectively controlling the false discovery rate (FDR), the expected proportion of false positives among all selected variables, without sacrificing power has been a major challenge in variable selection research.This study aims to develop penalized parametric MCMs for high‐dimensional datasets, which allow for identification of prognostic factors associated with both cure status and/or survival of susceptibles. Two different iterative algorithms, the generalized monotone incremental forward stagewise (GMIFS)
and the E‐M algorithm,
are adopted for model estimation. These algorithms are further combined with the model‐X knockoffs
which is a flexible selection framework that allows strict FDR control. The remainder of the article is organized as follows. Section 2 introduces the statistical models and algorithms, and presents a brief introduction of the model‐X knockoffs framework. A simulation study designed to empirically compare our proposed methods with alternative approaches (including SCinCRM, C‐Mix, and traditional survival models that do not consider cure) is reported in Section 3. In Section 4, we apply different methods to a high‐throughput gene expression dataset for AML patients and present a list of genes that were identified as being associated with either the probability of cure or survival. Section 5 concludes the article with a discussion.
STATISTICAL MODELS AND ALGORITHMS
Weibull MCMs
Let be a random non‐negative continuous variable representing lifetime of interest (e.g., relapse‐free survival time) with the cumulative distribution function (CDF) denoted by . A proportion of subjects are susceptible () to the event of interest, and are cured () who will not experience the event even after an extended follow‐up, that is, for all . Note the cure status is unknown for censored subjects. The survival function for the entire population is given by
where and represent the covariates associated with incidence (whether one is cured) and latency (what is the survival function given one is uncured), respectively. Therefore, and differentiating this with respect to , the density is . The likelihood for right‐censored survival data with a cured fraction is
where indicates the failure time was observed while indicates the observed time was censored. The log‐likelihood is then proportional toIn this article, a fully parametric likelihood function is considered. Specifically, the Weibull distribution, a popular parametric survival distribution that allows the formularization of both AFT and PH models, is assumed for the latency of the susceptibles. The Weibull density is where the shape and scale parameters are both positive. We introduce the effects of the covariates by replacing with such that
The probability of being susceptible is most frequently modeled with logistic regression,
,
,
,
in which case we replace with . Substituting these two expressions into Equation (3) yields the log‐likelihood for the Weibull MCM, given byWith high‐dimensional covariate spaces, penalization is desired for both coefficient sets, for incidence and for latency. That being said, sometimes it is useful to coerce some covariates that are known risk factors into the model without penalty, such as baseline demographic and clinical characteristics. Hence, we further partition the two sets of predictors as and , where the subscript represents the unpenalized predictors that we wish to force into the model, while the subscript represents the penalized predictors for which we seek a parsimonious model, such as genomic features. The parameters corresponding to the incidence portion of the model are given by where is the intercept, while the parameters corresponding to the latency portion are , so that and . The unknown parameters are listed in , and the observed data include , for .
First, we used the GMIFS method
to estimate the penalized Weibull MCM for high‐dimensional data. In the linear regression setting, the incremental forward stagewise (FS) method is a version of boosting which produces a coefficient path strikingly similar to the ‐penalized regression (LASSO) path along the penalization level.
FS works by incrementing the coefficient of the variable most correlated with the current residuals by an amount at each step. When the incremental amount , the algorithm (called FS) produces an identical path to the LASSO path under certain conditions.
FS was later characterized as a monotone version of the LASSO with much smoother regularization paths.
The GMIFS method is a generalization of this characterization to problems involving other than squared error loss, such as the logistic regression model.
As long as the gradient functions can be derived, the GMIFS is theoretically applicable to any parametric models. In fact, it has been proven useful in a wide variety of high‐dimensional settings for modeling discrete survival time,
ordinal,
,
or count responses.The GMIFS algorithm proceeds in an iterative fashion and updates one of the penalized coefficients by a small incremental amount at each iteration step. To determine which penalized covariate is to be updated, the algorithm adopts the steepest ascent method, that is, updating the coefficient associated with the largest gradient. In our case, one penalized incidence coefficient in and one penalized latency coefficient in were selected and updated at each step. To determine the direction of the update at each step, the expanded penalized design matrices and were used as input. Both expanded matrices were centered and scaled before entering the algorithm. The corresponding coefficients were then expanded to and , and the Karush‐Kuhn‐Tucker condition ensures that at most one of and associated with the same covariate (or and with ) was greater than zero at the same time.
At each iteration, the selected coefficients were updated with a small incremental amount (set to be 0.001) so that the coefficient paths for both positive and negative parts are constrained to be monotonically nondecreasing. In the end of the algorithm, the solution paths for the original coefficients and were obtained by subtracting the coefficient estimates for the negative versions of the variables, from those for the positive counterparts.The GMIFS algorithm for the penalized Weibull MCM is summarized in Algorithm 1. The algorithm starts with , and . The unpenalized parameters, , , , and , are initialized using the Broyden‐Fletcher‐Goldfarb‐Shanno (BFGS) optimization method with the method of moments (MOM) estimates for and as starting values. The algorithm then iterates between updating the penalized parameters and updating the unpenalized parameters. To update the unpenalized parameters, the BFGS method is applied, considering the penalized parameters and fixed at the current estimates. If the difference between the log‐likelihoods of two successive steps is less than a prespecified tolerance (set to be ), the iterative procedure is considered to be convergent. Otherwise, the algorithm stops after 10 000 iterations. To prevent over‐fitting, cross‐validation can be used to select the optimal step which yields the final estimates. The C‐statistic designed for MCMs
(described in Section 3.3) is used as the cross‐validation metric for the GMIFS as well as the E‐M algorithm discussed in the next section.
Expectation‐maximization algorithm
The E‐M algorithm is a natural choice for problems with hidden variables, such as the cure status in the MCMs.
,
,
In this section, we introduce the E‐M and describe how it can be applied to the penalized MCMs.Assuming that we have observed the cure status for , the complete‐data likelihood function for the MCM is
where is the hazard function of an uncured patient. In a Weibull MCM, the penalized complete‐data log‐likelihood with the LASSO penalty for both and can be written as
where denotes the number of predictors in , denotes the number of predictors in , and is a vector of tuning parameters tuned by cross‐validation. For the sake of simplicity, the same value was used for all tuning parameters.In the E‐step, the algorithm calculated the expected with respect to the conditional distribution of given the current parameter estimates and the observed data . Since the 's are linear terms in , we only need to compute the expected value of given and , denoted by . When the subject was censored with ,
according to the Bayes' theorem. When an event was observed for the subject with , equals to 1. We can integrate the two cases into one expression, given by
To obtain the expected , the E‐step replaces in Equation (7) with given above.The M‐step maximized the expected , which is equivalent to maximizing the logistic portion and the survival portion separately to obtain updated parameters. A trick was adopted to convert the penalization to a constrained optimization problem with a differentiable objective, by doubling the number of penalized parameters.
Specifically, for a real number , we can write and , where and correspond to the positive and negative part of , respectively, satisfying and . Then the limited‐memory BFGS with bound constraints (L‐BFGS‐B) is applicable to efficiently solve the optimization problems in the M‐step. The unpenalized parameters were updated using the regular BFGS method in each M‐step.The E‐M algorithm is summarized in Algorithm 2. Similar to the GMIFS, and were initialized using the MOM estimates, while was initialized using estimates from a low‐dimensional Cox model with covariates , and was initially set equal to for simplicity. The algorithm then iterated between the E‐step and the M‐step until convergence. The E‐M usually converged much quicker than the GMIFS thanks to the L‐BFGS‐B method, so a maximum iteration step of 100 was used. The optimal was selected using the cross‐validated C‐statistic for MCMs.
Model‐X knockoffs
The model‐X knockoffs approach
was recently developed as a flexible variable selection framework with exact finite‐sample FDR control. Because the method places no restriction on data dimensionality or conditional distribution, it is known in theory to apply seamlessly to arbitrary response types including time‐to‐event data. In this framework, a set of “knockoff” variables are constructed to mimic the covariance structure of the original covariates . Formally, for any subset of the covariates, , when the entries and are swapped for each , the joint distribution of will remain invariant. Another property of the knockoff variables is that they are independent of the response conditionally on the original covariates. In this way, the knockoffs can be used as negative controls for the real covariates so that true signals can be teased apart from noise variables. Practically, the knockoff variables can be constructed using the second‐order approximate approach
with the knockoff R package,
assuming that the covariates follow a multivariate normal distribution. When the covariates cannot be depicted by Gaussian distributions, deep generative models such as the deep knockoff machine
can be used to relax the distributional assumption and generate valid knockoffs in general settings. Since we have two sets of covariates and in our model, we constructed a group of knockoff variables for each set. The second‐order approximate approach was used for the knockoff construction in the simulation studies in Section 3 and the deep knockoff machine was used in the application in Section 4 where the covariates were not normally distributed.The knockoffs and were then augmented with the corresponding original covariates and , respectively, to form two extended matrices, and , with the number of predictors doubled. These extended matrices were used as new input matrices that entered the GMIFS or E‐M algorithm. The absolute values of estimates of the penalized coefficients for the extended sets of covariates were obtained and used as the variable importance measurements in terms of explaining the cure status or the latency. For example, the importance measure for an incidence covariate is denoted by . We then calculated the difference between the importance measurement of the original predictor and that of the knockoff as the statistic used for variable selection, denoted by . A data‐dependent threshold for is given by
where denotes the target FDR level. The variables in whose statistics exceed the threshold are then selected. In this way, strict FDR control in variable selection is guaranteed,
meaning that on average, 80% of the variables selected by the framework are expected to be true signals if is set to 20%. A similar procedure is followed to select important variables from the latency covariates .
SIMULATION STUDIES
Competing approaches
Simulation studies were conducted to compare the performance of our methods with that of competing approaches. The SCinCRM
and C‐Mix
are both relevant alternative methods with publicly available code and are thus included in the comparison. The SCinCRM aimed at promoting sign consistency between the incidence coefficients and the latency coefficients by imposing a sign‐based penalty. The sign penalty term was governed by a tuning parameter which can be tuned to zero when the sign consistency assumption is not supported by the data. In that case, the method reduces to a penalized MCM. Besides the sign constraint, there are a few other differences between our method and the SCinCRM. They used the Cox PH model for the survival function of susceptible subjects, while we considered the Weibull model which is also a PH model. Instead of the penalty, the minimax concave penalty (MCP)
was adopted for the regression coefficients. They also applied the E‐M algorithm to estimate the model but used coordinate descent in each M‐step.The C‐Mix
method considers a general mixture model where different subgroups of patients have different prognosis and risks. The provided code supports the estimation of a MCM where one of the subgroups has zero risk. Compared with our method, they used a similar E‐M procedure with the L‐BFGS‐B algorithm, but there are some distinctions between their model and ours. The most obvious distinction is that they only consider regression on the incidence part while we incorporate regression on both incidence and latency. They assume a simple geometric distribution in their code which allows for closed‐form updates of a latency parameter that is identical for all subjects. Moreover, the elastic‐net penalty
is used for the coefficient regularization instead of .Other than the aforementioned approaches, we would also like to investigate how a survival model without considering cure fraction would perform when a non‐negligible subset of treated patients are cured. We thus included the penalized Weibull survival model (referred to as “non‐cure Weibull” in the remainder of this article) and the penalized Cox PH model (referred to as “non‐cure Cox”) for comparison. Since no efficient implementation of the penalized Weibull model for high‐dimensional data is available to our best knowledge (there is a Bayesian implementation
which is expected to be slow), we implemented the non‐cure Weibull model via the GMIFS algorithm. The glmnet R package
,
was used to implement the penalized Cox model. These non‐cure models were tuned using the cross‐validated C‐statistic based on only the latency coefficients (see Section 3.3 for a detailed description of the metric).The R code for implementing the proposed estimation algorithms and conducting the simulation studies has been made available at https://github.com/hanfu‐bios/curemodels.
Simulation settings
In the simulations, we used the same set of covariates for incidence () and latency () with observations, penalized predictors and two unpenalized predictors. We randomly allocated 3/4 of the data to a training set and 1/4 to a testing set. The penalized covariates were generated from a ‐dimensional Gaussian distribution MVN where was a block diagonal matrix with the block size of . Each block had an autoregressive structure with the element being where and to reflect small correlations among predictors. The entries in the unpenalized covariate matrices were i.i.d. generated from a distribution with .There were penalized predictors having nonzero effects (called signals) on each regression part. One signal predictor ( or ) was randomly selected from each block and assigned a coefficient value (called signal amplitude) taking on random signs, that is, . We varied the value of from 0.4 to 1.8 in our simulations to cover common effect sizes of interest for gene expression data. All other predictors within that block were simply “null” noise variables with zero coefficients. The unpenalized coefficients and were sampled from where and . In the default simulation setting, the incidence covariates and latency covariates were independently sampled. In order not to favor our methods over SCinCRM because of their sign consistency assumption, we also performed a simulation where the coefficients from the two parts, and , and , had the same signs.Next, the cure status for was generated from a Bernoulli distribution with the mean of . The intercept was sampled from . To control the cure rate in the simulated data, the expected value for was varied. Two values were chosen, , to reflect different cure rates approximately at 40% and 25%, respectively. For susceptible subjects with , the event time (in years) followed a Weibull distribution with the shape parameter and the scale parameter . We set and . The censoring time followed a uniform distribution on . If the event occurred later than the censoring () or the subject was cured (), then the censoring indicator ; otherwise, indicating the event was observed. For uncured subjects, the observed time was the smaller value between and , and for cured subjects, . We set so that the censoring rate was approximately 45% when the cure rate was around 40%, and 31% when the cure rate was around 25%.To test robustness of our methods, we first applied an alternative data‐generating distribution for the event time of susceptible subjects. The generalized gamma (GG) family is known to be an extensive family containing many survival distributions as special cases, including the Weibull. It has three parameters, the location parameter , scale parameter and shape parameter , with the density of
When , and , the GG density translates into the Weibull density we used in Equation (4). In the alternative data‐generating process, the event time for uncured subjects followed a GG distribution with and described above but instead, which made it not a Weibull density anymore and further, violated the PH assumption. We therefore investigated the model performance of the different methods when the latency models were misspecified.Further, we examined the robustness of our methods in the case where the underlying data generating process was a non‐MCM, or more specifically, a promotion time cure model.
The promotion time cure model was constructed in the context of cancer recurrence which is assumed to be promoted by carcinogenic cells that remain active after treatment. The unobserved number of carcinogenic cells is incorporated through a Poisson model. In the simulation, we let follow a Poisson distribution with the mean of . For each carcinogenic cell, the time to activation followed a standard exponential distribution. If for subject , the event time was the time when the first carcinogenic cell became activated; if , subject was cured with . The covariates , the coefficients and the censoring times were generated in the same way as in our default simulation setting. The signal amplitude was set to be 1. Under this simulation setting, we assessed the performance of our penalized MCMs when the model was completely different from the data generating process.
Metrics for performance evaluation
In this section, we describe multiple metrics for performance evaluation and method comparison. Since variable selection is our primary objective, the most important metrics in this study are false discovery proportion (FDP) and power in selecting variables. The FDP is the realized version of FDR, calculated by the proportion of false positives among all variables selected, that is, variables with nonzero estimated coefficients in the regularized models. Power can be estimated by the proportion of true signals being identified. A low FDP and a high power are usually desired but a tradeoff exists between the two. When the knockoff framework is applied so that FDR is controlled at a target level, a model with a higher power is preferred.Besides variable selection, we also assessed performance in terms of prediction and estimation bias. The concordance index (C‐index or C‐statistic) is a frequently used metric for censored data which measures the probability of concordance between prediction and observation. Given a predicted risk score ( in our model), the C‐statistic for a standard survival model is the proportion of concordant pairs divided by the total number of possible evaluation pairs, given by
where . In this formula, the cured patients are not differentiated from those uncured but censored. A refined version was proposed
to take the cure status into account. Specifically, they applied a prespecified cutoff to assume whether a censored subject was cured or not. If the censoring time was beyond the cutoff time point, the subject was assumed to be cured. Otherwise, the cure status was unknown. In their cure status weighting approach, they assigned the weight of 1 for subjects who experienced the event (), 0 for presumptive cured subjects (), and the estimated probability of non‐cure for other censored subjects with unknown cure status ( missing). The C‐statistic is defined as
where is the indicator of whether is known and . In this way, the estimated coefficients from both regression parts are incorporated into the metric. For approaches with both regression parts (our methods and SCinCRM), we used the definition in Equation (13), while the Equation (12) was used for approaches with only one regression part (C‐Mix, non‐cure Weibull, and non‐cure Cox models), with calculated using the corresponding regression coefficients.In an earlier paper,
Asano et al proposed an AUC metric for susceptibility prediction based on a similar weighting scheme which they called the mean score imputation method. Given the estimated probability of non‐cure and a cutoff value (), the true positive rate (TPR) and false positive rate (FPR) for susceptibility prediction were estimated by
Having the TPR and FPR values, the AUC can be estimated using the trapezoidal method. In addition, we also assessed susceptibility prediction in terms of prediction accuracy. Specifically, we used the K‐M estimated survival probability for the last observed event as a cured proportion cutoff. Then, the subjects with the top percentile of the estimated susceptible probability were predicted as being susceptible () and the others being cured (). Since the true susceptibility status was available in the simulation studies, we calculated the susceptibility prediction accuracy with the proportion of for . The AUC and accuracy metrics assess the predictive performance for the cure status and thus only apply to the approaches which include the incidence regression part.A common limitation of the C‐statistic with cure status weighting and the AUC with mean score imputation is the requirement of a prespecified cutoff for cure, which may be considered subjective or arbitrary. The cutoff point was used to produce a proxy for the unobserved real cure status , based on which the predictive performance for the incidence portion can be measured. In this article, a cutoff of 5 years was used since it is a commonly used time point in cancer prognosis to indicate potential cure after a patient achieves complete remission. Under the default simulation setting, the probability of being cured given one's observed time was around 95.3%, suggesting that the cutoff of 5 was reasonable for the simulations presented here. Researchers are advised to select a cutoff value tailored to their own applications and data.Two additional metrics were applied to measure estimation bias, the relative model error (RME) and estimation error (ERR), as described in the SCinCRM paper.
They are defined as
where is the covariance matrix of the covariates, and represents the oracle estimates derived from the models where only the true signals were included and the coefficients for the other covariates were forced to be zero. These two metrics are basically comparing the distance between estimates and true values with that between oracle estimates and true values. They were calculated to assess the estimation for both penalized incidence coefficients and penalized latency coefficients .
Simulation results
In this section, simulation results are reported for different approaches using the metrics previously described. Due to space limitations, we only presented here the results under the default simulation setting where corresponds to a cure rate of roughly 40%, and were independently simulated, and a Weibull distribution was used for latency. The results under alternative settings (, same signs of and , or GG distribution for latency) were similar and reported in the Appendix.Figure 1 presents the cure prediction accuracy and C‐statistic for different methods on training and testing data. The ‐axis is the signal amplitude in the data generating process. Each point in the plots represents the averaged value among 100 repetitive experiments. From the figure, we can observe that our methods (MCMs estimated by GMIFS or E‐M) generally achieved better performance than the competing approaches in terms of these two metrics. The C‐statistic for the non‐cure Cox model had a tiny advantage for the training data but a noticeable disadvantage for the testing data in comparison to our MCMs. Further, cure prediction cannot be assessed when using the non‐cure models because of the lack of an incidence regression component. C‐Mix performed the worst due to its simplified model assumption. As the signal became stronger, most methods tended to perform better except for C‐Mix and the non‐cure Weibull model. As observed, the testing performance was generally worse in comparison to the training performance, but again, that did not hold for C‐Mix and the non‐cure Weibull model.
FIGURE 1
Cure prediction accuracy (top row) and C‐statistic (bottom row) plotted by signal amplitude for the training (right) and testing (left) datasets for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
Cure prediction accuracy (top row) and C‐statistic (bottom row) plotted by signal amplitude for the training (right) and testing (left) datasets for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure modelsFigure 2 shows the changes of the metrics of RME (in the left panels) and ERR (in the right panels) over signal amplitude for both incidence and latency parts. The incidence error for C‐Mix and the latency error for MCM(EM) and the non‐cure Cox model were extremely high when the signal amplitude was small and thus omitted from the figure for better presentation. In most cases, our GMIFS approach outperformed SCinCRM even though SCinCRM uses the MCP penalty which is known to enjoy the oracle property while the penalty does not. With our variance‐covariance matrix , the differences between the values of RME and those of ERR were unnoticeable.
FIGURE 2
Relative model error (left) and estimation error (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
Relative model error (left) and estimation error (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox modelTo compare the performance in variable selection, we applied the model‐X knockoffs framework to each of the methods and set the target FDR level set at 20%. The FDP and power results of different methods are presented in Figure 3. We can see the FDP of all methods was well controlled below the target FDR level, except for the non‐cure Cox model. In the meantime, the power increased as the amplitude increased and the incidence power was generally lower than the latency power, but our GMIFS and E‐M methods achieved higher power than the other approaches in both regression parts. The power for SCinCRM was pretty low all the time and the non‐cure Weibull models performed the worst in latency variable selection given the power of almost zero.
FIGURE 3
False discovery proportion (left) and power (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
False discovery proportion (left) and power (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox modelThe results under alternative simulation settings are presented in the Appendix. Figures A1, A2, A3 show the simulation results when corresponding to a cure rate of roughly 25%. The results were very similar to what we have observed in Figures 1, 2, 3. From Figure A2, SCinCRM has achieved slightly better estimation error than our GMIFS approach when the signals were strong. The results when the latency followed a GG distribution are presented in Figures A4, A5, A6. With the misspecified latency distribution, the latency power of our Weibull MCMs was lower in comparison to our default simulation scenario, but still higher than the competitors. Other metrics were not noticeably hampered, which suggests our methods may be robust to latency model misspecification. Under the simulation setting where the event times were generated from a promotion cure model, only the regular C‐statistic in Equation (12) was assessed as it was the only metric that was still meaningful when we fit a two‐component MCM to data from a one‐component data generating process. With 50 repetitive experiments where the signal amplitude was 1, our MCM using GMIFS achieved an average C‐statistic of 0.778 in training sets and 0.572 in testing sets. In contrast, the non‐cure Weibull model had an average C‐statistic of 0.547 in training sets and 0.539 in testing sets. The results indicate that our MCM still achieves better predictive performance than the non‐cure model even under severe deviations from the model assumptions, as long as a cure fraction exists.
FIGURE A1
For a cure fraction of roughly 25%, cure prediction accuracy (top row) and C‐statistic (bottom row) plotted by signal amplitude for the training (right) and testing (left) datasets for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
FIGURE A2
For a cure fraction of roughly 25%, relative model error (left) and estimation error (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull and a non‐cure Cox model
FIGURE A3
For a cure fraction of roughly 25%, false discovery proportion (left) and power (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
FIGURE A4
Results when the latency portion of the model was simulated from a generalized gamma distribution. Cure prediction accuracy (top row) and C‐statistic (bottom row) plotted by signal amplitude for the training (right) and testing (left) datasets for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
FIGURE A5
Results when the latency portion of the model was simulated from a generalized gamma distribution. Relative model error (left) and estimation error (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
FIGURE A6
Results when the latency portion of the model was simulated from a generalized gamma distribution. False discovery proportion (left) and power (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
Figures A7, A8, A9 display the results when the true penalized incidence and latency coefficients were identical (), and the unpenalized incidence and latency coefficients ( and ) had the same signs. The power for SCinCRM became higher under the same‐sign settings, but was still lower than that for our GMIFS and E‐M methods. Due to the two‐part interconnections, the incidence power of our methods was boosted, so was the power for C‐Mix and non‐cure Cox model which only have one regression component. It is worth noting that the non‐cure Cox model performed the best in terms of all metrics when the signal was strong enough under this simulation scenario. These results indicate that the non‐cure Cox model may still be a good choice even when a cured fraction is present, provided the true signals from the two regression components are the same, having identical or at least strongly correlated coefficients. This is a rather limited scenario, though. In most cases, we would not expect the incidence and latency components to be comprised of exactly the same variables with the same effect sizes. Therefore, our two‐component MCMs have clear advantages when different variables influence cure probability or survival of the susceptibles.
FIGURE A7
Simulation scenarios when the incidence coefficients and latency coefficients have the same signs. Cure prediction accuracy (top row) and C‐statistic (bottom row) plotted by signal amplitude for the training (right) and testing (left) datasets for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
FIGURE A8
Simulation scenarios when the incidence coefficients and latency coefficients have the same signs. Relative model error (left) and estimation error (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
FIGURE A9
Simulation scenarios when the incidence coefficients and latency coefficients have the same signs. False discovery proportion (left) and power (right) plotted by signal amplitude for incidence (top row) and latency (bottom row) for our mixture cure models (MCM) using the GMIFS and EM algorithm in comparison to SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model
APPLICATION ON AML
Data
We applied the methods to a dataset containing 816 adult AML patients who were treated on frontline CALGB/Alliance protocols. Almost all of these patients received intensive cytarabine and daunorubicin or idarubicin‐based induction treatment on CALGB/Alliance trials between 1986 and 2016. Institutional review board approval of all CALGB/Alliance protocols was obtained before any research was performed. In accordance with the Declaration of Helsinki, patients provided study‐specific written informed consent to participate in treatment and companion cytogenetic (CALGB 8461), leukemia tissue bank (CALGB 9665), and molecular (CALGB 20202) studies, which involved collection of pretreatment bone marrow aspirates and blood samples. No patient received allogeneic stem cell transplantation (allo‐SCT) in first complete remission (CR) on study protocols, and off‐study patients who received an allo‐SCT were excluded from the outcome analyses due to missing follow up data. Only those younger than 60 years old at enrollment were included into the analysis because for them, the treatment may be promising enough to lead to cures. We also limited the samples to those who had complete baseline and demographic data.In the context of cure, relapse‐free survival (RFS) is more relevant than overall survival (OS) because, obviously, one cannot be cured from death. We thus considered the patients who had attained a CR, which left us with 452 AML patients, and defined RFS as the duration between the date of CR and relapse or death, whichever was earlier. We used the traditional definition for RFS because patients who died prior to their visit may have relapsed, so this is a conservative estimate. The censoring rate was 34.5%. Out of the 296 events observed, most of the patients (87.1%) experienced AML relapse, indicating our definition of event is a good proxy for relapse. We performed a hypothesis test to detect whether there was a significant nonzero cure fraction for RFS,
which was significant (‐value 10). Figure 4 depicts the RFS estimated using the K‐M method. The long plateau in the RFS distribution demonstrates there is empirical evidence of sufficient follow‐up as well as a fraction of long‐term survivors at roughly 30%. The median follow‐up among those censored for RFS was 8.95 years. Other than identifying a sufficient follow‐up from a plateau of K‐M curve, formal statistical tests
,
can be performed. Identifying a significant cure fraction and sufficient follow‐up is usually the recommended first step before one applies cure models to survival data.
In fact, cure models may yield biased estimates when these assumptions are violated.
FIGURE 4
Relapse‐free survival for 452 AML patients younger than 60 years at enrolment treated on frontline CALGB/Alliance protocols
Relapse‐free survival for 452 AML patients younger than 60 years at enrolment treated on frontline CALGB/Alliance protocolsAlong with the time‐to‐event response, the dataset contains gene expression levels of 35 226 RNA transcripts captured using ribosomal RNA‐depleted protocols, allowing for quantification of mRNA and non‐coding RNAs. We filtered out the transcripts with low expression (mean value 5), leaving transcripts for the analyses. Some baseline/demographic characteristics were also available in the dataset, including age, race, sex, European Leukemia Net (ELN) risk group, cytogenetic abnormality (binary), white blood cell (WBC) count, hemoglobin, platelet count, percent of blasts in bone marrow and in peripheral blood, as well as mutation status for 18 known AML‐associated genes. In the preliminary analysis, we performed a stepwise variable selection in a MCM
among the baseline/demographic variables. The cutoff of 0.05 was used to add or remove covariates based on ‐values from likelihood ratio tests. The selected baseline variables for incidence and latency were considered as unpenalized covariates and , respectively, whose descriptive statistics were displayed in Table 1. The gene expression data were used as penalized covariates and in the following analyses. We elected to include this initial screening process among baseline/demographic variables as we were interested in discovering genes associated with outcome of AML patients after controlling for commonly measured variables. Generally, researchers should appeal to existing literature, prior knowledge, or expert clinical opinion when determining whether to include unpenalized covariates vs enforcing penalties on all predictors.
TABLE 1
Descriptive statistics of unpenalized covariates in the Alliance data
Variables
Total, N=452
Incidence covariates
ELN 2017, no. (%)
Favorable
279 (61.7%)
Intermediate
99 (21.9%)
Adverse
74 (16.4%)
WT1 mutation, no. (%)
35 (7.7%)
Latency covariates
Platelets, mean (SD)
73.1 (63.6)
WBC, mean (SD)
41.7 (49.0)
FLT3‐ITD, no. (%)
104 (23.0%)
TET2 mutation, no. (%)
36 (8.0%)
NRAS mutation, no. (%)
77 (17.0%)
Abbreviations: ELN 2017, European LeukemiaNet prognostic group;
FLT3‐ITD, presence of FLT3‐internal tandem duplication; SD, standard deviation; WBC, white blood cell count.
Descriptive statistics of unpenalized covariates in the Alliance dataAbbreviations: ELN 2017, European LeukemiaNet prognostic group;
FLT3‐ITD, presence of FLT3‐internal tandem duplication; SD, standard deviation; WBC, white blood cell count.
Method comparison
A data‐splitting approach was used to evaluate the model performance on the real data. We split the dataset into a training set of size and a testing set of size . Then we fit different models on the training data and evaluated the performance on the testing data. Specifically, the C‐statistic and the imputation‐based AUC
described in Section 3.3 were measured for evaluation, since these two metrics do not rely on underlying true values of parameters (which are not available in real data). Again, the C‐statistic with cure status weighting in Equation (12) was calculated for our MCMs and SCinCRM, while the regular C‐statistic in Equation (13) was calculated for C‐Mix and non‐cure models which only have one regression component. The data‐splitting process was repeated 10 times to reflect variability.Figure 5 displays the C‐statistic and AUC results for different methods. Our GMIFS and E‐M methods achieved better results than SCinCRM and C‐Mix in this application dataset. The non‐cure models obtained a high C‐statistic but they were unable to provide incidence‐based information to calculate AUC for cure prediction. We also recorded the running time for different approaches. The non‐cure Cox model was the fastest (less than 1 minute per repetition) due to the effective implementation of the glmnet R package.
,
C‐Mix was also relatively fast (30 minutes per repetition) because of its simple model assumptions. SCinCRM was quite slow for such high‐dimensional data and spent around two days for a single repetition. The other three methods (GMIFS, E‐M and non‐cure Weibull model) took a few hours, among which the E‐M algorithm appeared to be the fastest.
FIGURE 5
Boxplots of C‐statistics (left) and AUC (right) from the testing datasets from 10 repeats of a data‐splitting approach to evaluate the model performance on the AML dataset. Performance was compared among our mixture cure models (MCM) using the GMIFS and EM algorithm, SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
Boxplots of C‐statistics (left) and AUC (right) from the testing datasets from 10 repeats of a data‐splitting approach to evaluate the model performance on the AML dataset. Performance was compared among our mixture cure models (MCM) using the GMIFS and EM algorithm, SCinCRM, C‐Mix, a non‐cure Weibull, and a non‐cure Cox model. For our MCMs and SCinCRM, the C‐statistic with cure status weighting was calculated, while the regular C‐statistic was calculated for C‐Mix and non‐cure models
Gene discovery
In this section, we fit the models using the data we have after filtering (). Before combining with the model‐X knockoffs, the estimates for unpenalized coefficients in the model were obtained using GMIFS or E‐M and reported in Table A1. The estimates from GMIFS and E‐M were very similar and their magnitudes and signs were mostly consistent with previous findings. For example, when we look at the coefficients for the ELN risk group
which stratifies AML patients into genetic‐risk categories using selected cytogenetic abnormalities and gene mutations, subjects in the Intermediate and Adverse risk groups were associated with higher probabilities of being susceptible in contrast to the favorable group (reference group).
TABLE A1
Estimated unpenalized covariates using GMIFS and E‐M
Incidence
Latency
Covariate
GMIFS
E‐M
Covariate
GMIFS
E‐M
ELN 2017 Intermediate
0.62
0.64
Platelets
−0.26
−0.36
ELN 2017 Adverse
0.92
0.94
WBC
0.16
0.18
WT1
0.29
0.29
FLT3−ITD
0.23
0.22
TET2
−0.12
−0.12
NRAS
0.16
0.16
Abbreviations: ELN 2017, European LeukemiaNet prognostic group;
FLT3‐ITD, presence of FLT3‐internal tandem duplication; SD, standard deviation; WBC, white blood cell count.
The model‐X knockoffs framework was then combined with different methods for controlled variable selection. Since the gene expression values typically do not strictly follow a Gaussian distribution, we used the deep knockoff machine
to generate the knockoff copies. With the target FDR level set to be 20% for both incidence and latency coefficients, the number of selected genes for different methods was displayed in the first two rows of Table 2. The specific selected genes are presented in Figures A10 and A11 in the form of Venn diagrams. Since GMIFS was the only method that successfully identified important genes from both regression parts, we only focused on the GMIFS findings. Table 3 and 4 report the selected genes by GMIFS for incidence and latency, respectively, as well as brief description of known associations with AML or oncology in general. Some genes are known oncogenes or tumor suppressor genes, and some have been shown to differentially express in AML or other cancers.
TABLE 2
Number of selected genes for models fit using the Alliance data and validation performance using three other publicly available AML datasets
MCM (GMIFS)
MCM (E‐M)
SCin CRM
C‐Mix
Non‐cure Weibull
Non‐cure Cox
# selected in incidence
14
0
0
29
‐
‐
# selected in latency
12
9
16
‐
0
39
GSE37642
AUC
0.944
‐
‐
0.961
‐
‐
C‐index
0.679
0.660
0.606
0.712
‐
0.849
GSE12417
AUC
1
‐
‐
1
‐
‐
C‐index
0.844
0.668
0.849
0.783
‐
0.889
TCGA
AUC
0.917
‐
‐
1
‐
‐
C‐index
0.751
0.691
0.715
0.744
‐
0.808
FIGURE A10
Venn diagram displaying the overlap with respect to genes identified as associated with incidence when using our mixture cure GMIFS algorithm and C‐mix
FIGURE A11
Venn diagram displaying the overlap with respect to genes identified as associated with latency when using our mixture cure GMIFS algorithm, mixture cure EM algorithm, SCinCRM, and a non‐cure Cox model
TABLE 3
Fourteen selected genes in the incidence regression part using GMIFS combined with the model‐X knockoffs framework
Gene
Description
COQ8A
SNP associated with leukocyte count in GWAS studies52
EIF4E3
Tumor suppressor, reduced expression in AML specimens53
FAM30A
High expression was an adverse risk factor in AML54
TNPO1
Potential direct target of mixed lineage leukemia fusion proteins55
PEX2
Overexpressed in hepatocellular carcinoma (HCC) tissues56
KDM2B
Oncogene in diverse cancers, inducing cell proliferation57
SRSF2
Mutations found in AML, associated with inferior RFS58
RUNX2
Homolog of RUNX1, a central player in hematopoiesis59
ADA
Increased levels associated with short survival in AML60
MXD1
Tumor suppressor, expression affected AML survival61
IDS
Mutations lead to Hunter syndrome52
STK17B
High expression associated with apoptosis and poor OS62, 63
PFDN5
Candidate for a tumor suppressor gene64
TAF9B
Essential for cell viability, associated with apoptosis52
Abbreviation: GWAS, genome‐wide association study.
TABLE 4
Twelve selected genes in the latency regression part using GMIFS combined with the model‐X knockoffs framework
Gene
Description
CALCOCO2
Up‐regulated in high‐risk myelodysplastic syndrome65
TAX1BP1
Protects from liver cancer development66
LMO2
Oncogene of T‐cell acute lymphoblastic leukemia67
RNU5B‐1
Differentially expressed in various cancers68, 69
H2BC21
Expression associated with AML relapse and prognosis70
EREG
Promotes progression of various cancers71
ABCC4
Regulates leukemia cell proliferation and differentiation72
CAMK2G
Suppresses differentiation and stimulates proliferation73
PLXNC1
Part of immunological signature predictive of prognosis74
PITRM1
Regulating mitochondrial function in AML75
TMEM50A
Highly up‐regulated in late stage cervical cancer76
TMCO3
Up‐regulated in chronic lymphocytic leukemia77
Number of selected genes for models fit using the Alliance data and validation performance using three other publicly available AML datasetsFourteen selected genes in the incidence regression part using GMIFS combined with the model‐X knockoffs frameworkAbbreviation: GWAS, genome‐wide association study.Twelve selected genes in the latency regression part using GMIFS combined with the model‐X knockoffs framework
Validation of the identified genes
Independent datasets were employed to validate our selected genes, including two datasets from Gene Expression Omnibus (GEO) and one from The Cancer Genome Atlas (TCGA).
It is worth noting that there exist two major disparities between our dataset and the validation datasets. First, relapse‐free survival data were not available in these independent datasets, so overall survival was used instead as a proxy. Second, the validation datasets included patients who received allo‐SCT while our dataset only included patients with intensive chemotherapy. The first GEO dataset (GSE37642) includes the gene expression for 136 AML patients treated in the German AMLCG 1999 trial. The gene expression levels were measured using the Affymetrix HG‐U133Plus2 GeneChip and each observation corresponded to a probe set instead of a RNA transcript as in the Alliance data. Thus, the predictors in the original and validation datasets are distinct and the coefficients cannot be directly applied. As a workaround, we linked RNA transcripts and probe sets through their corresponding genes and included all probe sets that mapped to our selected transcripts in Tables 3 and 4 into the Weibull MCM.According to the estimated coefficients, we assigned the 136 subjects into different subgroups, including cured vs susceptible, and having low risk vs high risk of death. The predicted cured status was determined by the estimated incidence coefficients (see Section 3.3) and the high/low risk group was determined by the estimated latency coefficients and a cutoff of 0, that is, with centered. The left panel in Figure 6 presents the K‐M estimates for predicted cured () vs uncured () groups, and the middle panel in Figure 6 presents the K‐M estimates for subjects with lower risk () vs higher risk () among those that were predicted to be uncured. From the figures, the predicted cured group had the survival probability of 1 as expected, while the predicted uncured group had an estimated survival function that descended toward 0. The two risk groups among those predicted to be susceptible were well separated, with a ‐value of for a log‐rank test.
FIGURE 6
Using GSE37642 as a validation dataset and extracting Affymetrix probe sets that mapped to our selected genes from our training phase to assign subjects into groups based on our MCM, Kaplan‐Meier estimates for predicted cured (N = 32) vs uncured (N = 104) groups (left panel), Kaplan‐Meier estimates for subjects with lower risk (N = 52) vs higher risk (N = 52) among those that were predicted to be uncured (middle panel), and re‐scaled latency survival functions for all 136 subjects with lower (N = 71) vs higher (N = 65) risk (right panel)
Using GSE37642 as a validation dataset and extracting Affymetrix probe sets that mapped to our selected genes from our training phase to assign subjects into groups based on our MCM, Kaplan‐Meier estimates for predicted cured (N = 32) vs uncured (N = 104) groups (left panel), Kaplan‐Meier estimates for subjects with lower risk (N = 52) vs higher risk (N = 52) among those that were predicted to be uncured (middle panel), and re‐scaled latency survival functions for all 136 subjects with lower (N = 71) vs higher (N = 65) risk (right panel)The above estimation method for latency survival probabilities only includes the subjects who were predicted to be uncured and thus relies on accurate predictions of each individual's cure status. A recent book on cure models
proposed an alternative way to estimate latency. In this method, the latency is estimated using the information from all subjects and only depends on the cured proportion, which can be easily estimated. The latency regression part can be evaluated independently of the incidence part as a result. Specifically, the K‐M estimates are rescaled through where is the original K‐M value and is the estimated cured proportion, so that the latency survival probabilities range from 0 to 1. The right panel of Figure 6 displays the rescaled latency survival functions for all 136 subjects with lower () vs higher () risk. The survival curves were almost identical to the middle panel of Figure 6 which only included predicted uncured subjects, suggesting the relevance of our latency predictors as well as the accuracy of our cure status prediction.Similar validation results for the other two datasets (GSE12417 and TCGA data) are shown in Figures A12 and A13. Due to the small sample sizes and potentially insufficient follow‐up of these two datasets, the predicted cure status was not as accurate as that in GSE37642, but well‐separated risk groups could still be observed in both datasets. Given the small sample sizes of the validation datasets, we did not further split the data to training and testing sets and thus the predictive performance may have been overestimated due to overfitting. However, since the main focus of the validation process is on gene selection rather than model fitting, we believe the current process can serve the purpose well. We validated the relevance and importance of our concise list of genes by using only the gene expression levels for these genes to predict cure status and survival probabilities in the new datasets. The results indicate that our identified gene list is useful in differentiating patient subgroups and may serve as a prognostic signature to better inform treatment decisions for AML patients with different genomic characteristics.
FIGURE A12
Using GSE12417 as a validation dataset and extracting Affymetrix probe sets that mapped to our selected genes from our training phase to assign subjects into groups based on our MCM, Kaplan‐Meier estimates for predicted cured (N = 29) vs uncured (N = 50) groups (left panel), Kaplan‐Meier estimates for subjects with lower risk (N = 28) vs higher risk (N = 22) among those that were predicted to be uncured (middle panel), and rescaled latency survival functions for all 79 subjects with lower (N = 39) vs higher (N = 40) risk (right panel)
FIGURE A13
Using TCGA as a validation dataset, Kaplan‐Meier estimates for predicted cured () vs uncured (N = 70) groups (left panel), Kaplan‐Meier estimates for subjects with lower risk (N = 34) vs higher risk (N = 36) among those that were predicted to be uncured (middle panel), and rescaled latency survival functions for all 95 subjects with lower (N = 45) vs higher (N = 50) risk (right panel)
We also performed the same validation procedure for the genes selected by other approaches. For each method, we fit the corresponding model on different validation datasets including only the genes selected by that method, and calculated the AUC and C‐index for the fitted model which are summarized in Table 2. Because the non‐cure Weibull did not select any features with the FDR controlled, it was excluded in the validation. Our E‐M approach and SCinCRM both only selected genes in the latency part, so the incidence‐based AUC cannot be calculated. C‐Mix has achieved pretty good AUC and C‐index scores which is partly due to the long list of genes selected in the incidence part, similarly for the latency part of the non‐cure Cox model. Overall, our GMIFS method is strongly recommended as it selected succinct lists of important genes for both regression components and had good predictive performance on independent datasets.
DISCUSSION AND CONCLUSION
In this article, we proposed penalized Weibull MCMs to model censored survival outcomes in the presence of a cure fraction for high‐dimensional data. The models allow us to estimate the effects of covariates on both the probability of cure and time‐to‐event of the susceptibles simultaneously. Two estimation algorithms (GMIFS and E‐M) have been adapted to fit these models and combined with the model‐X knockoffs framework for FDR control. The simulation results demonstrate that our proposed methods outperformed competing approaches in terms of variable selection, estimation, and prediction. In variable selection, we achieved high power with the FDR being controlled after adopting the knockoffs framework. In the AML application, important genes have been identified to be associated with cure and/or survival, which may have important implications for AML research or clinical practice paradigms.One shortcoming of our study is that the incidence power is relatively low according to our simulation results. Although the incidence part has the form of a logistic regression, it is essentially a more difficult problem to tackle than a typical logistic regression. Unlike a regular binary problem, the response is hidden in the data and not explicitly observed. Besides, due to censoring and low cure rate, the effective sample size can be relatively small compared with the number of predictors, resulting in a poor signal‐to‐noise ratio. Any method that effectively boosts the incidence power may be interesting to explore in the future. One potential weakness in our application study is that we assume all AML patients who died prior to relapse died due to unobserved relapse and we used the death time as a surrogate for relapse time in this case. This assumption may not hold in all cases because some patients may have died of irrelevant events like car accidents. If we strictly consider relapse as the event of interest, then death is a competing risk for relapse and competing risk models may be of interest. Two semi‐parametric regression models for competing risks data with a cured fraction based on finite‐mixture models were introduced by Peng and Yu.In the future, we are interested in generalizing the proposed methods to penalized semi‐parametric PH MCMs and other parametric MCMs incorporating distributions in the generalized gamma and generalized F families. Penalties other than LASSO, including the MCP
and the elastic‐net penalty,
are interesting directions to work on. Besides penalized regression models, machine learning techniques including random survival forests
and gradient boosting machines
in combination with the model‐X knockoffs framework are promising approaches to high‐dimensional variable selection in cure models. These techniques have been shown useful for controlled variable selection in other outcome types including continuous, binary,
and ordinal responses.
A previous paper
applied bagging survival trees to cure models based on the promotion time cure framework, but they only provided variable importance scores for latency covariates. How to extract variable importance and perform variable selection for both incidence and latency using machine learning techniques warrants further investigation.
FUNDING INFORMATION
Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM013879. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Research reported in this publication was also supported in part by the National Cancer Institute of the National Institutes of Health under Award Numbers U10CA180821, U10CA180882, U24CA196171, UG1CA233180, UG1CA233331, UG1CA233338, R35CA197734, P30CA016058; the Coleman Leukemia Research Foundation; ASH Junior Faculty Scholar Award and ASH Bridge Grant (A‐KE), Leukemia Research Foundation (A‐KE), Leukemia & Lymphoma Society (A‐KE); The D Warren Brown Foundation; and by an allocation of computing resources from The Ohio Supercomputer Center and Shared Resources (Leukemia Tissue Bank). Support to Alliance for Clinical Trials in Oncology and Alliance Foundation Trials programs is listed at https://acknowledgments.alliancefound.org.
Authors: Hartmut Döhner; Elihu Estey; David Grimwade; Sergio Amadori; Frederick R Appelbaum; Thomas Büchner; Hervé Dombret; Benjamin L Ebert; Pierre Fenaux; Richard A Larson; Ross L Levine; Francesco Lo-Coco; Tomoki Naoe; Dietger Niederwieser; Gert J Ossenkoppele; Miguel Sanz; Jorge Sierra; Martin S Tallman; Hwei-Fang Tien; Andrew H Wei; Bob Löwenberg; Clara D Bloomfield Journal: Blood Date: 2016-11-28 Impact factor: 22.113
Authors: Vaia Stavropoulou; Susanne Kaspar; Laurent Brault; Mathijs A Sanders; Sabine Juge; Stefano Morettini; Alexandar Tzankov; Michelina Iacovino; I-Jun Lau; Thomas A Milne; Hélène Royo; Michael Kyba; Peter J M Valk; Antoine H F M Peters; Juerg Schwaller Journal: Cancer Cell Date: 2016-06-23 Impact factor: 31.743
Authors: Timothy J Ley; Christopher Miller; Li Ding; Benjamin J Raphael; Andrew J Mungall; A Gordon Robertson; Katherine Hoadley; Timothy J Triche; Peter W Laird; Jack D Baty; Lucinda L Fulton; Robert Fulton; Sharon E Heath; Joelle Kalicki-Veizer; Cyriac Kandoth; Jeffery M Klco; Daniel C Koboldt; Krishna-Latha Kanchi; Shashikant Kulkarni; Tamara L Lamprecht; David E Larson; Ling Lin; Charles Lu; Michael D McLellan; Joshua F McMichael; Jacqueline Payton; Heather Schmidt; David H Spencer; Michael H Tomasson; John W Wallis; Lukas D Wartman; Mark A Watson; John Welch; Michael C Wendl; Adrian Ally; Miruna Balasundaram; Inanc Birol; Yaron Butterfield; Readman Chiu; Andy Chu; Eric Chuah; Hye-Jung Chun; Richard Corbett; Noreen Dhalla; Ranabir Guin; An He; Carrie Hirst; Martin Hirst; Robert A Holt; Steven Jones; Aly Karsan; Darlene Lee; Haiyan I Li; Marco A Marra; Michael Mayo; Richard A Moore; Karen Mungall; Jeremy Parker; Erin Pleasance; Patrick Plettner; Jacquie Schein; Dominik Stoll; Lucas Swanson; Angela Tam; Nina Thiessen; Richard Varhol; Natasja Wye; Yongjun Zhao; Stacey Gabriel; Gad Getz; Carrie Sougnez; Lihua Zou; Mark D M Leiserson; Fabio Vandin; Hsin-Ta Wu; Frederick Applebaum; Stephen B Baylin; Rehan Akbani; Bradley M Broom; Ken Chen; Thomas C Motter; Khanh Nguyen; John N Weinstein; Nianziang Zhang; Martin L Ferguson; Christopher Adams; Aaron Black; Jay Bowen; Julie Gastier-Foster; Thomas Grossman; Tara Lichtenberg; Lisa Wise; Tanja Davidsen; John A Demchok; Kenna R Mills Shaw; Margi Sheth; Heidi J Sofia; Liming Yang; James R Downing; Greg Eley Journal: N Engl J Med Date: 2013-05-01 Impact factor: 91.245
Authors: Stefanos A Bamopoulos; Aarif M N Batcha; Vindi Jurinovic; Maja Rothenberg-Thurley; Hanna Janke; Bianka Ksienzyk; Julia Philippou-Massier; Alexander Graf; Stefan Krebs; Helmut Blum; Stephanie Schneider; Nikola Konstandin; Maria Cristina Sauerland; Dennis Görlich; Wolfgang E Berdel; Bernhard J Woermann; Stefan K Bohlander; Stefan Canzar; Ulrich Mansmann; Wolfgang Hiddemann; Jan Braess; Karsten Spiekermann; Klaus H Metzeler; Tobias Herold Journal: Leukemia Date: 2020-05-01 Impact factor: 11.528
Authors: Han Fu; Deedra Nicolet; Krzysztof Mrózek; Richard M Stone; Ann-Kathrin Eisfeld; John C Byrd; Kellie J Archer Journal: Stat Med Date: 2022-07-06 Impact factor: 2.497