Kyle Ferber1, Kellie J Archer1. 1. Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
Abstract
Researchers have recently shown that penalized models perform well when applied to high-throughput genomic data. Previous researchers introduced the generalized monotone incremental forward stagewise (GMIFS) method for fitting overparameterized logistic regression models. The GMIFS method was subsequently extended by others for fitting several different logit link ordinal response models to high-throughput genomic data. In this study, we further extended the GMIFS method for ordinal response modeling using a complementary log-log link, which allows one to model discrete survival data. We applied our extension to a publicly available microarray gene expression dataset (GSE53733) with a discrete survival outcome. The dataset included 70 primary glioblastoma samples from patients of the German Glioma Network with long-, intermediate-, and short-term overall survival. We tested the performance of our method by examining the prediction accuracy of the fitted model. The method has been implemented as an addition to the ordinalgmifs package in the R programming environment.
Researchers have recently shown that penalized models perform well when applied to high-throughput genomic data. Previous researchers introduced the generalized monotone incremental forward stagewise (GMIFS) method for fitting overparameterized logistic regression models. The GMIFS method was subsequently extended by others for fitting several different logit link ordinal response models to high-throughput genomic data. In this study, we further extended the GMIFS method for ordinal response modeling using a complementary log-log link, which allows one to model discrete survival data. We applied our extension to a publicly available microarray gene expression dataset (GSE53733) with a discrete survival outcome. The dataset included 70 primary glioblastoma samples from patients of the German Glioma Network with long-, intermediate-, and short-term overall survival. We tested the performance of our method by examining the prediction accuracy of the fitted model. The method has been implemented as an addition to the ordinalgmifs package in the R programming environment.
Frequently in high-throughput genomic research, we want to fit a statistical model using gene expression data in order to predict a future outcome. This has become a modern challenge for statisticians because there are far more features or variables (p) than samples (n). This is an obstacle in two regards. First, the design matrix will not be full-rank. Thus, there is an infinite number of solutions to the system of equations. Even small perturbations in the data will lead to large fluctuations in the coefficient estimates. Second, given the vast interrelatedness of genes, collinearity is likely to be a problem. Collinear predictors further contribute to the instability of the parameter estimates. Recently, penalization (also referred to as regularization) has stood out as an effective method to combat these two issues. There are several popular penalization methods, but the defining characteristic of them all is that they introduce bias into the parameter estimates in exchange for a reduction in variance. In many cases, penalization improves the model’s predictive accuracy and, relatedly, reduces the mean squared error (MSE) of the parameter estimates.1 In cases where model parsimony and interpretability are important, the least absolute shrinkage and selection operator (LASSO) penalization method is effective as it shrinks many parameter estimates to be exactly zero.2 The generalized monotone incremental forward stagewise (GMIFS) method is an algorithm that can be used in logistic regression to produce a monotone LASSO solution.3 The GMIFS method was subsequently extended by Archer et al for fitting several different logit link ordinal response models to high-throughput genomic data4 including the cumulative logit, forward continuation ratio (CR), backward CR, stereotype logit, and adjacent category models. Herein we describe the GMIFS algorithm for ordinal response modeling using a complementary log-log (cloglog) link, which is useful for discrete survival Modeling. Therefore, in the Discrete Survival Analysis Section, we describe the model formulation for modeling a discrete survival outcome. In the GMIFS Method for Ordinal Response Modeling section, we present the GMIFS method for the forward CR model using the cloglog link. Next, in the Application section, we discuss the motivating dataset that examined survival in glioblastoma (GBM) patients. The Results section examines the model performance in terms of parsimony, resubstitution error, and cross-validation (CV) error. Finally, in the Conclusion we provide concluding remarks, including limitations of the study.
Discrete Survival Analysis
Survival analysis encompasses methods in which the outcome variable is time to event (eg, time to death, disease relapse, etc.). The particular method used in the analysis will depend on the scale of the survival times collected. Ideally, these will be measured on a continuous scale, but sometimes for a variety of reasons, researchers only collect times on a discrete scale. For instance, for many diseases, it is impossible to record the precise date and time of relapse (ie, a continuous measurement) because the needed data are often only collected at a physician visit. Thus, we are forced to work with discrete times. Furthermore, discrete times are used when the latent scale of the response times is discrete.
High dimensional discrete survival data
Assume there are n independent subjects (i = l, 2, 3,…, n) and p features per subject, where p >> n. Because this design matrix will be singular, traditional statistical methods (eg, OLS) are not applicable. The data are often presented as follows:Let Y represent the discrete survival time response variable that takes on the values (j = 1, 2…, K), where K is the largest value of Y observed.To facilitate the formation of the likelihood, we define an n × K response matrix as follows:A p × 1 vector of covariates, , is observed for each subject.
The forward CR model with a complementary log-log link function
With discrete survival data, we are generally interested in modeling the discrete hazard rate defined asThis is also the form of a probability modeled by a forward CR model. Furthermore, if it is reasonable to assume that the data were generated by a continuous-time proportional hazards model, then we use the complementary log-log (cloglog) link function,5Here α represents the intercept, or threshold, for the jth class. Notice that α is the only component of the model that depends on time. Thus, the functions for the K time points are parallel, which implies we are assuming proportional hazards.
Likelihood
We define the likelihood as a product of n conditionally independent Bernoulli random variables,6 where π is the discrete hazard rate and (1 − π) is the conditional complement of π given by
for the forward CR model.Now define π(π1
π2
πn). When using the cloglog link, the derivative of the log-likelihood is then given byWe use the generalized monotone incremental forward stagewise algorithm to solve for the penalized solution:The tuning parameter, λ, controls the amount of shrinkage. As λ increases, the number of parameter estimates that will be shrunk to zero also increases. Using these coefficient estimates and the estimates for the α’s (described later), we can recursively estimate the probability that subject i belongs to class j whereSubject i is then classified to the class that corresponds to the maximum class-specific probability.
GMIFS Method for Ordinal Response Modeling
The incremental forward stagewise (IFS) method is an iterative algorithm that produces a penalized solution for a linear regression model.3 The GMIFS method is an extension of IFS capable of fitting overparameterized logistic regression models.3 The GMIFS algorithm was extended by Archer et al (2014) for fitting several different logit link ordinal response models to high-throughput genomic data.4 We updated this method to allow for the use of a complementary log-log link function. The steps of the GMIFS algorithm for ordinal response modeling are as follows4:Enlarge the predictor space as
where represents the standardized predictors.Initialize the α’s to their empirical values. For the forward CR model with a cloglog link, these are initialized asFor step s = 0, initialize the components ofFind
at the current estimate
.Update
.Estimate the α’s by maximum likelihood, treating
(from step 5) as fixed.Repeat steps 4 to 6 until the difference between two successive log-likelihoods is smaller than a pre-specified tolerance, τ.The rationale for enlarging the predictor space is that it allows us to avoid taking the second derivative of the log-likelihood. Once the algorithm has converged, we can obtain the penalized solution by
Furthermore, in step 5, is a small incremental value; we used 0.001 in our analysis.
Application
Glioblastoma
Glioblastomas (GBMs) are highly malignant and aggressive tumors that arise from the supportive tissue of the brain. Among all primary brain and central nervous system (CNS) tumors, they are the second most common after meningiomas, which are predominantly benign, and the five-year survival rate for GBM patients is less than 4%.7 Aside from the aggressiveness of the tumors, one possible explanation for the low survival rate is that GBMs are rare in young people; the median age at diagnosis is 64, and the age group with the highest incidence rate is 75–84 year olds.7 Treatment involves surgical removal of as much of the tumor as is safely possible followed by radiotherapy and/or chemotherapy.8 The Cancer Genome Atlas (TCGA) Research Network revealed a subtype of GBM related to the mRNA expression and methylation of a set of genes that affects young adults and has an increased survival rate. Researchers also discovered four molecular subtypes of GBM that have unique responses to treatment and gene mutations that could lead GBMs to become resistant to therapy after a standard chemotherapy treatment.9,10 These findings highlight the importance of genomic research in the study and treatment of GBM.
Data
We downloaded the raw CEL files for GSE53733 from Gene Expression Omnibus.11 The investigators used Affymetrix HG-U133 v2.0 GeneChips to measure gene expression from patients’ tumor samples taken from their initial operation. In the dataset, there were n = 70 GBM patients, of which 16 had an overall survival (OS) of less than 12 months, 31 patients had an OS between 12 and 36 months, and 23 patients had an OS greater than 36 months.12 The patients’ survival times were reported by the investigator as short-, intermediate-, and long-term OS. There were p = 54,613 features per subject in the CEL files after excluding control probe sets. However, after processing the data to remove probe sets with MAS5 present calls in <30% of the subjects,13 31,744 features remained. Furthermore, a 3′:5′ ratio much different from 1 for the housekeeping gene glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is associated with poor cDNA and cRNA quality.14 Thus, we removed one subject with a 3′:5′ GAPDH ratio greater than 3, leaving us with 69 subjects. We then used the RMA method to obtain probe set expression summaries for our statistical analysis.15 Afterward, we fit a forward CR model using the cloglog link with ε = 0.001 and τ = 0.00001.
Results
After the GMIFS algorithm converged, we examined two models: (a) the model selected my minimizing the AIC criterion and (b) the model resulting from the convergence of the GMIFS algorithm (Fig. 1). Using the full dataset, the AIC-selected model misclassified 10 of the 69 patients, while the converged model only misclassified one patient (Tables 1 and 2). However, the AIC-selected model was more parsimonious with 25 non-zero coefficients, while the converged model contained 46 non-zero coefficients. The 25 probe sets that had non-zero coefficient estimates in the AIC-selected model are shown in Table 3. Furthermore, for each model, we examined the sensitivity and specificity for diagnosing short-term survival as well as the sensitivity and specificity for diagnosing short- or intermediate-term survival (Tables 4 and 5). Among the probe sets with non-zero coefficient estimates, the one with the largest absolute coefficient estimate in both models (among probe sets with known gene symbols) was designed to interrogate HD Domain Containing 2 (HDDC2). Long-term survivors had higher HDDC2 expression levels than short-and intermediate-term survivors (Fig. 2). This result agrees with another GBM study that showed that HDDC2 was significantly downregulated in short-term survivors compared to long-term survivors.12 There was also a clear positive relationship between Nucleoside-Triphosphatase, Cancer-Related (NTPCR) expression and survival time (Fig. 3). Interestingly, researchers have shown that NTPCR is overexpressed in neuroblastomas,16 but no study has associated NTPCR with GBM. Additionally, one of the probe sets that had a non-zero coefficient estimate was designed to interrogate the gene PDZ and LIM domain 4 (PDLIM4), which has previously been studied in association with gliomas. Researchers examined the expression of the gene at the protein level for patients with gliomas and discovered that the median OS for patients with high levels of the protein (PDLI4) was significantly shorter than for patients with low protein levels.17 We compared the mean log2 expression levels of PDLIM4 for patients with short-, intermediate-, and long-term OS using Welch’s T-test. Patients with short-term OS had significantly higher expression levels than patients with long-term OS at the Bonferroni adjusted
significance level (P = 0.0019), and patients with intermediate-term OS also had significantly higher expression levels than patients with long-term OS (P < 0.0001). The difference in mean expression levels between those with short-term OS and those with intermediate-term OS was not significant. Thus, it appears that both the gene expression levels and the protein levels of the gene are lower for patients who survive longer.
Figure 1
Coefficient paths for our forward CR model using a complementary log-log link.
Notes: The first vertical dashed line signifies the step in the algorithm when the AIC was minimized. The second vertical dashed line marks the step when the algorithm converged.
Table 1
AIC-selected model cross-tabulation of the observed versus the predicted class using the full dataset.
Observed
Short
Intermediate
Long
Short
10
0
0
Predicted
Intermediate
6
31
4
Long
0
0
18
Table 2
Converged model cross-tabulation of the observed versus the predicted class using the full dataset.
Observed
Short
Intermediate
Long
Short
16
0
0
Predicted
Intermediate
0
31
1
Long
0
0
21
Table 3
Probe sets with non-zero coefficient estimates in the AIC and converged models.
Glioma,17 acute myeloid leukemia,26 Prostate cancer,27 breast cancer28
218669_at
57826
RAP2C
X
0.019
0.036
Acute lymphoblastic leukemia29
1561759_at
645513
LOC645513
4
0.049
0.062
1559283_a_at
285888
CNPY1
7
0.062
0.160
221900_at
1296
COL8A2
1
0.064
0.109
203184_at
2201
FBN2
5 9
0.089
0.179
Colorectal cancer30
234547_at
<NA>
<NA>
<NA>
0.221
0.231
229146_at
136895
C7orf31
7
0.242
0.295
Table 4
AIC-selected model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.
OUTCOME
SENSITIVITY
SPECIFICITY
Short-term survival
63
100
Short- or intermediate-term survival
100
82
Table 5
Converged model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.
OUTCOME
SENSITIVITY
SPECIFICITY
Short-term survival
100
100
Short- or intermediate-term survival
100
95
Figure 2
Boxplot of 203260_at (HDDC2) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).
Note: *Significant at the Bonferroni adjusted
significance level.
Figure 3
Boxplot of 226813_at (NTPCR) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).
Note: *Significant at the Bonferroni adjusted
significance level.
A common critique of a model fitted from high-dimensional data is that the final model, even if selected by minimizing AIC, is not parsimonious. In this example, critics may say that given a sample size of 69 subjects, including 25 coefficients in the model is overfitting, and that the model performance is likely a result of chance. In response, we fit two additional models whose performances will be a result of chance alone. First, we fit a model with the same gene-expression data used in our example, but we randomly permuted the response vector. Next, we fit a model using our original response vector, but instead of using the gene expression data, we used a design matrix filled with 31,744 × 69 = 2,190,336 random variables generated from a Gaussian distribution with a mean and standard deviation equal to the corresponding sample statistics of the gene expression data. If we exclude regions of underfitting and overfitting, the model fit with the gene expression data and the original response vector had better performance than the other two models whose performances are a result of chance rather than a relationship between the features and the response (Fig. 4).
Figure 4
Plot of model misclassification rate by number of variables included in the model for the original GSE53733 data (red line), GSE53733 data with a permuted response (blue line), and Gaussian random variables (green line).
We also performed N-fold (or leave-one-out) CV to assess the generalizability of our models (where N = 69). Both the AIC-selected model and the converged model had an N-fold CV error rate of about 44.9% (Tables 6 and 7). Thus, it appears that the AIC-selected model and the model that satisfied the GMIFS convergence criterion predict discrete survival time equally well. We chose the AIC-selected model as our final model as it is more parsimonious and therefore more interpretable.
Table 6
N-fold CV: AIC-selected model cross-tabulation of the observed versus the predicted class.
Observed
Short
Intermediate
Long
Short
1
1
0
Predicted
Intermediate
14
30
15
Long
1
0
7
Table 7
N-fold CV: converged model cross-tabulation of the observed versus the predicted class.
Observed
Short
Intermediate
Long
Short
4
3
1
Predicted
Intermediate
12
27
14
Long
0
1
7
Conclusion
GBM is a particularly dangerous tumor with a low survival rate. A specific and accurate prognosis would be very useful to both the patient and the oncologist. Thus, we were interested in predicting survival time based on a patient’s genomic feature data. We used discrete times because the investigators of this particular GBM study reported discrete times. Another case when discrete survival times would be used is when the outcome of interest (eg, disease relapse) can only be assessed at physician visits. The GMIFS algorithm is an effective method for building a classifier for an ordinal response outcome given a high-dimensional covariate space. In this case, we fit a forward CR model with a complementary loglog link function to model discrete survival time. The model resulting from the convergence of the algorithm had only a 1.4% resubstitution error. Using N-fold CV, the model had a 44.9% misclassification rate, significantly better than chance (66% misclassification rate for a three-class outcome), but there is room for improvement. For example, although our method performs automatic variable selection, improvement gains in classification accuracy may be achieved by reducing the dimensionality of the feature set in a meaningful way prior to model fitting. We plan to explore this topic in a follow-up paper. Furthermore, a more accurate classifier could be built with more information. For instance, the five-year survival rate for patients diagnosed between the ages of 0 and 19 is around 19%, while the five-year survival rate for patients diagnosed between the ages of 45 and 54 is only about 3.3%.7 Additionally, age was significantly different across the three outcome classes in this study12 but was not made available in the data. Thus, including age as an unpenalized predictor in our model would likely improve its predictive accuracy (the ordinalgmifs R package allows the user to select a subset of predictors that will not be penalized in the GMIFS algorithm31). Also, Karnofsky performance status and extent of surgical resection are known prognostic factors for GBM,32 so they could have been effective unpenalized predictors as well (despite the fact that these two variables were not significantly different across the three classes in this study12). Additionally, a specific month of death would have provided more information than a range of months. However, for each discrete time value, we would need enough subjects with that response to fit a reliable model. As the price of microarray experiments decreases, we will have greater access to datasets with a larger number of subjects, making this a reasonable expectation for the future.
Authors: Catherine I Dumur; Suhail Nasim; Al M Best; Kellie J Archer; Amy C Ladd; Valeria R Mas; David S Wilkinson; Carleton T Garrett; Andrea Ferreira-Gonzalez Journal: Clin Chem Date: 2004-09-13 Impact factor: 8.327
Authors: Seok Jin Kim; Insuk Sohn; In-Gu Do; Sin Ho Jung; Young Hyeh Ko; Hae Yong Yoo; Soonmyung Paik; Won Seog Kim Journal: Ann Hematol Date: 2013-08-24 Impact factor: 3.673
Authors: Cameron W Brennan; Roel G W Verhaak; Aaron McKenna; Benito Campos; Houtan Noushmehr; Sofie R Salama; Siyuan Zheng; Debyani Chakravarty; J Zachary Sanborn; Samuel H Berman; Rameen Beroukhim; Brady Bernard; Chang-Jiun Wu; Giannicola Genovese; Ilya Shmulevich; Jill Barnholtz-Sloan; Lihua Zou; Rahulsimham Vegesna; Sachet A Shukla; Giovanni Ciriello; W K Yung; Wei Zhang; Carrie Sougnez; Tom Mikkelsen; Kenneth Aldape; Darell D Bigner; Erwin G Van Meir; Michael Prados; Andrew Sloan; Keith L Black; Jennifer Eschbacher; Gaetano Finocchiaro; William Friedman; David W Andrews; Abhijit Guha; Mary Iacocca; Brian P O'Neill; Greg Foltz; Jerome Myers; Daniel J Weisenberger; Robert Penny; Raju Kucherlapati; Charles M Perou; D Neil Hayes; Richard Gibbs; Marco Marra; Gordon B Mills; Eric Lander; Paul Spellman; Richard Wilson; Chris Sander; John Weinstein; Matthew Meyerson; Stacey Gabriel; Peter W Laird; David Haussler; Gad Getz; Lynda Chin Journal: Cell Date: 2013-10-10 Impact factor: 41.582
Authors: Donkena Krishna Vanaja; Michael E Grossmann; John C Cheville; Mozammel H Gazi; Aiyu Gong; Jin San Zhang; Katalin Ajtai; Thomas P Burghardt; Charles Y F Young Journal: Cancer Invest Date: 2009-03 Impact factor: 2.176
Authors: H Lilljebjörn; M Heidenblad; B Nilsson; C Lassen; A Horvat; J Heldrup; M Behrendtz; B Johansson; A Andersson; T Fioretos Journal: Leukemia Date: 2007-08-09 Impact factor: 11.528
Authors: Han Fu; Deedra Nicolet; Krzysztof Mrózek; Richard M Stone; Ann-Kathrin Eisfeld; John C Byrd; Kellie J Archer Journal: Stat Med Date: 2022-07-06 Impact factor: 2.497