Literature DB >> 25861216

Modeling discrete survival time using genomic feature data.

Abstract

Researchers have recently shown that penalized models perform well when applied to high-throughput genomic data. Previous researchers introduced the generalized monotone incremental forward stagewise (GMIFS) method for fitting overparameterized logistic regression models. The GMIFS method was subsequently extended by others for fitting several different logit link ordinal response models to high-throughput genomic data. In this study, we further extended the GMIFS method for ordinal response modeling using a complementary log-log link, which allows one to model discrete survival data. We applied our extension to a publicly available microarray gene expression dataset (GSE53733) with a discrete survival outcome. The dataset included 70 primary glioblastoma samples from patients of the German Glioma Network with long-, intermediate-, and short-term overall survival. We tested the performance of our method by examining the prediction accuracy of the fitted model. The method has been implemented as an addition to the ordinalgmifs package in the R programming environment.

Entities: CellLine Chemical Disease Gene Species

Keywords: R; classification; gene expression; ordinal response; survival analysis

Year: 2015 PMID： 25861216 PMCID： PMC4360712 DOI： 10.4137/CIN.S17275

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Frequently in high-throughput genomic research, we want to fit a statistical model using gene expression data in order to predict a future outcome. This has become a modern challenge for statisticians because there are far more features or variables (p) than samples (n). This is an obstacle in two regards. First, the design matrix will not be full-rank. Thus, there is an infinite number of solutions to the system of equations. Even small perturbations in the data will lead to large fluctuations in the coefficient estimates. Second, given the vast interrelatedness of genes, collinearity is likely to be a problem. Collinear predictors further contribute to the instability of the parameter estimates. Recently, penalization (also referred to as regularization) has stood out as an effective method to combat these two issues. There are several popular penalization methods, but the defining characteristic of them all is that they introduce bias into the parameter estimates in exchange for a reduction in variance. In many cases, penalization improves the model’s predictive accuracy and, relatedly, reduces the mean squared error (MSE) of the parameter estimates.1 In cases where model parsimony and interpretability are important, the least absolute shrinkage and selection operator (LASSO) penalization method is effective as it shrinks many parameter estimates to be exactly zero.2 The generalized monotone incremental forward stagewise (GMIFS) method is an algorithm that can be used in logistic regression to produce a monotone LASSO solution.3 The GMIFS method was subsequently extended by Archer et al for fitting several different logit link ordinal response models to high-throughput genomic data4 including the cumulative logit, forward continuation ratio (CR), backward CR, stereotype logit, and adjacent category models. Herein we describe the GMIFS algorithm for ordinal response modeling using a complementary log-log (cloglog) link, which is useful for discrete survival Modeling. Therefore, in the Discrete Survival Analysis Section, we describe the model formulation for modeling a discrete survival outcome. In the GMIFS Method for Ordinal Response Modeling section, we present the GMIFS method for the forward CR model using the cloglog link. Next, in the Application section, we discuss the motivating dataset that examined survival in glioblastoma (GBM) patients. The Results section examines the model performance in terms of parsimony, resubstitution error, and cross-validation (CV) error. Finally, in the Conclusion we provide concluding remarks, including limitations of the study.

Discrete Survival Analysis

Survival analysis encompasses methods in which the outcome variable is time to event (eg, time to death, disease relapse, etc.). The particular method used in the analysis will depend on the scale of the survival times collected. Ideally, these will be measured on a continuous scale, but sometimes for a variety of reasons, researchers only collect times on a discrete scale. For instance, for many diseases, it is impossible to record the precise date and time of relapse (ie, a continuous measurement) because the needed data are often only collected at a physician visit. Thus, we are forced to work with discrete times. Furthermore, discrete times are used when the latent scale of the response times is discrete.

High dimensional discrete survival data

Assume there are n independent subjects (i = l, 2, 3,…, n) and p features per subject, where p >> n. Because this design matrix will be singular, traditional statistical methods (eg, OLS) are not applicable. The data are often presented as follows: Let Y represent the discrete survival time response variable that takes on the values (j = 1, 2…, K), where K is the largest value of Y observed. To facilitate the formation of the likelihood, we define an n × K response matrix as follows: A p × 1 vector of covariates, , is observed for each subject.

The forward CR model with a complementary log-log link function

With discrete survival data, we are generally interested in modeling the discrete hazard rate defined as This is also the form of a probability modeled by a forward CR model. Furthermore, if it is reasonable to assume that the data were generated by a continuous-time proportional hazards model, then we use the complementary log-log (cloglog) link function,5 Here α represents the intercept, or threshold, for the jth class. Notice that α is the only component of the model that depends on time. Thus, the functions for the K time points are parallel, which implies we are assuming proportional hazards.

Likelihood

We define the likelihood as a product of n conditionally independent Bernoulli random variables,6 where π is the discrete hazard rate and (1 − π) is the conditional complement of π given by for the forward CR model. Now define π(π1 π2 πn). When using the cloglog link, the derivative of the log-likelihood is then given by We use the generalized monotone incremental forward stagewise algorithm to solve for the penalized solution: The tuning parameter, λ, controls the amount of shrinkage. As λ increases, the number of parameter estimates that will be shrunk to zero also increases. Using these coefficient estimates and the estimates for the α’s (described later), we can recursively estimate the probability that subject i belongs to class j where Subject i is then classified to the class that corresponds to the maximum class-specific probability.

GMIFS Method for Ordinal Response Modeling

The incremental forward stagewise (IFS) method is an iterative algorithm that produces a penalized solution for a linear regression model.3 The GMIFS method is an extension of IFS capable of fitting overparameterized logistic regression models.3 The GMIFS algorithm was extended by Archer et al (2014) for fitting several different logit link ordinal response models to high-throughput genomic data.4 We updated this method to allow for the use of a complementary log-log link function. The steps of the GMIFS algorithm for ordinal response modeling are as follows4: Enlarge the predictor space as where represents the standardized predictors. Initialize the α’s to their empirical values. For the forward CR model with a cloglog link, these are initialized as For step s = 0, initialize the components of Find at the current estimate . Update . Estimate the α’s by maximum likelihood, treating (from step 5) as fixed. Repeat steps 4 to 6 until the difference between two successive log-likelihoods is smaller than a pre-specified tolerance, τ. The rationale for enlarging the predictor space is that it allows us to avoid taking the second derivative of the log-likelihood. Once the algorithm has converged, we can obtain the penalized solution by Furthermore, in step 5, is a small incremental value; we used 0.001 in our analysis.

Application

Glioblastoma

Glioblastomas (GBMs) are highly malignant and aggressive tumors that arise from the supportive tissue of the brain. Among all primary brain and central nervous system (CNS) tumors, they are the second most common after meningiomas, which are predominantly benign, and the five-year survival rate for GBM patients is less than 4%.7 Aside from the aggressiveness of the tumors, one possible explanation for the low survival rate is that GBMs are rare in young people; the median age at diagnosis is 64, and the age group with the highest incidence rate is 75–84 year olds.7 Treatment involves surgical removal of as much of the tumor as is safely possible followed by radiotherapy and/or chemotherapy.8 The Cancer Genome Atlas (TCGA) Research Network revealed a subtype of GBM related to the mRNA expression and methylation of a set of genes that affects young adults and has an increased survival rate. Researchers also discovered four molecular subtypes of GBM that have unique responses to treatment and gene mutations that could lead GBMs to become resistant to therapy after a standard chemotherapy treatment.9,10 These findings highlight the importance of genomic research in the study and treatment of GBM.

Data

We downloaded the raw CEL files for GSE53733 from Gene Expression Omnibus.11 The investigators used Affymetrix HG-U133 v2.0 GeneChips to measure gene expression from patients’ tumor samples taken from their initial operation. In the dataset, there were n = 70 GBM patients, of which 16 had an overall survival (OS) of less than 12 months, 31 patients had an OS between 12 and 36 months, and 23 patients had an OS greater than 36 months.12 The patients’ survival times were reported by the investigator as short-, intermediate-, and long-term OS. There were p = 54,613 features per subject in the CEL files after excluding control probe sets. However, after processing the data to remove probe sets with MAS5 present calls in <30% of the subjects,13 31,744 features remained. Furthermore, a 3′:5′ ratio much different from 1 for the housekeeping gene glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is associated with poor cDNA and cRNA quality.14 Thus, we removed one subject with a 3′:5′ GAPDH ratio greater than 3, leaving us with 69 subjects. We then used the RMA method to obtain probe set expression summaries for our statistical analysis.15 Afterward, we fit a forward CR model using the cloglog link with ε = 0.001 and τ = 0.00001.

Results

After the GMIFS algorithm converged, we examined two models: (a) the model selected my minimizing the AIC criterion and (b) the model resulting from the convergence of the GMIFS algorithm (Fig. 1). Using the full dataset, the AIC-selected model misclassified 10 of the 69 patients, while the converged model only misclassified one patient (Tables 1 and 2). However, the AIC-selected model was more parsimonious with 25 non-zero coefficients, while the converged model contained 46 non-zero coefficients. The 25 probe sets that had non-zero coefficient estimates in the AIC-selected model are shown in Table 3. Furthermore, for each model, we examined the sensitivity and specificity for diagnosing short-term survival as well as the sensitivity and specificity for diagnosing short- or intermediate-term survival (Tables 4 and 5). Among the probe sets with non-zero coefficient estimates, the one with the largest absolute coefficient estimate in both models (among probe sets with known gene symbols) was designed to interrogate HD Domain Containing 2 (HDDC2). Long-term survivors had higher HDDC2 expression levels than short-and intermediate-term survivors (Fig. 2). This result agrees with another GBM study that showed that HDDC2 was significantly downregulated in short-term survivors compared to long-term survivors.12 There was also a clear positive relationship between Nucleoside-Triphosphatase, Cancer-Related (NTPCR) expression and survival time (Fig. 3). Interestingly, researchers have shown that NTPCR is overexpressed in neuroblastomas,16 but no study has associated NTPCR with GBM. Additionally, one of the probe sets that had a non-zero coefficient estimate was designed to interrogate the gene PDZ and LIM domain 4 (PDLIM4), which has previously been studied in association with gliomas. Researchers examined the expression of the gene at the protein level for patients with gliomas and discovered that the median OS for patients with high levels of the protein (PDLI4) was significantly shorter than for patients with low protein levels.17 We compared the mean log2 expression levels of PDLIM4 for patients with short-, intermediate-, and long-term OS using Welch’s T-test. Patients with short-term OS had significantly higher expression levels than patients with long-term OS at the Bonferroni adjusted significance level (P = 0.0019), and patients with intermediate-term OS also had significantly higher expression levels than patients with long-term OS (P < 0.0001). The difference in mean expression levels between those with short-term OS and those with intermediate-term OS was not significant. Thus, it appears that both the gene expression levels and the protein levels of the gene are lower for patients who survive longer.

Figure 1

Coefficient paths for our forward CR model using a complementary log-log link.

Notes: The first vertical dashed line signifies the step in the algorithm when the AIC was minimized. The second vertical dashed line marks the step when the algorithm converged.

Table 1

AIC-selected model cross-tabulation of the observed versus the predicted class using the full dataset.

		Observed
		Short	Intermediate	Long
	Short	10	0	0
Predicted	Intermediate	6	31	4
	Long	0	0	18

Table 2

Converged model cross-tabulation of the observed versus the predicted class using the full dataset.

		Observed
		Short	Intermediate	Long
	Short	16	0	0
Predicted	Intermediate	0	31	1
	Long	0	0	21

Table 3

Probe sets with non-zero coefficient estimates in the AIC and converged models.

PROBE SET	ENTREZ ID	GENE SYMBOL	CHROMOSOME	β^AIC	β^CONVERGED	CANCER ASSOCIATIONS
203260_at	51020	HDDC2	6	−0.268	−0.309	Glioblastoma12
1557883_a_at	<NA>	<NA>	<NA>	−0.203	−0.269
206565_x_at	11039	SMA4	5	−0.186	−0.267
1558723_at	284014	LOC284014	17	−0.178	−0.226
202447_at	1666	DECR1	8	−0.142	−0.163	Breast cancer18
226813_at	84284	NTPCR	1	−0.142	−0.194	Neuroblastomas16
209078_s_at	25828	TXN2	22	−0.081	−0.133	Breast cancer19
230581_at	<NA>	<NA>	<NA>	−0.069	−0.069
215962_at	<NA>	<NA>	<NA>	−0.063	−0.072
1557100_s_at	25831	HECTD1	14	−0.032	−0.032	Breast cancer20
242333_at	<NA>	<NA>	<NA>	−0.032	−0.032
206697_s_at	3240	HP	16	−0.029	−0.061	Non-small cell lung cancer,21 Hepatocellular carcinoma22
222992_s_at	4715	NDUFB9	8	−0.028	−0.028
219221_at	253461	ZBTB38	3	−0.014	−0.048	Involved in DNA replication and stability23
230353_at	284112	LOC284112	17	−0.013	−0.039
243957_at	400464	LOC400464	15	0.005	0.013	Diffuse large cell B lymphoma24
231773_at	9068	ANGPTL1	1	0.016	0.017	Prostate cancer25
211564_s_at	8572	PDLIM4	5	0.017	0.017	Glioma,17 acute myeloid leukemia,26 Prostate cancer,27 breast cancer28
218669_at	57826	RAP2C	X	0.019	0.036	Acute lymphoblastic leukemia29
1561759_at	645513	LOC645513	4	0.049	0.062
1559283_a_at	285888	CNPY1	7	0.062	0.160
221900_at	1296	COL8A2	1	0.064	0.109
203184_at	2201	FBN2	5 9	0.089	0.179	Colorectal cancer30
234547_at	<NA>	<NA>	<NA>	0.221	0.231
229146_at	136895	C7orf31	7	0.242	0.295

Table 4

AIC-selected model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.

OUTCOME	SENSITIVITY	SPECIFICITY
Short-term survival	63	100
Short- or intermediate-term survival	100	82

Table 5

Converged model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.

OUTCOME	SENSITIVITY	SPECIFICITY
Short-term survival	100	100
Short- or intermediate-term survival	100	95

Figure 2

Boxplot of 203260_at (HDDC2) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).

Note: *Significant at the Bonferroni adjusted significance level.

Figure 3

Boxplot of 226813_at (NTPCR) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).

Note: *Significant at the Bonferroni adjusted significance level.

A common critique of a model fitted from high-dimensional data is that the final model, even if selected by minimizing AIC, is not parsimonious. In this example, critics may say that given a sample size of 69 subjects, including 25 coefficients in the model is overfitting, and that the model performance is likely a result of chance. In response, we fit two additional models whose performances will be a result of chance alone. First, we fit a model with the same gene-expression data used in our example, but we randomly permuted the response vector. Next, we fit a model using our original response vector, but instead of using the gene expression data, we used a design matrix filled with 31,744 × 69 = 2,190,336 random variables generated from a Gaussian distribution with a mean and standard deviation equal to the corresponding sample statistics of the gene expression data. If we exclude regions of underfitting and overfitting, the model fit with the gene expression data and the original response vector had better performance than the other two models whose performances are a result of chance rather than a relationship between the features and the response (Fig. 4).

Figure 4

Plot of model misclassification rate by number of variables included in the model for the original GSE53733 data (red line), GSE53733 data with a permuted response (blue line), and Gaussian random variables (green line).

We also performed N-fold (or leave-one-out) CV to assess the generalizability of our models (where N = 69). Both the AIC-selected model and the converged model had an N-fold CV error rate of about 44.9% (Tables 6 and 7). Thus, it appears that the AIC-selected model and the model that satisfied the GMIFS convergence criterion predict discrete survival time equally well. We chose the AIC-selected model as our final model as it is more parsimonious and therefore more interpretable.

Table 6

N-fold CV: AIC-selected model cross-tabulation of the observed versus the predicted class.

		Observed
		Short	Intermediate	Long
	Short	1	1	0
Predicted	Intermediate	14	30	15
	Long	1	0	7

Table 7

N-fold CV: converged model cross-tabulation of the observed versus the predicted class.

		Observed
		Short	Intermediate	Long
	Short	4	3	1
Predicted	Intermediate	12	27	14
	Long	0	1	7

Conclusion

GBM is a particularly dangerous tumor with a low survival rate. A specific and accurate prognosis would be very useful to both the patient and the oncologist. Thus, we were interested in predicting survival time based on a patient’s genomic feature data. We used discrete times because the investigators of this particular GBM study reported discrete times. Another case when discrete survival times would be used is when the outcome of interest (eg, disease relapse) can only be assessed at physician visits. The GMIFS algorithm is an effective method for building a classifier for an ordinal response outcome given a high-dimensional covariate space. In this case, we fit a forward CR model with a complementary loglog link function to model discrete survival time. The model resulting from the convergence of the algorithm had only a 1.4% resubstitution error. Using N-fold CV, the model had a 44.9% misclassification rate, significantly better than chance (66% misclassification rate for a three-class outcome), but there is room for improvement. For example, although our method performs automatic variable selection, improvement gains in classification accuracy may be achieved by reducing the dimensionality of the feature set in a meaningful way prior to model fitting. We plan to explore this topic in a follow-up paper. Furthermore, a more accurate classifier could be built with more information. For instance, the five-year survival rate for patients diagnosed between the ages of 0 and 19 is around 19%, while the five-year survival rate for patients diagnosed between the ages of 45 and 54 is only about 3.3%.7 Additionally, age was significantly different across the three outcome classes in this study12 but was not made available in the data. Thus, including age as an unpenalized predictor in our model would likely improve its predictive accuracy (the ordinalgmifs R package allows the user to select a subset of predictors that will not be penalized in the GMIFS algorithm31). Also, Karnofsky performance status and extent of surgical resection are known prognostic factors for GBM,32 so they could have been effective unpenalized predictors as well (despite the fact that these two variables were not significantly different across the three classes in this study12). Additionally, a specific month of death would have provided more information than a range of months. However, for each discrete time value, we would need enough subjects with that response to fit a reliable model. As the price of microarray experiments decreases, we will have greater access to datasets with a larger number of subjects, making this a reasonable expectation for the future.

27 in total

1. Evaluation of quality-control criteria for microarray gene expression analysis.

Authors: Catherine I Dumur; Suhail Nasim; Al M Best; Kellie J Archer; Amy C Ladd; Valeria R Mas; David S Wilkinson; Carleton T Garrett; Andrea Ferreira-Gonzalez
Journal: Clin Chem Date: 2004-09-13 Impact factor: 8.327

2. Gene expression profiles for the prediction of progression-free survival in diffuse large B cell lymphoma: results of a DASL assay.

Authors: Seok Jin Kim; Insuk Sohn; In-Gu Do; Sin Ho Jung; Young Hyeh Ko; Hae Yong Yoo; Soonmyung Paik; Won Seog Kim
Journal: Ann Hematol Date: 2013-08-24 Impact factor: 3.673

3. The somatic genomic landscape of glioblastoma.

Authors: Cameron W Brennan; Roel G W Verhaak; Aaron McKenna; Benito Campos; Houtan Noushmehr; Sofie R Salama; Siyuan Zheng; Debyani Chakravarty; J Zachary Sanborn; Samuel H Berman; Rameen Beroukhim; Brady Bernard; Chang-Jiun Wu; Giannicola Genovese; Ilya Shmulevich; Jill Barnholtz-Sloan; Lihua Zou; Rahulsimham Vegesna; Sachet A Shukla; Giovanni Ciriello; W K Yung; Wei Zhang; Carrie Sougnez; Tom Mikkelsen; Kenneth Aldape; Darell D Bigner; Erwin G Van Meir; Michael Prados; Andrew Sloan; Keith L Black; Jennifer Eschbacher; Gaetano Finocchiaro; William Friedman; David W Andrews; Abhijit Guha; Mary Iacocca; Brian P O'Neill; Greg Foltz; Jerome Myers; Daniel J Weisenberger; Robert Penny; Raju Kucherlapati; Charles M Perou; D Neil Hayes; Richard Gibbs; Marco Marra; Gordon B Mills; Eric Lander; Paul Spellman; Richard Wilson; Chris Sander; John Weinstein; Matthew Meyerson; Stacey Gabriel; Peter W Laird; David Haussler; Gad Getz; Lynda Chin
Journal: Cell Date: 2013-10-10 Impact factor: 41.582

Review 4. CBTRUS statistical report: primary brain and central nervous system tumors diagnosed in the United States in 2005-2009.

Authors: Therese A Dolecek; Jennifer M Propp; Nancy E Stroup; Carol Kruchko
Journal: Neuro Oncol Date: 2012-11 Impact factor: 12.300

5. Polymorphisms in oxidative stress-related genes and postmenopausal breast cancer risk.

Authors: Petra Seibold; Rebecca Hein; Peter Schmezer; Per Hall; Jianjun Liu; Norbert Dahmen; Dieter Flesch-Janys; Odilia Popanda; Jenny Chang-Claude
Journal: Int J Cancer Date: 2011-01-06 Impact factor: 7.396

6. PDLIM4, an actin binding protein, suppresses prostate cancer cell growth.

Authors: Donkena Krishna Vanaja; Michael E Grossmann; John C Cheville; Mozammel H Gazi; Aiyu Gong; Jin San Zhang; Katalin Ajtai; Thomas P Burghardt; Charles Y F Young
Journal: Cancer Invest Date: 2009-03 Impact factor: 2.176

7. Combined high-resolution array-based comparative genomic hybridization and expression profiling of ETV6/RUNX1-positive acute lymphoblastic leukemias reveal a high incidence of cryptic Xq duplications and identify several putative target genes within the commonly gained region.

Authors: H Lilljebjörn; M Heidenblad; B Nilsson; C Lassen; A Horvat; J Heldrup; M Behrendtz; B Johansson; A Andersson; T Fioretos
Journal: Leukemia Date: 2007-08-09 Impact factor: 11.528

8. Elevated expression of DecR1 impairs ErbB2/Neu-induced mammary tumor development.

Authors: Josie Ursini-Siegel; Ashish B Rajput; Huiling Lu; Virginie Sanguin-Gendreau; Dongmei Zuo; Vasilios Papavasiliou; Cynthia Lavoie; Jason Turpin; Katherine Cianflone; David G Huntsman; William J Muller
Journal: Mol Cell Biol Date: 2007-07-16 Impact factor: 4.272

9. Angiopoietin-like protein 2 induces androgen-independent and malignant behavior in human prostate cancer cells.

Authors: Ryuta Sato; Mutsushi Yamasaki; Kenichi Hirai; Takanori Matsubara; Takeo Nomura; Fuminori Sato; Hiromitsu Mimata
Journal: Oncol Rep Date: 2014-11-03 Impact factor: 3.906

10. ordinalgmifs: An R Package for Ordinal Regression in High-dimensional Data Settings.

Authors: Kellie J Archer; Jiayi Hou; Qing Zhou; Kyle Ferber; John G Layne; Amanda E Gentry
Journal: Cancer Inform Date: 2014-12-10

2 in total

1. Penalized negative binomial models for modeling an overdispersed count outcome with a high-dimensional predictor space: Application predicting micronuclei frequency.

Authors: Rebecca R Lehman; Kellie J Archer
Journal: PLoS One Date: 2019-01-08 Impact factor: 3.240

2. Controlled variable selection in Weibull mixture cure models for high-dimensional data.

Authors: Han Fu; Deedra Nicolet; Krzysztof Mrózek; Richard M Stone; Ann-Kathrin Eisfeld; John C Byrd; Kellie J Archer
Journal: Stat Med Date: 2022-07-06 Impact factor: 2.497

2 in total