Literature DB >> 31641157

Analysis of microarray right-censored data through fused sliced inverse regression.

Abstract

Sufficient dimension reduction (SDR) for a regression pursue a replacement of the original p-dimensional predictors with its lower-dimensional linear projection. The so-called sliced inverse regression (SIR; [5]) arguably has the longest history in SDR methodologies, but it is still one of the most popular one. The SIR is known to be easily affected by the number of slices, which is one of its critical deficits. Recently, a fused approach for SIR is proposed to relieve this weakness, which fuses the kernel matrices computed by the SIR application from various numbers of slices. In the paper, the fused SIR is applied to a large-p-small n regression of a high-dimensional microarray right-censored data to show its practical advantage over usual SIR application. Through model validation, it is confirmed that the fused SIR outperforms the SIR with any number of slices under consideration.

Entities: Chemical Gene Species

Year: 2019 PMID： 31641157 PMCID： PMC6806006 DOI： 10.1038/s41598-019-51441-0

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

Sufficient dimension reduction (SDR) in regression of Y|X ∈ R = (X1, …, X)T pursue a replacement of the original p-dimensional predictors X with its lower-dimensional linear projection without loss of information about the conditional distribution of Y|X. Equivalently, SDR seeks for finding M ∈ R such thatwhere a notation of represents a statistical independence and q ≤ p. The conditional independence statement (1) indicates that the two conditional distributions of Y |X and Y |MTX are equivalent, so X is replaced by MTX with preventing loss of the information about Y |X. A subspace spanned by the columns of M satisfying (1) is called a dimension reduction subspace. If the subspace acquired by intersecting all possible dimension reduction subspaces is still a dimension reduction subspace, the intersection subspace is defined as the central subspace SY|X[1]. The central subspace is minimal and unique, and its restoration is the main purpose of SDR literature. Hereafter, notations of d and ∈ R represent the true dimension and orthonormal basis matrix of S, respectively. The dimension-reduced predictor TX is called sufficient predictors. Data, whose sample size n is smaller than p, such as microarray data, high-throughput data, etc., are quite popular these days. In such data, so-called curse of dimensionality usually occurs, so a proper model-building are often problematic in practice. Then the SDR of X through SY| can facilitate a model specification, so it turns out to be practically useful in such data. One of the most popular SDR methods should be sliced inverse regression (SIR[2]). Implementation of SIR requires a categorization of a response variable Y, called slicing, and the selection of the appropriate number of slices are often critical in the application results. So far, any ideal or recommended selection guidelines to choose the number of slices are not yet known. To overcome this, a fused approach is proposed in[3] by combining sample kernel matrices of SIR constructed by varying the numbers of slices. The combining approach in[3] is called fused sliced inverse regression (FSIR). According to[3], FSIR results in robust basis estimates of SY| to the numbers of slices. The purpose of this paper is to analyze a micro array right-censored survival data by implementing fused sliced inverse regression (FSIR) by[3]. The performances of FSIR will be compared with the usual SIR applications with different numbers of slices. The organization of the paper is as follows. The SIR and FSIR along with the applicability to survival regression is discussed in section 2. In the same section, the permutation dimension test is discussed. Diffuse large-B-cell lymphoma data is analyzed through SIR and FSIR, and their results are compared in section 3. We summarize our work in section 4. We will define the following notations, which will be used frequently throughout the rest of the paper. A subspace S(B) stands for a subspace spanned by the columns of B. And, we define that Σ = cov(X).

Material and Methods

Sliced inverse regression and fused sliced inverse regression

Before explaining sliced inverse regression[2], the predictor X is normalized to . Letting S be the central subspace for a regression of Y|Z, then the relationship that S = Σ−1/2S holds. Define η be p × d orthonormal basis matrix for S. Consider the so-called linearity condition: (C1) E(Z|TZ) is linear in TZ. According to[2], a proper subspace of S can be constructed under linearity condition: For estimating of S completely, it is typically assumed that S(Σ−1E(X|Y)) = S. The so-called sliced inverse regression is a method to recover S by computing E(X|Y). In population, the quantity E(Z|Y) should be computed without any specific assumptions on Y |Z. If Y is discrete with h levels, E(Z |Y = s) is the average of Z within the sth category of Y. Following this idea, if Y is continuous or many-valued, Y is transformed to a categorized response with h levels. Then E(Z| = s) becomes the average of Z within the sth category of for s = 1, …, h. This categorization of Y is called slicing, which is done for each category to have equal numbers of observations. The SIR constructs: In sample structure, the algorithm of SIR is as follows: Construct by dividing the range of Y into h non-overlapping intervals. Let n be the number of observations for the sth category of for s = 1, …, h. Compute , for s = 1, …, h, where . Construct as follows: Spectral-decompose , where . Determine the structural dimension d. Let denote an estimate of d. A set of eigenvectors corresponding to first largest eigenvalues are the estimate of an orthonormal basis for S. Back-transform to have the estimate of an orthonormal basis of S. As we can see the implementation of SIR in practice, the results may critically vary depending on the selection of h. This is discussed in[3]. Define that MFSIR(h) = (MSIR(1), …, MSIR(), where MSIR( stands for the kernel matrix of SIR with h slices. Since S(MSIR() = S for k = 2, 3, .., h, we have. In[3], the matrix MFSIR( is proposed as another kernel matrix to estimate S, and this approach is called fused sliced inverse regression (FSIR). In[3], it is confirmed that MFSIR( is robust to the choices of h through various numerical studies. Inference on S is done by the spectral decomposition of . The eigenvectors of corresponding to its non-zero eigenvalues form an estimate of an orthonormal basis of S.

Permutation dimension test

The true structural dimension d is determined by a sequence of hypothesis tests[4]. Starting with m = 0, test H0: d = m versus H1: d = m + 1. If H0: d = m is rejected, increment m by 1 and redo the test, stopping the first time H0 is not rejected and setting = m. This dimension test is equivalent to testing the rank of MFSIR(. So, a proposed test statistics is as follows:where . Here a permutation approach is adopted to implement the dimension estimation. An advantage of the permutation test is no requirement of the asymptotics of . The permutation test algorithm is as follows: Construct . Under H0: d = m, compute and partition the eigenvectors: . Construct two sets of vectors: and Randomly permute index i of the with the permuted set . Construct the test statistics based on a regression of Y|( ). Repeat steps (3–4) N times, where N is the total number of permutations. The p-value of the hypothesis testing is the fraction of that exceed . The setting N = 1000 is a widely-used choice.

Application to survival regression

Survival regression is a study of the conditional distribution of survival time T given a set of predictors X. Naturally, SDR in the survival regression should seek for recovering the central subspace S: However, since the true survival time T cannot be completely observed due to censoring, the direct study of T|X cannot be usually done. Instead, the data (Y, δ, X), i = 1, …, n, are collected as n independent and identically distributed realizations of (T, C, X), where Y = T δ + C(1 − δ), δ = 0, 1 is an indicator variable whose value is equal to 1, if δ(C > T) = 1 and 0, otherwise, and C stands for a censoring time. This type of censoring is called right-censoring. Using (Y, δ, X), the regression of T|X is replaced as follows. The first step is a consideration of a regression of (T, C)|X. The construction of (T, C)|X directly implies that S ⊆ S(. According to[5], the central subspace S( from a bivariate regression of (Y, δ)|X is informative to S, because S( ⊆ S(. Since (Y, δ, X) are collected for survival analysis, the estimation of S( can be done. The two regressions of T|X and (Y, δ)|X are connected in[3] under condition: (C2) C X|(TX, T). Conditionc2 is weaker than C (T, X), which is normally assumed in in survival analysis. Then, condition C2 guarantees that statement (2) is equivalent to (T,C)X|ηTX, so we have S( = S. Therefore, the following relation directly implied: According to[5,6], the equality would normally hold, because proper containment requires carefully balanced conditions. Then, SIR and FSIR are directly applicable with bivariate slicing of Y and δ to recover S. Similar discussion about this can be found in section 4.2 of [ 6].

Results

Analysis of diffuse large-B-cell lymphoma data

The diffuse large-B-cell lymphoma dataset (DLBCL[7]) contains measurements of 7399 genes from 240 patients obtained from customized cDNA microarrays. For each patient, his/her survival time was recorded and varied from 0 to 21.8 years. The total uncensored cases (deceased) are 138 among 240 patients. More detailed description on the data is founded in[6-8]. We follow the approach in[9] to analyze the DLBCL. The DLBCL is randomly divided into the training set of 160 and the test set of 80. As usual, the training set is used for model-building, and the test set is utilized for model-validation. First, the 7399 genes in the training set, which are denoted as Xtr, are initially reduced to their 40 principal components through principal component analysis. Letting be the rotation matrix, the 40 principal components are . Second, the SIR is employed for the additional dimension reduction of with observed survival time and censoring status as bivariate responses. Let stand for the estimated matrix. According to[9], the dimension d is estimated to be one. The finalized estimated sufficient predictors through this two-step dimension reduction are denoted as with For model-building, the Cox-proportional hazards model was fitted with . For model-validation, the predicted scores and the corresponding area under ROC curves for prediction of survival time from 1 to 10 years for both the training and test sets were computed. For the test set, the dimension- reduced predictors are defined as , where is obtained from the training set and Xte stands for the predictors in the test set. The area closer to one indicates better estimation. One potentially arguable issue in the analysis in the context should arise on the selection of the number of slices h in the SIR application. As discussed in the previous section, its performance inevitably depends on h. To investigate how serious they impact on the model-validation, we consider h = 4, 6, 8 and 10 for SIR along with FSIR. Following the guidance of [3], 10 slices are used in FSIR. The area under ROC curve for the training and test sets are reported in Fig. 1.

Figure 1

Area under ROC curves at time 1 to 10 years for DLBCL data in Section 3: h = 4, 6, 8, 10, sliced inverse regression with the according number of slices; Fused, fused sliced inverse regression with h = 10. First, we see the areas under ROC curves for the training set in Fig. 1(a). Larger areas indicate better prediction performances. For the SIR application, the smaller numbers of slices show the better performances. The FSIR is not best among the all application of SIR considered here, but there are no notable differences to the best results, which is with h = 2, among all the SIR applications. Therefore, for the training set, the FSIR is not cause of concern at all. In the case of the test set in Fig. 1(b), the FSIR shows better prediction performances than any of the SIR applications. The prediction results by the FSIR is consistent in both the training and test sets, while the usual SIR applications are very sensitive to the choices of h, as expected. The application of the FSIR to the data is concluded to be successful.

Discussion

According to Fig. 1(a,b), the areas under ROC curves for the training and test sets are reversed against h in the SIR applications. In the training set, smaller numbers of slices have larger areas, while the areas with smaller numbers of h become smaller in the test sets, which is even below 0.5. The area equal to 0.5 is often used as the cut-off. Therefore, for SIR, the application with h = 10 alone is above 0.5 in both train and test sets, although its performance is worst among the others in the train set. The FSIR, however, shows reliable and consistently good performances in both training and test sets. The best selections of h in the training set and the test set are different, and this selection bias in h can cause the ironic results in SIR. This bias also affects the estimation of h in the analysis. With level 5%, the SIR application with h = 4 and 8 determines that = 0 with the corresponding p-values of 0.139 and 0.244 for H0: d = 0, respectively. However, the SIR with h = 6 and 10 determines that = 1 (h = 6: 0.009 for H0: d = 0 and 0.097 for H0: d = 1 & h = 10: 0.007 for H0: d = 0 and 0.10 for H0: d = 1). This confirms the severe sensitivity of the SIR to the selection of h in the high-dimensional data analysis. The FSIR determines that = 1 with the p-values of 0.014 for H0: d = 0 and of 0.115 for H0: d = 1. This shows that the FSIR has potential advantages over the SIR in high-dimensional data analysis in practice.

Conclusion

Fused sliced inverse regression (FSIR) proposed by[3] solves the sensitiveness of slice inverse regression (SIR[2]) to the number of slices by combining SIR kernel matrices. In this paper, the fused sliced inverse regression is applied to high-dimensional microarray right-censored data to show the potential advantage to large p-small n data over the usual SIR application. The predictors are initially reduced through principal components analysis, and then SIR and FSIR are implemented with 40 principal components. According to model-validation, the SIR reveals its sensitiveness to the number of slices. Moreover, ironic validation results are observed in the training and test sets. For SIR, the numbers of slices to have better performances in the training set show worse performances in the test set. This may be because good slicing schemes in the training set do not coincide with that in the test set. This is confirmed again through the estimation of the true structural dimension. However, FSIR shows better performances in the training and test sets than all SIR-application under consideration. This proves a practical advantage of FSIR over SIR. The usage of FSIR can improve the accuracy in high-dimensional data analysis, which often arise in many scientific fields including biological sciences, so it can contribute to discover new founding in the many science areas.

3 in total

1. Dimension reduction and graphical exploration in regression including survival analysis.

Authors: R Dennis Cook
Journal: Stat Med Date: 2003-05-15 Impact factor: 2.373

2. Dimension reduction methods for microarrays with application to censored survival data.

Authors: Lexin Li; Hongzhe Li
Journal: Bioinformatics Date: 2004-07-15 Impact factor: 6.937

3. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma.

Authors: Andreas Rosenwald; George Wright; Wing C Chan; Joseph M Connors; Elias Campo; Richard I Fisher; Randy D Gascoyne; H Konrad Muller-Hermelink; Erlend B Smeland; Jena M Giltnane; Elaine M Hurt; Hong Zhao; Lauren Averett; Liming Yang; Wyndham H Wilson; Elaine S Jaffe; Richard Simon; Richard D Klausner; John Powell; Patricia L Duffey; Dan L Longo; Timothy C Greiner; Dennis D Weisenburger; Warren G Sanger; Bhavana J Dave; James C Lynch; Julie Vose; James O Armitage; Emilio Montserrat; Armando López-Guillermo; Thomas M Grogan; Thomas P Miller; Michel LeBlanc; German Ott; Stein Kvaloy; Jan Delabie; Harald Holte; Peter Krajci; Trond Stokke; Louis M Staudt
Journal: N Engl J Med Date: 2002-06-20 Impact factor: 91.245

3 in total