Literature DB >> 28881997

Predicting phenotypes from microarrays using amplified, initially marginal, eigenvector regression.

Abstract

MOTIVATION: The discovery of relationships between gene expression measurements and phenotypic responses is hampered by both computational and statistical impediments. Conventional statistical methods are less than ideal because they either fail to select relevant genes, predict poorly, ignore the unknown interaction structure between genes, or are computationally intractable. Thus, the creation of new methods which can handle many expression measurements on relatively small numbers of patients while also uncovering gene-gene relationships and predicting well is desirable.
RESULTS: We develop a new technique for using the marginal relationship between gene expression measurements and patient survival outcomes to identify a small subset of genes which appear highly relevant for predicting survival, produce a low-dimensional embedding based on this small subset, and amplify this embedding with information from the remaining genes. We motivate our methodology by using gene expression measurements to predict survival time for patients with diffuse large B-cell lymphoma, illustrate the behavior of our methodology on carefully constructed synthetic examples, and test it on a number of other gene expression datasets. Our technique is computationally tractable, generally outperforms other methods, is extensible to other phenotypes, and also identifies different genes (relative to existing methods) for possible future study.
AVAILABILITY AND IMPLEMENTATION: All of the code and data are available at http://mypage.iu.edu/∼dajmcdon/research/ . CONTACT: dajmcdon@indiana.edu. SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28881997 PMCID： PMC5870707 DOI： 10.1093/bioinformatics/btx265

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A typical scenario in genomics is to obtain expression measurements for thousands of genes from microarrays or RNA-Seq which may be relevant for predicting a particular phenotype. Such studies have been useful in relating specific genetic variations to a wide variety of outcomes such as disease specific indicators (Lesage and Brice, 2009; Barrett ; Burton ; Sladek ); drug or vaccine response (Saito ; Kennedy ); and individual traits like motion sickness (Hromatka ) or age at menarche (Elks ; Perry ). In these scenarios, researchers are interested in the accurate prediction of the phenotype and the identification of a handful of relevant genes with a reasonable computational expense. With these goals in mind, supervised linear regression techniques such as ridge regression (Hoerl and Kennard, 1970), the lasso (Tibshirani, 1996), the Dantzig selector (Candes and Tao, 2007) or other penalized methods are often employed. However, because phenotypes tend to be the result of groups of genes, which perhaps together describe more complicated biomechanical processes, rather than individual polymorphisms, recent approaches have tried to account for this group structure. Techniques such as the group lasso (Yuan and Lin, 2006) can predict the response with sparse groupings of coefficients as long as the groups are partially understood ahead of time. In contrast, unsupervised methods such as principal components analysis (Hotelling, 1957; Jolliffe, 2002; Pearson, 1901) are often used directly on the genes when no phenotype is being examined (Alter ; Sladek ; Wall ). Finally, modern approaches developed specifically for the genomics context such as supervised gene shaving (Hastie ), tree harvesting (Hastie ), and supervised principal components (Bair and Tibshirani, 2004; Bair ) have sought to combine the presence of a response with the structure estimation properties of eigendecompositions from unsupervised techniques to obtain the best of both. It is this last set of techniques that most closely resemble the approach we present here. We give a more detailed discussion of supervised principal components next, before motivating our method with an example. Notation: We will use bolded letters M to indicate matrices, capital letters to denote column vectors, such that M is the jth column of the matrix M, and lower case letters m to denote row vectors (a single subscript) or scalars (m being the i, j element of M). We will use the notation to mean the columns of M whose indices are in the set A and . Finally, for a matrix M, we write the singular value decomposition (SVD) of and define to be the Moore-Penrose inverse of . In the case only of the design matrix X discussed below, we will use the more compact decomposition .

1.1 Supervised eigenstructure techniques

The first technique for extending unsupervised principal components analysis to the case where a response is available is principal components regression (PCR, Hotelling, 1957; Kendall, 1965). Instead of regressing the response on all the available covariates as in ordinary least squares (OLS), PCR first performs an eigendecomposition of the empirical covariance matrix and then regresses the response on the subset of principal components corresponding to the largest variances. Defining to be the centered response vector, and X to be the n × p centered design matrix, write the (reduced) SVD of X as For some integer , the principal components regression estimator is given as the solution to which has the closed form representation Since this solution is in the space spanned by the principal components, it is easy to rotate the estimate back onto the span of X:. Then any elements of which are identically zero imply the irrelevance of those genes for predicting the phenotype while the columns of can be interpreted as indicating groupings of individual genes. Principal components regression performs well under certain conditions when we believe that there are natural groupings of covariates (linear combinations) which are useful for predicting the response. However, Lu (2002) and Johnstone and Lu (2009) show that the empirical singular vectors are poor estimates of the associated population quantity (the left singular vectors of the expected value of X) unless as . In particular, when , as is common in genomics where the number of gene expression measurements is much larger than the number of patients, PCR will suffer. To avoid this flaw in PCR, various approaches have been proposed. Hastie proposed a method called ‘gene shaving’ that is applicable to both supervised (given a phenotype) and unsupervised (only gene expressions) settings. In the supervised setting, it works by computing the first principal component and ranking the genes using a combined measure that balances the principal component scores and the marginal relationship with the response. Those genes with lowest combined scores are removed and the process is repeated until only one gene remains, resulting in a nested sequence of clusters containing fewer and fewer genes. Then one chooses a cluster along this sequence, orthogonalizes the data with respect to the genes in that cluster, and repeats the entire process again, iterating until the desired number of clusters has been recovered. This procedure is somewhat computationally expensive as well as requiring both the cluster sizes and the number of clusters to be chosen. An alternative with somewhat similar behavior is supervised principal components (SPC, Bair and Tibshirani, 2004; Bair ). SPC avoids the high-dimensional regression problem by first selecting a much smaller subset of useful genes which have high marginal correlation with the phenotype (in contrast to gene shaving, which uses the marginal correlation and the covariance between genes). By screening out most of the hopefully irrelevant genes, we can return to the scenario where p < n. In follow-up work, Paul show that, if a small marginal correlation with the response implies irrelevance for prediction, then SPC will find any truly relevant genes and predict the phenotype accurately. They also suggest using lasso or forward stepwise selection after SPC to further reduce the number of genes. However, if some genes have small marginal relationship with the response but large conditional relationship, they will be erroneously ignored by SPC. It is this last property that our method attempts to correct. We now illustrate that the screening step of SPC is likely to remove important genes in typical applications before discussing how our procedure avoids suffering the same fate.

1.2 A motivating example

To motivate our methodology in relation to previous approaches, we examine a dataset consisting of 240 patients with diffuse large B-cell lymphoma (DLBCL, Rosenwald ) in some detail. Each patient is measured on 7399 genes, and her survival time is recorded. Previous approaches rely on the assumption that a small marginal correlation between the response variable, in this case patient survival time, and the vector of expression measurements for a particular gene is sufficient for guaranteeing the irrelevance of that particular gene for prediction. To make this assumption mathematically precise, suppose where y is the response, x is a vector of gene expression measurements, and ϵ is a mean-zero error. Then, the assumption can be stated mathematically as . While reasonable under some conditions, this assumption is perhaps too strong for many gene expression datasets. Very often, individual gene expressions are only predictive of phenotype in the presence of other genes. We can rewrite this assumption using the population covariance matrix between genes, , and the vector-valued covariance between gene expressions and phenotype, . Then, using the population equation for β allows us to rewrite the assumption as In words, we are assuming that the dot product of the jth row of the inverse covariance matrix with the covariance between x and y is zero whenever the jth element of Σ is zero. To examine whether this assumption holds, we can estimate both and Σ using the DLBCL data and imagine that these estimates are the population quantities for illustration. To estimate Σ, we use the standard covariance estimate, but set all but the largest 120 values equal to zero, corresponding to a sparse solution. For the case of , estimating large inverse covariance matrices accurately is impossible when unless we assume some additional structure. If most of the entries are 0 [a necessary condition for (1) to hold], methods like the graphical lasso (glasso, Friedman ) or graph estimation (Meinshausen and Bühlmann, 2006) have been shown to work well. We use the graph estimation technique for all 7399 genes in the dataset at 10 different sparsity levels ranging from 100% to 99.2%. For visualization purposes, Figure 1 shows the first 250 genes for one estimate of the inverse covariance that is 97.5% sparse.

Fig. 1

A sparse estimate of the inverse covariance of gene expression measurements for the first 250 genes from the DLBCL dataset. The estimate has 97.5% of the off-diagonal elements equal to 0. Darker colors represent inverse co-variances of larger magnitude To assess the validity of (1), Table 1 shows the sparsity of the full inverse covariance matrix, the percentage of non-zero regression coefficients, and the percentage of non-zero regression coefficients which are incorrectly ignored by the assumption (the false negative rate). In all cases, Σ is about 98% sparse. Even with an extremely sparse inverse covariance matrix, the false negative rate is at least 25% meaning that 25% of possibly relevant genes are ignored by the analysis. If the sparsity of is allowed to increase only slightly, the false negative rate increases to over 95%.

Table 1

Sparsity of Σxx−1	1.0000	0.9999	0.9998	0.9995	0.9991	0.9984	0.9975	0.9963	0.9946	0.9922
% Non-zero β’s	0.0162	0.0216	0.0287	0.0418	0.0618	0.0843	0.1193	0.1803	0.2645	0.3699
False Negative Rate	0.0000	0.2500	0.4340	0.6117	0.7374	0.8077	0.8641	0.9100	0.9387	0.9562

Note: The second row is the number of non-zero population regression coefficients corresponding to each inverse covariance matrix. The bottom row shows the percentage of non-zero regression coefficients which are incorrectly ignored under the assumption on the relationship between marginal correlations and regression coefficients.

This table shows properties of the coefficients of the linear model corresponding to 10 different estimates of the inverse covariance matrix, from complete sparsity on the left (a diagonal matrix) to still more than 99% sparsity on the right Note: The second row is the number of non-zero population regression coefficients corresponding to each inverse covariance matrix. The bottom row shows the percentage of non-zero regression coefficients which are incorrectly ignored under the assumption on the relationship between marginal correlations and regression coefficients.

1.3 Our contribution

For a similar computational budget, our method outperforms existing approaches by taking advantage of all the data. Our method does not require that the set of non-zero regression coefficients be a subset of the non-zero marginal correlations. Suppose that is a symmetric, non-negative definite matrix; that is, for all vectors , and . To approximate the matrix , we fix an integer and form a sketching matrix . Then, we report the following approximation: . The details behind the formation of the matrix control the type of approximation. In the simplest case, which we employ here, we take where is a permutation of the identity matrix and is a truncation matrix. While many alternative sketching matrices, mostly based on random projections, have been proposed, this method is the only one necessary to develop our results. Without loss of generality, divide the matrix into blocks so that we can (implicitly) construct the matrix as Because we can approximate the eigendecomposition of using the SVD of . If we decompose where we have suppressed the dependence of on when is an argument for clarity, then the resulting approximation to the eigenvectors of is Likewise, the approximate eigenvalues of are given the singular values . Homrighausen and McDonald (2016) show that this approximation is more accurate than the one based on for performing a principal components analysis. As previous techniques for principal components regression (like SPC) are based on rather than , it is possible that by using , we will have better results. As we will see, this intuition turns out to be true under some conditions which were suggested in Section 1.2. In particular, for essentially the same computational budget, our procedure outperforms previous procedures if some genes have small marginal correlations with the phenotype but are, nonetheless, important for predicting the phenotype conditional on the presence of other genes. Furthermore, even if the assumption in (1) is true, our procedure is not much worse than existing approaches. In Section 2, we discuss exactly how to implement our methodology. We examine the behavior of our procedure in Section 3. In Section 3.1, we state an explicit model for the data-generating mechanism in order to be clear about the conditions under which our procedure works well. Section 3.2 uses a number of carefully constructed simulations to show when our technique works well, and when it doesn’t. In Section 4, we examine our procedure on four genetics datasets, including the one discussed above. We find that our methods slightly outperform existing techniques on three of them, suggesting that the motivation is sound. Finally, in Section 5, we give conclusions and discuss some avenues for future work.

2 Methods and computations

We now give the details of our methodology. For clarity, we assume that the design matrix X and the response Y are already centered. Let T be a p-dimensional vector denoting standardized regression coefficient estimates i.e. for any , t is the coefficient estimate of standardized univariate regression between response Y and covariate X. We use standardized regression so that the coefficient estimates are comparable across disparate covariates. Note that t is also the marginal correlation between the response Y and covariate X. For some threshold , we separate X into two matrices and , where . We assume . The hope is that contains many of the genes that are most predictive of the phenotype under study. Ideally, high marginal correlations will suggest relevant predictors to be emphasized in the decomposition, but unlike other methods, we will also use those genes in the set A. We now focus on and note that it has the same range as X. Therefore, we will use the approximation technique discussed in Section 1.3 to try to estimate the eigendecomposition of using sample quantities. Because is symmetric and positive definite, write and decompose . For some integer , we define Now, we have estimates for the principal components . Therefore, just as with principal components regression, we can regress Y on the estimated principal components to produce estimated coefficients in principal component space: Then the coefficient estimates for linear regression in the space spanned by are given by Because our methodology uses marginal regression to select a small number of hopefully relevant predictors before ‘amplifying’ their eigenstructure information with the matrix, we refer to our technique as ‘Amplified, Initially Marginal, Eigenvector Regression’ (AIMER). Unlike previous approaches, the solution given by (2) is not sparse: with probability 1, . However, most of the coefficients will be small. We therefore threshold the estimates to produce our final estimator: where , and is the indicator function, which returns the value one for every element of and zero otherwise. We summarize this procedure in Algorithm 1. As with SPC, the computational burden of our method is dominated by the SVD. We use an SVD of while SPC uses the SVD of . However, since the SVD is cubic in the smaller dimension, in both cases the computation is . Thus, to leading order, both methods require the same amount of computation. Algorithm 1: Amplified, Initially Marginal, Eigenvector Regression (AIMER) Input: centered design matrix X, centered response Y, thresholds , integer d 1 Compute marginal correlation t between X and Y for all j; 2 Set; 3 Set; 4 Define ; 5 Decompose ; 6 Set; 7 Set 8 Set; 9 Calculate 10 Set; Output: coefficient estimates To make predictions given a new observation , we simply center it using the mean of the original data, reorder its entries to conform to , multiply by the coefficient vector in (3), and add the mean of the original response vector.

3 Experimental analysis

To examine the performance of our method, we set up a number of carefully constructed simulations under various conditions. We first discuss the generic data model we assume, a latent factor model, which is amenable to analysis via SPC or AIMER.

3.1 Data model

Consider the multivariate Gaussian linear regression model with y the response, a column vector of gene expression measurements, the coefficients, ϵ a random Gaussian distributed error with zero mean and variance 1, and . We further assume that has a Gaussian distribution with mean vector and covariance matrix . We will assume that is sparse, in that most of its elements are exactly 0 indicating no linear relationship between the associated gene and the response. Finally, the design matrix X and the response vector Y include n independent observations of x and y, respectively. Model for X. As is symmetric and positive (semi-) definite, we can decompose it as where are orthonormal eigenvectors on and are eigenvalues. We assume that there is some such that the eigenvalues can be separated into two groups, one of which includes relatively large eigenvalues and the other relatively small eigenvalues, that is, for and for k > G where , and Then, because X is multivariate Gaussian, we can write X as where latent factors are independent and identically distributed (i.i.d.) vectors, and the noise matrix E is n × p with i.i.d. N(0, 1) entries independent of . Model for We assume that Y is a linear function of the first latent factors in plus additive Gaussian noise: where is the coefficient vector, is a constant, and Z is distributed , independent of X. Note that the expectation of Y is zero and that this is a specific form of (4). Implication of the model. Under this model for X and Y, the population marginal covariance between each gene X and the response Y can be written as Therefore, the population ordinary least squares coefficients of regressing Y on X ( in (4)) can be written as We will define the set and the set . We note that for K = 1, it is always the case that . By manipulating the parameters in , L, and , we can create a number of scenarios for testing AIMER against alternative methods.

3.2 Experiments

We present results under five different experiments. For each of the simulations which follow, we generate datasets with n = 200 and p = 1000. We use half (n = 100) to estimate the model and test our predictions on the other half. We repeat this process 100 times for each combination of parameters. Throughout, we use and . The matrix U is generated with i.i.d. standard Gaussian entries, while the matrix V is constructed by hand to have the correct number of orthogonal components. The first experiment is designed to be favorable to AIMER. The second is designed to be favorable to SPC. The third examines the extent to which the assumption that is beneficial to SPC over AIMER. The fourth examines the impact of using incorrect numbers of components, while the fifth uses cross validation on all the tuning parameters. Simulation 1: Favorable conditions for AIMER. In this simulation, we create data which is amenable to AIMER at the expense of the conditions for SPC, that is we use . We set parameters in the data model as and choose and . In order to achieve , we set and solve (5) for θ3 so that some corresponding elements of Σ will be zero. We make the first 15 elements of non-zero, five corresponding to each of the three principal components. Thus, the first 10 genes have non-zero population marginal correlation and the remaining 990 have zero marginal correlation. In this scenario, SPC should find the first 10 important genes, but AIMER will find the remaining five important genes as well. In order to focus on the relationship between performance and the condition , we examine the methods for a fixed computational budget and choose to select the same 50 most predictive genes. We examine SPC, SPC with lasso, AIMER(b = 0), and AIMER. We use the first three principal components for regression in all the methods. For SPC with lasso and AIMER, we choose the remaining tuning parameters via 10-fold cross-validation. We also give results for OLS on the first 15 genes. This is the oracle estimator, the best one could hope to do with foreknowledge of the predictive genes. Figure 2 shows the classification performance using a receiver operating characteristic (ROC) curve for SPC with lasso and AIMER in the left panel (the remaining panels are for the next two simulations). Examining the figure, it is easy to see that SPC + lasso identifies the first 10 genes easily, but AIMER is able to capture all 15 predictive genes at a low cost of false positive identifications. A more detailed analysis is given in the first row of Figure 3. Panel 1a shows the ability of each method to estimate the coefficients of three different factors. Coefficient estimates for the five genes in factor 1 by AIMER are slightly more accurate, and no more variable, than SPC + lasso. Furthermore, AIMER is better at estimating those ’s associated with factor 2, and much better at those associated with factor 3 (these are assumed zero in SPC). Panel 1b examines the mean square error (MSE) of estimation as the average squared difference between the true coefficients and their estimates for all 1000 genes. The overall estimation accuracy of AIMER (b = 0) is worse because of the inclusion of so many useless genes (it estimates all 1000), however, by thresholding with AIMER, accuracy is improved and exceeds that of SPC with and without lasso. In panel 1c, we show the MSE for prediction, the average squared difference between predicted values and the actual observations, for a test set. This MSE is smaller for AIMER than for SPC much of the time, but the variance across simulations is large.

Fig. 2

Fig. 3

Estimation and prediction performance of SPC and AIMER in the first three simulations. The left panel shows the estimates of the regression coefficients, the middle panel shows the mean squared error (MSE) of estimation for all 1000 genes, and the right panel shows prediction MSE on the held-out data. The boxes indicate variability across 100 replications. The dashed black horizontal lines indicate the true values of β

Receiver operating characteristic (ROC) Curve for Simulations 1–3. The x-axis is the false positive rate while the y-axis is the true positive rate. The curves present averages across 100 replications. SPC is limited to only 50 selected genes, and so its false positive rate is bounded. The dashed line indicates its best case theoretical performance were it allowed to continue to select further genes Estimation and prediction performance of SPC and AIMER in the first three simulations. The left panel shows the estimates of the regression coefficients, the middle panel shows the mean squared error (MSE) of estimation for all 1000 genes, and the right panel shows prediction MSE on the held-out data. The boxes indicate variability across 100 replications. The dashed black horizontal lines indicate the true values of β Simulation 2: Favorable conditions for SPC. This simulation compares the performance of SPC and AIMER under conditions which are more favorable to SPC. In particular, we choose parameters such that . While AIMER is likely to perform worse because it will tend to include irrelevant genes, it is not too much worse. Most of the parameters are the same as in Simulation 1, except that , , and we use the first two principal components to do regression. Therefore, 10 out of 1000 genes are truly predictive of the response, and all 10 have non-zero marginal correlation with the response (the rest have ). Looking again at Figure 2, both SPC + lasso and AIMER can identify all 10 predictive genes at a small price of false positives. Examining Figure 3, we see that the estimation accuracy of SPC/SPC + lasso is better than that of AIMER as expected, and the MSE of prediction for AIMER is about twice that of SPC/SPC + lasso. The estimation MSE (panel 2b) of AIMER is comparable to that of SPC. Simulation 3: Slight perturbations. In this simulation, we adjust only , rather than 1 as in simulation 2, thereby maintaining the condition that . However, in this case AIMER works much better than SPC/SPC + lasso. Figures 2 and 3 show that AIMER can easily identify all the predictive genes, has more precise coefficient estimates, and has much smaller MSE for prediction. The reason is that, even though , the marginal correlations for some predictive genes are very small. Therefore, those genes are more difficult for SPC to identify, but AIMER can compensate. For one further comparison, Table 2 shows the average (standard deviation in parentheses) number of predictive genes selected in each of the first three simulations. AIMER selects the smallest number of coefficients in most cases.

Table 2

Average final number of predictive genes in Simulations 1, 2 and 3

Simulation	1	2	3
True #	15	10	10
SPC	50 (0)	50 (0)	50 (0)
SPC+lasso	31 (9.011)	39 (3.636)	46 (2.665)
AIMER (b = 0)	1000 (0)	1000 (0)	1000 (0)
AIMER	39 (9.225)	21 (12.750)	16 (7.558)

Note: The standard deviation is shown in parentheses.

Average final number of predictive genes in Simulations 1, 2 and 3 Note: The standard deviation is shown in parentheses. Simulation 4: Choosing the number of components. In the previous simulations, we used the correct number of principal components, though such a choice is unlikely to be possible given real data. In this simulation, we examine the impact choosing the number of components has on estimation accuracy. We use similar parameter settings as Simulation 1 except with rather than 3 (we maintain the condition that ). We then use all the methods with 1, 2 and 3 components. We also adjust the values of λ1 in a range from 5 to 50. As we can see in Figure 4, using two components reduces MSE for AIMER(b = 0) and AIMER across all values of λ1 relative to using only one component, while using more than two components has little impact. With only one component, SPC performs better than AIMER, likely due to smaller variance for a similar bias, but using two or three components leads to large gains for AIMER. In practice, it is worthwhile to try several numbers of components and use cross-validation to decide which works best.

Fig. 4

Prediction MSE averaged across 100 replications for each method for different numbers of components (Simulation 4). We also allow λ1 to vary between 5 and 50

Prediction MSE averaged across 100 replications for each method for different numbers of components (Simulation 4). We also allow λ1 to vary between 5 and 50 Simulation 5: The screening threshold. In previous simulations, we choose so that variable screening by the marginal correlation would always select exactly 50 genes. Thus, we could compare methods based on their ability to use the same amount of information. In reality, it may be better to choose the threshold using cross validation. In this simulation, we use the same conditions as in the previous simulation with . It is still not appropriate to have more genes than patients, so we allow the number of selected genes to be anything less than the number of patients (100). We further use 10-fold cross-validation to choose the best threshold. As shown in Figure 5, allowing to be chosen rather than fixed leads to improved results for AIMER relative to SPC/SPC + lasso. The prediction MSE decreases and fewer genes are selected.

Fig. 5

Performance of each method when we allow to be chosen by cross validation rather than fixed to choose 50 genes (Simulation 5)

4 Performance on real data

We now illustrate our methods on four empirical datasets in genomics that record the censored survival time and gene expression measurements from DNA microarrays of patients with four different types of cancer. The first dataset comes from Rosenwald and contains 240 patients with diffuse large B-cell lymphoma (DLBCL) and 7399 genes. The second dataset has 4751 gene expression measurements of 78 breast cancer patients (Van’t Veer ). The third consists of 86 lung cancer patients measured on 7129 genes (Beer ), and finally, we analyze a dataset consisting of 116 patients with acute myeloid leukemia (AML, Bullinger ) and 6283 genes. Since the survival times for some patients are censored and right-skewed, we use as the response. A Cox model would be more appropriate, but this transformation is enough to illustrate our methodology. In order to assess our method using limited data, we randomly select half of the data as the training set and let the rest be in the testing set, then estimate each model using the training half and predict the held out data. We repeat this procedure for 10 random splits and report the average error. We use 10-fold cross-validation on the training set to choose all tuning parameters (, d and λ where appropriate), mimicking the procedure of a real data analysis. We apply seven methods on each dataset: (i) PCR; (ii) lasso; (iii) ridge regression; (iv) SPC; (v) SPC + lasso; (vi) AIMER(b = 0) and (vii) AIMER. We use the R packages pls (Mevik and Wehrens, 2007) to perform PCR and glmnet (Friedman ) to perform lasso and ridge. For PCR, SPC, SPC + lasso, AIMER(b = 0) and AIMER, we allow the number of components d to be chosen between 1 and 5. Our results are shown in Table 3. For each dataset, we show the MSE on the testing set, the number of selected genes, and the number of principal components used (if relevant), averaged across the 10 random training-testing splits. We do not show results for PCR because it is uniformly awful. The results in Table 3 are largely consistent with the conclusions we derive from simulations. AIMER and SPC + lasso tend to select a similar number of genes, though AIMER has better prediction error on three of the four datasets. Interestingly, the genes selected by SPC + lasso, lasso and AIMER rarely overlap, suggesting that to identify genes for further study, one should try all three methods. The online Supplementary Material lists the genes identified by AIMER for each dataset. In the case of DLBCL, we also list any previous research relating the selected genes to lymphoma.

Table 3

The MSE on the test set, the number of selected genes and the number of principal components used (d if relevant), each averaged across the 10 random training-testing splits

	DLBCL			Breast cancer			Lung cancer			AML
Methods	MSE	# genes	d	MSE	# genes	d	MSE	# genes	d	MSE	# genes	d
lasso	0.6805	20		0.6285	9		0.8159	22		1.9564	6
ridge	0.6485	7399		0.6407	4751		0.7713	7129		1.9234	6283
SPC	0.6828	41	3	0.6066	16	2	0.8344	19	3	2.4214	24	2
SPC+lasso	0.6780	31	3	0.6029	14	2	0.8436	9	4	2.3980	22	2
AIMER(b = 0)	1.1896	7399	2	2.6531	4751	1	0.9444	7129	1	12.4014	6283	1
AIMER	0.6518	28	4	0.6004	31	3	1.0203	13	1	1.8746	36	4

Note: Bolded values indicate the best predictive performance for each type of method (with and without structure learning) for each dataset.

The MSE on the test set, the number of selected genes and the number of principal components used (d if relevant), each averaged across the 10 random training-testing splits Note: Bolded values indicate the best predictive performance for each type of method (with and without structure learning) for each dataset. The Lung Cancer data is rather odd in that AIMER(b = 0) has better performance than AIMER. This anomaly is likely because, in contrast with the other datasets, the lung cancer expression measurements have not been scaled relative to a control group. We tried two transformations using only the treatment group to approximate such a scaling, but, while the performance of our method becomes comparable to SPC following transformations, it remains slightly worse. Without a control group, it is difficult to explain this outcome with any certainty. A comparison of these alternative transformations with our results in Table 3 is contained in the online Supplementary Material. As seen in the table, ridge regression is sometimes the best of all the methods. Previous experience suggests that ridge regression is dominant if the genes are highly correlated or when there is not a particularly predictive set of genes. However, the fact that ridge does not screen out unimportant genes is a barrier to its applications in genomics. On the other hand, AIMER approaches or exceeds the small prediction error of ridge regression while also selecting a small number of predictive genes, making it a better candidate for solving these types of problems.

5 Discussion

High-dimensional regression methods help in predicting future survival time and identifying possibly predictive genes for diseases. However, the large number of genes, the limited access to patients, and the complex covariance structure between genes make the problem both computationally and statistically difficult. In both simulations and analysis of actual gene expression datasets, AIMER has comparable or slightly improved prediction accuracy relative to existing methods and finds small numbers of actually predictive genes, all while having a similar computational burden. On the other hand, there are some issues which warrant further exploration. A major benefit of SPC is that it comes with theoretical guarantees under certain assumptions. While our methodology is intended to work when these assumptions don’t hold, we do not yet have comparable guarantees. However, the simulated experiments in this paper have suggested how we might derive such results in a more general setting. For the real data examples in this paper, we applied a simple monotonic transformation to the response variable, however, extending our methods to Cox models, which are more appropriate, and other generalized linear models for predicting discrete traits is highly desirable. It may also be useful to examine other eigenstructure techniques such as Locally Linear Embeddings or Laplacian Eigenmaps to produce non-linear predictors. Finally, using other matrix approximation techniques may yield improved performance or be more amenable to theoretical analysis.

Funding

This work is supported by the National Science Foundation [grant number DMS–14-07439 to D.J.M.]. Conflict of Interest: none declared. Click here for additional data file.

20 in total

1. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease.

Authors: Jeffrey C Barrett; Sarah Hansoul; Dan L Nicolae; Judy H Cho; Richard H Duerr; John D Rioux; Steven R Brant; Mark S Silverberg; Kent D Taylor; M Michael Barmada; Alain Bitton; Themistocles Dassopoulos; Lisa Wu Datta; Todd Green; Anne M Griffiths; Emily O Kistner; Michael T Murtha; Miguel D Regueiro; Jerome I Rotter; L Philip Schumm; A Hillary Steinhart; Stephan R Targan; Ramnik J Xavier; Cécile Libioulle; Cynthia Sandor; Mark Lathrop; Jacques Belaiche; Olivier Dewit; Ivo Gut; Simon Heath; Debby Laukens; Myriam Mni; Paul Rutgeerts; André Van Gossum; Diana Zelenika; Denis Franchimont; Jean-Pierre Hugot; Martine de Vos; Severine Vermeire; Edouard Louis; Lon R Cardon; Carl A Anderson; Hazel Drummond; Elaine Nimmo; Tariq Ahmad; Natalie J Prescott; Clive M Onnie; Sheila A Fisher; Jonathan Marchini; Jilur Ghori; Suzannah Bumpstead; Rhian Gwilliam; Mark Tremelling; Panos Deloukas; John Mansfield; Derek Jewell; Jack Satsangi; Christopher G Mathew; Miles Parkes; Michel Georges; Mark J Daly
Journal: Nat Genet Date: 2008-06-29 Impact factor: 38.330

2. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962

3. Pharmacogenomic Study of Clozapine-Induced Agranulocytosis/Granulocytopenia in a Japanese Population.

Authors: Takeo Saito; Masashi Ikeda; Taisei Mushiroda; Takeshi Ozeki; Kenji Kondo; Ayu Shimasaki; Kohei Kawase; Shuji Hashimoto; Hidenaga Yamamori; Yuka Yasuda; Michiko Fujimoto; Kazutaka Ohi; Masatoshi Takeda; Yoichiro Kamatani; Shusuke Numata; Tetsuro Ohmori; Shu-Ichi Ueno; Manabu Makinodan; Yosuke Nishihata; Masaharu Kubota; Takemi Kimura; Nobuhisa Kanahara; Naoki Hashimoto; Kiyoshi Fujita; Kiyotaka Nemoto; Taku Fukao; Taro Suwa; Tetsuro Noda; Yuji Yada; Manabu Takaki; Naoya Kida; Taku Otsuru; Masaru Murakami; Atsushi Takahashi; Michiaki Kubo; Ryota Hashimoto; Nakao Iwata
Journal: Biol Psychiatry Date: 2016-02-11 Impact factor: 13.382

4. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

5. Gene-expression profiles predict survival of patients with lung adenocarcinoma.

Authors: David G Beer; Sharon L R Kardia; Chiang-Ching Huang; Thomas J Giordano; Albert M Levin; David E Misek; Lin Lin; Guoan Chen; Tarek G Gharib; Dafydd G Thomas; Michelle L Lizyness; Rork Kuick; Satoru Hayasaka; Jeremy M G Taylor; Mark D Iannettoni; Mark B Orringer; Samir Hanash
Journal: Nat Med Date: 2002-07-15 Impact factor: 53.440

6. Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies.

Authors: Cathy E Elks; John R B Perry; Patrick Sulem; Daniel I Chasman; Nora Franceschini; Chunyan He; Kathryn L Lunetta; Jenny A Visser; Enda M Byrne; Diana L Cousminer; Daniel F Gudbjartsson; Tõnu Esko; Bjarke Feenstra; Jouke-Jan Hottenga; Daniel L Koller; Zoltán Kutalik; Peng Lin; Massimo Mangino; Mara Marongiu; Patrick F McArdle; Albert V Smith; Lisette Stolk; Sophie H van Wingerden; Jing Hua Zhao; Eva Albrecht; Tanguy Corre; Erik Ingelsson; Caroline Hayward; Patrik K E Magnusson; Erin N Smith; Shelia Ulivi; Nicole M Warrington; Lina Zgaga; Helen Alavere; Najaf Amin; Thor Aspelund; Stefania Bandinelli; Inês Barroso; Gerald S Berenson; Sven Bergmann; Hannah Blackburn; Eric Boerwinkle; Julie E Buring; Fabio Busonero; Harry Campbell; Stephen J Chanock; Wei Chen; Marilyn C Cornelis; David Couper; Andrea D Coviello; Pio d'Adamo; Ulf de Faire; Eco J C de Geus; Panos Deloukas; Angela Döring; George Davey Smith; Douglas F Easton; Gudny Eiriksdottir; Valur Emilsson; Johan Eriksson; Luigi Ferrucci; Aaron R Folsom; Tatiana Foroud; Melissa Garcia; Paolo Gasparini; Frank Geller; Christian Gieger; Vilmundur Gudnason; Per Hall; Susan E Hankinson; Liana Ferreli; Andrew C Heath; Dena G Hernandez; Albert Hofman; Frank B Hu; Thomas Illig; Marjo-Riitta Järvelin; Andrew D Johnson; David Karasik; Kay-Tee Khaw; Douglas P Kiel; Tuomas O Kilpeläinen; Ivana Kolcic; Peter Kraft; Lenore J Launer; Joop S E Laven; Shengxu Li; Jianjun Liu; Daniel Levy; Nicholas G Martin; Wendy L McArdle; Mads Melbye; Vincent Mooser; Jeffrey C Murray; Sarah S Murray; Michael A Nalls; Pau Navarro; Mari Nelis; Andrew R Ness; Kate Northstone; Ben A Oostra; Munro Peacock; Lyle J Palmer; Aarno Palotie; Guillaume Paré; Alex N Parker; Nancy L Pedersen; Leena Peltonen; Craig E Pennell; Paul Pharoah; Ozren Polasek; Andrew S Plump; Anneli Pouta; Eleonora Porcu; Thorunn Rafnar; John P Rice; Susan M Ring; Fernando Rivadeneira; Igor Rudan; Cinzia Sala; Veikko Salomaa; Serena Sanna; David Schlessinger; Nicholas J Schork; Angelo Scuteri; Ayellet V Segrè; Alan R Shuldiner; Nicole Soranzo; Ulla Sovio; Sathanur R Srinivasan; David P Strachan; Mar-Liis Tammesoo; Emmi Tikkanen; Daniela Toniolo; Kim Tsui; Laufey Tryggvadottir; Jonathon Tyrer; Manuela Uda; Rob M van Dam; Joyce B J van Meurs; Peter Vollenweider; Gerard Waeber; Nicholas J Wareham; Dawn M Waterworth; Michael N Weedon; H Erich Wichmann; Gonneke Willemsen; James F Wilson; Alan F Wright; Lauren Young; Guangju Zhai; Wei Vivian Zhuang; Laura J Bierut; Dorret I Boomsma; Heather A Boyd; Laura Crisponi; Ellen W Demerath; Cornelia M van Duijn; Michael J Econs; Tamara B Harris; David J Hunter; Ruth J F Loos; Andres Metspalu; Grant W Montgomery; Paul M Ridker; Tim D Spector; Elizabeth A Streeten; Kari Stefansson; Unnur Thorsteinsdottir; André G Uitterlinden; Elisabeth Widen; Joanne M Murabito; Ken K Ong; Anna Murray
Journal: Nat Genet Date: 2010-12 Impact factor: 38.330

7. Semi-supervised methods to predict patient survival from gene expression data.

Authors: Eric Bair; Robert Tibshirani
Journal: PLoS Biol Date: 2004-04-13 Impact factor: 8.029

8. Supervised harvesting of expression trees.

Authors: T Hastie; R Tibshirani; D Botstein; P Brown
Journal: Genome Biol Date: 2001-01-10 Impact factor: 13.583

9. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche.

Authors: John Rb Perry; Felix Day; Cathy E Elks; Patrick Sulem; Kari Stefansson; Joanne M Murabito; Ken K Ong; Deborah J Thompson; Teresa Ferreira; Chunyan He; Daniel I Chasman; Tõnu Esko; Gudmar Thorleifsson; Eva Albrecht; Wei Q Ang; Tanguy Corre; Diana L Cousminer; Bjarke Feenstra; Nora Franceschini; Andrea Ganna; Andrew D Johnson; Sanela Kjellqvist; Kathryn L Lunetta; George McMahon; Ilja M Nolte; Lavinia Paternoster; Eleonora Porcu; Albert V Smith; Lisette Stolk; Alexander Teumer; Natalia Tšernikova; Emmi Tikkanen; Sheila Ulivi; Erin K Wagner; Najaf Amin; Laura J Bierut; Enda M Byrne; Jouke-Jan Hottenga; Daniel L Koller; Massimo Mangino; Tune H Pers; Laura M Yerges-Armstrong; Jing Hua Zhao; Irene L Andrulis; Hoda Anton-Culver; Femke Atsma; Stefania Bandinelli; Matthias W Beckmann; Javier Benitez; Carl Blomqvist; Stig E Bojesen; Manjeet K Bolla; Bernardo Bonanni; Hiltrud Brauch; Hermann Brenner; Julie E Buring; Jenny Chang-Claude; Stephen Chanock; Jinhui Chen; Georgia Chenevix-Trench; J Margriet Collée; Fergus J Couch; David Couper; Andrea D Coveillo; Angela Cox; Kamila Czene; Adamo Pio D'adamo; George Davey Smith; Immaculata De Vivo; Ellen W Demerath; Joe Dennis; Peter Devilee; Aida K Dieffenbach; Alison M Dunning; Gudny Eiriksdottir; Johan G Eriksson; Peter A Fasching; Luigi Ferrucci; Dieter Flesch-Janys; Henrik Flyger; Tatiana Foroud; Lude Franke; Melissa E Garcia; Montserrat García-Closas; Frank Geller; Eco Ej de Geus; Graham G Giles; Daniel F Gudbjartsson; Vilmundur Gudnason; Pascal Guénel; Suiqun Guo; Per Hall; Ute Hamann; Robin Haring; Catharina A Hartman; Andrew C Heath; Albert Hofman; Maartje J Hooning; John L Hopper; Frank B Hu; David J Hunter; David Karasik; Douglas P Kiel; Julia A Knight; Veli-Matti Kosma; Zoltan Kutalik; Sandra Lai; Diether Lambrechts; Annika Lindblom; Reedik Mägi; Patrik K Magnusson; Arto Mannermaa; Nicholas G Martin; Gisli Masson; Patrick F McArdle; Wendy L McArdle; Mads Melbye; Kyriaki Michailidou; Evelin Mihailov; Lili Milani; Roger L Milne; Heli Nevanlinna; Patrick Neven; Ellen A Nohr; Albertine J Oldehinkel; Ben A Oostra; Aarno Palotie; Munro Peacock; Nancy L Pedersen; Paolo Peterlongo; Julian Peto; Paul Dp Pharoah; Dirkje S Postma; Anneli Pouta; Katri Pylkäs; Paolo Radice; Susan Ring; Fernando Rivadeneira; Antonietta Robino; Lynda M Rose; Anja Rudolph; Veikko Salomaa; Serena Sanna; David Schlessinger; Marjanka K Schmidt; Mellissa C Southey; Ulla Sovio; Meir J Stampfer; Doris Stöckl; Anna M Storniolo; Nicholas J Timpson; Jonathan Tyrer; Jenny A Visser; Peter Vollenweider; Henry Völzke; Gerard Waeber; Melanie Waldenberger; Henri Wallaschofski; Qin Wang; Gonneke Willemsen; Robert Winqvist; Bruce Hr Wolffenbuttel; Margaret J Wright; Dorret I Boomsma; Michael J Econs; Kay-Tee Khaw; Ruth Jf Loos; Mark I McCarthy; Grant W Montgomery; John P Rice; Elizabeth A Streeten; Unnur Thorsteinsdottir; Cornelia M van Duijn; Behrooz Z Alizadeh; Sven Bergmann; Eric Boerwinkle; Heather A Boyd; Laura Crisponi; Paolo Gasparini; Christian Gieger; Tamara B Harris; Erik Ingelsson; Marjo-Riitta Järvelin; Peter Kraft; Debbie Lawlor; Andres Metspalu; Craig E Pennell; Paul M Ridker; Harold Snieder; Thorkild Ia Sørensen; Tim D Spector; David P Strachan; André G Uitterlinden; Nicholas J Wareham; Elisabeth Widen; Marek Zygmunt; Anna Murray; Douglas F Easton
Journal: Nature Date: 2014-07-23 Impact factor: 49.962

10. Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes and glucose homeostasis.

Authors: Bethann S Hromatka; Joyce Y Tung; Amy K Kiefer; Chuong B Do; David A Hinds; Nicholas Eriksson
Journal: Hum Mol Genet Date: 2015-01-26 Impact factor: 6.150

2 in total

1. Sufficient principal component regression for pattern discovery in transcriptomic data.

Authors: Lei Ding; Gabriel E Zentner; Daniel J McDonald
Journal: Bioinform Adv Date: 2022-05-14

2. SMSSVD: SubMatrix Selection Singular Value Decomposition.

Authors: Rasmus Henningsson; Magnus Fontes
Journal: Bioinformatics Date: 2019-02-01 Impact factor: 6.937

2 in total