| Literature DB >> 20523915 |
Yingdong Zhao1, Richard Simon.
Abstract
There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.Entities:
Keywords: continuous outcome; gene expression; regression model
Year: 2010 PMID: 20523915 PMCID: PMC2879606 DOI: 10.4137/cin.s3805
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Flow chart of complete cross validation.
Figure 2Simulated data with no noise using LASSO. Part A shows the relationship between the cross-validated estimate of prediction error to model size for the corresponding models. The confidence bars are output by the R function ‘cv.lars’. The x-axis stands for fraction, which refers to the ratio of the L1 norm of the coefficient vector relative to the norm at the full LS solution for the model with the maximum steps used. Part B shows the relationship of predicted and observed response for the optimal model.
List of coefficients of the variables (genes) in each model using no noise and noise added simulated data. When generating the simulated continuous response data, we use −1 for the coefficients of the first five variables and −1 for the next five variables. For ALM, the selected and unselected genes in the final model are marked as “Yes” and “No”.
| True coefficient | No noise | 1 SD noise | 2 SD noise | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| LAR | LASSO | ALM | LAR | LASSO | ALM | LAR | LASSO | ALM | ||
| Gene 1 | 1 | 1 | 0.977 | No | 0.836 | 0.654 | No | 0.673 | 0.449 | No |
| Gene 2 | 1 | 1 | 0.991 | Yes | 0.833 | 0.886 | Yes | 0.870 | 0.948 | Yes |
| Gene 3 | 1 | 1 | 0.989 | Yes | 0.565 | 0.657 | No | 0.394 | 0.439 | No |
| Gene 4 | 1 | 1 | 0.993 | Yes | 0.926 | 0.857 | Yes | 0.641 | 0.572 | Yes |
| Gene 5 | 1 | 1 | 0.987 | Yes | 0.468 | 0.512 | Yes | 0.357 | 0.438 | Yes |
| Gene 6 | −1 | −1 | −0.979 | No | −0.243 | −0.453 | No | 0 | 0 | No |
| Gene 7 | −1 | −1 | −0.976 | No | 0 | 0 | No | 0 | 0 | No |
| Gene 8 | −1 | −1 | −0.989 | No | −1.139 | −0.931 | No | −0.715 | −1.060 | No |
| Gene 9 | −1 | −1 | −0.975 | No | 0 | −0.183 | No | 0 | 0 | No |
| Gene 10 | −1 | −1 | −0.987 | No | −1.035 | −1.016 | No | −1.009 | −1.100 | No |
Figure 3Comparison of LAR, LASSO, and ALM on simulated and real data sets. Data Set 1: Simulated data with no noise; Data Set 2: Simulated data with 1 SD noise; Data Set 3: Simulated data with 2 SD noise; Data Set 4: real data.22 Part (A) shows the cross validated global minimum estimated squared prediction errors. Part (B) shows the association between observed and predicted responses (R2).
Figure 4Comparison of the performances of the models including main effects only and the model including nonlinear terms on the lung cancer cyto-toxicity data set.22 The x axis is the levels of strength of the two way interaction added to the true response. The y axis is the cross validated estimate of prediction error. The line with squares stands for the models with main effects only, while the line with triangles stands for the models including main effects and interactions.