| Literature DB >> 23893343 |
Abstract
To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known-associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high-dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.Entities:
Keywords: GWAS; penalised regression; prediction; shrinkage methods; software
Mesh:
Year: 2013 PMID: 23893343 PMCID: PMC4377081 DOI: 10.1002/gepi.21750
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Prediction squared error (PSE) or classification squared error (CSE) in out-of-sample prediction using r with different proportions of variance explained
| Proportion of variance explained by PCs used to compute | RR parameter | ||||||
|---|---|---|---|---|---|---|---|
| 10 | 50 | 70 | 90 | MAX | CV | ||
| Continuous outcomes (mean PSE) | 1.24 | 1.23 | 1.23 | 1.27 | 3.20 | 1.24 | 1.23 |
| (sd) | (0.06) | (0.05) | (0.05) | (0.06) | (0.87) | (0.05) | (0.05) |
| Binary outcomes (mean CE) | 0.46 | 0.47 | 0.47 | 0.47 | 0.47 | 0.47 | 0.46 |
| (sd) | (0.03) | (0.03) | (0.03) | (0.04) | (0.03) | (0.03) | (0.03) |
MAX, r is the maximum number of PCs where the corresponding eigenvalues are non-zero; CV, r chosen using cross-validated PRESS statistic; : r chosen based on degrees of freedom (see main text); SD, standard deviation.
Figure 1Bias in PCR and RR in regression scenarios (1), (2), (3), and (4) (Supplementary Table S1), at different values of r.
Figure 2Ridge trace showing estimated regression coefficients estimated using computed using different numbers of PCs. The x-axis shows the number of PCs used to compute . The vertical dotted line indicates that our proposed method of choosing the number of components chooses a ridge parameter in the region where ridge estimates stabilise. The black line indicates a causal variant. Plotted are the first 100 SNPs of the 20,000 in one simulation replicate, with continuous outcomes.
Performance in out-of-sample prediction in simulated data using different methods to fit prediction models
| Univariate | RR-CV | HL | EN | RR- | |||||
|---|---|---|---|---|---|---|---|---|---|
| % of SNPs ranked by univariate | 0.1% | 0.5% | 1% | 3% | 4% | ||||
| Continuous outcomes (mean PSE) | 1.51 | 1.55 | 1.54 | 2.21 | 3.93 | 1.22 | 2.41 | 3.26 | 1.23 |
| (sd) | (0.10) | (0.11) | (0.11) | (0.54) | (1.34) | (0.05) | (0.31) | (0.21) | (0.05) |
| Binary outcomes (mean CE) | 0.49 | 0.48 | 0.49 | 0.49 | 0.49 | 0.46 | 0.50 | 0.48 | 0.46 |
| (sd) | (0.03) | (0.03) | (0.03) | (0.03) | (0.01) | (0.03) | (0.03) | (0.04) | (0.03) |
| Binary outcomes (Brier score) | 0.26 | 0.28 | 0.31 | 0.37 | 0.41 | 0.25 | 0.30 | 0.27 | 0.25 |
| (sd) | (0.01) | (0.02) | (0.06) | (0.06) | (0.05) | (0.005) | (0.06) | (0.04) | (0.003) |
RR-CV, RR with the shrinkage parameter chosen using 10-fold cross-validation; HL, HyperLasso; EN, Elastic Net; RR-, RR with the shrinkage parameter .
Performance in out-of-sample prediction using Bipolar Disorder data
| Univariate | HL | RR- | ||||
|---|---|---|---|---|---|---|
| 10−4 | 10−5 | 10−7 | 10−10 | |||
| SNPs reaching threshold | 346 | 58 | 3 | 2 | ||
| Mean classification error | 0.510 | 0.489 | 0.491 | 0.490 | 0.492 | 0.465 |
| Brier score | 0.35 | 0.29 | 0.28 | 0.26 | 0.38 | 0.29 |
Logistic RR models were fitted on WTCCC-BD data. Mean classification error and Brier score were computed using the GAIN-BD data.