| Literature DB >> 24363662 |
Patrik Waldmann1, Gábor Mészáros2, Birgit Gredler3, Christian Fuerst4, Johann Sölkner2.
Abstract
The number of publications performing genome-wide association studies (GWAS) has increased dramatically. Penalized regression approaches have been developed to overcome the challenges caused by the high dimensional data, but these methods are relatively new in the GWAS field. In this study we have compared the statistical performance of two methods (the least absolute shrinkage and selection operator-lasso and the elastic net) on two simulated data sets and one real data set from a 50 K genome-wide single nucleotide polymorphism (SNP) panel of 5570 Fleckvieh bulls. The first simulated data set displays moderate to high linkage disequilibrium between SNPs, whereas the second simulated data set from the QTLMAS 2010 workshop is biologically more complex. We used cross-validation to find the optimal value of regularization parameter λ with both minimum MSE and minimum MSE + 1SE of minimum MSE. The optimal λ values were used for variable selection. Based on the first simulated data, we found that the minMSE in general picked up too many SNPs. At minMSE + 1SE, the lasso didn't acquire any false positives, but selected too few correct SNPs. The elastic net provided the best compromise between few false positives and many correct selections when the penalty weight α was around 0.1. However, in our simulation setting, this α value didn't result in the lowest minMSE + 1SE. The number of selected SNPs from the QTLMAS 2010 data was after correction for population structure 82 and 161 for the lasso and the elastic net, respectively. In the Fleckvieh data set after population structure correction lasso and the elastic net identified from 1291 to 1966 important SNPs for milk fat content, with major peaks on chromosomes 5, 14, 15, and 20. Hence, we can conclude that it is important to analyze GWAS data with both the lasso and the elastic net and an alternative tuning criterion to minimum MSE is needed for variable selection.Entities:
Keywords: GWAS; cattle; elastic net; lasso; population structure; simulation
Year: 2013 PMID: 24363662 PMCID: PMC3850240 DOI: 10.3389/fgene.2013.00270
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Plot of the Mean-Squared Error (MSE) and the number of SNPs in the model as functions of -log( λ) for the 10-fold cross-validation analyses with the EN01 penalty of one the mixed LD simulated data sets. The red dots are the mean form the cross-validation and the bars indicate mean + 1SE and mean − 1SE, respectively. The minMSE + 1SE of minMSE (left) and minMSE (right) and are indicated by the dashed vertical lines.
Results from the first simulation where three correlation settings (High, Mixture and Low) were considered, all with 25 significant predictor variables out of a total of 50,000 predictor variables.
| High LD | Correct | 3 (0.99) | 6 (1.69) | 10 (2.36) | 13 (2.42) | 16 (2.43) | 21 (1.99) | 23 (1.89) | 25 (0.66) | 25 (0.00) | 25 (0.00) |
| False positive | 5 (13.51) | 5 (15.48) | 8 (24.31) | 8 (18.60) | 7 (19.27) | 10 (25.19) | 10 (27.33) | 15 (34.78) | 35 (72.89) | 689 (335) | |
| MSE | 2.89 (0.11) | 2.90 (0.11) | 2.91 (0.11) | 2.88 (0.10) | 2.89 (0.11) | 2.91 (0.11) | 2.97 (0.08) | 2.98 (0.08) | 3.01 (0.08) | 3.12 (0.09) | |
| Mixed LD | Correct | 3 (1.04) | 5 (1.38) | 9 (1.71) | 11 (1.94) | 14 (1.87) | 17 (1.66) | 18 (1.41) | 20 (1.52) | 24 (1.15) | 25 (0.00) |
| False positive | 3 (23.47) | 4 (16.78) | 6 (19.96) | 9 (21.46) | 7 (20.20) | 11 (32.30) | 17 (30.23) | 25 (43.00) | 59 (87.53) | 898 (369) | |
| MSE | 2.92 (0.10) | 2.93 (0.10) | 2.94 (0.10) | 2.94 (0.10) | 2.96 (0.10) | 2.97 (0.10) | 2.98 (0.11) | 2.99 (0.11) | 3.02 (0.11) | 3.16 (0.11) | |
| Low LD | Correct | 18 (1.87) | 19 (1.88) | 20 (1.78) | 20 (1.44) | 21 (1.33) | 23 (1.23) | 23 (0.91) | 24 (0.66) | 25 (0.24) | 25 (0.00) |
| False positive | 7 (16.73) | 8 (20.85) | 10 (28.81) | 9 (30.58) | 10 (29.86) | 13 (30.76) | 27 (32.62) | 30 (54.56) | 73 (81.20) | 1227 (406) | |
| MSE | 3.10 (0.12) | 3.10 (0.12) | 3.09 (0.12) | 3.08 (0.10) | 3.09 (0.10) | 3.09 (0.10) | 3.10 (0.10) | 3.11 (0.10) | 3.14 (0.10) | 3.31 (0.10) | |
| High LD | Correct | 3 (0.91) | 7 (1.80) | 12 (2.38) | 16 (2.32) | 20 (2.05) | 24 (1.22) | 25 (0.81) | 25 (0.10) | 25 (0.00) | 25 (0.00) |
| False positive | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.71) | 0 (0.35) | 0 (1.61) | 0 (0.00) | 0 (0.10) | 0 (0.74) | 35 (68) | |
| MSE | 2.99 (0.12) | 3.00 (0.11) | 3.01 (0.12) | 3.00 (0.12) | 3.00 (0.12) | 3.01 (0.10) | 2.97 (0.08) | 2.98 (0.09) | 3.01 (0.08) | 3.12 (0.10) | |
| Mixed LD | Correct | 3 (0.75) | 6 (1.38) | 10 (1.56) | 12 (1.59) | 14 (1.23) | 16 (1.06) | 17 (1.25) | 18 (1.47) | 24 (1.06) | 25 (0.00) |
| False positive | 0 (0.00) | 0 (0.00) | 0 (0.00) | 0 (0.31) | 0 (0.14) | 0 (0.74) | 0 (0.72) | 0 (1.19) | 0 (2.24) | 38 (64.72) | |
| MSE | 3.03 (0.11) | 3.04 (0.11) | 3.06 (0.11) | 3.06 (0.11) | 3.06 (0.11) | 3.08 (0.11) | 3.11 (0.11) | 3.10 (0.12) | 3.13 (0.12) | 3.26 (0.11) | |
| Low LD | Correct | 17 (2.12) | 18 (1.89) | 19 (1.88) | 20 (1.49) | 21 (1.36) | 23 (1.22) | 24 (0.78) | 25 (0.51) | 25 (0.00) | 25 (0.00) |
| False positive | 0 (0.39) | 0 (0.65) | 0 (0.48) | 0 (0.56) | 0 (0.80) | 0 (2.45) | 0 (1.21) | 0 (2.06) | 0 (1.91) | 106 (109) | |
| MSE | 3.20 (0.13) | 3.21 (0.13) | 3.20 (0.13) | 3.20 (0.12) | 3.19 (0.11) | 3.19 (0.12) | 3.20 (0.11) | 3.22 (0.10) | 3.25 (0.10) | 3.42 (0.11) | |
Reported values are medians and SD (within parentheses) over 100 replicates.
The values of the elastic net (EN) refers to the penalty weight α.
The stopping criteria for λ were obtained with 10-fold cross-validation both at minimum MSE and minimum MSE plus 1 SE.
The median and standard deviation of MSE at the stopping criteria is also reported.
Results from the analysis of the simulated QTLMAS 2010 workshop data with and without correction for population structure (using eigenvectors from spectral graph analyses).
| No pop. struct. corr. | Selected SNPs | 161 | 176 | 168 | 219 | 232 | 326 | 454 | 78 |
| minMSE + 1SE | 0.2825 | 0.3082 | 0.3822 | 0.5331 | 0.9087 | 2.6208 | 4.8283 | – | |
| Pop. struct. corr. | Selected SNPs | 82 | 87 | 87 | 92 | 98 | 161 | 240 | 134 |
| minMSE + 1SE | 0.2421 | 0.2594 | 0.3114 | 0.4673 | 0.7751 | 2.1707 | 4.0467 | – |
The simulated pedigree consists of 3226 individuals from 5 generations.
The continuous trait was controlled by 37 QTLs that had 364 SNPs with .
The values of the elastic net (EN) refers to the penalty weight α (e.g., EN005 is elastic net with α = 0.05).
fdr refers to the SNPs selected by the single marker regression local false discovery rate method (Efron, .
Figure 2Plots of positions and regression coefficients of the selected SNPs from the elastic net (EN01) analysis of the QTLMAS 2010 data in relation to the positions of the 37 simulated QTLs (X). The red, blue and green colors of the QTLs indicate additive positive, additive negative and epistatic (including imprinted) effects, respectively. The manhattan plot for the single marker regression shows the significance [-log10(p)-value] for all SNPs. (A) Without population structure correction. (B) With population structure correction using eigenvectors from spectral graph analysis. (C) Single marker regression with population structure correction. Highlighted markers (in magenta) are the important SNPs picked by the local false discovery rate method (Efron, 2010).
Results from the analysis of the deregressed breeding value evaluation for fat content in Fleckvieh bulls.
| No pop. struct. corr. | Selected SNPs | 1439 | 1451 | 1452 | 1556 | 1603 | 2142 | 2689 | 251 |
| minMSE + 1SE | 0.0029 | 0.0033 | 0.0039 | 0.0057 | 0.0092 | 0.0240 | 0.0438 | – | |
| Pop. struct. corr. | Selected SNPs | 1291 | 1291 | 1297 | 1400 | 1460 | 1966 | 2504 | 160 |
| minMSE + 1SE | 0.0028 | 0.0031 | 0.0038 | 0.0055 | 0.0090 | 0.0236 | 0.0433 | – |
34,373SNPs were analyzed on 5570 individuals with and without correction for population structure (using eigenvectors from spectral graph analyses).
The stopping criteria for λ were obtained as the average of ten 10-fold cross-validation runs at minimum MSE plus 1 standard error.
The values of the elastic net refer to the penalty weight α (e.g., EN005 is elastic net with α = 0.05).
fdr refers to the SNPs selected by the single marker regression local false discovery rate method (Efron, .
Figure 3Plots of positions and regression coefficients of the selected SNPs from (A) Lasso, (B) elastic net with α = 0.1, and (C) single marker regression with population structure correction for fat content in Fleckvieh bulls. The manhattan plot shows the significance [-log10(p)-value] for all SNPs. Highlighted markers (in magenta) are the important SNPs picked by the local false discovery rate method (Efron, 2010).
Figure 4Venn diagram (BioVenn, Hulsen et al., . The area of overlap is proportional to the number of commonly selected markers.