| Literature DB >> 28738830 |
Jaron Arbet1, Matt McGue2, Snigdhansu Chatterjee3, Saonli Basu4.
Abstract
BACKGROUND: Genome-wide association studies involve detecting association between millions of genetic variants and a trait, which typically use univariate regression to test association between each single variant and the phenotype. Alternatively, Lasso penalized regression allows one to jointly model the relationship between all genetic variants and the phenotype. However, it is unclear how to best conduct inference on the individual Lasso coefficients, especially in high-dimensional settings.Entities:
Keywords: Bootstrap; GWAS; Lasso; Permutation; Resampling; Testing
Mesh:
Substances:
Year: 2017 PMID: 28738830 PMCID: PMC5525347 DOI: 10.1186/s12863-017-0533-3
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Fig. 4Diagnostic plots for Lasso-PL with the alcohol consumption quantitative trait. The figure on the left shows the total number of selected SNPs for a given number of permutations (B) used in the estimator Ideally, as the number of permutations increases, the number of selected SNPs should converge to some constant . The figure on the right shows the number of discrepant SNPs between models using B and B−1 permutations in the estimator . Ideally, as B increases, the number of discrepant SNPs should converge to 0
Comparison of methods given fixed sample size and increasing number of null SNPs, α=0.01
| Model | 0 Null SNPs | 45 Null SNPs | 300 Null SNPs | 900 Null SNPs | 20,000 Null SNPs |
|---|---|---|---|---|---|
| Lasso-PL | 0.840 | 0.831 (0.008) | 0.841 (0.0081) | 0.857 (0.0083) | 0.841 (0.0095) |
| Lasso-AL | 0.850 | 0.850 (0.0102) | 0.851 (0.0089) | 0.854 (0.0077) | 0.828 (0.0079) |
| Lasso-Ayers | 0.819 | 0.808 (0.0121) | 0.836 (0.0096) | 0.846 (0.0086) | 0.838 (0.0095) |
| SMA | 0.828 | 0.828 (0.0106) | 0.828 (0.0101) | 0.828 (0.01) | 0.828 (0.01) |
| Lasso-PT | 0.829 | 0.832 (0.0078) | 0.827 (0.0073) | 0.818 (0.0076) | 0.560 (0.0013) |
| Lasso-RB | 0.869 | 0.859 (0.0137) | 0.847 (0.0106) | 0.833 (0.0089) | 0.555 (0.0012) |
| Lasso-MRB(t=0.001) | 0.869 | 0.865 (0.0138) | 0.850 (0.0105) | 0.838 (0.0089) | 0.556 (0.0012) |
| Lasso-MRB(t=0.005) | 0.869 | 0.863 (0.0137) | 0.849 (0.0105) | 0.836 (0.009) | 0.555 (0.0012) |
| Lasso-MRB(t=0.01) | 0.869 | 0.864 (0.0139) | 0.849 (0.0105) | 0.837 (0.009) | 0.556 (0.0012) |
| Lasso-MRB(t=0.03) | 0.869 | 0.864 (0.0136) | 0.849 (0.0102) | 0.839 (0.0092) | 0.558 (0.0013) |
| Lasso-MRB(t=0.05) | 0.869 | 0.859 (0.0124) | 0.841 (0.0099) | 0.833 (0.0089) | 0.559 (0.0013) |
Models are compared by their true positive rate and false positive rate (in parentheses), across 300 simulated datasets, using a significance level of α=0.01. Each column represents a scenario where a different number of null SNPs were used (e.g. 0, 45, 300, 900, or 20,000)
Comparison of between Lasso-Ayers, Lasso-PL, and Lasso-AL
| Model | 0 Null SNPs | 45 Null SNPs | 300 Null SNPs | 900 Null SNPs | 20,000 Null SNPs |
|---|---|---|---|---|---|
| Lasso-Ayers | 0.112 (0.0168) | 0.112 (0.0165) | 0.108 (0.0099) | 0.105 (0.0065) | 0.074 (0.0031) |
| Lasso-PL | 0.110 (0.0042) | 0.110 (0.0041) | 0.109 (0.0034) | 0.105 (0.0033) | 0.074 (0.0022) |
| Lasso-AL | 0.108 (0.0033) | 0.107 (0.0033) | 0.107 (0.0033) | 0.106 (0.0033) | 0.081 (0.0039) |
The average selected value of λ that controls the type-1-error rate at level α=0.01 is compared between three different methods across 300 simulated datasets. Standard deviations are reported in parentheses
Fig. 1Comparison of the true Lasso distribution with the residual bootstrap approximation of the Lasso distribution. The black curve represents the empirical “true” Lasso distribution of , over 300 simulated datasets. The blue curve combines the residual bootstrap distribution of from all 300 datasets
Fig. 2Comparison of the true Lasso distribution with the modified residual bootstrap approximation to the distribution of with increasing number of null SNPs. The black curve represents the empirical “true” Lasso distribution of , over 300 simulated datasets. The other curves combine the modified residual bootstrap distribution of from all 300 datasets
Average λ selected by 10-fold cross validation
| Num. of Null SNPs | Avg. |
|---|---|
| 0 | 0.0007 (0.0013) |
| 45 | 0.048 (0.011) |
| 300 | 0.076 (0.013) |
| 900 | 0.089 (0.015) |
| 20000 | 0.136 (0.024) |
For Lasso-RB, Lasso-MRB, and Lasso-PT: the average λ selected by 10-fold CV across 300 simulated datasets is reported for simulation scenarios with varying number of null SNPs. Standard deviations are listed in parentheses
Comparison of TPR, LTPR, and FPR across 300 simulated datasets, each with five causal regions
| Model | TPR(t=0.3) | LTPR(t=0.3) | FPR(t=0.3) | TPR(t=0.5) | LTPR(t=0.5) | FPR(t=0.5) |
|---|---|---|---|---|---|---|
| Lasso-PL | 0.780 | 0.166 | 0.0087 | 0.762 | 0.232 | 0.0091 |
| Lasso-Ayers | 0.775 | 0.166 | 0.0092 | 0.755 | 0.232 | 0.0096 |
| Lasso-AL | 0.761 | 0.159 | 0.0075 | 0.741 | 0.222 | 0.0079 |
| SMA | 0.753 | 0.336 | 0.0101 | 0.740 | 0.448 | 0.0115 |
Each dataset contains n=600 subjects, 1000 SNPs and five causal regions. All testing was done using significance level α=0.01. See Section “Simulation 2” for a definition of “causal region”, TPR, LTPR, and FPR
Fig. 3Comparison of selected number of SNPs in a GWAS with two different quantitative traits: alcohol consumption (top) and non-substance behavioral disinhibition (bottom). A total of 3853 subjects were used with 507,541 SNPs. All testing used significance level α=1.18∗10−5. Venn Diagrams were created using [39]
Alcohol Consumption GWAS
| SNP | Chr. | Gene | Distance from Gene (bp) | SMA | Lasso-PL | Lasso-AL | Lasso-Ayers |
|---|---|---|---|---|---|---|---|
| rs7574612 | 2 | LHCGR | -24421 | 9.4∗10−6 (S) | S | S | S |
| rs4836266 | 5 | GRAMD3 | -235 | 1.3∗10−5 (N) | S | N | N |
| rs211598 | 6 | EYA4 | -29329 | 9.3∗10−6 (S) | S | S | S |
| rs4385434 | 8 | hCG_1814486 | -129857 | 5.5∗10−6 (S) | S | S | S |
| rs7136989 | 12 | GOLGA3 | -244 | 1.7∗10−5 (N) | S | N | N |
| rs6072694 | 20 | PTPRT | -1180 | 2.1∗10−6 (S) | S | S | S |
| rs233278 | 21 | KRTAP10-4 | -1677 | 1.3∗10−6 (N) | S | N | N |
For Tables 5 and 6: “S” means ”significant” and “N” means “not significant” using significance level α=1.18∗10−5. Note that Lasso-PL, Lasso-AL, and Lasso-Ayers cannot provide exact p-values, but selects significant SNPs while attempting to control the type-1-error rate at level α. One could fit multiple penalized regression models and estimate λ that controls the type-1-error rate at various orders of magnitude (e.g. 10−5,10−6, etc) to get a better idea of the significance of each selected SNP (not done here)
Non-substance behavioral disinhibition GWAS
| SNP | Chr. | Gene | Distance from Gene (bp) | SMA | Lasso-PL | Lasso-AL | Lasso-Ayers |
|---|---|---|---|---|---|---|---|
| rs831750 | 1 | LOC440706 | -1546 | 8.1∗10−6 (S) | N | N | N |
| rs1007227 | 1 | LOC440706 | -698 | 7.5∗10−7 (S) | S | S | S |
| rs17045125 | 2 | ASB3 | -3399 | 1.3∗10−5 (N) | S | N | N |
| rs1384394 | 2 | IKZF2 | -32475 | 7.9∗10−6 (S) | S | S | S |
| rs4527483 | 4 | TSPAN5 | -7088 | 6.9∗10−6 (S) | S | S | S |
| rs3017726 | 4 | LOC728847 | -88224 | 1.3∗10−5 (N) | S | N | N |
| rs6923361 | 6 | MCHR2 | -58727 | 5.0∗10−6 (S) | S | S | S |
| rs2215987 | 7 | THSD7A | -177262 | 1.4∗10−5 (N) | S | N | N |
| rs10504658 | 8 | PXMP3 | -718715 | 1.3∗10−6 (S) | S | S | S |
| rs7314533 | 12 | KCNC2 | -81891 | 9.7∗10−6 (S) | S | N | S |
Computation time in minutes for alcohol consumption and behavioral disinhibition GWASs
| Method | Cores | Alc_CON | Behav_Dis |
|---|---|---|---|
| SMA | 20 | 8.0 | 7.9 |
| Lasso-Ayers | 1 | 10.5 | 19.9 |
| Lasso-AL | 1 | 29.2 | 29.4 |
| Lasso-PL( | 20 | 57.1 | 62.0 |
Computation time in minutes for the models fit to the Alcohol Consumption and Behavioral Disinhibition quantitative traits, using 507,541 SNPS from 3853 subjects