| Literature DB >> 28145530 |
Hon-Cheong So1,2, Pak C Sham3,4,5,6.
Abstract
Polygenic risk scores (PRS) from genome-wide association studies (GWAS) are increasingly used to predict disease risks. However some included variants could be false positives and the raw estimates of effect sizes from them may be subject to selection bias. In addition, the standard PRS approach requires testing over a range of p-value thresholds, which are often chosen arbitrarily. The prediction error estimated from the optimized threshold may also be subject to an optimistic bias. To improve genomic risk prediction, we proposed new empirical Bayes approaches to recover the underlying effect sizes and used them as weights to construct PRS. We applied the new PRS to twelve cardio-metabolic traits in the Northern Finland Birth Cohort and demonstrated improvements in predictive power (in R2) when compared to standard PRS at the best p-value threshold. Importantly, for eleven out of the twelve traits studied, the predictive performance from the entire set of genome-wide markers outperformed the best R2 from standard PRS at optimal p-value thresholds. Our proposed methodology essentially enables an automatic PRS weighting scheme without the need of choosing tuning parameters. The new method also performed satisfactorily in simulations. It is computationally simple and does not require assumptions on the effect size distributions.Entities:
Mesh:
Year: 2017 PMID: 28145530 PMCID: PMC5286518 DOI: 10.1038/srep41262
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The predictive performances (in R2 for linear traits and AUC for binary traits) at the optimal p-value threshold for standard PRS and three other weighting schemes in simulations for N = 5000 and 10000 using independent SNPs.
| Linear traits | Binary traits | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Standard | Tdr | Tweedie | Tweedie*tdr | Standard | Tdr | Tweedie | Tweedie*Tdr | ||
| 5000 | 0.15 | 0.048 | 0.054 | 0.052 | 0.047 | 0.546 | 0.544 | 0.544 | 0.536 |
| 0.35 | 0.204 | 0.216 | 0.205 | 0.200 | 0.608 | 0.621 | 0.616 | 0.605 | |
| 0.55 | 0.399 | 0.410 | 0.398 | 0.397 | 0.691 | 0.705 | 0.702 | 0.690 | |
| 10000 | 0.15 | 0.081 | 0.086 | 0.084 | 0.081 | 0.57 | 0.576 | 0.574 | 0.567 |
| 0.35 | 0.274 | 0.279 | 0.267 | 0.266 | 0.679 | 0.688 | 0.687 | 0.682 | |
| 0.55 | 0.468 | 0.473 | 0.457 | 0.457 | 0.763 | 0.771 | 0.771 | 0.768 | |
We first applied LD-clumping with an r2 threshold of 0.25 to all SNPs, followed by p-value thresholding in the testing set. The results were derived from testing over a range of p-value thresholds and picking the threshold that gave the best predictive performance.
N denotes the total sample size. For binary traits, an equal number of cases and controls are simulated (e.g. for N = 5000, there are 2500 cases and 2500 controls). Tdr: True discovery rate; h: total heritability explained. The rest of the simulation results are presented in Supplementary Table 1.
The predictive performances (in R 2 for linear traits and AUC for binary traits) when all markers are included in PRS in simulations for N = 5000 and 10000 using independent SNPs.
| Linear traits | Binary traits | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Standard | Tdr | Tweedie | Tweedie*tdr | Standard best | Standard | Tdr | Tweedie | Tweedie*tdr | Standard best | ||
| 5000 | 0.15 | 0.012 | 0.053 | 0.040 | 0.047 | 0.048 | 0.527 | 0.542 | 0.530 | 0.535 | 0.546 |
| 0.35 | 0.051 | 0.214 | 0.187 | 0.200 | 0.204 | 0.564 | 0.620 | 0.595 | 0.605 | 0.608 | |
| 0.55 | 0.118 | 0.407 | 0.377 | 0.397 | 0.399 | 0.599 | 0.705 | 0.680 | 0.690 | 0.691 | |
| 10000 | 0.15 | 0.019 | 0.085 | 0.074 | 0.081 | 0.081 | 0.537 | 0.576 | 0.559 | 0.567 | 0.570 |
| 0.35 | 0.087 | 0.275 | 0.255 | 0.266 | 0.274 | 0.585 | 0.688 | 0.671 | 0.682 | 0.679 | |
| 0.55 | 0.187 | 0.461 | 0.444 | 0.457 | 0.468 | 0.632 | 0.770 | 0.757 | 0.768 | 0.763 | |
For the columns labelled “Standard”, “Tdr”, “Tweedie” and “Tweedie*tdr”, we first applied LD-clumping with an r2 threshold of 0.25 to all SNPs, then PRS was derived using all SNPs that remained. There was no selection of p-value thresholds.
The best predictive performance obtained from optimal p-value thresholds using standard PRS are also shown for comparison (under the column “standard best p”). N denotes the total sample size. For binary traits, an equal number of cases and controls are simulated. Tdr: True discovery rate; h: total heritability explained. The rest of the simulation results are presented in Supplementary Table 2.
Predictive performances (prediction R 2 in %) of the standard PRS and four other PRS schemes in simulations using real genotype data (a mixture small and large effects simulated).
| % casual | Type | Standard | Tdr | Tweedie | Tweedie*tdr | LDpred | |
|---|---|---|---|---|---|---|---|
| 0.1% | 10% | max | 0.700 | 0.951 | 0.925 | 0.705 | |
| all SNPs | 0.044 | 0.644 | 0.165 | 0.055 | |||
| 0.1% | 20% | max | 2.497 | 2.712 | 2.669 | 2.922 | |
| all SNPs | 0.072 | 2.025 | 0.178 | 0.041 | |||
| 0.1% | 30% | max | 6.622 | 6.800 | 6.627 | 7.079 | |
| all SNPs | 0.411 | 4.342 | 2.852 | 0.489 | |||
| 0.1% | 40% | max | 7.327 | 8.289 | 8.522 | 7.743 | |
| all SNPs | 0.265 | 5.048 | 2.634 | 0.817 | |||
| 0.25% | 10% | max | 0.921 | 1.154 | 1.269 | 0.986 | |
| all SNPs | 0.075 | 0.864 | 0.237 | 0.162 | |||
| 0.25% | 20% | max | 2.381 | 2.342 | 2.432 | 0.913 | |
| all SNPs | 0.038 | 2.099 | 0.320 | 0.038 | |||
| 0.25% | 30% | max | 2.015 | 2.469 | 2.328 | 2.571 | |
| all SNPs | 0.730 | 1.899 | 1.268 | 0.653 | |||
| 0.25% | 40% | max | 4.762 | 5.328 | 5.218 | 5.333 | |
| all SNPs | 0.910 | 3.707 | 2.882 | 1.136 | |||
| 1.0% | 10% | max | 0.695 | 0.773 | 0.766 | 0.374 | |
| all SNPs | 0.069 | 0.498 | 0.102 | 0.056 | |||
| 1.0% | 20% | max | 1.439 | 1.350 | 1.206 | 1.253 | |
| all SNPs | 0.042 | 0.915 | 0.132 | 0.040 | |||
| 1.0% | 30% | max | 3.018 | 3.189 | 3.274 | 2.656 | |
| all SNPs | 0.274 | 3.189 | 1.059 | 0.730 | |||
| 1.0% | 40% | max | 2.906 | 3.104 | 3.011 | 3.404 | |
| all SNPs | 0.599 | 2.147 | 1.565 | 1.178 | |||
| 2.5% | 10% | max | 0.730 | 0.934 | 0.982 | 0.960 | |
| all SNPs | 0.035 | 0.387 | 0.112 | 0.046 | |||
| 2.5% | 20% | max | 1.912 | 2.106 | 2.096 | 2.045 | |
| all SNPs | 0.352 | 1.452 | 0.797 | 0.338 | |||
| 2.5% | 30% | max | 1.269 | 1.398 | 1.440 | 1.482 | |
| all SNPs | 0.353 | 0.972 | 0.648 | 0.324 | |||
| 2.5% | 40% | max | 2.691 | 2.861 | 2.787 | 2.925 | |
| all SNPs | 0.706 | 2.043 | 1.554 | 0.605 |
A mixture of small and large effects was simulated as described in the main text. The best performing PRS weighting method in each scenario is in bold. % causal: percentage of causal markers; h2: total heritability explained by panel markers. For all methods except LDpred, we first applied LD-clumping with an r2 threshold of 0.25 to all SNPs. “Max” refers to the maximum prediction R2 achieved after optimizing over a range of p-value thresholds or fractions of causal variants. “All.SNPs” refers to the predictive performance using all SNPs after LD-clumping, except for LDpred where no clumping was performed. All predictive performances were measured by R2 in %.
Predictive performances (prediction R 2 in %) of the standard PRS and four other PRS schemes, applied to twelve cardio-metabolic traits in the Northern Finland Birth Cohort.
| Standard | Tdr | Tweedie | Tweedie*tdr | LDpred | ||
|---|---|---|---|---|---|---|
| LDL | max | 3.731 | 3.956 | 4.349 | 4.070 | |
| all SNPs | 0.166 | 1.873 | 1.295 | 1.295 | ||
| HDL | max | 10.050 | 10.436 | 10.325 | 10.438 | |
| all SNPs | 0.311 | 6.788 | 3.548 | 0.472 | ||
| TG | max | 2.256 | 2.627 | 2.784 | 2.508 | |
| all SNPs | 0.056 | 1.128 | 0.311 | 0.130 | ||
| FG | max | 1.234 | 1.239 | 1.775 | 1.708 | |
| all SNPs | 0.180 | 0.800 | 0.519 | 0.162 | ||
| BMI | max | 0.481 | 0.682 | 0.984 | 1.116 | |
| all SNPs | 0.207 | 0.640 | 0.442 | 0.339 | ||
| WHR | max | 18.764 | 19.449 | 20.051 | 20.231 | |
| all SNPs | 0.044 | 18.916 | 7.139 | 0.019 | ||
| INS | max | 0.167 | 0.183 | 0.136 | 0.048 | |
| all SNPs | 0.044 | 0.117 | 0.028 | 0.026 | ||
| CRP | max | 0.644 | 0.700 | 0.787 | 0.740 | |
| all SNPs | 0.166 | 0.140 | 0.318 | 0.102 | ||
| SBP | max | 7.641 | 8.124 | 8.519 | 8.545 | |
| all SNPs | 0.132 | 4.853 | 2.565 | 0.291 | ||
| DBP | max | 2.270 | 2.842 | 3.229 | 3.369 | |
| all SNPs | 0.359 | 1.684 | 1.636 | 0.340 | ||
| Height | max | 29.907 | 30.258 | 31.065 | NA^ | |
| all SNPs | 0.168 | 20.153 | 12.318 | NA^ | ||
| Weight | max | 12.708 | 13.354 | 13.973 | 14.093 | |
| all SNPs | 0.106 | 12.244 | 6.881 | 0.049 |
The best performing PRS weighting method in each scenario is in bold. Max: maximum prediction R2 achieved after optimizing over a range of p-value thresholds or fractions of causal variants; all SNPs: predictive performance using all SNPs. Note that for all methods except LDpred, the original set of genome-wide SNPs were LD-clumped by PLINK before construction of PRS.
LDL: Low density lipoprotein; HDL, high density lipoprotein; TG, triglyceride; FG, fasting glucose; BMI, body mass index; WHR, waist-hip ratio; INS, fasting insulin; CRP, C-reactive protein; SBP, systolic blood pressure; DBP, diastolic blood pressure.
^The LDpred program failed to run properly for the prediction of height despite repeated trials; hence it is listed as NA in this table.
Figure 1Predictive performance of standard PRS and four other PRS weighting schemes on lipids, fasting glucose, body-mass index and waist-hip ratio in the Northern Finland Birth Cohort (Orange: standard PRS; blue: weighting by ; yellow, weighting by ; green, weighting by ; purple, weighting by LDpred).
For all methods except LDpred, we first applied LD-clumping with an r2 threshold of 0.25 to all SNPs. “Max” refers to the maximum prediction R2 achieved after optimizing over a range of p-value thresholds or fractions of causal variants. “All.SNPs” refers to the predictive performance using all SNPs after LD-clumping, except for LDpred where no clumping was performed. All predictive performances were measured by R2 in %.
Figure 2Predictive performance of standard PRS and four other PRS weighting schemes on fasting insulin, C-reactive protein, systolic and diastolic blood pressure, height and weight in the Northern Finland Birth Cohort (Orange: standard PRS; blue: weighting by ; yellow, weighting by ; green, weighting by ; purple, weighting by LDpred).
For all methods except LDpred, we first applied LD-clumping with an r2 threshold of 0.25 to all SNPs. “Max” refers to the maximum prediction R2 achieved after optimizing over a range of p-value thresholds or fractions of causal variants. “All.SNPs” refers to the predictive performance using all SNPs after LD-clumping, except for LDpred where no clumping was performed. All predictive performances were measured by R2 in %.
Computational speed of different algorithms.
| Method | Average time taken for 1 simulation | Time taken for 20 simulations |
|---|---|---|
| Tdr | 0.14 s | 2.73 s |
| Tweedie | 1.34 s | 26.74 s |
| Tweedie *tdr | 2.05 s | 40.98 s |
| LDpred | 5 h 37 min 44 s | 112 h 34 min 32 s |
We compared the time taken for computing the corrected effect sizes based on one simulation scenario. Names of methods are defined as in previous tables; h: hours; min: minutes; s, seconds.