| Literature DB >> 33964208 |
Clara Albiñana1, Jakob Grove2, John J McGrath3, Esben Agerbo4, Naomi R Wray5, Cynthia M Bulik6, Merete Nordentoft7, David M Hougaard8, Thomas Werge9, Anders D Børglum10, Preben Bo Mortensen4, Florian Privé4, Bjarni J Vilhjálmsson11.
Abstract
The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.Entities:
Keywords: PRS; complex traits; genetic prediction; meta-analysis; polygenic risk scores; psychiatric disorders
Mesh:
Year: 2021 PMID: 33964208 PMCID: PMC8206385 DOI: 10.1016/j.ajhg.2021.04.014
Source DB: PubMed Journal: Am J Hum Genet ISSN: 0002-9297 Impact factor: 11.043
Overview of the compared data-combining approaches and data utilization
| Meta-GWAS | GWAS | – | select PRS parameters | assess PRS prediction accuracy | |
| SCT | penalized regression of C+T scores | grid C+T scores | not used | ||
| Meta-PRS | derive | derive | select PRS parameters |
Abbreviations: M, number of SNPs; Z, SNP effect size; x, SNP effect allele count; n, effective sample size ; int, internal data; ext, external data; k, number of PRSs in grid; w, weights (either regression coefficients or square root of training sample size).
When the weights for meta-PRS were obtained with linear regression, the validation dataset was also used to train the regression parameters. When the weights were obtained from the training sample size, the validation set was not used.
Summary of real datasets
| Anorexia nervosa (AN) | 7,713 | 35,274 | 1:5 | 0.8147 (0.0945) |
| Bipolar disorder (BD) | 8,436 | 48,609 | 1:6 | 0.7855 (0.0804) |
| Schizophrenia (SCZ) | 15,421 | 48,307 | 1:3 | 0.6175 (0.0677) |
| Autism spectrum disorder (ASD) | 39,068 | 10,610 | 4:1 | 0.6241 (0.0671) |
| Attention deficit hyperactivity disorder (ADHD) | 43,405 | 12,214 | 4:1 | 1.3137 (0.1216) |
| Major depressive disorder (MDD) | 49,234 | 646,483 | 1:13 | 0.8115 (0.0477) |
| Coronary artery disease (CAD) | 35,457 | 162,973 | 1:5 | 0.8644 (0.0672) |
| Breast cancer (BC) | 35,707 | 227,688 | 1:6 | 0.9378 (0.085) |
| Type 2 diabetes (T2D) | 57,086 | 88,825 | 1:2 | 0.9567 (0.0595) |
| Major depressive disorder (MDD) | 83,900 | 123,796 | 1:2 | 0.8156 (0.0632) |
| Body mass index (BMI) | 269,106 | 339,224 | 1:1 | 0.9536 (0.0347) |
| Height | 269,407 | 253,288 | 1:1 | 0.9389 (0.0417) |
Effective sample sizes of the six psychiatric disorders in iPSYCH 2015 and ANGI, four diseases and two continuous traits in the UK Biobank, along with the effective sample sizes of the corresponding external GWAS summary statistics. The table reflects sizes of European, unrelated samples (see material and methods).
Figure 1Prediction accuracy of the PRSs in the simulation study
Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach. The traits were simulated from a liability threshold model with 10,000 (10k) and 100,000 (100k) causal SNPs and heritability of 0.5, and case-control status was inferred from a disease prevalence of 0.2. Mean and 95% CI of prediction were obtained from 10k non-parametric bootstrap samples of 5 independent replicates.
(A) Effect of training sample size in the PRSs prediction accuracy. The x axis indicates the percentage of individuals from the total training set (n = 303,728) used as individual-level data for BOLT-LMM or GWAS summary statistics for C+T and LDpred.
(B) Effect of the ratio between internal and external data in the combining approaches. The x axis indicates the relative amount of external versus internal data, e.g., 3:1 indicates a scenario where the external data was 75% and the internal data was 25% of the total sample. Figure 1 is a simplified version of Figure S3, selecting a single method per combining approach between C+T and LDpred, where the method maximizing mean prediction was selected.
Figure 2Prediction accuracy of the combining approaches in 12 complex traits from iPSYCH 2015 and UK Biobank
Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach, of PRS trained on individual-level data (int), GWAS summary statistics (ext), or both (ext+int) (x axis). The prediction was transformed to the liability-scale using a population prevalence of 0.01 (ASD), 0.05 (ADHD), 0.15 (MDD UK Biobank), 0.05 (T2D), 0.01 (AN), 0.03 (CAD), 0.01 (SCZ), 0.07 (BC), 0.01 (BD), and 0.08 (MDD iPSYCH). The methods noted as int and ext were fitted using BOLT-LMM with individual-level data and LDpred or C+T with GWAS summary statistics, respectively. For simplification, only the ext PRS with larger mean prediction is shown, the full results are available in Figure S8. Mean and 95% CI of the prediction were obtained from 10k non-parametric bootstrap samples of the 5 cross-validation subsets.