Literature DB >> 33964208

Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.

Clara Albiñana¹, Jakob Grove², John J McGrath³, Esben Agerbo⁴, Naomi R Wray⁵, Cynthia M Bulik⁶, Merete Nordentoft⁷, David M Hougaard⁸, Thomas Werge⁹, Anders D Børglum¹⁰, Preben Bo Mortensen⁴, Florian Privé⁴, Bjarni J Vilhjálmsson¹¹.

Abstract

The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.

Entities: Chemical

Keywords: PRS; complex traits; genetic prediction; meta-analysis; polygenic risk scores; psychiatric disorders

Mesh：

Year: 2021 PMID： 33964208 PMCID： PMC8206385 DOI： 10.1016/j.ajhg.2021.04.014

Source DB: PubMed Journal: Am J Hum Genet ISSN： 0002-9297 Impact factor: 11.043

Introduction

Polygenic risk scores (PRSs) are a powerful approach to summarize the individual genetic liability to develop a specific disease. They are particularly useful for complex traits and diseases, such as psychiatric disorders, as these are often highly polygenic. This is because PRSs aggregate the small risk contributions from thousands of variants into a single score, summarizing their overall risk contribution. Broadly, the existing polygenic prediction methods differ in the type of data they use for training, i.e., individual-level genotypes/dosages or GWAS summary statistics. Today, GWAS summary statistics are widely available for a broad range of diseases and traits in public databases, e.g., the GWAS catalog contains more than 1,400 summary statistics. For psychiatric disorders, the Psychiatric Genomics Consortium (PGC) provides GWAS summary statistics based on ever larger sample sizes, as a result of meta-analyzing the individual efforts of many research groups worldwide. Furthermore, many GWAS summary statistics-based PRS methods are broadly used: clumping and thresholding (C+T),5, 6, 7 LDpred, or more recent methods,9, 10, 11, 12, 13 and have proven successful to identify individuals with significant increased risk of complex diseases such as coronary artery disease. Interestingly, many of these external GWAS summary statistics-based PRS methods approximate the results of the internal individual-level data approaches, making some assumptions in the process (e.g., LDpred-inf and sBLUP approximate the genomic BLUP, assuming that linkage disequilibrium [LD] patterns in the external data from which the GWAS summary statistics were derived can be captured using an LD reference). Furthermore, phenotype definition, genetic architecture, and/or technical artifacts may affect the prediction accuracy of the derived PRSs., Using methods that fit prediction effect sizes jointly on internal individual-level data for training PRSs makes some of these assumptions unnecessary, which can lead to improved prediction accuracy, (e.g., Privé et al. found that prediction of height using penalized linear regression provides more accurate PRSs compared to C+T [LD clumping an p value thresholding] when trained on individual-level data). Indeed, a number of powerful alternatives exist for deriving PRSs using individual-level data.20, 21, 22, 23, 24, 25 Until recently, most individual-level datasets have been small, especially in comparison to sample sizes achieved in GWAS meta-analyses, but cheaper genotyping has led to the generation of large genetic datasets (e.g., iPSYCH for psychiatric disorders, and UK Biobank for a multitude of complex traits). Therefore, researchers often have access to large individual-level genetic data as well as large published GWAS summary statistics. However, most PRS methods train on either of these data types separately but not directly on both (although many methods do require individual-level data for hyper-parameter optimization). SCT is the only exception that we are aware of, as it does train directly on both types of data. By combining and leveraging data, we aim to increase the training sample size of PRSs and, ultimately, their prediction accuracy. In the current paper, we explore and compare different approaches of combining internal individual-level data and external GWAS summary statistics for polygenic prediction. Currently, the most widespread approach is combining the data at the level of GWAS summary statistics by meta-analyzing the marginal effect estimates of different studies, prior to training the PRS (meta-GWAS). We believe this approach is reasonable when the individual-level dataset is small, but may discard its potential for training when larger sample sizes are available. Alternatively, SCT generates a range of C+T PRSs from the external GWAS summary statistics over a grid of hyper-parameters (e.g., LD clumping parameters and p value thresholds) and then stacks these PRSs by fitting a penalized regression model using individual-level data. This results in a more accurate PRS compared to C+T provided sufficient training data sample size. Based on weighted average PRSs,, we propose a model with two independently generated PRS (meta-PRS): an internal PRS, derived from the individual-level data; and an external PRS, derived from the GWAS summary statistics; and train the weights using linear regression on a validation dataset. We derive the PRSs with methods that work well for highly polygenic traits—namely we use BOLT-LMM for deriving the internal PRS and LDpred for the external PRS. We compare the prediction accuracy of the three approaches presented above (meta-GWAS, SCT, and meta-PRS) through simulations and application to real data of psychiatric disorders and other complex diseases and traits, using individual-level data from two large cohorts (iPSYCH and UK Biobank) as well as large GWAS summary statistics that excluded these cohorts. We show that meta-PRS often outperforms the other compared data-combining approaches in terms of prediction accuracy, while being a simpler approach. We also show that, with larger individual-level datasets, the performance of meta-PRS is expected to increase. Finally, we provide recommendations for selecting a PRS approach when GWAS summary statistics and large individual-level data are available for training.

Material and methods

Approaches for combining internal and external data

We investigated the difference in prediction performance of PRSs that are trained using both external GWAS summary statistics and internal individual-level genetic data, but combined through three different approaches (Table 1). In the first approach (meta-GWAS), the internal individual-level data were used to derive GWAS summary statistics that were subsequently meta-analyzed with the external GWAS summary statistics and finally used for deriving PRSs. For the second approach (SCT), we used the external summary statistics to derive a large number of C+T scores and the individual-level data to fit a penalized regression to linearly combine these C+T scores. In the third approach (meta-PRS), the individual-level data and GWAS summary statistics were used for deriving two independent PRSs. We obtained a weighted average of the two PRSs by fitting a linear regression model.

Table 1

Overview of the compared data-combining approaches and data utilization

Combining approach	Individual-level data	GWAS summary statistics	Combining strategy	Validation	Test
Meta-GWAS	GWAS	–	PRS=∑i=1M,Zi⋅xi,Zi=nint⋅zint+next⋅zextnint+next	select PRS parameters	assess PRS prediction accuracy
SCT	penalized regression of C+T scores	grid C+T scores	PRS=∑j=1kwj⋅PRSj	not used
Meta-PRS	derive PRSint	derive PRSext	PRS=wint⋅PRSint+wext⋅PRSext	select PRS parametersa

Abbreviations: M, number of SNPs; Z, SNP effect size; x, SNP effect allele count; n, effective sample size ; int, internal data; ext, external data; k, number of PRSs in grid; w, weights (either regression coefficients or square root of training sample size).

When the weights for meta-PRS were obtained with linear regression, the validation dataset was also used to train the regression parameters. When the weights were obtained from the training sample size, the validation set was not used.

Overview of the compared data-combining approaches and data utilization Abbreviations: M, number of SNPs; Z, SNP effect size; x, SNP effect allele count; n, effective sample size ; int, internal data; ext, external data; k, number of PRSs in grid; w, weights (either regression coefficients or square root of training sample size). When the weights for meta-PRS were obtained with linear regression, the validation dataset was also used to train the regression parameters. When the weights were obtained from the training sample size, the validation set was not used. In the three approaches, the individual-level data were split into training, validation, and test subsets following a 5-fold cross-validation scheme (4-0.5-0.5; 80% training, 10% validation, 10% testing). The selection criterion for all method parameters was the parameter maximizing prediction accuracy in terms of prediction R2 in the validation data. Consequently, we obtained five estimates of PRS prediction performance for each method in the test subset and reported the mean. The standard error of the mean prediction accuracy was estimated through 10K bootstrap replicates of this mean.

Computing PRSs

Meta-GWAS

We obtained internal GWAS summary statistics for the individual-level data using linear regression for the simulations and continuous phenotypes and logistic regression for the case-control real phenotypes. For the GWAS, we used the functions big_univLinReg and big_univLogReg from the R package bigstatsr. We used sex, age, genotyping batch, and the first 20 principal components (PCs) of each dataset as covariates in the GWAS. We performed an inverse variance-based meta-analysis with the external GWAS summary statistics using the software METAL. We computed PRSs using LDpred v.1.0.10 (note that this version already implements some of the improvements made in LDpred2), using the infinitesimal model and 7 priors assuming a proportion of causal variants (p = 1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001). To compute the LD reference panel, we used an LD radius of 500 variants and a random sample of 5k unrelated individuals of European ancestry from each individual-level dataset. We then selected the LDpred PRS with p maximizing the prediction R2 in the validation set. We also computed PRSs with LD-clumping and p value thresholding (C+T), selecting the score from a set of C+T PRSs that maximized the prediction R2 in the validation set. The C+T PRSs were generated from a grid of parameters: LD pairwise correlation values (0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95), base window sizes (50, 100, 200, 500), and 50 p value thresholds (depending on max and min p value in summary statistics, on a log-log scale). For LD clumping, the SNP p values were used as a selection variable, i.e., for a pair of correlated SNPs, the SNP with the lowest p value was kept. A total of 1,400 C+T PRSs were derived for each chromosome. We performed logistic regression followed by an inverse variance-based meta-analysis, as this is common practice for GWASs and all the analyzed case-control real traits had GWAS summary statistics from logistic regression. Nevertheless, we observed a slight increase in mean prediction accuracy of the PRSs from linear regression and sample size-based meta-analysis versus logistic regression and inverse variance-based meta-analysis (Figure S1), although with highly overlapping CIs. We also note that some of the variation was expected due to randomness in the cross-validation subsets.

SCT

We computed C+T PRSs using the external GWAS summary statistics and the same grid of parameters as in the section meta-GWAS. The final PRS was computed using the function snp_grid_stacking from the R package bigsnpr, which performs penalized logistic regression, with the 1400 × 22 C+T scores as predictors and phenotypes as outcomes in the training set.

Meta-PRS

To obtain the meta-PRS, we first computed two independent PRSs: PRS and . For , we obtained per-SNP prediction betas with BOLT-LMM (using the flag –predBetasFile) and computed the PRS as , where are the number of SNPs in the model, . For each sample and trait, we ran BOLT-LMM v.2.3.4 using sex, age, genotyping batch, and the first 20 PCs of the dataset as covariates. Depending on the polygenicity of the trait, BOLT-LMM computes a mixture-of-Gaussians prior on SNP effect sizes or the single-Gaussian BOLT-LMM-inf model, equivalent to best linear unbiased prediction (BLUP). The was computed with LDpred or C+T, as described in the section meta-GWAS. Finally, we defined the meta-PRS with weights and as the linear combination of the two PRSs with these weights, as (lm function in R). To avoid overfitting, we trained the weights in a linear regression model in the validation dataset (lm function in R). For the linear combination, we also used as weights the square root of the respective PRS training data sample size. In these cases, PRSs were standardized prior to being combined. The latter use of weights is highlighted in the text, otherwise the weights in the meta-PRS came from the linear regression model.

Data and quality control

iPSYCH data

We used genotype and phenotype data from the iPSYCH 2012 and iPSYCH 2015 case-cohort samples. The iPSYCH2015 is an expansion of the iPSYCH2012 data and includes the samples of the latter. Both datasets were analyzed separately to show the effect of increasing the sample sizes in the method comparison. The iPSYCH2015 case-cohort sample is nested within the entire Danish population born between 1981 and 2008, including 1,657,449 persons. Cases were identified as persons with schizophrenia (SCZ), autism (ASD), attention-deficit/hyperactivity disorder (ADHD), and major depressive disorder (MDD); we identified control subjects as persons from the randomly selected cohort that were not diagnosed with any of the previous disorders. We also included the anorexia nervosa (AN; ANGI-DK) samples from the Anorexia Nervosa Genetics Initiative (ANGI). The genetics dataset consists of 134,677 individuals and 8,785,478 SNPs imputed following the RICOLPILI pipeline. We computed KING-relatedness robust coefficient and excluded at random one of the individuals in the pairs >3rd degree relatedness, resulting in 14,789 individuals excluded. We performed principal component analysis (PCA) following Privé et al. and obtained 20 PCs. We also identified 122,197 genetically homogeneous individuals based on these 20 PCs. We define homogeneous individuals as <4.8 log(dist) units from the center of the 20 PCs, calculated using the function dist_ogk from R package bigutilsr. This resulted in a subset of 108,623 unrelated individuals of homogeneous ancestry. After removing SNPs with minor allele frequency (MAF) < 0.01 and Hardy-Weinberg p value ( (df = 1) test statistic pHWE) < , we restricted to the HapMap3 variants. The final dataset was composed of 108,623 individuals and 1,184,443 SNPs.

UK Biobank data

We used genotype and phenotype data from the full release of the UK Biobank, consisting of 488,377 individuals with genetic information. Specifically, we imported dosage data from BGEN files using the function snp_readBGEN from the R package bigsnpr. We identified individuals with either self-reported or ICD-10 diagnosis for breast cancer (BC), coronary artery disease (CAD), type 2 diabetes (T2D), and major depressive disorder (MDD), setting the undiagnosed individuals as control subjects and restricting to women for breast cancer. We also identified individuals with standing height and body mass index (BMI) measurements to use as quantitative traits. We restricted the analysis to unrelated (as described in the section iPSYCH data) and “white British” genetic ancestry individuals. We removed SNPs with MAF < 0.01 and restricted to HapMap3 variants. The final dataset was composed of 337,475 individuals and 1,194,574 SNPs.

Simulations

We simulated case-control phenotypes using 1,194,574 HapMap3 SNPs and the subset of 337,475 unrelated European-ancestry individuals from the UK Biobank. The phenotypes were simulated with two different numbers of causal variants: 10k and 100k, representing polygenic traits. We also used two different total sample sizes: n = 337,475 (large simulations) and n = 50,000 (small simulations) individuals. Each causal variant was assigned an effect size drawn from , where the heritability = 0.5. The case-control status was assigned under a genetic liability model, with a simulated prevalence of 0.2. Each simulation scenario was repeated 5 times. From the sample of individuals, 90% were used as the training set, 5% as the validation set, and 5% as the test set. To represent scenarios with different sample sizes of the individual-level data and GWAS summary statistics, the training set was further split randomly according to the following partitions: 90%–10%, 75%–25%, 50%–50%, 25%–75%, and 10%–90%. One part was used to derive summary statistics and act as the external summary data, while the other part was used as individual-level data. The labels 9:1, 3:1, 1:1, 1:3, 1:9 used in the results reflect the sample size ratio of GWAS summary statistics (left) and individual-level data (right).

Prediction accuracy

The prediction accuracy of the PRSs was assessed in terms of squared correlation (R2) and area under the curve (AUC). The PRSs prediction R2 were reported as the squared partial correlation (using sex, age, genotyping batch, and first 20 PCs as covariates) for the quantitative traits and transformed to the liability scale for the case-control data. Additionally, the AUC was reported for the case-control data.

LDSC regression

We obtained estimates of the genetic correlation and intercept from a bivariate LD score regression (LDSC), between the internal and external GWAS summary statistics of the traits in Tables 2 and S1. We used the R package GenomicSEM.

Table 2

Summary of real datasets

Traits	Individual-level sample size	GWAS sample size	Ratio int:ext	r_ginternal-external (SE)
iPSYCH dataset

Anorexia nervosa (AN)⁴⁵	7,713	35,274	1:5	0.8147 (0.0945)
Bipolar disorder (BD)⁴⁶	8,436	48,609	1:6	0.7855 (0.0804)
Schizophrenia (SCZ)⁴⁷	15,421	48,307	1:3	0.6175 (0.0677)
Autism spectrum disorder (ASD)⁴⁸	39,068	10,610	4:1	0.6241 (0.0671)
Attention deficit hyperactivity disorder (ADHD)⁴⁹	43,405	12,214	4:1	1.3137 (0.1216)
Major depressive disorder (MDD)⁵⁰	49,234	646,483	1:13	0.8115 (0.0477)

UK Biobank dataset

Coronary artery disease (CAD)⁵¹	35,457	162,973	1:5	0.8644 (0.0672)
Breast cancer (BC)⁵²	35,707	227,688	1:6	0.9378 (0.085)
Type 2 diabetes (T2D)⁵³	57,086	88,825	1:2	0.9567 (0.0595)
Major depressive disorder (MDD)⁵⁴	83,900	123,796	1:2	0.8156 (0.0632)
Body mass index (BMI)⁵⁵	269,106	339,224	1:1	0.9536 (0.0347)
Height⁵⁶	269,407	253,288	1:1	0.9389 (0.0417)

Effective sample sizes of the six psychiatric disorders in iPSYCH 2015 and ANGI, four diseases and two continuous traits in the UK Biobank, along with the effective sample sizes of the corresponding external GWAS summary statistics. The table reflects sizes of European, unrelated samples (see material and methods).

Summary of real datasets Effective sample sizes of the six psychiatric disorders in iPSYCH 2015 and ANGI, four diseases and two continuous traits in the UK Biobank, along with the effective sample sizes of the corresponding external GWAS summary statistics. The table reflects sizes of European, unrelated samples (see material and methods).

Results

Performance on simulated data

We evaluated the prediction accuracy of the PRSs using simulated data to explore the relationship between the combining approaches and the training sample size. Using the UK Biobank genetic data, we simulated traits with 10,000 (10k) and 100,000 (100k) causal SNPs, aiming at representing the polygenicity range of complex traits, and different sizes of training sample (10%, 25%, 50%, 75%, and 90% of n = 303,728 and 45,000 individuals) of individual-level data (internal) and GWAS summary statistics (external). First, we compared the prediction accuracy of PRSs trained only on internal data (using BOLT-LMM) or external data (using C+T or LDpred) in terms of mean prediction R2 (Figure 1A) and AUC (Figure S2A). For all simulated scenarios, the BOLT-LMM outperformed other methods, with a larger relative improvement in the simulations with 10k causal SNPs. The comparison between the GWAS summary statistics-based methods resulted in C+T being generally preferred in the simulations with 10k and LDpred in the ones with 100k causal SNPs. These results highlight the benefits of using the individual-level data for training PRSs over the derived GWAS summary statistics.

Figure 1

Prediction accuracy of the PRSs in the simulation study

Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach. The traits were simulated from a liability threshold model with 10,000 (10k) and 100,000 (100k) causal SNPs and heritability of 0.5, and case-control status was inferred from a disease prevalence of 0.2. Mean and 95% CI of prediction were obtained from 10k non-parametric bootstrap samples of 5 independent replicates.

(A) Effect of training sample size in the PRSs prediction accuracy. The x axis indicates the percentage of individuals from the total training set (n = 303,728) used as individual-level data for BOLT-LMM or GWAS summary statistics for C+T and LDpred.

(B) Effect of the ratio between internal and external data in the combining approaches. The x axis indicates the relative amount of external versus internal data, e.g., 3:1 indicates a scenario where the external data was 75% and the internal data was 25% of the total sample. Figure 1 is a simplified version of Figure S3, selecting a single method per combining approach between C+T and LDpred, where the method maximizing mean prediction was selected.

Prediction accuracy of the PRSs in the simulation study Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach. The traits were simulated from a liability threshold model with 10,000 (10k) and 100,000 (100k) causal SNPs and heritability of 0.5, and case-control status was inferred from a disease prevalence of 0.2. Mean and 95% CI of prediction were obtained from 10k non-parametric bootstrap samples of 5 independent replicates. (A) Effect of training sample size in the PRSs prediction accuracy. The x axis indicates the percentage of individuals from the total training set (n = 303,728) used as individual-level data for BOLT-LMM or GWAS summary statistics for C+T and LDpred. (B) Effect of the ratio between internal and external data in the combining approaches. The x axis indicates the relative amount of external versus internal data, e.g., 3:1 indicates a scenario where the external data was 75% and the internal data was 25% of the total sample. Figure 1 is a simplified version of Figure S3, selecting a single method per combining approach between C+T and LDpred, where the method maximizing mean prediction was selected. We also compared the prediction accuracy of PRSs using different data-combining approaches (SCT, meta-GWAS, and meta-PRS) in the simulated traits (Figures 1B, S2B, and S3). The external and internal datasets were matched to create combinations with different ratios of each data type (9:1, 3:1, 1:1, 1:3, 1:9; e.g., 3:1 indicates a scenario where the external data was 75% and the internal data was 25% of the total N ∼300k individuals in the training set). For meta-PRS, we observed a positive relation between the size of the internal data and the mean prediction R2. The opposite was observed for SCT, where larger external datasets provided larger mean predictions. The ratio of data showed no effect for meta-GWAS, with constant prediction R2 along the simulated ratios (Figure 1B). These results indicated that it was possible to optimize PRS prediction accuracy by selecting a data-combining approach depending on the sample size ratio between the available internal and external data. While the classical meta-GWAS was a valid strategy in ratios of 1:1, scenarios with a more skewed ratio benefit from approaches like meta-PRS (for larger individual-level data) and SCT (for larger GWAS summary statistics), which use the individual-level data for training. We also performed simulations with smaller effective sample sizes for both individual-level data and GWAS summary statistics (Figure S4). Using a total sample size of 50k individuals, these simulations correspond better to the sample sizes used in the real data analysis. We observed similar mean prediction R2 for both meta-PRS and meta-GWAS in these simulations (Figure S4B). The method-specific differences only showed an increase in mean prediction of the BOLT-LMM PRS over the LDpred PRS when the training sample was 90% of the total, i.e., when the effective sample size was 40.5k individuals (Figure S4A). To better understand the relationship between the sample size and the difference in mean prediction R2 between meta-PRS and meta-GWAS, we plotted it as a function of the ratio , where is the effective sample size in the individual-level data, is the heritability, and is the fraction of causal variants, i.e., 0.1 and 0.01 for the simulations with 100k and 10k causal SNPs, respectively. We note that this ratio is related to the expected prediction accuracy by Daetwyler et al., i.e., the larger it is the more accurate predictions we can expect. We found that the observed benefit from applying meta-PRS over meta-GWAS increased as a function of this quantity (Figure S5). Interestingly, we also found that the effective sample size of the external GWAS summary statistics did not influence this relative performance. Aiming to simplify the construction of the meta-PRS, we attempted to use the square root of the effective sample size () to weight the internal and external PRSs. This simplified version of meta-PRS is faster and does not need of a validation dataset. In the previously described simulated scenarios, we compared the mean prediction R2 of PRSs weighted by and PRSs weighted by linear regression effect sizes (using a validation dataset). We only observed a small increase in mean prediction R2 in the scenarios with large individual-level data (ratios 1:3 and 1:9), with the other remaining the same. We also compared to a meta-PRS between the meta-GWAS and the internally trained PRS with BOLT-LMM, observing no increase in mean prediction R2 (Figure S6).

Performance on real data

We investigated the prediction accuracy of the data-combining approaches (meta-PRS, SCT, and meta-GWAS) in real complex traits using internal individual-level data from large genotype cohorts (iPSYCH,, and the UK Biobank) and external GWAS summary statistics. The set of traits selected included the six major psychiatric disorders (ASD, ADHD, MDD, BD, SCZ, and AN), three other complex diseases (BC, T2D, and CAD), and two continuous complex traits (height and BMI) (Table 2). The external GWAS summary statistics were selected to not have sample overlap with the individual-level datasets used. This was confirmed by checking the intercept of a bivariate LDSC regression between the internal and external data (Table S1). Of all traits, only height showed an intercept different from 0 (0.099, SE: 0.0265). Large sample sizes in GWASs (specifically for height) have been reported to cause this effect in the bivariate LDSC regression intercept. The set of SNPs used for each trait was the intersection between the SNPs in the individual-level data, GWAS summary statistics and the 1,440,616 HapMap3 SNPs. No single combining approach provided the largest mean prediction R2 for all traits (Figure 2) or AUC (Figure S7) for all traits. In the cases where the sample size of individual-level data was larger than the summary statistics (int > ext), meta-PRS increased mean prediction R2 over SCT and meta-GWAS for height, while both meta-GWAS and meta-PRS had similar results for ASD and ADHD, with large and overlapping CIs. In the cases with equal data training sample sizes (1:1), meta-PRS increased prediction accuracy over meta-GWAS and SCT for BMI and T2D, while the results for meta-GWAS and meta-PRS were similar for MDD UKB. Finally, in the cases where the sample size of the GWAS summary statistics was larger than the individual-level data (ext > int), the results were also diverse. For AN, CAD, SCZ, BD, and MDD iPSYCH, there was no major difference between meta-GWAS and meta-PRS. However, for BC, the data-combining approach with the largest mean prediction R2 was SCT.

Figure 2

Prediction accuracy of the combining approaches in 12 complex traits from iPSYCH 2015 and UK Biobank

Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach, of PRS trained on individual-level data (int), GWAS summary statistics (ext), or both (ext+int) (x axis). The prediction was transformed to the liability-scale using a population prevalence of 0.01 (ASD), 0.05 (ADHD), 0.15 (MDD UK Biobank), 0.05 (T2D), 0.01 (AN), 0.03 (CAD), 0.01 (SCZ), 0.07 (BC), 0.01 (BD), and 0.08 (MDD iPSYCH). The methods noted as int and ext were fitted using BOLT-LMM with individual-level data and LDpred or C+T with GWAS summary statistics, respectively. For simplification, only the ext PRS with larger mean prediction is shown, the full results are available in Figure S8. Mean and 95% CI of the prediction were obtained from 10k non-parametric bootstrap samples of the 5 cross-validation subsets.

Prediction accuracy of the combining approaches in 12 complex traits from iPSYCH 2015 and UK Biobank Each panel displays the mean and 95% CI of the PRS prediction (y axis) for each data combining approach, of PRS trained on individual-level data (int), GWAS summary statistics (ext), or both (ext+int) (x axis). The prediction was transformed to the liability-scale using a population prevalence of 0.01 (ASD), 0.05 (ADHD), 0.15 (MDD UK Biobank), 0.05 (T2D), 0.01 (AN), 0.03 (CAD), 0.01 (SCZ), 0.07 (BC), 0.01 (BD), and 0.08 (MDD iPSYCH). The methods noted as int and ext were fitted using BOLT-LMM with individual-level data and LDpred or C+T with GWAS summary statistics, respectively. For simplification, only the ext PRS with larger mean prediction is shown, the full results are available in Figure S8. Mean and 95% CI of the prediction were obtained from 10k non-parametric bootstrap samples of the 5 cross-validation subsets. Generally, the meta-GWAS resulted in a similar mean prediction R2 with meta-PRS for the psychiatric disorders, with large and overlapping CIs. This was independent of the sample size ratio of internal versus external data. Results using either iPSYCH 2012 or 2015 were similar, despite the iPSYCH 2015 data having almost twice as many individuals (Figure S9, Table S1). For most outcomes validated in the UK Biobank data, the most accurate approach was meta-PRS, where the largest improvement was for height, BMI, and T2D. For these outcomes, the internal effective sample size was larger than for most of the other outcomes. BC was the only trait where SCT led to the most predictive PRS, even though the ratio internal:external was similar to other traits like CAD. The difference in mean prediction R2 between meta-PRS and meta-GWAS was plotted as a function of the internal effective sample size (), SNP-heritability (), and proportion of causal variants () (Figure S10). We observed a similar trend as observed earlier in our simulations (Figure S6). While all of the psychiatric disorders showed small values of , all the other disorders and traits showed an increase in mean prediction R2 from using meta-PRS as the data-combining approach over meta-GWAS. We also compared the meta-PRS constructed with linear regression weights to the one weighed by effective sample sizes () of training data (Figure S11). As in the simulations, we only observed an increase in mean prediction R2 in the traits with large individual-level data (height and BMI). In the rest of the traits, there was no preference for a specific weight type. The use of as weights is therefore recommended for these traits, as it does not require a validation set. Additionally, we constructed a meta-PRS between the meta-GWAS PRS and the BOLT-LMM PRS. As observed in the simulations, the mean prediction R2 of this PRS was similar to the one obtained from the linear regression meta-PRS, which combines the BOLT-LMM PRS to the .

Discussion

With genetic data now available to researchers as both large individual-level datasets and GWAS summary statistics, we want to understand how to best combine these two types of data to optimize polygenic prediction. With this aim, we have evaluated the predictive performance of PRSs generated with different data-combining approaches: meta-GWAS, SCT, and meta-PRS. We find that the simple approach of combining two different PRSs (meta-PRS), trained on individual-level data and GWAS summary statistics separately, may yield more accurate PRSs than a meta-GWAS, particularly in the cases with sufficiently large individual-level datasets. We observe this in simulated data, where meta-PRS consistently increases the mean prediction R2 over the widely used meta-GWAS approach, and in the real complex traits with a large individual-level dataset e.g., height, BMI, and T2D. Another advantage of meta-PRS is that it allows to combine multiple pre-calculated PRSs, irrespective of prediction method. When validation data are not available, we show that one can use the square root of the training sample sizes as weights. The same approach could also be used to combine multiple PRSs (e.g., in the PGS Catalog), being standardized and averaged together with their corresponding training sample sizes. As an alternative approach, the scores in a meta-PRS could be weighted using MT-BLUP. Additionally, we also tried using the meta-GWAS as one of the variables for meta-PRS, which provided similar performance. In the case of BC, which has several large effects and relatively low polygenicity, the SCT PRS prediction is the most accurate, presumably because it relies more on variant thinning. For psychiatric disorders, we found that the meta-GWAS and meta-PRS generally yielded similar results, despite these disorders being very polygenic and often having relatively large individual-level data sample sizes. We also note that the expected relative improvement of meta-PRS over meta-GWAS is small when polygenicity is large. Our simulations and real data suggest that the relative prediction gain of meta-PRS over meta-GWAS increases as a function of the individual-level data sample size and seems to be independent of the external sample size. This is consistent with the observation that BMI and height display the largest benefit from using meta-PRS over meta-GWAS. As a general rule of thumb, we set the threshold value of at 100k. However, we also note that our results suggest that meta-PRS can be applied using smaller sample sizes without loss in prediction accuracy. Meta-PRS may be easier to construct in practice, as it does not require to make a meta-analysis of GWAS summary statistics. In addition, meta-PRS can be updated easily when new external data becomes available, as it only requires one to generate a new PRS on the new external GWAS summary statistics or even take it from a resource like the PGS Catalog. Our simulations represent an idealized scenario where we assume that the genetic architecture is invariant between cohorts/samples (i.e., genetic correlation is 1). Studies have shown that psychiatric disorders can be quite heterogenous between cohorts. As previously shown by Schork et al., we have estimated the genetic correlation for psychiatric disorders between external and iPSYCH samples to be between 0.5 and 0.8, while the genetic correlation was larger (>0.8) for the rest of the analyzed complex traits. Similar to disease heterogeneity, differences in genetic ancestry between the training and testing data can also decrease the prediction accuracy of PRSs. In the case of ancestry heterogeneity, the linear combination of PRSs trained independently on different ancestries improves prediction for admixed individuals, but the extent to which these sample heterogeneities affect each of the prediction accuracies in the compared data-combining approaches should be further studied. In meta-PRS we combined the BOLT-LMM and LDpred (or C+T) predictions, and therefore the results may not be fully generalizable to other methods, e.g., a more accurate method may lead to more accurate meta-GWAS scores. Nevertheless, given that LDpred generally performs well for polygenic traits in independent comparisons,, we believe it acts as a good proxy for other similar methods, such as lasso regression, SBayesR, and PRS-CS. In the case of individual-level data and low polygenicity, L1-penalized regression may also provide more accurate PRSs than BOLT-LMM. In summary, we found that a simple additive model of two polygenic scores (meta-PRS) often outperformed the accuracy of approaches that first meta-analyzed SNP effects (meta-GWAS) in highly polygenic traits. Fundamentally, the improvement in meta-PRS prediction accuracy stems from the fact that methods that train a polygenic prediction model on individual-level data have access to more training information than methods that only train on a summary of this data and usually make fewer assumptions. However, meta-GWAS has the advantage that each effect estimate is updated separately, possibly making it more robust to small sample sizes and changes in genetic architecture.

Declaration of interests

C.M.B. reports: Shire (grant recipient, Scientific Advisory Board member); Idorsia (consultant); Lundbeckfonden (grant recipient); Pearson (author, royalty recipient). The other authors declare no competing interests.

59 in total

1. Estimating missing heritability for disease from genome-wide association studies.

Authors: Sang Hong Lee; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Am J Hum Genet Date: 2011-03-03 Impact factor: 11.025

2. Polygenic scores via penalized regression on summary statistics.

Authors: Timothy Shin Heng Mak; Robert Milan Porsch; Shing Wan Choi; Xueya Zhou; Pak Chung Sham
Journal: Genet Epidemiol Date: 2017-05-08 Impact factor: 2.135

3. Genome-wide association study identifies eight risk loci and implicates metabo-psychiatric origins for anorexia nervosa.

Authors: Hunna J Watson; Zeynep Yilmaz; Laura M Thornton; Christopher Hübel; Jonathan R I Coleman; Héléna A Gaspar; Julien Bryois; Anke Hinney; Virpi M Leppä; Manuel Mattheisen; Sarah E Medland; Stephan Ripke; Shuyang Yao; Paola Giusti-Rodríguez; Ken B Hanscombe; Kirstin L Purves; Roger A H Adan; Lars Alfredsson; Tetsuya Ando; Ole A Andreassen; Jessica H Baker; Wade H Berrettini; Ilka Boehm; Claudette Boni; Vesna Boraska Perica; Katharina Buehren; Roland Burghardt; Matteo Cassina; Sven Cichon; Maurizio Clementi; Roger D Cone; Philippe Courtet; Scott Crow; James J Crowley; Unna N Danner; Oliver S P Davis; Martina de Zwaan; George Dedoussis; Daniela Degortes; Janiece E DeSocio; Danielle M Dick; Dimitris Dikeos; Christian Dina; Monika Dmitrzak-Weglarz; Elisa Docampo; Laramie E Duncan; Karin Egberts; Stefan Ehrlich; Geòrgia Escaramís; Tõnu Esko; Xavier Estivill; Anne Farmer; Angela Favaro; Fernando Fernández-Aranda; Manfred M Fichter; Krista Fischer; Manuel Föcker; Lenka Foretova; Andreas J Forstner; Monica Forzan; Christopher S Franklin; Steven Gallinger; Ina Giegling; Johanna Giuranna; Fragiskos Gonidakis; Philip Gorwood; Monica Gratacos Mayora; Sébastien Guillaume; Yiran Guo; Hakon Hakonarson; Konstantinos Hatzikotoulas; Joanna Hauser; Johannes Hebebrand; Sietske G Helder; Stefan Herms; Beate Herpertz-Dahlmann; Wolfgang Herzog; Laura M Huckins; James I Hudson; Hartmut Imgart; Hidetoshi Inoko; Vladimir Janout; Susana Jiménez-Murcia; Antonio Julià; Gursharan Kalsi; Deborah Kaminská; Jaakko Kaprio; Leila Karhunen; Andreas Karwautz; Martien J H Kas; James L Kennedy; Anna Keski-Rahkonen; Kirsty Kiezebrink; Youl-Ri Kim; Lars Klareskog; Kelly L Klump; Gun Peggy S Knudsen; Maria C La Via; Stephanie Le Hellard; Robert D Levitan; Dong Li; Lisa Lilenfeld; Bochao Danae Lin; Jolanta Lissowska; Jurjen Luykx; Pierre J Magistretti; Mario Maj; Katrin Mannik; Sara Marsal; Christian R Marshall; Morten Mattingsdal; Sara McDevitt; Peter McGuffin; Andres Metspalu; Ingrid Meulenbelt; Nadia Micali; Karen Mitchell; Alessio Maria Monteleone; Palmiero Monteleone; Melissa A Munn-Chernoff; Benedetta Nacmias; Marie Navratilova; Ioanna Ntalla; Julie K O'Toole; Roel A Ophoff; Leonid Padyukov; Aarno Palotie; Jacques Pantel; Hana Papezova; Dalila Pinto; Raquel Rabionet; Anu Raevuori; Nicolas Ramoz; Ted Reichborn-Kjennerud; Valdo Ricca; Samuli Ripatti; Franziska Ritschel; Marion Roberts; Alessandro Rotondo; Dan Rujescu; Filip Rybakowski; Paolo Santonastaso; André Scherag; Stephen W Scherer; Ulrike Schmidt; Nicholas J Schork; Alexandra Schosser; Jochen Seitz; Lenka Slachtova; P Eline Slagboom; Margarita C T Slof-Op 't Landt; Agnieszka Slopien; Sandro Sorbi; Beata Świątkowska; Jin P Szatkiewicz; Ioanna Tachmazidou; Elena Tenconi; Alfonso Tortorella; Federica Tozzi; Janet Treasure; Artemis Tsitsika; Marta Tyszkiewicz-Nwafor; Konstantinos Tziouvas; Annemarie A van Elburg; Eric F van Furth; Gudrun Wagner; Esther Walton; Elisabeth Widen; Eleftheria Zeggini; Stephanie Zerwas; Stephan Zipfel; Andrew W Bergen; Joseph M Boden; Harry Brandt; Steven Crawford; Katherine A Halmi; L John Horwood; Craig Johnson; Allan S Kaplan; Walter H Kaye; James E Mitchell; Catherine M Olsen; John F Pearson; Nancy L Pedersen; Michael Strober; Thomas Werge; David C Whiteman; D Blake Woodside; Garret D Stuber; Scott Gordon; Jakob Grove; Anjali K Henders; Anders Juréus; Katherine M Kirk; Janne T Larsen; Richard Parker; Liselotte Petersen; Jennifer Jordan; Martin Kennedy; Grant W Montgomery; Tracey D Wade; Andreas Birgegård; Paul Lichtenstein; Claes Norring; Mikael Landén; Nicholas G Martin; Preben Bo Mortensen; Patrick F Sullivan; Gerome Breen; Cynthia M Bulik
Journal: Nat Genet Date: 2019-07-15 Impact factor: 38.330

4. METAL: fast and efficient meta-analysis of genomewide association scans.

Authors: Cristen J Willer; Yun Li; Gonçalo R Abecasis
Journal: Bioinformatics Date: 2010-07-08 Impact factor: 6.937

5. Multi-polygenic score approach to trait prediction.

Authors: E Krapohl; H Patel; S Newhouse; C J Curtis; S von Stumm; P S Dale; D Zabaneh; G Breen; P F O'Reilly; R Plomin
Journal: Mol Psychiatry Date: 2017-08-08 Impact factor: 15.992

6. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.

Authors: Annalisa Buniello; Jacqueline A L MacArthur; Maria Cerezo; Laura W Harris; James Hayhurst; Cinzia Malangone; Aoife McMahon; Joannella Morales; Edward Mountjoy; Elliot Sollis; Daniel Suveges; Olga Vrousgou; Patricia L Whetzel; Ridwan Amode; Jose A Guillen; Harpreet S Riat; Stephen J Trevanion; Peggy Hall; Heather Junkins; Paul Flicek; Tony Burdett; Lucia A Hindorff; Fiona Cunningham; Helen Parkinson
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7. Unraveling the genetic architecture of major depressive disorder: merits and pitfalls of the approaches used in genome-wide association studies.

Authors: I Schwabe; Y Milaneschi; Z Gerring; P F Sullivan; E Schulte; N P Suppli; J G Thorp; E M Derks; C M Middeldorp
Journal: Psychol Med Date: 2019-09-27 Impact factor: 7.723

8. Improved polygenic prediction by Bayesian multiple regression on summary statistics.

Authors: Luke R Lloyd-Jones; Jian Zeng; Julia Sidorenko; Loïc Yengo; Gerhard Moser; Kathryn E Kemper; Huanwei Wang; Zhili Zheng; Reedik Magi; Tõnu Esko; Andres Metspalu; Naomi R Wray; Michael E Goddard; Jian Yang; Peter M Visscher
Journal: Nat Commun Date: 2019-11-08 Impact factor: 14.919

9. Identification of common genetic risk variants for autism spectrum disorder.

Authors: Jakob Grove; Stephan Ripke; Thomas D Als; Manuel Mattheisen; Raymond K Walters; Hyejung Won; Jonatan Pallesen; Esben Agerbo; Ole A Andreassen; Richard Anney; Swapnil Awashti; Rich Belliveau; Francesco Bettella; Joseph D Buxbaum; Jonas Bybjerg-Grauholm; Marie Bækvad-Hansen; Felecia Cerrato; Kimberly Chambert; Jane H Christensen; Claire Churchhouse; Karin Dellenvall; Ditte Demontis; Silvia De Rubeis; Bernie Devlin; Srdjan Djurovic; Ashley L Dumont; Jacqueline I Goldstein; Christine S Hansen; Mads Engel Hauberg; Mads V Hollegaard; Sigrun Hope; Daniel P Howrigan; Hailiang Huang; Christina M Hultman; Lambertus Klei; Julian Maller; Joanna Martin; Alicia R Martin; Jennifer L Moran; Mette Nyegaard; Terje Nærland; Duncan S Palmer; Aarno Palotie; Carsten Bøcker Pedersen; Marianne Giørtz Pedersen; Timothy dPoterba; Jesper Buchhave Poulsen; Beate St Pourcain; Per Qvist; Karola Rehnström; Abraham Reichenberg; Jennifer Reichert; Elise B Robinson; Kathryn Roeder; Panos Roussos; Evald Saemundsen; Sven Sandin; F Kyle Satterstrom; George Davey Smith; Hreinn Stefansson; Stacy Steinberg; Christine R Stevens; Patrick F Sullivan; Patrick Turley; G Bragi Walters; Xinyi Xu; Kari Stefansson; Daniel H Geschwind; Merete Nordentoft; David M Hougaard; Thomas Werge; Ole Mors; Preben Bo Mortensen; Benjamin M Neale; Mark J Daly; Anders D Børglum
Journal: Nat Genet Date: 2019-02-25 Impact factor: 38.330

10. Multiethnic polygenic risk scores improve risk prediction in diverse populations.

Authors: Carla Márquez-Luna; Po-Ru Loh; Alkes L Price
Journal: Genet Epidemiol Date: 2017-11-07 Impact factor: 2.135

3 in total

1. A multi-ethnic polygenic risk score is associated with hypertension prevalence and progression throughout adulthood.

Authors: Nuzulul Kurniansyah; Matthew O Goodman; Tanika N Kelly; Tali Elfassy; Kerri L Wiggins; Joshua C Bis; Xiuqing Guo; Walter Palmas; Kent D Taylor; Henry J Lin; Jeffrey Haessler; Yan Gao; Daichi Shimbo; Jennifer A Smith; Bing Yu; Elena V Feofanova; Roelof A J Smit; Zhe Wang; Shih-Jen Hwang; Simin Liu; Sylvia Wassertheil-Smoller; JoAnn E Manson; Donald M Lloyd-Jones; Stephen S Rich; Ruth J F Loos; Susan Redline; Adolfo Correa; Charles Kooperberg; Myriam Fornage; Robert C Kaplan; Bruce M Psaty; Jerome I Rotter; Donna K Arnett; Alanna C Morrison; Nora Franceschini; Daniel Levy; Tamar Sofer
Journal: Nat Commun Date: 2022-06-21 Impact factor: 17.694

2. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort.

Authors: Florian Privé; Hugues Aschard; Shai Carmi; Lasse Folkersen; Clive Hoggart; Paul F O'Reilly; Bjarni J Vilhjálmsson
Journal: Am J Hum Genet Date: 2022-01-06 Impact factor: 11.043

3. Early-Life Injuries and the Development of Attention-Deficit/Hyperactivity Disorder.

Authors: Theresa Wimberley; Isabell Brikell; Emil M Pedersen; Esben Agerbo; Bjarni J Vilhjálmsson; Clara Albiñana; Florian Privé; Anita Thapar; Kate Langley; Lucy Riglin; Marianne Simonsen; Helena S Nielsen; Anders D Børglum; Merete Nordentoft; Preben B Mortensen; Søren Dalsgaard
Journal: J Clin Psychiatry Date: 2022-01-04 Impact factor: 4.384

3 in total