| Literature DB >> 33945532 |
Oliver Pain1,2, Kylie P Glanville1, Saskia P Hagenaars1, Saskia Selzam1, Anna E Fürtjes1, Héléna A Gaspar1, Jonathan R I Coleman1, Kaili Rimfeld1, Gerome Breen1,2, Robert Plomin1, Lasse Folkersen3, Cathryn M Lewis1,2,4.
Abstract
The predictive utility of polygenic scores is increasing, and many polygenic scoring methods are available, but it is unclear which method performs best. This study evaluates the predictive utility of polygenic scoring methods within a reference-standardized framework, which uses a common set of variants and reference-based estimates of linkage disequilibrium and allele frequencies to construct scores. Eight polygenic score methods were tested: p-value thresholding and clumping (pT+clump), SBLUP, lassosum, LDpred1, LDpred2, PRScs, DBSLMM and SBayesR, evaluating their performance to predict outcomes in UK Biobank and the Twins Early Development Study (TEDS). Strategies to identify optimal p-value thresholds and shrinkage parameters were compared, including 10-fold cross validation, pseudovalidation and infinitesimal models (with no validation sample), and multi-polygenic score elastic net models. LDpred2, lassosum and PRScs performed strongly using 10-fold cross-validation to identify the most predictive p-value threshold or shrinkage parameter, giving a relative improvement of 16-18% over pT+clump in the correlation between observed and predicted outcome values. Using pseudovalidation, the best methods were PRScs, DBSLMM and SBayesR. PRScs pseudovalidation was only 3% worse than the best polygenic score identified by 10-fold cross validation. Elastic net models containing polygenic scores based on a range of parameters consistently improved prediction over any single polygenic score. Within a reference-standardized framework, the best polygenic prediction was achieved using LDpred2, lassosum and PRScs, modeling multiple polygenic scores derived using multiple parameters. This study will help researchers performing polygenic score studies to select the most powerful and predictive analysis methods.Entities:
Mesh:
Year: 2021 PMID: 33945532 PMCID: PMC8121285 DOI: 10.1371/journal.pgen.1009021
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Sample size of target sample phenotypes after quality control.
| UKB Phenotype | Description | Total sample size | No. of cases | No. of controls |
|---|---|---|---|---|
| Depression | Major depression | 50000 | 25000 | 25000 |
| Intelligence | Fluid intelligence | 50000 | NA | NA |
| BMI | Body Mass Index | 50000 | NA | NA |
| Height | Height | 50000 | NA | NA |
| T2D | Type-2 Diabetes | 50000 | 35112 | 14888 |
| CAD | Coronary Artery Disease | 50000 | 25000 | 25000 |
| IBD | Inflammatory Bowel Disease | 50000 | 46539 | 3461 |
| MultiScler | Multiple Sclerosis | 50000 | 48863 | 1137 |
| RheuArth | Rheumatoid Arthritis | 50000 | 46592 | 3408 |
| Prostate Cancer | Prostate Cancer | 50000 | 47073 | 2927 |
| Breast Cancer | Breast Cancer | 50000 | 41488 | 8512 |
| GCSE | Mean GCSE scores | 7296 | NA | NA |
| ADHD | ADHD symptoms | 7880 | NA | NA |
| BMI21 | Body Mass Index at age 21 | 5220 | NA | NA |
| Height21 | Height at age 21 | 5455 | NA | NA |
Fig 1Schematic diagram of reference-standardized polygenic scoring.
1KG = 1000 Genomes; LDSC = Linkage Disequiibrium Score Regression; MAF = Minor allele Frequency; Pre-imputed genotype data = Indicates the observed genotype data has already been imputed; Observed genome-wide genotype data = Indicate the observed genotype data has not been imputed, and therefore requires imputation.
Description of polygenic scoring approaches.
| Method | Multiple tuning parameters | Pseudo-validation/ infinitesimal option | Software | Description | Parameters | MHC region | LD-reference |
|---|---|---|---|---|---|---|---|
| pT+clump[ | Yes | No | PLINK | LD-based clumping and | 10 nested | Only top variant retained | EUR 1KG, EUR 10K UKB |
| lassosum[ | Yes | Pseudo-validation | lassosum | Lasso regression-based | 80 s and lambda combinations: s = 0.2, 0.5, 0.9, 1. lambda = exp(seq(log(0.001), log(0.1), length.out = 20)) | Not excluded | EUR 1KG, EUR 10K UKB |
| PRScs[ | Yes | Pseudo-validation | PRScs | Bayesian shrinkage | 5 global shrinkage parameters (phi) = 1e-6, 1e-4, 1e-2, 1, auto | Not excluded | PRScs-provided EUR 1KG |
| SBLUP[ | No | Infinitesimal (only option | GCTA | Best Linear Unbiased Prediction | NA | Not excluded | EUR 1KG, EUR 10K UKB |
| SBayesR[ | No | Pseudo-validation (only option) | GCTB | Bayesian shrinkage | NA | Excluded (as recommended) | EUR 1KG, EUR 10K UKB, GCTB-provided |
| LDpred1[ | Yes | Infinitesimal | LDpred | Bayesian shrinkage | Infinitesimal model and 7 non-zero effect fractions (p) = 3e-3, 1e-3, 3e-2, 1e-2, 3e-1, 1e-1, 1 | Not excluded | EUR 1KG, EUR 10K UKB |
| LDpred2[ | Yes | Pseudo-validation and infinitesimal | bigsnpr | Bayesian Shrinkage | Auto, infinitesimal, and grid modes. Grid includes 126 combinations of heritability and non-zero effect fractions (p). | Not excluded | EUR 1KG, EUR 10K UKB |
| DBSLMM | No | Yes (only option) | DBSLMM | Bayesian shrinkage | NA | Not excluded | EUR 1KG, EUR 10K UKB |
Note. Default or recommended parameters were used for all methods.
A lassosum lambda values described using R code.
Average test-set correlation between predicted and observed values across phenotypes.
| Method | Model | CrossVal R (SE) | IndepVal R (SE) |
|---|---|---|---|
| pT+clump | 10FCVal | 0.155 (0.002) | 0.153 (0.004) |
| pT+clump | MultiPRS | 0.175 (0.002) | 0.174 (0.004) |
| lassosum | 10FCVal | 0.19 (0.002) | 0.183 (0.004) |
| lassosum | MultiPRS | 0.199 (0.002) | 0.194 (0.004) |
| lassosum | PseudoVal | 0.159 (0.002) | 0.157 (0.004) |
| PRScs | 10FCVal | 0.19 (0.002) | 0.183 (0.004) |
| PRScs | MultiPRS | 0.194 (0.002) | 0.187 (0.004) |
| PRScs | PseudoVal | 0.188 (0.002) | 0.182 (0.004) |
| SBLUP | Inf | 0.162 (0.002) | 0.156 (0.004) |
| SBayesR | PseudoVal | 0.17 (0.002) | 0.167 (0.004) |
| LDpred1 | 10FCVal | 0.178 (0.002) | 0.171 (0.004) |
| LDpred1 | MultiPRS | 0.181 (0.002) | 0.175 (0.004) |
| LDpred1 | Inf | 0.163 (0.002) | 0.156 (0.004) |
| LDpred2 | 10FCVal | 0.194 (0.002) | 0.187 (0.004) |
| LDpred2 | MultiPRS | 0.197 (0.002) | 0.191 (0.004) |
| LDpred2 | PseudoVal | 0.155 (0.002) | 0.151 (0.004) |
| LDpred2 | Inf | 0.161 (0.002) | 0.155 (0.004) |
| DBSLMM | PseudoVal | 0.182 (0.002) | 0.175 (0.004) |
| All | MultiPRS | 0.202 (0.002) | 0.197 (0.004) |
Note. This table shows results based on the UKB target sample and 1000 genomes reference. 10FCVal = Single polygenic score based on the optimal parameter as identified using 10-fold cross-validation. Multi-PRS = Elastic net model containing polygenic scores based on a range of parameters, with elastic net shrinkage parameters derived using 10-fold cross-validation. PseudoVal = Single polygenic score based on the predicted optimal parameter as identified using pseudovalidation, which requires no tuning sample, Inf = Single polygenic score based on infinitesimal model, which requires no tuning sample.
Fig 2Polygenic scoring methods comparison for UKB target sample with 1KG reference.
A) Average test-set correlation between predicted and observed values across phenotypes. B) Average difference between observed-prediction correlations for the best pT+clump polygenic score and all other methods. The average difference across phenotypes are shown as diamonds and the difference for each phenotype shown as transparent circles. Shows only results based on the UKB target sample when using the 1KG reference. Error bars indicate standard error of correlations for each method. 10FCVal represents a single polygenic score based on the optimal parameter as identified using 10-fold cross-validation. Multi-PRS represents an elastic net model containing polygenic scores based on a range of parameters, with elastic net shrinkage parameters derived using 10-fold cross-validation. PseudoVal represents a single polygenic score based on the predicted optimal parameter as identified using pseudovalidation, which requires no tuning sample. Inf represents a single polygenic score based on the infinitesimal model, which requires no tuning sample.
Fig 3Pairwise comparison between all methods, showing average test-set observed-expected correlation difference between all methods with significance value.
Correlation difference = Test correlation–Comparison correlation. Red/orange coloring indicates the Test method (shown on Y axis) performed better than the Comparison method (shown on X axis). Shows only results based on the UKB target sample when using the 1KG reference. * = p<0.05. ** = p<1×10−3. *** = p<1×10−6. P-values are two-sided. 10FCVal represents a single polygenic score based on the optimal parameter as identified using 10-fold cross-validation. Multi-PRS represents an elastic net model containing polygenic scores based on a range of parameters, with elastic net shrinkage parameters derived using 10-fold cross-validation. PseudoVal represents a single polygenic score based on the predicted optimal parameter as identified using pseudovalidation, which requires no tuning sample. Inf represents a single polygenic score based on the infinitesimal model, which requires no tuning sample.