| Literature DB >> 31328830 |
Paul J Newcombe1, Christopher P Nelson2,3, Nilesh J Samani2,3, Frank Dudbridge4.
Abstract
The heritability of most complex traits is driven by variants throughout the genome. Consequently, polygenic risk scores, which combine information on multiple variants genome-wide, have demonstrated improved accuracy in genetic risk prediction. We present a new two-step approach to constructing genome-wide polygenic risk scores from meta-GWAS summary statistics. Local linkage disequilibrium (LD) is adjusted for in Step 1, followed by, uniquely, long-range LD in Step 2. Our algorithm is highly parallelizable since block-wise analyses in Step 1 can be distributed across a high-performance computing cluster, and flexible, since sparsity and heritability are estimated within each block. Inference is obtained through a formal Bayesian variable selection framework, meaning final risk predictions are averaged over competing models. We compared our method to two alternative approaches: LDPred and lassosum using all seven traits in the Welcome Trust Case Control Consortium as well as meta-GWAS summaries for type 1 diabetes (T1D), coronary artery disease, and schizophrenia. Performance was generally similar across methods, although our framework provided more accurate predictions for T1D, for which there are multiple heterogeneous signals in regions of both short- and long-range LD. With sufficient compute resources, our method also allows the fastest runtimes.Entities:
Keywords: Bayesian variable selection; meta-GWAS; polygenic risk scores; risk prediction; summary statistics
Mesh:
Year: 2019 PMID: 31328830 PMCID: PMC6764842 DOI: 10.1002/gepi.22245
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Application of three summary statistics prediction methods in the Welcome Trust Case Control Consortium under three‐fold cross‐validation
| Trait | LassoSum | LDPred | JAM | |||
|---|---|---|---|---|---|---|
| AUC | r2 | AUC | r2 | AUC | r2 | |
| Bipolar disorder | 0.67 (0.64, 0.69) | 0.09 | 0.66 (0.63, 0.69) | 0.08 | 0.70 (0.67, 0.73) | 0.13 |
| Coronary artery disease | 0.59 (0.56, 0.62) | 0.02 | 0.59 (0.56, 0.62) | 0.02 | 0.65 (0.62, 0.67) | 0.08 |
| Crohn's disease | 0.65 (0.62, 0.68) | 0.07 | 0.69 (0.66, 0.72) | 0.10 | 0.69 (0.66, 0.72) | 0.12 |
| Hypertension | 0.61 (0.58, 0.64) | 0.04 | 0.59 (0.56, 0.62) | 0.03 | 0.58 (0.55, 0.61) | 0.02 |
| Rheumatoid arthritis | 0.71 (0.68, 0.73) | 0.12 | 0.72 (0.69, 0.75) | 0.14 | 0.74 (0.71, 0.76) | 0.16 |
| Type 1 diabetes | 0.83 (0.80, 0.85) | 0.30 | 0.87 (0.85, 0.89) | 0.39 | 0.86 (0.84, 0.88) | 0.36 |
| Type 2 diabetes | 0.62 (0.60, 0.65) | 0.04 | 0.60 (0.57, 0.63) | 0.05 | 0.64 (0.61, 0.67) | 0.08 |
Note: ROC AUCs and predictive r2 are presented, with ROC AUC 95% confidence intervals calculated via 2,000 stratified bootstrap samples. For each method, performance is presented for the best performing sparsity.
Abbreviations: AUC, area under the curve; ROC, receiver operating characteristic.
Figure 1Receiver operating characteristic area under the curves ROC AUCs and predictive r2 for various predictive methods in three case studies training polygenic predictive models using meta‐GWAS summaries. For type 1 diabetes, the T1DGC (n = 8,005) was used for training and the Welcome Trust Case Control Consortium for validation. For coronary artery disease (CAD), cardiogram (n = 137,535) was used for training and the WTCCC for testing. For schizophrenia, the Psychiatric Genomics Consortium (n = 74,511) was used for training and the MGS study for testing. For each analysis and method, results are presented for the best performing sparsity. ROC AUC 95% confidence intervals were calculated using 2,000 stratified bootstrap replicates
Figure 2Block‐specific posterior mean numbers of selected single nucleotide polymorphisms by the JAM method in each of the three meta‐GWAS case studies. The vertical spread indicates the variation in block‐specific adapted sparsities from step one of our proposed framework. Note that the global average varies across the case studies owing to different optimal λs, which control the global sparsity. For CAD and schizophrenia, summary statistics were available from considerably larger training datasets, allowing the estimation of many more small effects (see Table 2). CAD, coronary artery disease
Computational aspects of JAM and runtimes (in minutes) of the different methods applied to the meta‐GWAS case studies
| Case study | Total | JAM | JAM | JAM | lassosum | LDPred |
|---|---|---|---|---|---|---|
| SNPs |
| SNPs | Runtime | Runtime | Runtime | |
| T1D | 231,510 | 0.01 | 712 | 4.6 | 10.3 | 61.5 |
| CAD | 211,263 | 0.001 | 7233 | 5.1 | 7.5 | 27.5 |
| Schizophrenia | 385,474 | 1E‐04 | 30544 | 16.6 | 17.4 | 157.2 |
Note: The total number of SNPs analyzed (i.e., after QC), for all methods, is shown in the first column. The next two columns correspond to the optimal value of for use with JAM—smaller values encourage more sparsity—and the posterior average number of SNPs selected into the corresponding optimal JAM model
Abbreviations: CAD, coronary artery disease; SNP, single‐nucleotide polymorphism; T1D, type 1 diabetes.