| Literature DB >> 23555274 |
Abstract
Polygenic scores have recently been used to summarise genetic effects among an ensemble of markers that do not individually achieve significance in a large-scale association study. Markers are selected using an initial training sample and used to construct a score in an independent replication sample by forming the weighted sum of associated alleles within each subject. Association between a trait and this composite score implies that a genetic signal is present among the selected markers, and the score can then be used for prediction of individual trait values. This approach has been used to obtain evidence of a genetic effect when no single markers are significant, to establish a common genetic basis for related disorders, and to construct risk prediction models. In some cases, however, the desired association or prediction has not been achieved. Here, the power and predictive accuracy of a polygenic score are derived from a quantitative genetics model as a function of the sizes of the two samples, explained genetic variance, selection thresholds for including a marker in the score, and methods for weighting effect sizes in the score. Expressions are derived for quantitative and discrete traits, the latter allowing for case/control sampling. A novel approach to estimating the variance explained by a marker panel is also proposed. It is shown that published studies with significant association of polygenic scores have been well powered, whereas those with negative results can be explained by low sample size. It is also shown that useful levels of prediction may only be approached when predictors are estimated from very large samples, up to an order of magnitude greater than currently available. Therefore, polygenic scores currently have more utility for association testing than predicting complex traits, but prediction will become more feasible as sample sizes continue to grow.Entities:
Mesh:
Year: 2013 PMID: 23555274 PMCID: PMC3605113 DOI: 10.1371/journal.pgen.1003348
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Parameters and notation of polygenic model.
|
| Training sample size |
|
| Replication sample size |
|
| Number of markers in genotyping panel |
|
| Variance of marker effects in training sample |
|
| Variance of marker effects in replication sample |
|
| Covariance of marker effects between training and replication samples |
|
| Proportion of markers with no effect in either sample |
|
| Lower bound on |
|
| Upper bound on |
|
| Prevalence of binary trait in training sample |
|
| Prevalence of binary trait in replication sample |
|
| Sampling proportion of cases of binary trait in training sample |
|
| Sampling proportion of cases of binary trait in replication sample |
Figure 1Expected −log10(P) of linear regression estimate as a function of P-value threshold for selecting markers into the polygenic score.
Training sample, 3322 cases and 3587 controls; replication sample, 2687 cases and 2656 controls. Marker panel of 74062 independent SNPs. Variance explained by markers, 28.7%. pi0, proportion of markers with no effect on disease.
Figure 2Expected −log10(P) of allele score estimate as a function of P-value threshold for selecting markers into the polygenic score.
Training sample, 3322 cases and 3587 controls; replication sample, 2687 cases and 2656 controls. Marker panel of 74062 independent SNPs. Variance explained by markers, 28.7%. pi0, proportion of markers with no effect on disease.
AUC calculated by Evans et al [19] compared to analytic values when () marker panel explains half the heritability, or () marker panel explains the full heritability.
| Bipolar disorder | Coronary artery disease | Crohn's disease | Hypertension | Type-2 diabetes | ||
|
| 0.01 | 0.056 | 0.001 | 0.3 | 0.03 | |
|
| 0.69 | 0.72 | 0.76 | 1 | 0.6 | |
| Linear regression | Evans | 0.668 | 0.595 | 0.614 | 0.61 | 0.601 |
|
| 0.570 (0.890) | 0.547 (0.843) | 0.620 (0.948) | 0.539 (0.841) | 0.545 (0.832) | |
|
| 0.638 (0.974) | 0.592 (0.948) | 0.727 (0.995) | 0.577 (0.971) | 0.588 (0.934) | |
| Allele count | Evans | 0.653 | 0.599 | 0.617 | 0.602 | 0.589 |
|
| 0.561 (0.827) | 0.540 (0.780) | 0.604 (0.894) | 0.533 (0.770) | 0.539 (0.772) | |
|
| 0.620 (0.922) | 0.580 (0.880) | 0.698 (0.970) | 0.567 (0.885) | 0.576 (0.868) |
K, population prevalence and h 2, heritability of liability taken from Wray et al [21] except for hypertension which is assumed fully heritable for illustration. In parentheses, AUC achieved by an infinite sample.
R 2 reported for complex diseases compared to analytic values when marker panel explains one quarter, one half or the full heritability.
| Schiz | MS | BrCa | PrCa | RA | Celiac | MI/CAD | T2D | ||
|
| .01 | .001 | .036 | .024 | .0075 | .0075 | .056 | .03 | |
|
| .8 | .5 | .44 | .44 | .55 | .55 | .72 | .6 | |
|
| 0 | 0 | 0 | 0 | .97 | .98 | .98 | .96 | |
| Reported | .013 | .012 | .001 | .001 | .003 | .007 | .007 | .013 | |
|
|
| .006 (.2) | .002 (.125) | .0002 (.11) | .0002 (.11) | .001 (.1375) | .0008 (.1375) | .001 (.18) | .003 (.15) |
|
| .56 (.81) | .54 (.81) | .51 (.71) | .51 (.72) | .52 (.52) | .52 (.77) | .52 (.75) | .53 (.75) | |
|
|
| .024 (.4) | .008 (.25) | .0008 (.22) | .0009 (.22) | .006 (.275) | .003 (.275) | .004 (.36) | .010 (.3) |
|
| .62 (.91) | .58 (.90) | .52 (.79) | .52 (.80) | .56 (.87) | .55 (.87) | .54 (.84) | .57 (.94) | |
|
|
| .089 (.8) | .03 (.5) | .003 (.88) | .003 (.44) | .025 (.55) | .013 (.55) | .017 (.72) | .013 (.6) |
|
| .72 (.99) | .66 (.97) | .54 (.89) | .54 (.90) | .62 (.95) | .59 (.96) | .58 (.95) | .63 (.94) | |
| Reported | .3 | na | na | na | .18 | .44 | .48 | .49 | |
|
| .29 | .31 | .30 | .28 | .21 | .40 | .47 | .34 |
Schiz, schizophrenia. MS, multiple sclerosis. BrCa, breast cancer. PrCa, prostate cancer. RA, rheumatoid arthritis. Celiac, celiac disease. MI/CAD, early-onset myocardial infarction or coronary artery disease. T2D, type-2 diabetes. K, population prevalence and h 2, heritability of liability taken from Visscher et al [1] and Wray et al [21] except for celiac, assumed equal to RA. π 0, proportion of markers assumed to have no effects. Reported R 2, highest R 2 reported in cited publication, transformed to the liability scale. In parentheses, values achieved by an infinite training sample. Reported , variance explained by markers as estimated in cited publication. , estimated variance explained using method proposed herein.
Figure 3AUC as a function of sample size, using a panel of 100,000 markers that explains half the heritability of liability.
n, number of cases and of controls in training sample. Heritability of liability, 76% for Crohn's disease. 44% for breast cancer. Line annotations are the proportion of markers with no effect on disease.
Numbers of cases and controls (in 1000s of each, rounded up) required to attain a specified AUC using a panel of 100,000 markers that explains half the heritability of liability.
| AUC | π0 = 0.99 | π0 = 0.90 | π0 = 0.75 | π0 = 0 | |
| Crohn's disease ( | 0.75 | 2 (0.0004) | 9 (0.02) | 12 (0.5) | 12 (1) |
| 0.855 = 0.9*Max | 3 (0.0004) | 19 (0.01) | 34 (0.06) | 42 (1) | |
| 0.9025 = 0.95*Max | 6 (0.0004) | 35 (0.008) | 68 (0.04) | 100 (1) | |
| 0.9405 = 0.99*Max | 23 (0.0003) | 165 (0.004) | 349 (0.02) | 690 (1) | |
| Breast cancer ( | 0.75 | 23 (0.0004) | 157 (0.008) | 311 (0.03) | 476 (1) |
| 0.711 = 0.9*Max | 12 (0.0005) | 77 (0.01) | 144 (0.05) | 183 (1) | |
| 0.7125 = 0.95*Max | 23 (0.0005) | 159 (0.01) | 315 (0.05) | 484 (1) | |
| 0.7821 = 0.99*Max | 100 (0.00024) | 755 (0.00389) | 1610 (0.0147) | 3281 (1) |
π0, proportion of SNPs having no effect on disease. Max, maximum AUC achievable given the genetic variance of the marker panel. In parentheses, P-value threshold that maximises the AUC.
Figure 4AUC as a function of sample size, using a panel of 1,000,000 markers that explains the full heritability.
n, number of cases and of controls in training sample. Heritability of liability, 76% for Crohn's disease. 44% for breast cancer. Line annotations are the proportion of markers with no effect on disease.
Numbers of cases and controls (in 1000s of each, rounded up) required to attain a specified AUC using a panel of 1,000,000 markers that explains the full heritability.
| AUC | π0 = 0.999 | π0 = 0.99 | π0 = 0.90 | π0 = 0.75 | π0 = 0 | |
| Crohn's disease ( | 0.75 | 1 (0.00007) | 5 (0.0004) | 25 (0.08) | 27 (1) | 27 (1) |
| 0.9 = 0.9*Max | 2 (0.00007) | 10 (0.0004) | 62 (0.01) | 107 (0.1) | 117 (1) | |
| 0.95 = 0.95*Max | 3 (0.00007) | 16 (0.0005) | 103 (0.01) | 190 (0.05) | 243 (1) | |
| 0.99 = 0.99*Max | 8 (0.00007) | 58 (0.0003) | 413 (0.006) | 847 (0.02) | 1487 (1) | |
| Breast cancer ( | 0.75 | 6 (0.00007) | 41 (0.0004) | 256 (0.01) | 448 (0.09) | 505 (1) |
| 0.801 = 0.9*Max | 9 (0.00007) | 65 (0.0005) | 428 (0.009) | 806 (0.05) | 1062 (1) | |
| 0.8455 = 0.95*Max | 17 (0.00007) | 124 (0.0004) | 857 (0.007) | 1702 (0.03) | 2656 (1) | |
| 0.8811 = 0.99*Max | 77 (0.00007) | 566 (0.0002) | 4305 (0.004) | 9223 (0.01) | 19191 (1) |
π0, proportion of SNPs having no effect on disease. Max, maximum AUC achievable given the genetic variance of the marker panel. In parentheses, P-value threshold that maximises the AUC.
Numbers of subjects (in 1000s, rounded up) required to attain a specified correlation with a normal trait using a panel of 1,000,000 markers that explains the full heritability.
| Correlation | π0 = 0.999 | π0 = 0.99 | π0 = 0.90 | π0 = 0.75 | π0 = 0 | |
|
| 0.8046 = 0.9*Max | 31 (0.00007) | 227 (0.0004) | 1601 (0.007) | 3231 (0.03) | 5329 (1) |
| 0.8493 = 0.95*Max | 55 (0.00007) | 411 (0.0003) | 3004 (0.005) | 6250 (0.02) | 11571 (1) | |
| 0.88506 = 0.99*Max | 213 (0.00007) | 1546 (0.0002) | 12171 (0.003) | 26724 (0.01) | 61565 (1) | |
|
| 0.5688 = 0.9*Max | 61 (0.00007) | 453 (0.0004) | 3201 (0.007) | 6461 (0.03) | 10658 (1) |
| 0.6004 = 0.95*Max | 109 (0.00007) | 821 (0.0003) | 6007 (0.005) | 12500 (0.02) | 23141 (1) | |
| 0.62568 = 0.99*Max | 426 (0.00007) | 3092 (0.0002) | 24341 (0.003) | 53448 (0.01) | 123128 (1) |
π0, proportion of SNPs having no effect on the trait. Max, maximum correlation achievable given the genetic variance of the marker panel. In parentheses, P-value threshold that maximises the correlation.