| Literature DB >> 23455638 |
Nilanjan Chatterjee1, Bill Wheeler, Joshua Sampson, Patricia Hartge, Stephen J Chanock, Ju-Hyun Park.
Abstract
We report a new method to estimate the predictive performance of polygenic models for risk prediction and assess predictive performance for ten complex traits or common diseases. Using estimates of effect-size distribution and heritability derived from current studies, we project that although 45% of the variance of height has been attributed to SNPs, a model trained on one million people may only explain 33.4% of variance of the trait. Models based on current studies allow for identification of 3.0%, 1.1% and 7.0% of the populations at twofold or higher than average risk for type 2 diabetes, coronary artery disease and prostate cancer, respectively. Tripling of sample sizes could elevate these percentages to 18.8%, 6.1% and 12.2%, respectively. The utility of polygenic models for risk prediction will depend on achievable sample sizes for the training data set, the underlying genetic architecture and the inclusion of information on other risk factors, including family history.Entities:
Mesh:
Year: 2013 PMID: 23455638 PMCID: PMC3729116 DOI: 10.1038/ng.2579
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Figure 1Predictive correlation coefficient (PCC) for polygenic models and corresponding optimal significance level for SNP selection under three models for polygenic architectures for adult height
Each model assumes a total of 45% of phenotypic variance of adult height can be explained by common SNPs included in standard GWAS platforms involving M=200,000 independent SNPs. The effect size distribution for susceptibility SNPs are assumed to follow an exponential distribution (black line), a mixture of two exponential distributions (red line) or a mixture of three exponential distributions (blue line). Panel (a) and (b) show expected value of squared PCC and corresponding optimal significance level (αopt), respectively, as a function of sample size (N). Panel (c) compares PCC values reported in a predictive analysis of the GIANT study (dashed line) with corresponding theoretical expected values under the three different models.
Characteristics of ten complex traits and associated GWAS used in reported analysis.
| Trait | HT | BMI | TC | HDL | LDL | CD | T1D | T2D | PrCA | CAD |
|---|---|---|---|---|---|---|---|---|---|---|
| Narrow sense heritability (
| 0.45 | 0.14 | - | 0.12 | - | 0.22 | 0.30 | 0.51 | 0.22 | - |
| 133K | 162K | 100K | 100K | 95K | 25K | 22K | 36K | 28K | 73K | |
| 108 | 31 | 45 | 35 | 36 | 64 | 30 | 22 | 20 | 21 | |
| 0.066 | 0.014 | 0.063 | 0.046 | 0.059 | 0.066 | 0.053 | 0.034 | 0.061 | 0.024 |
HT, height; BMI, body mass index; TC, total cholesterol; CD, Crohn’s disease; T1D, Type 1 diabetes; T2D, Type 2 diabetes; PrCA, prostate cancer; CAD, coronary artery disease.
Estimates of narrow sense heritability ( ), i.e. phenotype variability due total additive effects of common SNPs, for HT, BMI, HDL, CD, T1D and T2D are taken from published studies[20,21,35] and that for PrCA is obtained based on internal analysis of a new NCI GWAS involving approximately 5000 cases and 5000 controls genotyped on Illumina Omni 2.5M platform. For qualitative traits, estimates are shown in the liability-threshold scale.
Figure 2Expected predictive correlation coefficient (PCC) for polygenic models at optimal significance level for SNP selection for four quantitative traits.
For HDL and BMI, range of performance is shown corresponding to estimate of (yellow line) and associated 95% confidence interval (dark blue region). For LDL and TC, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of three exponential distributions, which together with is calibrated to explain observed discoveries from the largest GWAS (see Methods).
Figure 3Expected AUC statistics at optimal significance level for SNP selection for five disease traits.
For all diseases except CAD, range of performance is shown corresponding to estimate of (yellow line) and associated 95% confidence intervals (dark blue region). For CAD, for which direct estimate of is not available, a range of its values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is calibrated to explain observed discoveries from the largest GWAS (see Methods).
Projected discriminatory performance (AUC statistic) for polygenic risk models including SNPs at genome-wide significance level (α=10−7) and at optimized significance threshold (αOPT). Results for T1D are shown with or without (in parenthesis) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates of shown in Table 1. For CAD, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is appropriately calibrated to explain observed discoveries from the largest GWAS to date.
| Trait | AUC with FH alone | Current Sample Size (N) | Model | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| α=10−7 | αOPT | α=10−7 | αOPT | α=10−7 | αOPT | α=10−7 | αOPT | ||||
| 0.612 | 17K | SNPs | 0.71 | 0.74 | 0.77 | 0.82 | 0.81 | 0.84 | 0.84 | 0.86 | |
| SNPs+FH | 0.79 | 0.81 | 0.83 | 0.87 | 0.86 | 0.89 | 0.89 | 0.90 | |||
|
| |||||||||||
| 0.533 | 16K | SNPs | 0.84 (0.67) | 0.84 (0.69) | 0.85 (0.71) | 0.86 (0.73) | 0.86 (0.73) | 0.86 (0.75) | 0.86 (0.75) | 0.87 (0.75) | |
| SNPs+FH | 0.94 (0.70) | 0.94 (0.71) | 0.95 (0.74) | 0.96 (0.76) | 0.96 (0.76) | 0.96 (0.77) | 0.96 (0.77) | 0.96 (0.78) | |||
|
| |||||||||||
| 0.595 | 22K | SNPs | 0.57 | 0.60 | 0.62 | 0.71 | 0.67 | 0.76 | 0.74 | 0.79 | |
| SNPs+FH | 0.63 | 0.66 | 0.67 | 0.74 | 0.71 | 0.78 | 0.77 | 0.81 | |||
|
| |||||||||||
| 0.552 | 24K | SNPs | 0.63 | 0.63 | 0.64 | 0.66 | 0.66 | 0.69 | 0.69 | 0.71 | |
| SNPs+FH | 0.65 | 0.66 | 0.66 | 0.68 | 0.68 | 0.71 | 0.71 | 0.73 | |||
|
| |||||||||||
| 0.601 | 57K | SNPs | 0.58 | 0.59 | 0.59–0.60 | 0.62–0.64 | 0.61–0.62 | 0.64–0.67 | 0.64–0.66 | 0.67–0.69 | |
| SNPs+FH | 0.65 | 0.65 | 0.66 | 0.67–0.69 | 0.66–0.68 | 0.69–0.71 | 0.68–0.71 | 0.71–0.73 | |||
FH, presence of any family history in first-degree relatives. Prevalences of FH for CAD, PrCA and T2D are 0.14 (ref 40), 0.07 (ref 41), and 0.143 (ref 42), respectively. Prevalence of FH for T1D and CD are taken to be 0.005 and 0.01 which are the same as the disease prevalence[35].
For all diseases, except PrCA the current sample size is shown for the first-stage of the respective largest GWAS. For PrCA, where a large number of SNPs were followed to stage-2, an effective sample size is shown for stage-1 and stage-2 combined.
Proportion of cases followed (PCF) among 20% of subjects with highest polygenic risk including SNPs at genome-wide significance level (α=10−7) and at optimized significance threshold (αOPT). Results for T1D are shown with or without (in parenthesis) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates of available from GWAS studies. For CAD, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is appropriately calibrated to explain observed discoveries from the largest GWAS to date.
| Trait | Current Sample Size (N) | Model | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| α=10−7 | αOPT | α=10−7 | αOPT | α=10−7 | αOPT | α=10−7 | αOPT | |||
| 17K | SNPs | 0.48 | 0.52 | 0.58 | 0.65 | 0.62 | 0.72 | 0.72 | 0.75 | |
| SNPs+FH | 0.61 | 0.65 | 0.70 | 0.77 | 0.75 | 0.80 | 0.81 | 0.83 | ||
|
| ||||||||||
| 16K | SNPs | 0.71 (0.42) | 0.71 (0.44) | 0.73 (0.48) | 0.75 (0.51) | 0.75 (0.51) | 0.76 (0.54) | 0.76 (0.54) | 0.77 (0.55) | |
| SNPs+FH | 0.91 (0.46) | 0.92 (0.48) | 0.94 (0.52) | 0.95 (0.56) | 0.95 (0.56) | 0.95 (0.58) | 0.95 (0.59) | 0.96 (0.60) | ||
|
| ||||||||||
| 22K | SNPs | 0.28 | 0.32 | 0.34 | 0.48 | 0.41 | 0.55 | 0.52 | 0.63 | |
| SNPs+FH | 0.40 | 0.42 | 0.43 | 0.54 | 0.48 | 0.60 | 0.57 | 0.66 | ||
|
| ||||||||||
| 24K | SNPs | 0.35 | 0.35 | 0.37 | 0.40 | 0.39 | 0.44 | 0.44 | 0.48 | |
| SNPs+FH | 0.40 | 0.40 | 0.41 | 0.44 | 0.43 | 0.47 | 0.47 | 0.51 | ||
|
| ||||||||||
| 57K | SNPs | 0.29 | 0.30 | 0.31 | 0.34–0.37 | 0.32–0.34 | 0.38–0.41 | 0.36–0.40 | 0.42–0.45 | |
| SNPs+FH | 0.42 | 0.42 | 0.42–0.43 | 0.44–0.46 | 0.43–0.44 | 0.46–0.49 | 0.46–0.48 | 0.49–0.52 | ||
FH, presence of any family history in first-degree relatives. Prevalences of FH for CAD, PrCA and T2D are 0.14 (ref 40), 0.07 (ref 41), and 0.143 (ref 42), respectively. Prevalence of FH for T1D and CD are taken to be 0.005 and 0.01 which are the same as the disease prevalence[35].
For all diseases, except PrCA the current sample size is shown for the first-stage of the respective largest GWAS. For PrCA, where a large number of SNPs were followed to stage-2, an effective sample size is shown for stage-1 and stage-2 combined.