Literature DB >> 23455638

Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies.

Nilanjan Chatterjee¹, Bill Wheeler, Joshua Sampson, Patricia Hartge, Stephen J Chanock, Ju-Hyun Park.

Abstract

We report a new method to estimate the predictive performance of polygenic models for risk prediction and assess predictive performance for ten complex traits or common diseases. Using estimates of effect-size distribution and heritability derived from current studies, we project that although 45% of the variance of height has been attributed to SNPs, a model trained on one million people may only explain 33.4% of variance of the trait. Models based on current studies allow for identification of 3.0%, 1.1% and 7.0% of the populations at twofold or higher than average risk for type 2 diabetes, coronary artery disease and prostate cancer, respectively. Tripling of sample sizes could elevate these percentages to 18.8%, 6.1% and 12.2%, respectively. The utility of polygenic models for risk prediction will depend on achievable sample sizes for the training data set, the underlying genetic architecture and the inclusion of information on other risk factors, including family history.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 23455638 PMCID： PMC3729116 DOI： 10.1038/ng.2579

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

For quite some time, many have predicted that the identification of heritable disease susceptibility markers, such as common genetic variants, could eventually lead to stable models for risk-prediction with important individual and public health implications[1]. Even for a trait such as breast cancer, which manifests a modest degree of familial aggregation, a polygenic model based on a comprehensive set of genetic variants could achieve sufficient discriminatory power and thus be applied in targeted screening programs[2]. To date, genome-wide association studies (GWAS) now have identified thousands of common susceptibility variants for a wide spectrum of complex traits. Recent studies, however, indicate that for most individual traits the loci discovered so far, explain only a small fraction of heritability and thus, collectively have low predictive power[3-11]. While the phenomenon of “missing heritability”[12,13] can be due to many factors such as an overestimation of heritability itself, lack of knowledge of gene-gene and gene-environment interactions and contributions from rare variants, there is increasing recognition that a significant part of the heritability comes from a large number of common SNPs, each of which individually has too small of an effect to be detected at the stringent genome-wide significance level with current sample sizes[14-18]. Recent studies, for example, have indicated that while about 200 loci identified through a large GWAS involving more than 100,000 subjects can explain only approximately 10% of the variance of adult height[6], a set of common SNPs included in existing GWAS platforms can explain up to 45% of the variance of the same trait[16]. Similar studies for a number of other complex traits have indicated the presence of significant “hidden heritability” in GWAS[17,19-21]. The gap between estimates of heritability based on known loci and those estimated due to the comprehensive set of common susceptibility variants raises the possibility of substantially improving prediction performance of risk models by using a “polygenic” approach, one that includes many SNPs that do not reach the stringent threshold for genome-wide significance. A major factor that determines how well such a model can perform in predicting a trait value in an independent sample will be the sample size of the “training” dataset based on which the prediction model can be built. Intuitively, as the sample size for the training dataset increases, the effects of the underlying SNPs can be more precisely estimated. Correspondingly, the underlying true polygenic model, which harnesses the full predictive power associated with total heritability associated with the SNPs, will be more accurately approximated. In this report, we measure the ability of models based on current as well as future GWAS to improve the prediction of individual traits. We develop a new theoretical framework that characterizes the relationship between sample-size and predictive performance of a polygenic model based on the number and distribution of effect-sizes for the underlying susceptibility SNPs and the optimal balance of type-I and type-II error associated with the underlying criterion of SNP selection. Based on this, we provide a realistic assessment of the predictive performance of a polygenic model for each of ten complex traits, namely, the quantitative traits height (HT), body mass index (BMI), total cholesterol (TC), HDL and LDL and the disease traits, Crohn’s disease (CD), Type 1 diabetes (T1D), Type 2 diabetes (T2D), coronary artery disease (CAD) and prostate cancer (PrCA). We use a range of effect-size distributions that are consistent with both known discoveries, 412 in total, reported from the largest GWAS of these traits and recent estimates of the “narrow-sense” heritability, i.e. the total heritability of the traits attributable to additive effects of common SNPs. The results disclose several insights into the predictive ability of existing GWAS, the marginal utility of further increase in sample size, the sample-size threshold beyond which the predictive ability of the models may reach a plateau, the optimal threshold for SNP selection and the joint utility of family history information and polygenic risks. Furthermore, the general theoretical framework we provide can be used to make projections for the predictive utility of different polygenic model building strategies that may utilize alternative statistical algorithms or/and could incorporate other types of effects, such as those due to gene-gene interactions and rare variants.

Results

Throughout, we assess the predictive performance of a model based on its “predictive correlation coefficient” (PCC), which, for a continuous outcome, is equivalent to the Pearson’s correlation coefficient between true and predicted outcomes for the underlying population of subjects. For a binary disease outcome, we show that PCC has a one-to-one mathematical correspondence to the area under the curve (AUC) statistics and other standard measures for discriminatory performance of risk models. In derivation of this formula, we assume a simple but popularly used[22] model building algorithm in which SNPs are first selected for inclusion in the model depending on whether the corresponding individual tests of association achieve a specified significance threshold (α) and then a polygenic score is built by weighing the selected SNPs based on their estimated regression coefficients. The details of the underlying models and assumptions can be found in Methods section. The relationship between predictive performance of the model and the sample size (N) for the training data set is shown in the formula (1), which forms the basis of our analytical calculations (Methods). Simulation studies confirm the accuracy of the expression (1) (Supplementary Figure 1). According to this formula, the predictive performance of a model depends upon: (i) the number of true susceptibility SNPs (M1) compared to the total number of SNPs under study (M), (ii) the true effect-sizes (βs) for the underlying susceptibility SNPs, (iii) the chosen significance level (α) for SNP selection, (iv) the power of the underlying association test to reach that significance level, and (v) the expected value of the estimated regression coefficients and their squared values for the selected SNPs. The sample-size of the training dataset (N) influences both the power of the association test-statistics and the deviations of the estimated regression coefficients from their true values (see Methods for more details). Given an effect-size distribution, since the number of underlying susceptibility SNPs (M1) determine the total variability of the trait explainable by the underlying model, expression (1) can be also re-written in terms of “narrow sense heritability” ( ), which is defined for the purpose of this report to be the heritability of a trait due to additive effects of common tagging SNPs included on current, commercially available SNP microarrays (see formula 2). In all our subsequent analyses, we assume that genotyping platforms based on which most current GWA studies have been conducted to contain approximately on average M=200,000 independent SNPs. As for a model for complex trait, we first investigated the predictive performance of polygenic models for adult height. Figure 1 shows that the predictive accuracy of polygenic models greatly depends upon the distribution of effect sizes even when all distributions result in a total heritability of 45%[16]. The predictive performance of the model for all sample sizes is the highest when an exponential distribution underlies the effect-sizes. The performance of the model decreases substantially under a two-component, exponential-mixture model, which, compared to the exponential model, provides a much better fit to the observed effect sizes of the known SNPs by allowing for the presence of a larger number of SNPs, each with smaller effect (Supplementary Table 1). Finally, the performance of the model is the lowest under a three component exponential-mixture distribution, which allows an even larger number of SNPs with smaller effects and produces results that are most consistent with the observed discoveries in the GIANT study[6] (Supplementary Table 1). Our methods successfully reproduced results from a predictive analysis reported in the GIANT study in which distinct polygenic models were built with different significance thresholds for SNP selection and their predictive performance were empirically assessed using independently held out assessed. Our method, when applied to the three-component mixture exponential distribution at the given sample size of the GIANT study (N=130,000), provided an accurate approximation for the entire profile of the observed predictive performance of these polygenic models (Figure 1).

Figure 1

Predictive correlation coefficient (PCC) for polygenic models and corresponding optimal significance level for SNP selection under three models for polygenic architectures for adult height

Each model assumes a total of 45% of phenotypic variance of adult height can be explained by common SNPs included in standard GWAS platforms involving M=200,000 independent SNPs. The effect size distribution for susceptibility SNPs are assumed to follow an exponential distribution (black line), a mixture of two exponential distributions (red line) or a mixture of three exponential distributions (blue line). Panel (a) and (b) show expected value of squared PCC and corresponding optimal significance level (αopt), respectively, as a function of sample size (N). Panel (c) compares PCC values reported in a predictive analysis of the GIANT study (dashed line) with corresponding theoretical expected values under the three different models.

Expression (1) illustrates the trade-off between specificity and sensitivity of the SNP selection criterion on the predictive performance of the model. When a more liberal significance threshold (α) is chosen, then the value of the predictive correlation coefficient will increase through the power of the association tests, but will decrease as a function of the underlying type-I error (α). Figure 1 illustrates the optimal threshold for SNP selection that would maximize predictive performance of a model for adult height. Under both the two- and three- component mixture distributions for effect sizes, the optimal significance level initially increases, with sample size, it reaches a plateau and then remains constant or decreases slightly. In contrast, under the single exponential distribution that corresponds to stronger effect sizes, the optimal significance level becomes more stringent as sample size increases. We next examined the potential predictive performance of polygenic models for a variety of traits that include both quantitative (BMI, TC, HDL, LDL) and qualitative phenotypes (CD, T1D, T2D, CAD, PrCA) that together demonstrate a spectrum of estimated heritability (Table 1). For most traits, we consider a range for the underlying effect-size distributions that are in accord with both reported discoveries from the largest GWAS and recent estimates of narrow-sense heritability (Methods, Supplementary Tables 2 and 3). For a few traits for which external estimates of are not available, we considered a range of its values within the limits of total heritability and effect-size distributions that can produce results consistent with the observed discoveries in the largest GWAS.

Table 1

Characteristics of ten complex traits and associated GWAS used in reported analysis.

Trait	HT	BMI	TC	HDL	LDL	CD	T1D	T2D	PrCA	CAD
Narrow sense heritability ( hg2)	0.45	0.14	-	0.12	-	0.22	0.30	0.51	0.22	-
Effective sample-size for the largest GWAS	133K	162K	100K	100K	95K	25K	22K	36K	28K	73K
No. of detected SNPs	108	31	45	35	36	64	30	22	20	21
Heritability explained by detected SNPs	0.066	0.014	0.063	0.046	0.059	0.066	0.053	0.034	0.061	0.024

HT, height; BMI, body mass index; TC, total cholesterol; CD, Crohn’s disease; T1D, Type 1 diabetes; T2D, Type 2 diabetes; PrCA, prostate cancer; CAD, coronary artery disease.

Estimates of narrow sense heritability ( ), i.e. phenotype variability due total additive effects of common SNPs, for HT, BMI, HDL, CD, T1D and T2D are taken from published studies[20,21,35] and that for PrCA is obtained based on internal analysis of a new NCI GWAS involving approximately 5000 cases and 5000 controls genotyped on Illumina Omni 2.5M platform. For qualitative traits, estimates are shown in the liability-threshold scale.

For all traits, the expected performance of the polygenic models built based on current GWAS (sample size=N) can be predicted fairly accurately (Figure 2 and Figure 3). Although it may be possible to improve the performance of these models by inclusion of SNPs that do not achieve strict genome-wide significance levels, the models are expected to have low to modest predictive power even after optimization of the SNP selection criterion (Table 2). As sample sizes of the future studies increase, the projected performances of the models will have a wider range reflecting the uncertainty associated with estimates of heritability. Nevertheless it is evident that only very large sample sizes can substantially improve the performance of the models, even in some of the best case scenarios. For prostate cancer (PrCA), for example, while a polygenic model built based on the current largest GWAS can be expected to achieve an AUC statistics of about 63%, in the future, a model built based on as many as triple that sample size is expected to increase the AUC statistic only in the range 64–70% (Figure 3). For all disease traits except coronary artery disease (CAD), it appears that the marginal utility of additional sample can be quite small after the size of GWAS reaches 100,000–200,000 subjects. In contrast, for CAD, BMI, and the lipid traits TC and LDL, the performance of predictive models may continue to improve gradually over a much wider range of sample sizes, reaching as high as 500,000 to one million subjects.

Figure 2

Expected predictive correlation coefficient (PCC) for polygenic models at optimal significance level for SNP selection for four quantitative traits.

For HDL and BMI, range of performance is shown corresponding to estimate of (yellow line) and associated 95% confidence interval (dark blue region). For LDL and TC, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of three exponential distributions, which together with is calibrated to explain observed discoveries from the largest GWAS (see Methods).

Figure 3

Expected AUC statistics at optimal significance level for SNP selection for five disease traits.

For all diseases except CAD, range of performance is shown corresponding to estimate of (yellow line) and associated 95% confidence intervals (dark blue region). For CAD, for which direct estimate of is not available, a range of its values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is calibrated to explain observed discoveries from the largest GWAS (see Methods).

Table 2

Projected discriminatory performance (AUC statistic) for polygenic risk models including SNPs at genome-wide significance level (α=10−7) and at optimized significance threshold (αOPT). Results for T1D are shown with or without (in parenthesis) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates of shown in Table 1. For CAD, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is appropriately calibrated to explain observed discoveries from the largest GWAS to date.

Trait	AUC with FH alone	Current Sample Size (N)	Model	N		3×N		5×N		10×N
Trait	AUC with FH alone	Current Sample Size (N)	Model	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT
CD	0.612	17K	SNPs	0.71	0.74	0.77	0.82	0.81	0.84	0.84	0.86
CD	0.612	17K	SNPs+FH	0.79	0.81	0.83	0.87	0.86	0.89	0.89	0.90

T1D	0.533	16K	SNPs	0.84 (0.67)	0.84 (0.69)	0.85 (0.71)	0.86 (0.73)	0.86 (0.73)	0.86 (0.75)	0.86 (0.75)	0.87 (0.75)
T1D	0.533	16K	SNPs+FH	0.94 (0.70)	0.94 (0.71)	0.95 (0.74)	0.96 (0.76)	0.96 (0.76)	0.96 (0.77)	0.96 (0.77)	0.96 (0.78)

T2D	0.595	22K	SNPs	0.57	0.60	0.62	0.71	0.67	0.76	0.74	0.79
T2D	0.595	22K	SNPs+FH	0.63	0.66	0.67	0.74	0.71	0.78	0.77	0.81

PrCA	0.552	24K	SNPs	0.63	0.63	0.64	0.66	0.66	0.69	0.69	0.71
PrCA	0.552	24K	SNPs+FH	0.65	0.66	0.66	0.68	0.68	0.71	0.71	0.73

CAD	0.601	57K	SNPs	0.58	0.59	0.59–0.60	0.62–0.64	0.61–0.62	0.64–0.67	0.64–0.66	0.67–0.69
CAD	0.601	57K	SNPs+FH	0.65	0.65	0.66	0.67–0.69	0.66–0.68	0.69–0.71	0.68–0.71	0.71–0.73

FH, presence of any family history in first-degree relatives. Prevalences of FH for CAD, PrCA and T2D are 0.14 (ref 40), 0.07 (ref 41), and 0.143 (ref 42), respectively. Prevalence of FH for T1D and CD are taken to be 0.005 and 0.01 which are the same as the disease prevalence[35].

For all diseases, except PrCA the current sample size is shown for the first-stage of the respective largest GWAS. For PrCA, where a large number of SNPs were followed to stage-2, an effective sample size is shown for stage-1 and stage-2 combined.

The predictive performance of a model strongly depends on the degree of heritability. For any given sample size, more accurate prediction is possible for more heritable traits, such as CD and T1D, than for relatively less heritable traits such as CAD, PrCA and T2D, which is in accord with classical estimates of heritability based on sibling and twin studies. Accordingly, the ability of the models to identify future cases in “high-risk” group varies (Table 3). For example, using models based on current GWAS, the proportion of future cases that could be identified among top 20 percentile of subjects with highest polygenic risk is 71% for T1D and about 32% for T2D. If the sample size for a future GWAS is tripled, then the corresponding proportions would be expected to increase to 75% and 48%, respectively. Among the three common chronic diseases, the proportion of the population that can be identified to have two-fold or higher risk than an average person ranged from 1.1% (CAD) to 7.0% (PrCA) for models built based on current sample sizes (Supplementary Table 4). If the sample size for future studies could be tripled, then these proportions could be elevate to 6.1% (CAD) and 18.8% (T2D).

Table 3

Proportion of cases followed (PCF) among 20% of subjects with highest polygenic risk including SNPs at genome-wide significance level (α=10−7) and at optimized significance threshold (αOPT). Results for T1D are shown with or without (in parenthesis) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates of available from GWAS studies. For CAD, for which direct estimate of is not available, a range of values are chosen based on constraints imposed by observed discoveries. For all traits, the underlying effect-size distribution is assumed to follow a mixture of two or three exponential distribution, which together with is appropriately calibrated to explain observed discoveries from the largest GWAS to date.

Trait	Current Sample Size (N)	Model	N		3×N		5×N		10×N
Trait	Current Sample Size (N)	Model	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT	α=10⁻⁷	α_OPT
CD	17K	SNPs	0.48	0.52	0.58	0.65	0.62	0.72	0.72	0.75
CD	17K	SNPs+FH	0.61	0.65	0.70	0.77	0.75	0.80	0.81	0.83

T1D	16K	SNPs	0.71 (0.42)	0.71 (0.44)	0.73 (0.48)	0.75 (0.51)	0.75 (0.51)	0.76 (0.54)	0.76 (0.54)	0.77 (0.55)
T1D	16K	SNPs+FH	0.91 (0.46)	0.92 (0.48)	0.94 (0.52)	0.95 (0.56)	0.95 (0.56)	0.95 (0.58)	0.95 (0.59)	0.96 (0.60)

T2D	22K	SNPs	0.28	0.32	0.34	0.48	0.41	0.55	0.52	0.63
T2D	22K	SNPs+FH	0.40	0.42	0.43	0.54	0.48	0.60	0.57	0.66

PrCA	24K	SNPs	0.35	0.35	0.37	0.40	0.39	0.44	0.44	0.48
PrCA	24K	SNPs+FH	0.40	0.40	0.41	0.44	0.43	0.47	0.47	0.51

CAD	57K	SNPs	0.29	0.30	0.31	0.34–0.37	0.32–0.34	0.38–0.41	0.36–0.40	0.42–0.45
CAD	57K	SNPs+FH	0.42	0.42	0.42–0.43	0.44–0.46	0.43–0.44	0.46–0.49	0.46–0.48	0.49–0.52

For all diseases, family history (FH) information alone has low discriminatory ability. However, models including both FH and polygenic scores can perform substantially better than models using polygenic scores alone especially for rare highly familial conditional such as CD and T1D. Even if polygenic scores could be built in the future based on very large sample sizes (e.g. sample-size=5×N), FH is expected to remain an important variable for identifying high-risk subjects (Table 2 and 3).

Discussion

In summary, our analysis demonstrates that the predictive ability of polygenic models depends not only on total heritability, but also on the underlying effect-size distributions (Figure 1). The emerging effect size distributions from large GWAS suggest that although risk prediction models will continue to improve in the future as total sample size accumulates, the improvement will be slow and modest even when common SNPs account for a large proportion of heritability of the underlying traits (Figures 2 and 3). Our analysis also reveals that under the most likely effect-size distributions, the optimal significance threshold for selecting SNPs for prediction models in large GWAS can be more liberal than threshold standard (e.g. p < 5 × 10−8) used for discovery. We observed that for less common, highly familial conditions, like T1D and CD, risk models including FH and optimal polygenic scores based on current GWAS can identify a large majority of cases by targeting a small group of high-risk individuals (e.g., subjects who fall in the highest quintile of risk). In contrast, for more common conditions with modest familial components, such as T2D, CAD and PrCA, risk models based on GWAS with current (N) or foreseeable sample sizes in the near future (e.g., triple in size, 3×N) can miss a large proportion (>50%) of cases by targeting a small group of high-risk individuals. For these common diseases, polygenic models using current GWAS can identify a small minority of the population with elevated risk. Based on our model, we suggest that it is necessary to augment sample size of current GWAS by at least three times to substantially increase the proportion of high-risk populations identified by polygenic models. Perhaps one day GWAS or sequencing would be carried out as part of standard clinical care and then such information together with electronic medical records (EMR) could be used to build polygenic models based on sufficiently large studies. Consistent with a previous report[23], our analysis of T1D with and without contribution of the MHC region highlights the limited incremental discriminatory ability of polygenic scores for diseases that have established common and strong risk factors (Tables 2 and 3). Nevertheless, for most diseases, polygenic scores are expected to contribute substantially in addition to family history. One could also expect that in the foreseeable future, even crude family history information such as the presence or absence of the disease in any first degree relative, will remain an important contributing factor for predicting disease-risk in the general population. More detailed information on extended family history, including age-at-onset information, could further enhance predictive utility of these models especially for applications in high-risk family settings. Our analysis extends beyond prior reports[24-27] to project the predictive performance of polygenic models most of which relied on simulation studies. A previous report[25] had noted that predictive performance of models that include all GWAS SNPs in a polygenic score without SNP selection depends only on the sample size of the training dataset and . More general theory shows that an algorithm which includes all SNPs in a model, i.e. uses the significance level of α=1, could be poor and the predictive performance of more efficient algorithms is expected to depend on the underlying effect-size distribution. Previous simulation studies often have relied on hypothetical effect size distributions. Here, we use the effect-size distributions that are implied by constraints imposed by both known discoveries reported from some of the largest GWAS to date and recent estimates of heritability to provide a realistic depiction of the future of genetic risk prediction. Our results are generally consistent with a recent analysis[28] that used information on risk in monozygotic twins to examine the absolute limits of “personalized medicine” achievable by genome sequencing under the assumption that such technology can ultimately lead to an ideal model that can capture the full spectrum of genetic risk without possibility of any error. In this report, we provide much sharper bounds for what can be achieved in practice using current or future GWAS by taking into account the likely error associated with estimation of underlying risk that is inevitable because of constraints on sample sizes. Emerging effect-size distributions suggest that GWAS will require huge sample size to approach the ideal predictive power associated with additive effects of common SNPs. Using a metric used by this report together with the assumption of independent susceptibility alleles across traits, for example, we can predict that while GWAS in principle can identify 55.1% of the population who might have two-fold or higher risk than average for at least one of the three common diseases, CAD, T2D and PrCA, the actual proportion which is achievable using current GWAS studies is only 10.7% and for future studies that triple e the sample size is 33.1%. If the susceptibility alleles across these traits are related, however, these proportions could become higher. In this report, we have made projections based on a simple GWAS polygenic model building algorithm[6,22] after its optimization with respect to the criteria for SNP inclusion. The general framework we constructed (Supplementary Note), however, can be used to assess the likely performance of other, possibly even more efficient, model building strategies. Using this framework, for example, we project that an algorithm that utilizes LASSO-type [29] thresholds and can analyze all SNPs simultaneously, may outperform the standard GWAS polygenic model building algorithm. This may be particularly interesting for large sample sizes and highly heritable traits like height, but we also note that the gains are generally modest in scope (Supplementary Figure 2). Simultaneous modeling of correlated SNPs within small genomic regions can unmask allelic heterogeneity possibly adding to the overall predictive strength of the models[8,30]. Other strategies may include linear mixed modeling[16] and Bayesian methods[31,32] that can construct polygenic scores based on shrinkage estimates for SNP coefficients utilizing specific priors for the effect-size distribution. Although the absolute performance of different algorithms could be somewhat different across settings, the main results we highlight regarding the order of sample sizes required for improvement of risk prediction is intrinsically related to the underlying effect-sizes and are likely to be observed with other algorithms as well. Our proposed theoretical framework can be used to speculate on the predictive performance of polygenic models that could be built based on rare variants. In an additional illustration (Supplementary Figure 3), under a model that allows large number of susceptibility loci each containing sets of low-penetrant rare variants, we assessed how polygenic models might perform if variants are included in a model as individual cofactors versus using a gene-collapsing strategy that has been advocated for improving power for association tests[33]. We observed that up to a certain range of sample sizes for the training dataset, models based on collapsed variables often can perform better, apparently due to the improved power for detection of the underlying susceptibility loci. For larger sample sizes, however, their performance might fall short compared to models based on individual variants as collapsed variables possibly including neutral variants can cause substantial dilution of effects for the susceptibility loci and the magnitude of such dilution may not diminish with increasing sample size for naive collapsing methods. In the future, it will be of great importance to determine the sample sizes at which such inflection point would occur for different traits depending on the underlying genetic architecture. In this report, we use a flexible class of mixture-exponential models to specify effect-size distributions. One could specify effect-size distributions using alternative parametric models such as Weibull, Gamma or Beta distributions all of which can generate L-shaped distributions that appear to be natural for specification of effect-sizes of common SNPs. Although the performance of polygenic models could differ widely in principle under different effect-size distributions, additional analyses (data not shown) indicate that when such alternative models were restricted so that they can also explain discoveries and estimates of heritabilities reported from current GWAS, each produced results that are qualitatively similar to what we report using the mixture of exponential distributions. For future studies of rare variants, however, the range of plausible models for effect-size distributions is substantial and thus evaluating the likely performance of polygenic models based on such variants remains challenging (Supplementary Figure 3). In conclusion, we have used novel theory together with empirical observations from large GWAS to provide a comprehensive evaluation of future of polygenic risk models using common susceptibility SNPs. Although our analysis points toward the challenges for achieving high-discriminatory[34] power for polygenic risk models especially for common diseases, it is noteworthy that even models with modest discriminatory power can provide important stratification for absolute risk, thus providing a rationale for potential public health applications such as for weighing risks and benefits for a treatment or an intervention[34]. For most common disease, existing models based on established environmental risk factors, if any, also has modest discriminatory power and faces additional challenges for long-term risk prediction as risk-factor history, unlike susceptibility status, can change over lifetime of an individual. In the future, development of robust prediction models will need to integrate a spectrum of alleles, from rare to common variants, and other risk factors as well. The framework outlined in this paper could be used to identify challenges and opportunities for public health application as well as the required resources needed for development of such models.

Methods

Underlying polygenic model

We assume Y is the outcome variable of and X1,…,XM are a set of independent covariates that are potentially predictive of Y. Without loss of generality, we will assume all variables are standardized, so that E(Y) = 0 and Var(Y) = 1 and similarly E(X) = 0 and Var(X) = 1 for each m. We assume that the true relationship between outcome and the set of covariates can be described by the underlying model ( ) where M1 out the M covariates are truly predictive of Y. We also assume ε, the residual term, to be independently distributed of X=(X1,…,XM).

Measure of predictive performance of a model

Now suppose an “estimated” prediction model ( ) is built base on “training” dataset of sample size N to predict Y using the formula where γ is indicator of whether the variable is selected (γ = 1) or not (γ = 0) and β̂ is the estimate of β for selected variables. We will denote λ to be a generic threshold parameter for the underlying model selection algorithm. We will define the predictive correlation for the model to be where the subscript X and ε signify that the correlation coefficient is computed with respect to the distribution of X and ε in the the underlying population for which prediction is desired while the estimated model and its associated parameter estimates (β̂s and γs), are held fixed. The only source of variation of R( ) is due to the randomness of the original training dataset based on which is built. For any fixed N and λ, the expected value of R( ) can be approximated as (see Supplementary Note) where em(N, λ) = EN,λ(β̂m|γm = 1), pm(N, λ) = PrN,λ(γm = 1) and .

GWAS polygenic model building algorithm

Suppose in a GWAS study, independent SNPs are included in a prediction model depending on whether the corresponding marginal trend-test for association achieves a specified significance level α or not. Let Zm denote the association test-statistics for the m-th SNP and Cα/2 denote the critical level for a two-sided test at level α. For any SNP that achieves the required significance level, i.e. γ = 1, its corresponding coefficient in the prediction model could be taken as β̂, i.e. the estimated regression coefficient from the marginal analysis of the SNP. Based on general theory developed in Supplementary Note, we show that in the above setting the expected value of predictive correlation coefficient of the above polygenic model building algorithm over different GWAS datasets of sample size N can be written as where pow(N, β, α) denotes the power of the study of size N for detecting an effect-size of β at level α, e(β) = E(β̂||Z| > C/2), and . Based on the formula for e(β) and ν(β) given in Suppementary Note, it is easy to see that as N → ∞, e(β) → β and . Thus, it follows that as N → ∞, Since is the variance of the trait due to the total additive effects of all susceptibility SNPs, , where is the total heritability in narrow sense.

Evaluation of AUC statistics and other performance measures for binary disease outcomes

Previously, several reports[2,43,44] have established the relationship between measures of discrminatory ability of risk models and the genetic variance explained by the true underlying polygenic score associated with a set of SNPs. To generalize such results when the polygenic score associated with a set of SNPs may be estimated with error, we assume that the true relationship between the risk of a binary disease outcome D and a set of covariates X1,…,XM is given by an underlying logistic model of the form We assume that a risk-prediction model is built base on a training dataset of sample-size N using the formula where γ is indicator of whether the variable is selected (γ = 1) or not (γ = 0) and β̂ is the estimate of β for selected variables. Let be the estimated risk for a person with covariate profile X in the underlying logistic scale. Without loss of generality, we assume each covariate Xm has been standardized with respect to its mean and variance of disease free population so that E(X|D = 0) = 0 and Var(X|D = 0) = 1. In the Supplementary Note, we show that the distribution of Û in controls (D=0) and cases (D=1) for large M, M1, and N can be approximated by normal distributions as where and . It is noteworthy that while the characterization of the distributions of true risk (U) for cases and controls requires a single parameter, namely the variance of U[2,43,44], the characterizations for the corresponding distributions for estimated risk (Û) requires two parameters, namely the variance of Û and its covariance with the true risk U. Now, the area under the curve (AUC), i.e., the probability that value of risk-score will be greater for a randomly selected case than that of a randomly selected control, can be approximated as where R = C/S is the predictive correlation measure defined earlier for continous outcome. Similarly, using above results, other measures of discreminatory performance of models, such as proportion of cases followed (PCF)[2], can be also characterized in terms of R. In the Supplementary Note, we further show that the distribution of estimated risk Û for subjects conditional on both his/her own disease status D and that of a relative DR can be approximately characterized as: where k = 2− is the coefficient of relationship. Based on these distributions, we further derive discriminatory ability of risk models that include both polygenic risk scores and family-history.

Estimation of effect-size distribution

We extend our previous methods[14,15,45] to obtain realistic estimates of effect-size distribution for all underlying susceptibility SNPs for individual traits by combining information from both known discoveries from largest GWAS and estimates of that have recently become available for most of the traits we studied. The major steps are: (1) identify the largest GWAS, termed the “current study”, for each of the traits and list “observed susceptibility SNPs” that are discovered through these studies; (2) following the design of the discovery studies (Supplementary Table 2), compute the power to detect SNPs with given effect-sizes; (3) obtain an estimate effect size distribution by fitting parametric mixture-exponential distribution to observed susceptibility SNPs after accounting for statistical power for their discovery and (4) incorporate an additional mixture component to the effect-size distribution that can allow a larger number of SNPs with very small effects so that the overall distribution can explain both estimate of heritability due to common variants ( ) and the number of observed discoveries and genetic variances explained in current studies. Below we describe the details for each step. In step 1, for each trait, we identified the largest GWAS to date (Supplementary Table 2) and constructed a list of observed susceptibility SNPs that could be considered to have been “detected” from this study. All independent SNPs that reach genome-wide significance according to specified criteria for these studies are included in the list of known susceptibility SNPs. Some studies used multistage designs and did not follow-up previously established susceptibility SNPs beyond the first stage. We included such previously established SNPs in our list if they reached the required threshold for follow-up in the first stage of the current study, on the assumption that these SNPs would have reached genome-wide significance had they been followed up like all other SNPs meeting the same criterion. For each observed susceptibility SNP, we obtained the effect-size as es = ψ2 × 2f(1 − f) where ψ is linear or logistic regression coefficient depending on quantitative or qualitative traits and f is the allele frequency. In the GWAS context, a covariate X in a polygenic model is the number of risk alleles associated with a SNP and thus following the notation in the main text where a covariate X is assumed to be standardized, it follows that and es = β2. To minimize bias from the winner’s curse, we estimated effect-sizes by excluding discovery stage data whenever replication phase data were available. Otherwise, we corrected for possible bias using statistical techniques[46]. In step 2, we evaluated power for detection for each susceptibility SNP at their observed effect-sizes following the exact design of the original discovery studies (see Supplementary Table 2). In step 3, we obtained estimate of effect-size distribution by fitting a parametric model to the effect-sizes for observed susceptibility SNPs. In our previous work[14,15,45], we have described non-parametric methods for estimating effect-size distribution within the range of effect-sizes for observed susceptibility SNPs. In this report, we considered use of parametric models that can be used to describe distribution of effect-sizes beyond the range of known discoveries. Specifically, we used the class of mixture of exponential distributions that allows specification of effect-size distribution in a flexible, weakly parametric fashion. The model is very natural as it allows for increasingly large number of susceptibility SNPs with decreasingly smaller effects, a common pattern that is emerging from GWAS. Mathematically, we assumed that the distribution of effect-sizes for all underlying susceptibility SNPs are given by where =(p, …, p, λ, …, λ) with p being the mixture weight for the h-th component, h = 1, …, H and g(es|λ) is an exponential distribution with mean 1/λ. Noting that the set of K observed susceptibility SNPs can be viewed as a random sample from the set of all underlying susceptibility SNPs with probability of sampling for each SNP is proportional to its power for discovery, we constructed a likelihood as where pow(es|N, α) is the power to detect a SNP with effect size es in the current GWAS of size N at a significance level of α. We use Bayesian methods to estimate the parameters of the mixture model based on the above likelihood and non-informative priors for the parameter vectors p=(p, …, p) and λ =(λ, …, λ). Specifically, we assumed a discrete Dirichlet distribution for p that leads to uniform prior for each of the p, h = 1, .., H marginally. We assumed λ, h = 1, …, H to be independently distributed each following a Gamma distribution with shape and scale parameters a = 0.5 and b = 2× 104, respectively. Posterior means for all parameters were obtained based on MCMC algorithms. For each trait, among several fitted mixture models with varying H (up to 3), we selected the best mixture model on the basis of the DIC model selection criterion[47]. For all traits except PrCA and CAD, a two-component (H = 2) mixture model was found to be the best fitted distribution. For PrCA and CAD, a single exponential distribution (H = 1) was found to be adequate. In step 4, we incorporated an additional mixture component to the effect-size distribution estimated in step 3 so that the overall distribution can be used to describe the effect-sizes for all SNPs that contribute to narrow-sense heritability . We observed that if we had assumed that the parametric effect-size distribution estimated based on known loci can be extrapolated to describe the effect-sizes for all susceptibility loci explaining , then the expected number of discoveries and the corresponding heritabilities explained in the current GWAS will be substantially larger than those empirically observed in these studies (Supplementary Table 1). Thus it is very likely that the true effect-size distribution for all susceptibility SNPs contributing to narrow-sense heritability is more skewed toward smaller effects. To obtain a properly calibrated effect-size distribution for all susceptibility SNPs, we thus added an additional mixture component to the fitted effect-size distribution that we estimated based on known loci. We assumed where the summation in the right side corresponds to the fitted mixture model based on known loci. For any given value of , we found the value of parameters p and λ for the additional component by equating the expected and observed number of discoveries and the corresponding heritability explained in the current largest GWAS by solving the equations and where α is the genome-wide significance level used for discovery and M1 is defined by We solved for p and λ by performing a grid-search within the ranges 0.01≤ p≤0.99 and λ̂ ≤ λ+1 ≤ 20 × λ̂ where the latter constraint is imposed to allow the mean of the new component to be smaller than that of the smallest component of the fitted distribution by a factor of up to 20-fold. For traits for which estimates of and associated confidence intervals were available, values of were chosen to be at their point estimates (Tables 2 and 3) or varied within the range of their CIs (Figures 2 and 3) and for each such value of a corresponding effect-size distribution was obtained by solving the above equations. For TC, LDL and CAD, for which direct estimates of were not available, we varied the value of to be within 20–80% of the range of total heritability of these traits that are available from family studies. For CAD, however, the range of for which solutions could be found for the equations (3) and (4) were severely restricted. In particular, it appears that the limited number of findings (21 SNPs) from the very large existing GWAS (N=75,000) of this trait automatically imposes major constraint on the upper bound of , at least for the class of effect-size distributions we considered. Characteristics of largest GWAS and associated discoveries are obtained from published reports[6-8,10,36-39]. For each trait, an effect-size sample size is calculated for a single-stage study that has equivalent power as the original study taking into accounting multi-stage genotyping and selective sampling by family history for PrCA. For HT, sample size and reported discoveries correspond to only first-stage of the GIANT study. The number of discoveries reported takes into account any genomic control adjustment used in the original study.

42 in total

Review 1. The future of genetic counselling: an international perspective.

Authors: B Bowles Biesecker; T M Marteau
Journal: Nat Genet Date: 1999-06 Impact factor: 38.330

Review 2. Genetic evaluation for coronary artery disease.

Authors: Maren T Scheuner
Journal: Genet Med Date: 2003 Jul-Aug Impact factor: 8.822

3. The mystery of missing heritability: Genetic interactions create phantom heritability.

Authors: Or Zuk; Eliana Hechter; Shamil R Sunyaev; Eric S Lander
Journal: Proc Natl Acad Sci U S A Date: 2012-01-05 Impact factor: 11.205

4. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants.

Authors: Ju-Hyun Park; Mitchell H Gail; Clarice R Weinberg; Raymond J Carroll; Charles C Chung; Zhaoming Wang; Stephen J Chanock; Joseph F Fraumeni; Nilanjan Chatterjee
Journal: Proc Natl Acad Sci U S A Date: 2011-10-14 Impact factor: 11.205

5. Prevalence of family history of breast, colorectal, prostate, and lung cancer in a population-based study.

Authors: P L Mai; L Wideroff; M H Greene; B I Graubard
Journal: Public Health Genomics Date: 2010-04-09 Impact factor: 2.000

6. Common SNPs explain a large proportion of the heritability for human height.

Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330

7. Potential usefulness of single nucleotide polymorphisms to identify persons at high cancer risk: an evaluation of seven common cancers.

Authors: Ju-Hyun Park; Mitchell H Gail; Mark H Greene; Nilanjan Chatterjee
Journal: J Clin Oncol Date: 2012-05-14 Impact factor: 44.544

Review 8. Genetic risk prediction in complex disease.

Authors: Luke Jostins; Jeffrey C Barrett
Journal: Hum Mol Genet Date: 2011-08-25 Impact factor: 6.150

9. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs.

Authors: S Hong Lee; Teresa R DeCandia; Stephan Ripke; Jian Yang; Patrick F Sullivan; Michael E Goddard; Matthew C Keller; Peter M Visscher; Naomi R Wray
Journal: Nat Genet Date: 2012-02-19 Impact factor: 38.330

10. Family history, diabetes, and other demographic and risk factors among participants of the National Health and Nutrition Examination Survey 1999-2002.

Authors: Ann M Annis; Mark S Caulder; Michelle L Cook; Debra Duquette
Journal: Prev Chronic Dis Date: 2005-03-15 Impact factor: 2.830

164 in total

1. Tests for Gene-Environment Interactions and Joint Effects With Exposure Misclassification.

Authors: Philip S Boonstra; Bhramar Mukherjee; Stephen B Gruber; Jaeil Ahn; Stephanie L Schmit; Nilanjan Chatterjee
Journal: Am J Epidemiol Date: 2016-01-10 Impact factor: 4.897

2. Screening Human Embryos for Polygenic Traits Has Limited Utility.

Authors: Ehud Karavani; Or Zuk; Danny Zeevi; Nir Barzilai; Nikos C Stefanis; Alex Hatzimanolis; Nikolaos Smyrnis; Dimitrios Avramopoulos; Leonid Kruglyak; Gil Atzmon; Max Lam; Todd Lencz; Shai Carmi
Journal: Cell Date: 2019-11-21 Impact factor: 41.582

3. Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs.

Authors: Gibran Hemani; Jian Yang; Anna Vinkhuyzen; Joseph E Powell; Gonneke Willemsen; Jouke-Jan Hottenga; Abdel Abdellaoui; Massimo Mangino; Ana M Valdes; Sarah E Medland; Pamela A Madden; Andrew C Heath; Anjali K Henders; Dale R Nyholt; Eco J C de Geus; Patrik K E Magnusson; Erik Ingelsson; Grant W Montgomery; Timothy D Spector; Dorret I Boomsma; Nancy L Pedersen; Nicholas G Martin; Peter M Visscher
Journal: Am J Hum Genet Date: 2013-10-31 Impact factor: 11.025

4. Genetic risk prediction using a spatial autoregressive model with adaptive lasso.

Authors: Yalu Wen; Xiaoxi Shen; Qing Lu
Journal: Stat Med Date: 2018-05-31 Impact factor: 2.373

5. Genomic Risk Stratification Predicts All-Cause Mortality After Cardiac Catheterization.

Authors: Michael G Levin; Rachel L Kember; Renae Judy; David Birtwell; Heather Williams; Zolt Arany; Jay Giri; Marie Guerraty; Tom Cappola; Jinbo Chen; Daniel J Rader; Scott M Damrauer
Journal: Circ Genom Precis Med Date: 2018-11

6. The mathematical limits of genetic prediction for complex chronic disease.

Authors: Katherine M Keyes; George Davey Smith; Karestan C Koenen; Sandro Galea
Journal: J Epidemiol Community Health Date: 2015-02-03 Impact factor: 3.710

7. Leveraging family history in population-based case-control association studies.

Authors: Arpita Ghosh; Patricia Hartge; Peter Kraft; Amit D Joshi; Regina G Ziegler; Myrto Barrdahl; Stephen J Chanock; Sholom Wacholder; Nilanjan Chatterjee
Journal: Genet Epidemiol Date: 2014-01-09 Impact factor: 2.135

Review 8. The emerging molecular architecture of schizophrenia, polygenic risk scores and the clinical implications for GxE research.

Authors: Conrad Iyegbe; Desmond Campbell; Amy Butler; Olesya Ajnakina; Pak Sham
Journal: Soc Psychiatry Psychiatr Epidemiol Date: 2014-01-17 Impact factor: 4.328

9. Polygenic risk, stressful life events and depressive symptoms in older adults: a polygenic score analysis.

Authors: K L Musliner; F Seifuddin; J A Judy; M Pirooznia; F S Goes; P P Zandi
Journal: Psychol Med Date: 2014-12-09 Impact factor: 7.723

Review 10. Routes for breaching and protecting genetic privacy.

Authors: Yaniv Erlich; Arvind Narayanan
Journal: Nat Rev Genet Date: 2014-05-08 Impact factor: 53.242