| Literature DB >> 30104762 |
Amit V Khera1,2,3,4, Mark Chaffin4, Krishna G Aragam1,2,3,4, Mary E Haas4, Carolina Roselli4, Seung Hoan Choi4, Pradeep Natarajan2,3,4, Eric S Lander4, Steven A Lubitz2,3,4, Patrick T Ellinor2,3,4, Sekar Kathiresan5,6,7,8.
Abstract
A key public health need is to identify individuals at high risk for a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a genetic component, one important approach is to stratify individuals based on inherited DNA variation1. Proposed clinical applications have largely focused on finding carriers of rare monogenic mutations at several-fold increased risk. Although most disease risk is polygenic in nature2-5, it has not yet been possible to use polygenic predictors to identify individuals at risk comparable to monogenic mutations. Here, we develop and validate genome-wide polygenic scores for five common diseases. The approach identifies 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively. For coronary artery disease, this prevalence is 20-fold higher than the carrier frequency of rare monogenic mutations conferring comparable risk6. We propose that it is time to contemplate the inclusion of polygenic risk prediction in clinical care, and discuss relevant issues.Entities:
Mesh:
Year: 2018 PMID: 30104762 PMCID: PMC6128408 DOI: 10.1038/s41588-018-0183-z
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
GPS derivation and testing for five common, complex diseases
| Disease | Discovery | Prevalence | Prevalence | Polymorphisms | Tuning parameter | AUC | AUC |
|---|---|---|---|---|---|---|---|
| CAD | 60,801 cases; 123,504 controls | 3,963/120,280 (3.4%) | 8,676/288,978 (3.0%) | 6,630,150 | LDPred ( | 0.81 (0.80–0.81) | 0.81 (0.81–0.81) |
| Atrial | 17,931 cases; 115,142 controls | 2,024/120,280 (1.7%) | 4,576/288,978 (1.6%) | 6,730,541 | LDPred ( | 0.77 (0.76–0.78) | 0.77 (0.76–0.77) |
| Type 2 | 26,676 cases; 132,532 controls | 2,785/120,280 (2.4%) | 5,853/288,978 (2.0%) | 6,917,436 | LDPred ( | 0.72 (0.72–0.73) | 0.73 (0.72–0.73) |
| Inflammatory | 12,882 cases; 21,770 controls | 1,360/120,280 (1.1%) | 3,102/288,978 (1.1%) | 6,907,112 | LDPred ( | 0.63 (0.62–0.65) | 0.63 (0.62–0.64) |
| Breast | 122,977 cases; 105,974 controls | 2,576/63,347 (4.1%) | 6,586/157,895 (4.2%) | 5,218 | Pruning and thresholding ( | 0.68 (0.67–0.69) | 0.69 (0.68–0.69) |
AUC was determined using a logistic regression model adjusted for age, sex, genotyping array, and the first four principal components of ancestry. The breast cancer analysis was restricted to female participants. For the LDPred algorithm, the tuning parameter ρ reflects the proportion of polymorphisms assumed to be causal for the disease. For the pruning and thresholding strategy, r2 reflects the degree of independence from other variants in the linkage disequilibrium reference panel, and P reflects the P value noted for a given variant in the discovery GWAS. CI, confidence interval.
Figure 1.Study design and workflow
A genome-wide polygenic score (GPS) for each disease was derived by combining summary association statistics from a recent large GWAS and a linkage disequilibrium reference panel of 503 Europeans.[34] 31 candidate GPS were derived using two strategies: 1. ‘pruning and thresholding’ – aggregation of independent polymorphisms that exceed a specified level of significance in the discovery GWAS and 2. LDPred computational algorithm,[13] a Bayesian approach to calculate a posterior mean effect for all variants based on a prior (effect size in the prior GWAS) and subsequent shrinkage based on linkage disequilibrium. The seven candidate LDPred scores vary with respect to the tuning parameter ρ, the proportion of variants assumed to be causal, as previously recommended.[13] The optimal GPS for each disease was chosen based on area under the receiver-operator curve (AUC) in the UK Biobank Phase I validation dataset (N=120,280 Europeans) and subsequently calculated in an independent UK Biobank Phase II testing dataset (N=288,978 Europeans).
Figure 2.Risk for coronary artery disease according to genome-wide polygenic score.
(a) Distribution of genome-wide polygenic score for CAD (GPSCAD) in the UK biobank testing dataset (N=288,978). The x-axis represents GPSCAD, with values scaled to a mean of 0 and standard deviation of 1 to facilitate interpretation. Shading reflects proportion of population with 3, 4, and 5-fold increased risk versus remainder of the population. Odds ratio assessed in a logistic regression model adjusted for age, sex, genotyping array, and the first four principal components of ancestry; (b) GPSCAD percentile among CAD cases versus controls in the UK biobank validation cohort. Within each boxplot, the horizontal lines reflect the median, the top and bottom of the box reflects the interquartile range, and the whiskers reflect the maximum and minimum value within each grouping; (c) prevalence of CAD according to 100 groups of the validation cohort binned according to percentile of the GPSCAD.
Proportion of the population at three-, four- and fivefold increased risk for each of the five common diseases
| High GPS definition | Individuals in testing dataset ( | % of individuals |
|---|---|---|
| CAD | 23,119/288,978 | 8.0 |
| Atrial fibrillation | 17,627/288,978 | 6.1 |
| Type 2 diabetes | 10,099 288,978 | 3.5 |
| Inflammatory bowel disease | 9,209 288,978 | 3.2 |
| Breast cancer | 2,369/157,895 | 1.5 |
| Any of the five diseases | 57,115/288,978 | 19.8 |
| CAD | 6,631/288,978 | 2.3 |
| Atrial fibrillation | 4,335/288,978 | 1.5 |
| Type 2 diabetes | 578/288,978 | 0.2 |
| Inflammatory bowel disease | 2,297/288,978 | 0.8 |
| Breast cancer | 474/157,895 | 0.3 |
| Any of the five diseases | 14,029/288,978 | 4.9 |
| CAD | 1,443/288,978 | 0.5 |
| Atrial fibrillation | 2,020 288,978 | 0.7 |
| Type 2 diabetes | 144/288,978 | 0.05 |
| Inflammatory bowel disease | 571/288,978 | 0.2 |
| Breast cancer | 158/157,895 | 0.1 |
| Any of the five diseases | 4,305/288,978 | 1.5 |
For each disease, progressively more extreme tails of the GPS distribution were compared with the remainder of the population in a logistic regression model with disease status as the outcome, and age, sex, the first four principal components of ancestry, and genotyping array as predictors. The breast cancer analysis was restricted to female participants.
Figure 3.Risk gradient for disease according to genome-wide polygenic score percentile
100 groups of the validation cohort were derived according to percentile of the disease-specific GPS. Prevalence of disease displayed for risk of (a) atrial fibrillation, (b) type 2 diabetes, (c) inflammatory bowel disease, and (d) breast cancer according to GPS percentile.
Prevalence and clinical impact of a high GPS
| High GPS definition | Reference group | Odds ratio | 95% CI | |
|---|---|---|---|---|
| Top 20% of distribution | Remaining 80% | 2.55 | 2.43–2.67 | <1 × 10−300 |
| Top 10% of distribution | Remaining 90% | 2.89 | 2.74–3.05 | <1 × 10−300 |
| Top 5% of distribution | Remaining 95% | 3.34 | 3.12–3.58 | 6.5 × 10−264 |
| Top 1% of distribution | Remaining 99% | 4.83 | 4.25–5.46 | 1.0 × 10−132 |
| Top 0.5% of distribution | Remaining 99.5% | 5.17 | 4.34–6.12 | 7.9 × 10−78 |
| Top 20% of distribution | Remaining 80% | 2.43 | 2.29–2.59 | 2.1 × 10−177 |
| Top 10% of distribution | Remaining 90% | 2.74 | 2.55–2.94 | 7.0 × 10−169 |
| Top 5% of distribution | Remaining 95% | 3.22 | 2.95–3.51 | 1.1 × 10−152 |
| Top 1% of distribution | Remaining 99% | 4.63 | 3.96–5.39 | 2.9 × 10−84 |
| Top 0.5% of distribution | Remaining 99.5% | 5.23 | 4.24–6.39 | 3.5 × 10−56 |
| Top 20% of distribution | Remaining 80% | 2.33 | 2.20–2.46 | 3.1 × 10−201 |
| Top 10% of distribution | Remaining 90% | 2.49 | 2.34–2.66 | 1.2 × 10−167 |
| Top 5% of distribution | Remaining 95% | 2.75 | 2.53–2.98 | 1.7 × 10−130 |
| Top 1% of distribution | Remaining 99% | 3.30 | 2.81–3.85 | 1.4 × 10−49 |
| Top 0.5% of distribution | Remaining 99.5% | 3.48 | 2.79–4.29 | 4.3 × 10−30 |
| Top 20% of distribution | Remaining 80% | 2.19 | 2.03–2.36 | 7.7 × 10−95 |
| Top 10% of distribution | Remaining 90% | 2.43 | 2.22–2.65 | 8.8 × 10−88 |
| Top 5% of distribution | Remaining 95% | 2.66 | 2.38–2.96 | 3.0 × 10−68 |
| Top 1% of distribution | Remaining 99% | 3.87 | 3.18–4.66 | 1.4 × 10−43 |
| Top 0.5% of distribution | Remaining 99.5% | 4.81 | 3.74–6.08 | 9.0 × 10−37 |
| Top 20% of distribution | Remaining 80% | 2.07 | 1.97–2.19 | 3.4 × 10−159 |
| Top 10% of distribution | Remaining 90% | 2.32 | 2.18–2.48 | 2.3 × 10−148 |
| Top 5% of distribution | Remaining 95% | 2.55 | 2.35–2.76 | 2.1 × 10−112 |
| Top 1% of distribution | Remaining 99% | 3.36 | 2.88–3.91 | 1.3 × 10−54 |
| Top 0.5% of distribution | Remaining 99.5% | 3.83 | 3.11–4.68 | 8.2 × 10−38 |
Odds ratios were calculated by comparing those with high GPS with the remainder of the population in a logistic regression model adjusted for age, sex, genotyping array, and the first four principal components of ancestry. The breast cancer analysis was restricted to female participants. CI, confidence interval.