| Literature DB >> 31653892 |
Louis Lello1, Timothy G Raben2, Soke Yuen Yong3, Laurent C A M Tellier4,5, Stephen D H Hsu6,7,8.
Abstract
We construct risk predictors using polygenic scores (PGS) computed from common Single Nucleotide Polymorphisms (SNPs) for a number of complex disease conditions, using L1-penalized regression (also known as LASSO) on case-control data from UK Biobank. Among the disease conditions studied are Hypothyroidism, (Resistant) Hypertension, Type 1 and 2 Diabetes, Breast Cancer, Prostate Cancer, Testicular Cancer, Gallstones, Glaucoma, Gout, Atrial Fibrillation, High Cholesterol, Asthma, Basal Cell Carcinoma, Malignant Melanoma, and Heart Attack. We obtain values for the area under the receiver operating characteristic curves (AUC) in the range ~0.58-0.71 using SNP data alone. Substantially higher predictor AUCs are obtained when incorporating additional variables such as age and sex. Some SNP predictors alone are sufficient to identify outliers (e.g., in the 99th percentile of polygenic score, or PGS) with 3-8 times higher risk than typical individuals. We validate predictors out-of-sample using the eMERGE dataset, and also with different ancestry subgroups within the UK Biobank population. Our results indicate that substantial improvements in predictive power are attainable using training sets with larger case populations. We anticipate rapid improvement in genomic prediction as more case-control data become available for analysis.Entities:
Mesh:
Year: 2019 PMID: 31653892 PMCID: PMC6814833 DOI: 10.1038/s41598-019-51258-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Table of genetic AUCs using SNPs only - no age or sex.
| Condition | Training Set | Test Set | AUC | Active SNPs | |
|---|---|---|---|---|---|
| Hypothyroidism | impute | UKBB | 0.705 (0.009) | 3704 (41) | 1.406e-06 (1.33e-7) |
| Hypothyroidism | impute | eMERGE | 0.630 (0.006) | ||
| Type 2 Diabetes | impute | UKBB | 0.640 (0.015) | 4168 (61) | 6.93e-06 (1.73e-6) |
| Type 2 Diabetes | impute | eMERGE | 0.633 (0.006) | ||
| Hypertension | impute | UKBB | 0.667 (0.012) | 9674 (55) | 4.46e-6 (4.86e-7) |
| Hypertension | impute | eMERGE | 0.651 (0.007) | ||
| Resistant Hypertension | impute | eMERGE | 0.6861 (0.001) | ||
| Asthma | calls | AA | 0.632 (0.006) | 3215 (16) | 2.37e-6 (0.35e-6) |
| Type 1 Diabetes | calls | AA | 0.647 (0.006) | 50 (7) | 7.9e-7 (0.1e-7) |
| Breast Cancer | calls | AA | 0.582 (0.006) | 480 (62) | 3.38e-6 (0.05e-6) |
| Prostate Cancer | calls | AA | 0.6399 (0.0077) | 448 (347) | 3.07e-6 (0.08e-8) |
| Testicular Cancer | calls | AA | 0.65 (0.02) | 19 (7) | 1.42e-6 (0.04e-6) |
| Glaucoma | calls | AA | 0.606 (0.006) | 610 (114) | 8.69e-7 (0.71e-7) |
| Gout | calls | AA | 0.682 (0.007) | 1010 (35) | 9.41e-7 (0.03e-7) |
| Atrial Fibrillation | calls | AA | 0.643 (0.006) | 181 (39) | 8.61e-7 (0.94e-7) |
| Gallstones | calls | AA | 0.625 (0.006) | 981 (163) | 1.01e-7 (0.02e-7)) |
| Heart Attack | calls | AA | 0.591 (0.006) | 1364 (49) | 1.181e-6 (0.002e-7) |
| High Cholesterol | calls | AA | 0.628 (0.006) | 3543 (36) | 2.4e-6 (0.2e-6) |
| Malignant Melanoma | calls | AA | 0.580 (0.006) | 26 (15) | 9.5e-7 (0.8e-7) |
| Basal Cell Carcinoma | calls | AA | 0.631 (0.006) | 76 (22) | 9.9e-7 (0.3e-7) |
Training and validating is done using UKBB data from either direct calls or imputed data to match eMERGE. Testing is done with UKBB, eMERGE, or AA as described in Secs. 2 and Supplementary Information Sec. D. Numbers in parenthesis are the larger of either a standard deviation from central value or numerical precision as described in Sec. 2. λ* refers to the lasso λ value used to compute AUC as described in Sec. 2.
Figure 1Top plots are histograms of controls (blue) and cases (gold). The bar heights are the averages over 5 AA testing runs. The error bars are standard deviations. On the bottom the same average case and control points are plotted on separate lines (1/0) for cases and controls. The height of the bars (gold and blue) represents the relative density of data points in that bin. Note that on the bottom, the gold and blue bars have been normalized using the same scale; the gold density looks small because most of the individuals in the data set are controls. The red dashed lines mark the 4% and 96% quartile of data, i.e. 92% of the data lies between those points. The x-axes are the same for top and bottom graphs: z scores, or number of standard deviations from the control mean. A linear (yellow) and logistic (black) curve are plotted over this range. It is clear that the difference between linear and logistic curves is negligible in the region where the data is concentrated.
Comparison of best known odds ratios in the literature (Literature) to the odds ratios calculated from UK BioBank data presented here (New).
| Condition | Odds Ratio | |||
|---|---|---|---|---|
| PGS % | Literature | New | 99% Predicted | |
| Asthma | >96% | — | ||
| Atrial Fibrillation | >90% | |||
| Basal Cell Carcinoma | >96% | — | ||
| Breast Cancer | >96% | |||
| Gallstones | >96% | — | ||
| Glaucoma | >96% | — | ||
| Gout | >90%/<10% | |||
| Heart Attack | >96% | — | ||
| High Cholesterol | >96% | — | ||
| Hypertension | >90% | |||
| Hypothyroidism | >96% | — | ||
| Malignant Melanoma | 1 | |||
| Prostate Cancer | >75%/<25% | |||
| Testicular Cancer | >96% | — | ||
| Type 1 Diabetes | >95% | 22.8* [ | ||
| Type 2 Diabetes | >90% | |||
Comparison was either made at the largest possible PGS common to the two sets, or using whatever definition of odds ratio was used in the literature (PGS %). Additionally we indicate what we predict the odds ratio will be for those with 99% scores or above (99% Predicted column). These predictions are found by assuming the data was drawn from Gaussian distributions. We confine our references to the literature to specifically genetic or polygenic risk score determination of odds ratios. Other biological risk factors could, in the future, be combined with genetic risk to generate even better prediction. Further details about the literature are found in Section E. We focus here on purely genetic predictors. For many traits we were unaware of previous odds ratio estimates based on a purely polygenic score. For those we were aware of we listed the largest odds ratio in the chart above. *These predictors include a regression on non-genetic biological information. †This article appeared on the BioRxiv shortly before our manuscript and we were originally unaware of the results.
Figure 2Probability of developing breast cancer or hypothyroidism given a specific polygenic score - shown in SD units and percentile. The lifetime population prevalence of both breast cancer and hypothyroidism are set to be 12%. Deviation from the red line, particularly at large and small PGS percentile, is likely an artifact of low statistics in these regions.
Figure 3The receiver operator characteristic curve for case-control data on Hypothyroidism. This example includes sex and age as covariates.
Figure 4AUC computed on 5 holdback sets (1,000 each of cases and controls) for Hypertension, as a function of λ. A. UK Biobank and B. eMERGE.
Figure 5Distribution of PGS, cases and controls for Hypertension in the eMERGE dataset using SNPs alone and including sex and age as regressors.
Figure 6Distribution of PGS score, cases and controls for Resistant Hypertension in the eMERGE dataset using SNPs alone and including sex and age as regressors.
Figure 7Distribution of PGS score, cases and controls for Hypothyroidism in the eMERGE dataset using SNPs alone and including sex and age as regressors.
Figure 8Distribution of PGS score, cases and controls for type 2 diabetes in the eMERGE dataset using SNPs alone and including sex and age as regressors.
AUCs obtained using sex and age alone, SNPs alone, and all three together.
| Condition | Test set | Age + Sex | Genetic Only | Age + Sex + genetic |
|---|---|---|---|---|
| Hypertension | UKBB | 0.638 (0.018) | 0.667 (0.012) | 0.717 (0.007) |
| Hypothyroidism | UKBB | 0.695 (0.007) | 0.705 (0.009) | 0.783 (0.008) |
| Type 2 Diabetes | UKBB | 0.672 (0.009) | 0.640 (0.015) | 0.651 (0.013) |
| Hypertension | eMERGE | 0.818 (0.008) | 0.651 (0.007) | 0.851 (0.009) |
| Resistant Hypertension | eMERGE | 0.817 (0.008) | 0.686 (0.007) | 0.864 (0.009) |
| Hypothyroidism | eMERGE | 0.643 (0.006) | 0.630 (0.006) | 0.697 (0.007) |
| Type 2 Diabetes | eMERGE | 0.565 (0.006) | 0.633 (0.006) | 0.651 (0.007) |
AUCs with and without SNPs from the sex chromosomes.
| Condition | With Sex Chr | No Sex Chr |
|---|---|---|
| Hypothyroidism | 0.6302 (0.0012) | 0.6300 (0.0012) |
| Type 2 Diabetes | 0.6377 (0.0018) | 0.6327 (0.0018) |
| Hypertension | 0.6499 (0.0008) | 0.6510 (0.0008) |
| Resistant Hypertension | 0.6845 (0.001) | 0.6861 (0.001) |
All tested on eMERGE using SNPs as the only covariate.
Mean and standard deviation for PGS distributions for cases and controls, using predictors built from SNPs only and trained on case-control status alone.
| Hypothyroidism | Type 2 Diabetes | Hypertension | Res HT | |
|---|---|---|---|---|
| 0.0093 | 0.0271 | 0.0240 | 0.0392 | |
| −0.0038 | −0.0141 | −0.0470 | −0.0448 | |
| 0.0284 | 0.0901 | 0.1343 | 0.1270 | |
| 0.0276 | 0.0866 | 0.1281 | 0.1219 | |
| 1,084/3,171 | 1,921/4,369 | 2,035/1,202 | 1,358/1,202 | |
| AUC | 0.630 (0.006) | 0.629 (0.006) | 0.649 (0.006) | 0.683 (0.007) |
| AUC | 0.630 (0.006) | 0.633 (0.006) | 0.651 (0.007) | 0.686 (0.006) |
Predicted AUC from assumption of displaced normal distributions and actual AUC are also given.
Mean and standard deviation for PGS distributions of cases and controls, using predictors built from SNPs, sex, and age, and trained on case-control status alone.
| Hypothyroidism | Type 2 Diabetes | Hypertension | Res HT | |
|---|---|---|---|---|
| 0.1516 | 0.1431 | 0.7377 | 0.7525 | |
| 0.1185 | 0.0924 | 0.4375 | 0.4366 | |
| 0.0437 | 0.0948 | 0.1829 | 0.1830 | |
| 0.0474 | 0.0943 | 0.2250 | 0.2258 | |
| 1,035/3,047 | 1,921/4,369 | 2,000/1,196 | 1,331/1,196 | |
| AUC | 0.696 (0.007) | 0.648 (0.006) | 0.850 (0.009) | 0.862 (0.009) |
| AUC | 0.697 (0.007) | 0.651 (0.007) | 0.852 (0.009) | 0.864 (0.009) |
Predicted AUC from assumption of displaced normal distributions and actual AUC are also given.
Figure 9Odds ratio between upper percentile in PGS and total population prevalence in eMERGE for Hypothyroidism with and without using age and sex as covariates.
Figure 10Odds ratio between upper percentile in PGS and total population prevalence in eMERGE for Hypertension with and without using age and sex as covariates.
Figure 11Odds ratio between upper percentile in PGS and total population prevalence in eMERGE for Resistant Hypertension with and without using age and sex as covariates.
Figure 12Odds ratio between upper percentile in PGS and total population prevalence in eMERGE for Type 2 Diabetes with and without using age and sex as covariates.
Table of prediction results using three types of regressands.
| Condition | CC Status | Mod 1 | Mod 2 |
|---|---|---|---|
| Hypothyroidism | |||
| SNPs alone | 0.6300 (0.0012) | 0.6046 (0.0025) | 0.6177 (0.0042) |
| Age/Sex Alone | 0.6430 | ||
| With Age/Sex | 0.6966 (0.0009) | 0.6489 (0.0173) | 0.6884 (0.0021) |
| Type 2 Diabetes | |||
| SNPs alone | 0.6327 (0.0018) | 0.6378 (0.0018) | 0.6327 (0.0018) |
| Age/Sex Alone | 0.5654 | ||
| With Age/Sex | 0.651 (0.0014) | 0.6283 (0.0039) | 0.651 (0.0014) |
| Hypertension | |||
| SNPs alone | 0.651 (0.0008) | 0.6495 (0.0004) | 0.6497 (0.0005) |
| Age/Sex Alone | 0.8180 | ||
| With Age/Sex | 0.8518 (0.0003) | 0.8519 (0.0003) | 0.8516 (0.0001) |
All results are on eMERGE and show results for using SNPs, Age, Sex and combinations of such.
Figure 13Maximum AUC on out-of-sample testing set (eMERGE) as a function of the number of cases (in thousands) included in training. Shown for type 2 diabetes, Hypothyroidism and Hypertension.