| Literature DB >> 35995843 |
Michael Elgart1,2, Genevieve Lyons3,4, Santiago Romero-Brufau4,5, Nuzulul Kurniansyah3, Jennifer A Brody6, Xiuqing Guo7, Henry J Lin7, Laura Raffield8, Yan Gao9, Han Chen10,11, Paul de Vries10, Donald M Lloyd-Jones12, Leslie A Lange13, Gina M Peloso14, Myriam Fornage10,15, Jerome I Rotter7, Stephen S Rich16, Alanna C Morrison10, Bruce M Psaty17, Daniel Levy18,19, Susan Redline3,20, Tamar Sofer21,22,23.
Abstract
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.Entities:
Mesh:
Year: 2022 PMID: 35995843 PMCID: PMC9395509 DOI: 10.1038/s42003-022-03812-z
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Summary statistics of phenotypes used in the training dataset.
| Characteristic | Black ( | Hispanic/Latino ( | White ( | Overall ( |
|---|---|---|---|---|
| Sex | ||||
| Male | 3066 (40.3%) | 3088 (42.2%) | 6432 (45.5%) | 12586 (43.3%) |
| Female | 4535 (59.7%) | 4232 (57.8%) | 7710 (54.5%) | 16477 (56.7%) |
| Age | ||||
| Mean (SD) | 50.6 (16.9) | 48.2 (14.3) | 50.2 (16.4) | 49.8 (16.1) |
| Median [Min, Max] | 52.0 [2.00, 93.0] | 49.0 [5.00, 86.0] | 51.0 [3.00, 98.0] | 51.0 [2.00, 98.0] |
| Triglycerides | ||||
| Mean (SD) | 106 (69.1) | 135 (96.0) | 125 (82.2) | 124 (84.2) |
| Median [Min, Max] | 90.0 [16.0, 1930] | 113 [20.0, 1670] | 106 [17.0, 1600] | 103 [16.0, 1930] |
| Missing | 2598 (34.2%) | 1316 (18.0%) | 3073 (21.7%) | 6987 (24.0%) |
| Total cholesterol | ||||
| Mean (SD) | 198 (41.8) | 200 (43.2) | 205 (39.2) | 202 (41.0) |
| Median [Min, Max] | 196 [74.0, 450] | 197 [62.0, 526] | 202 [77.8, 594] | 199 [62.0, 594] |
| Missing | 2598 (34.2%) | 1316 (18.0%) | 3073 (21.7%) | 6987 (24.0%) |
| Systolic blood pressure | ||||
| Mean (SD) | 127 (20.9) | 121 (17.2) | 118 (17.1) | 121 (18.5) |
| Median [Min, Max] | 123 [73.0, 246] | 119 [77.0, 218] | 116 [67.0, 227] | 118 [67.0, 246] |
| Missing | 1944 (25.6%) | 1589 (21.7%) | 2972 (21.0%) | 6505 (22.4%) |
| Sleep duration | ||||
| Mean (SD) | 6.50 (1.51) | 7.73 (1.52) | 7.09 (1.16) | 7.15 (1.44) |
| Median [Min, Max] | 6.00 [1.00, 16.5] | 7.79 [2.00, 13.4] | 7.00 [1.00, 16.0] | 7.00 [1.00, 16.5] |
| Missing | 2352 (30.9%) | 411 (5.6%) | 4468 (31.6%) | 7231 (24.9%) |
| Height | ||||
| Mean (SD) | 168 (10.4) | 163 (9.24) | 168 (10.3) | 167 (10.3) |
| Median [Min, Max] | 168 [85.7, 207] | 162 [116, 194] | 168 [94.0, 203] | 166 [85.7, 207] |
| Diastolic blood pressure | ||||
| Mean (SD) | 109 (44.2) | 90.5 (36.7) | 88.4 (36.2) | 94.3 (39.5) |
| Median [Min, Max] | 85.5 [18.0, 267] | 76.0 [40.0, 256] | 74.7 [18.0, 246] | 77.0 [18.0, 267] |
| Missing | 236 (3.1%) | 9 (0.1%) | 308 (2.2%) | 553 (1.9%) |
| HDL cholesterol | ||||
| Mean (SD) | 52.4 (14.9) | 49.1 (13.3) | 52.1 (16.0) | 51.4 (15.1) |
| Median [Min, Max] | 50.0 [15.4, 162] | 47.0 [13.0, 141] | 50.0 [9.63, 143] | 49.0 [9.63, 162] |
| Missing | 328 (4.3%) | 7 (0.1%) | 710 (5.0%) | 1045 (3.6%) |
| LDL cholesterol | ||||
| Mean (SD) | 123 (38.1) | 122 (36.7) | 125 (36.1) | 124 (36.8) |
| Median [Min, Max] | 120 [11.6, 435] | 120 [23.8, 417] | 123 [13.8, 505] | 121 [11.6, 505] |
| Missing | 376 (4.9%) | 143 (2.0%) | 877 (6.2%) | 1396 (4.8%) |
| BMI | ||||
| Mean (SD) | 30.0 (7.19) | 30.1 (6.29) | 26.3 (4.99) | 28.2 (6.25) |
| Median [Min, Max] | 28.9 [12.7, 91.8] | 29.1 [14.9, 70.3] | 25.6 [11.6, 66.6] | 27.2 [11.6, 91.8] |
| Missing | 6 (0.1%) | 9 (0.1%) | 8 (0.1%) | 23 (0.1%) |
Mean, Median, and percent of missing data for the phenotypes and covariates (sex and age) used in this study. Most missing values for systolic blood pressure, total cholesterol, and triglycerides are due to medication use. All the phenotypes are presented for the whole database as well as stratified by race/ethnicity (Black, White, and Hispanic/Latino). Summary statistics for the test dataset are provided in Supplementary Table 3.
Fig. 1PRSice, LDpred2, and Lassosum2 Linear PRS results.
Best-performing PRSice (gray) compared to best-performing LDpred2 (orange) and best Lassosum2 (brown) across the hyperparameters tuned using the training data.
Fig. 2Flow chart of ensemble model structure.
The model relies on jointly training the LASSO and XGBoost model to identify the optimal value for the L1 regularization parameter and the number of boosting steps. CV indicates cross-validation, α refers to the regularization parameter, and Ɵ is the number of boosted trees for XGBoost. The optimal values for these hyperparameters were selected using threefold CV for the mean squared error of the XGBoost model.
Number of SNPs selected through cross-validation for the PRS and XGBoost Model.
| Phenotype | XGBoost alone | LASSO | XGBoost with PRS | PRS | Lassosum2 |
|---|---|---|---|---|---|
| Sleep duration | 35 | 35 | 36 | 1M | 140,507 |
| Diastolic blood pressure | 297 | 297 | 298 | 1M | 197,039 |
| Systolic blood pressure | 38 | 38 | 27 | 1M | 249,700 |
| Triglycerides | 186 | 186 | 109 | 6799 | 8035 |
| LDL cholesterol | 727 | 727 | 429 | 5825 | 2056 |
| HDL cholesterol | 6746 | 6746 | 168 | 7694 | 10,651 |
| Total cholesterol | 1181 | 1181 | 84 | 6258 | 4340 |
| Body mass index | 44 | 44 | 51 | 1M | 51,133 |
| Height | 4807 | 4807 | 559 | 1M | 59,714 |
Displayed are number of SNPs selected for each of the phenotypes in the four models in this study: PRS (best-performing PRS from PRSice, LDpred2, or lassosum2), XGBoost alone, LASSO (which has the same number of variants as in the XGBoost alone model, because the LASSO selected the variants used by XGBoost), and XGBoost with PRS.
Fig. 3Nonlinear model consistently outperforms linear ones for prediction of multiple complex phenotypes in multi-ethnic dataset.
Linear (PRS-pink), linear-regularized (LASSO—teal), and nonlinear (XGBoost—gray, purple) models were employed to predict the harmonized phenotypes from SNP data from TOPMed following adjustment for covariates. Two versions of the XGBoost algorithm are shown with the first model employing only the SNPs as features (gray; XGBoost alone) and a second model which had the PRS as one of the features as well (XGBoost with PRS). The LASSO algorithm (teal) was trained on the same set of SNPs as the XGBoost. The inset (gray) depicts estimated heritability for same phenotypes in the same database using the REML approach with error bars of 95% confidence intervals estimated through restricted maximum-likelihood estimate.
Fig. 4Model performance differ by group, with XGBoost consistently outperforming PRS.
Performance of the PRS (pink) and XGBoost+PRS (purple) models trained on the combined dataset when applied to the prediction of the 5 phenotypes in separate race/ethnicities. Panels a, b, and c refer to White, Black and Hispanic/Latino groups, respectively.
Fig. 5Multi-ethnic XGBoost model performs on par with the race/ethnic-specific models.
XGBoost with PRS models were trained either on the combined dataset containing all participants, (pink) or on each race/ethnic group separately (teal, gray and purple). The models were then evaluated on each of the groups (a Black, b Hispanic/Latino, and c White).
Description of published GWAS used to identify summary statistics.
| Phenotype | GWAS | Study population | Participants |
|---|---|---|---|
| Height, BMI | Meta-analysis of genome-wide association studies for height and body mass index in ~700,000 individuals of European ancestry[ | UK Biobank and the GIANT consortium | 693,529 (European ancestry) |
| Total cholesterol, LDL cholesterol, HDL cholesterol, triglycerides | Genetics of blood lipids among ~300,000 Multi-ethnic Participants of the Million Veteran Program[ | Million Veteran Program | 297,626 (72.4% non-Hispanic Whites, 19.3% non-Hispanic Blacks, 8.3% Hispanics) |
| Systolic blood pressure, diastolic blood pressure | Trans-ethnic association study of blood pressure determinants in over 750,000 individuals[ | Million Veteran Program | 318,891 (69.1% non-Hispanic Whites, 18.8% non-Hispanic Blacks, 6.7% Hispanics, 0.77% non-Hispanic Asians and 0.85% non-Hispanic Native Americans) |
| Sleep duration | Genome-wide association study identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates[ | UK Biobank | 446,118 (European ancestry) |
GWAS source, study population as reported by the manuscript reporting the GWAS, and number of participants used to generate summary statistics for a given phenotype.