| Literature DB >> 35072137 |
Yu Xu1,2,3, Dragana Vuckovic4,5, Scott C Ritchie1,2,3,6, Parsa Akbari3,5, Tao Jiang3, Jason Grealey2,7, Adam S Butterworth3,5,6,8, Willem H Ouwehand4,6,9,10, David J Roberts5,9,11, Emanuele Di Angelantonio3,5,6,12,8, John Danesh3,4,5,6,8, Nicole Soranzo4,5,6, Michael Inouye1,2,3,6,8,13.
Abstract
Genetic association studies for blood cell traits, which are key indicators of health and immune function, have identified several hundred associations and defined a complex polygenic architecture. Polygenic scores (PGSs) for blood cell traits have potential clinical utility in disease risk prediction and prevention, but designing PGS remains challenging and the optimal methods are unclear. To address this, we evaluated the relative performance of 6 methods to develop PGS for 26 blood cell traits, including a standard method of pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer perceptron (MLP) and convolutional neural network (CNN). We evaluated these optimized PGSs on blood cell trait data from UK Biobank and INTERVAL. We find that PGSs designed using common machine learning methods EN and BR show improved prediction of blood cell traits and consistently outperform other methods. Our analyses suggest EN/BR as the top choices for PGS construction, showing improved performance for 25 blood cell traits in the external validation, with correlations with the directly measured traits increasing by 10%-23%. Ten PGSs showed significant statistical interaction with sex, and sex-specific PGS stratification showed that all of them had substantial variation in the trajectories of blood cell traits with age. Genetic correlations between the PGSs for blood cell traits and common human diseases identified well-known as well as new associations. We develop machine learning-optimized PGS for blood cell traits, demonstrate their relationships with sex, age, and disease, and make these publicly available as a resource.Entities:
Keywords: Blood cell trait; Disease assocations; Machine learning; Method; Polygenic score; Population stratification
Year: 2022 PMID: 35072137 PMCID: PMC8758502 DOI: 10.1016/j.xgen.2021.100086
Source DB: PubMed Journal: Cell Genom ISSN: 2666-979X
Figure 1PGS construction of blood cell traits using 6 different methods
Six PGS methods were evaluated in this study: pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer preceptron (MLP), and convolutional neural network (CNN).
Figure 2Performance comparison of 5 learning methods with the P + T method
Pearson r score performance of the P + T method for PGS construction of 26 blood cell traits are presented in testing on UKB or INTERVAL. Relative to the P + T method, performance of the 5 learning methods: EN, BR, LDpred2, MLP, and CNN, are presented for each blood cell trait in descending order, left to right, according to EN (largest Pearson r increases on left). Given a particular method, a trait and a cohort, the averaged r performance of the 5 trained models, corresponding to the 5 different training-testing data partitions, is shown.
Detailed comparison between variant effect sizes estimated using EN/BR and P + T are presented in Figure S1.
Figure 3Performance of P + T, EN, and LDpred2 methods on different variant sets in INTERVAL
Using conditional analysis variants as a base set, we added in the selected variant sets with LD thinning and p value thresholding to form different sizes of expanded variant sets for each trait. We used the CA variant set as the starting point and then observed the performance of P + T, EN, and LDpred2 on these expanded variant sets. Note that in this figure, P + T refers to the method that directly applies the weighted sum on a given variant set with effect sizes from GWAS.
See Figure S2 for similar performance comparison in UKB.
Figure 4Trait levels by quintiles of EN-trained trait PGS in men and women for traits MCV, WBC#, and neutrophil count (NEUT#) in INTERVAL
The y axis is the observed measurements adjusted only for technical artifacts and season for each blood cell trait. The generalized additive model (GAM) was used to fit the data across INTERVAL samples, and the shaded areas represent 95% confidence intervals.
See Figure S3 for results of all other traits.
Summary statistics of PGS-sex interaction tests for blood cell traits on INTERVAL
| Trait abbreviation | Trait name | Effect size | P | ||||
|---|---|---|---|---|---|---|---|
| Sex (male) | PGS (per SD) | Interaction | Sex | PGS | Interaction | ||
| EO% | eosinophil percentage of white cells | 0.41 | 1.30 | 0.32 | <2.2E−16 | <2.2E−16 | 9.60E−11 |
| EO# | eosinophil count | 0.013 | 0.091 | 0.012 | <2.2E−16 | <2.2E−16 | 2.20E−4 |
| HCT | hematocrit | 3.68 | 1.70 | 0.51 | <2.2E−16 | <2.2E−16 | 3.50E−9 |
| HGB | hemoglobin concentration | 1.48 | 0.56 | 0.24 | <2.2E−16 | <2.2E−16 | <2.2E−16 |
| HLSR# | high light scatter reticulocyte count | 0.00061 | 0.0019 | 0.00029 | <2.2E−16 | <2.2E−16 | 2.03E−5 |
| MCHC | mean corpuscular hemaglobin concentration | 0.70 | 0.74 | 0.14 | <2.2E−16 | <2.2E−16 | 2.63E−5 |
| MONO% | monocyte percentage of white blood cells | 0.90 | 1.73 | 0.19 | <2.2E−16 | <2.2E−16 | 1.37E−5 |
| PCT | plateletcrit | −0.033 | 0.051 | −0.0057 | <2.2E−16 | <2.2E−16 | 4.59E−7 |
| PLT# | platelet count | −29.10 | 56.20 | −7.53 | <2.2E−16 | <2.2E−16 | 1.71E−12 |
| RET% | reticulocyte fraction of red blood cells | −0.0010 | 0.30 | −0.024 | 7.34E−1 | <2.2E−16 | 8.84E−4 |
Interactions between PGS and sex were tested for all of the traits on the INTERVAL cohort by using the multivariate linear regression: y = β0 + β1∗PGS + β2∗Sex + β3∗PGS∗Sex, where y is the actual trait levels adjusted for technical artifacts, season, age, and the first 10 genetic principal components; PGSs were constructed using EN (p value threshold = 1) on UKB samples and standardized in the model. There are 10 traits whose p values of interaction term passed the Bonferroni significance threshold 10−3, which are listed in the table. SD, standard deviation.
Figure 5Correlation between PGS for blood cell traits and PGS for 6 common diseases in INTERVAL
PGSs for blood cell traits; diseases were adjusted for the first 10 genetic principal components before the correlation analysis. Pearson r correlation analysis was performed between the blood cell trait PGSs and disease PGSs across INTERVAL samples, and the correlation tests with the p value passing the threshold of p = 10−4 (Bonferroni adjusted for all trait-disease tests) were deemed significant.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| UKB summary statistics of blood cell traits | GWAS Catalog: GCST90002379-GCST90002407 | |
| CAD meta-GRS | PGS Catalog: PGS000018 | |
| GWAS summary statistics for schizophrenia | ||
| GWAS summary statistics for Crohn’s disease | GWAS Catalog: GCST003044 | |
| GWAS summary statistics for rheumatoid arthritis | GWAS Catalog: GCST002318 | |
| GWAS summary statistics for allergic disease | GWAS Catalog: GCST005038 | |
| GWAS summary statistics for asthma | GWAS Catalog: GCST006862 | |
| PGS models | This manuscript | PGS Catalog: PGS000088 - PGS000113 |
| R 3.6.3 | R Core Team | |
| Python 3.6.8 | Python Software Foundation | |
| scikit-learn 0.21.2 | ||
| Keras 2.1.6 | N/A | |
| SNPNET | ||
| LDpred2 | ||
| PLINK 2.0 | PLINK Working Group | |
| PLINK 1.9 | PLINK Working Group | |
| Bcftools 1.9 | N/A | |