Genome-wide association studies (GWASs) have successfully identified many genetic variants and risk loci for complex traits and common diseases in the last 15 years. However, these identified variants, in general, can explain only a small to moderate proportion of the heritability, thus the task of improving GWAS power for more discoveries remains both critical and challenging. In addition to the usual but costly or even infeasible route of continuing to increase the sample size, many approaches have been proposed to incorporate functional annotations to prioritize SNPs but with only limited success. Here, by taking advantage of increasing availability of various types of omics data, we propose a new and orthogonal approach by integrating individual-level omics data with GWASs. The premise is that since omics data reflect both genetic and environmental (such as diet and other lifestyle) effects on individuals, they can be used to account for (otherwise unexplained) variations among individuals in GWAS analysis, leading to more precise/efficient estimation and thus higher power. As a concrete example, we propose boosting GWAS power by adjusting for metabolomics data in GWAS analysis. We applied the method to the UK Biobank subcohort of n = 90,000 individuals with both GWAS and metabolomics data. The analysis of 7 quantitative traits and one binary trait demonstrated clear power gains. For example, the new method (after adjusting for metabolomics data) identified 13 new loci for diastolic blood pressure that were all missed by the standard GWAS, and most or all of the 13 new signals were validated in two much larger GWAS datasets (n = 340,000 and 700,000); the improved estimation efficiency was equivalent to a 38.4% gain of GWAS sample size. The proposed method is both simple and promising and broadly applicable to integrating GWASs with other omics data.
Genome-wide association studies (GWASs) have successfully identified many genetic variants and risk loci for complex traits and common diseases in the last 15 years. However, these identified variants, in general, can explain only a small to moderate proportion of the heritability, thus the task of improving GWAS power for more discoveries remains both critical and challenging. In addition to the usual but costly or even infeasible route of continuing to increase the sample size, many approaches have been proposed to incorporate functional annotations to prioritize SNPs but with only limited success. Here, by taking advantage of increasing availability of various types of omics data, we propose a new and orthogonal approach by integrating individual-level omics data with GWASs. The premise is that since omics data reflect both genetic and environmental (such as diet and other lifestyle) effects on individuals, they can be used to account for (otherwise unexplained) variations among individuals in GWAS analysis, leading to more precise/efficient estimation and thus higher power. As a concrete example, we propose boosting GWAS power by adjusting for metabolomics data in GWAS analysis. We applied the method to the UK Biobank subcohort of n = 90,000 individuals with both GWAS and metabolomics data. The analysis of 7 quantitative traits and one binary trait demonstrated clear power gains. For example, the new method (after adjusting for metabolomics data) identified 13 new loci for diastolic blood pressure that were all missed by the standard GWAS, and most or all of the 13 new signals were validated in two much larger GWAS datasets (n = 340,000 and 700,000); the improved estimation efficiency was equivalent to a 38.4% gain of GWAS sample size. The proposed method is both simple and promising and broadly applicable to integrating GWASs with other omics data.
Genome-wide association studies (GWASs) have become a powerful and common tool to identify genetic variants and loci associated with complex traits and complex diseases. As of April 7, 2022, the GWAS Catalog had collected 5,690 publications and 372,752 associations, and these findings have provided much insight into the genetic architecture of complex diseases/traits with potential clinical applications. For example, GWAS findings can inform new drug development, drug repurposing, and preventing adverse effects. Risk variants can be used to identify individuals at high risk of certain diseases for whom early prevention or treatment can be implemented., Despite the unprecedented success of GWASs, the detected GWAS variants usually cannot explain more than a modest proportion of the trait heritability, largely due to polygenicity and small effect sizes, limiting the power of even biobank-scale GWASs with large sample sizes in hundreds of thousands. Hence, lack of sufficient GWAS power for weakly associated variants is still a major bottleneck for more new discoveries, and increasing the power of GWASs remains both extremely important and challenging. The most obvious and common route is to continue increasing the sample size, which, however, is not only costly and difficult but also ultimately limited by a finite or even small number of patients with disease. Alternatively, a rich class of approaches have been proposed to incorporate various functional annotations of the genome into GWASs to prioritize Single nucleotide polymorphisms (SNPs) and thus boost the power;7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 however, their applications in practice are not common, presumably due to their limited success.In this article, instead of (or in addition to) prioritizing SNPs, we take a completely new route to improve the statistical power of GWASs by using omics data to reduce unexplained/residual variations among individuals. Specifically, we investigate whether adjusting for individual-level metabolomics data in GWASs can improve statistical power. Metabolites are small molecules that are intermediate or end products of metabolism. An individual’s metabolic phenotypes are influenced not only by one’s genotypes but also by numerous environmental factors such as diet and other lifestyle variables. For example, metabolites are responsive to one’s diet,, and the concentrations of many metabolites change after exercise. On the other hand, standard GWAS analysis only considers marginal SNP-trait associations with adjustment for few covariates, leading to possible confounding and/or large amounts of unexplained variations among the GWAS individuals, ultimately masking SNP-trait associations. We reason that since metabolites reflect both genetic and environmental effects, adjusting for them can account for some otherwise unexplained/residual variations among individuals in GWAS analysis and thus improve the power. This is analogous to using multiple regression to account for confounding and extra variations due to covariates, thus improving power over simple regression, though it may not work in some cases, as to be discussed in the discussion section. A common and plausible scenario is that the individuals are independently exposed to various environmental risk factors, which guarantees that adjusting for these risk factors will improve the statistical estimation efficiency and thus power, as in the well-known situation of adjusting for prognostic factors in randomized clinical trials., However, as often most environmental exposures are either difficult to measure or are unknown, metabolomics (or other omics) data can be used as their proxy to improve GWAS power, motivating our proposed method. Through an application to 7 quantitative traits and a binary trait (HTN) in the UK Biobank subcohort of n = 90,000 with both GWASs and metabolomics data available, many new signals were identified after adjustment for metabolomics data, which would be missed by a standard GWAS analysis. Importantly, the majority of these new loci were validated by other larger GWASs. For example, for body mass index (BMI), our new method identified 48 significant loci, including 31 novel ones, compared with 23 from the standard GWAS; our new method yielded much more precise estimates of the SNP-BMI association effect sizes, equivalent to an increase of 38.4% GWAS sample size over the standard GWAS. These results demonstrate that our proposed method, though simple, is promising with broad applications to other GWAS and metabolomics data.So far, most GWASs with metabolomics data have been focused on uncovering genetic associations with metabolites,27, 28, 29, 30 while others investigated the relationship between metabolites and diseases such as cardiovascular and kidney diseases.31, 32, 33, 34, 35 However, to our best knowledge, no previous research has leveraged metabolomics data to boost GWAS power, which is the goal of this work. Furthermore, using the same argument, we foresee that our proposed method can be applied to integrate GWASs with other individual-level omics data, such as proteomics, which substantially broadens the applicability of the proposed method and also opens a door for new and important applications of omics data. Finally, the proposed method is orthogonal to increasing sample size and to many existing approaches of integrating functional annotations of the genome. Hence, these approaches can be combined, though we will not pursue that here.
Material and methods
Discovery sample: UK biobank subcohort
UK Biobank (UKB) is a long-term population-based cohort study starting from 2006. It consists of about 500,000 individuals genotyped with around 800,000 SNPs and with a variety of phenotypic information. In this study, we will analyze 8 traits, including seven quantitative ones and a binary disease—self-reported hypertension (HTN). The 7 quantitative traits are BMI, standing height (Height), waist circumference (WC), hip circumference (HC), waist-to-hip ratio (WHR), systolic blood pressure (SBP), and diastolic blood pressure (DBP). 249 nuclear magnetic resonance (NMR) metabolic biomarkers are also available for a subcohort of around 120,000 randomly selected individuals, which will be used in our analysis. Targeted metabolic profiling was undertaken in baseline non-fasting plasma samples using high-throughput NMR spectroscopy. It simultaneously measured 249 metabolic biomarkers in a single assay, spanning multiple metabolic pathways, including lipoprotein lipids in 14 subclasses, fatty acids, and fatty acid compositions, as well as various low-molecular weight metabolites, such as amino acids, ketone bodies, and glycolysis metabolites. This dataset has been used by others, and further details have been described previously.,
Validation GWAS summary data
For all of the 8 traits except for WHR, we used the UKB GWAS round 2 results published by the Neale lab in 2018 (http://www.nealelab.is/uk-biobank/) for (partial) validation, which used a larger sample of 361 194 UKB individuals in the analysis. For WHR, we used the GWAS of around 700,000 samples from a meta-analysis of the GIANT and UKB data. In addition, for BMI and Height, we also used the summary statistics of around 700,000 samples from a meta-analysis of the GIANT and UKB data. For WC and HC, we used the results published in 2015 from the GIANT consortium, with the sample size around 140,000. For SBP and DBP, we used the results published in 2018 of around 1 million individuals from the UKB and ICBP consortium.
Quality control and data pre-processing
In the quality control (QC) process, we included self-report White and unrelated individuals with the metabolomics data and the non-missing phenotype in the corresponding GWAS analysis. We further filtered out individuals who are outliers for heterozygosity or missing rate, those who have an abnormal number of sex chromosomes, and those who have unmatched self-reported sex and genetic sex. We also filtered out genetic variants with minor allele frequency less than 0.01, missing genotype rates larger than 0.05, or failing the Hardy-Weinberg equilibrium test at a p value threshold 1e−6. The top 10 genotype principal components (PCs) were calculated based on the remaining set of SNPs and individuals after QC. For missing metabolomics data after QC, we used mean imputation. Then, the top 20 metabolomics PCs were used for dimension reduction, as many of the 249 metabolites were highly correlated. Finally, the quantitative phenotypes were inverse-rank normalized.
Genome-wide association testing
For each of the 8 traits, we used two different GWAS models for single SNP-trait association analysis. First, we performed a standard linear regression between each of the 7 transformed quantitative traits (Y) and each SNP (G), adjusting for covariates (C) including sex, age, and top 10 genotype PCs:where is the random error term. For the second model, as our new method, we additionally adjusted for the top 20 metabolite PCs (M):where is the random error term. Here, as usual, we chose using top 20 metabolite PCs based on aiming for both a large proportion of the total variation being explained and a relatively small number of PCs compared with the sample size; other choices may be more suitable for other data. We call the new method metabolomics-assisted GWAS (Ma-GWAS).As for HTN, all steps described above for the quantitative traits remained the same except that logistic regression, instead of linear regression, was used. All GWAS analyses were performed in PLINK 2.0. Significant SNPs were identified at the usual p value threshold of 5e−8 and mapped to 1,703 independent linkage disequilibrium (LD) blocks, each of which was treated as an independent genomic locus.One potential issue of mapping SNPs to the 1,703 independent LD blocks is that the SNPs located near and across the boundaries of two neighboring blocks might be in (weak) LD. Hence, as an alternative, we also used FUMA to define independent genomic loci containing significant SNP(s) uncovered by the proposed new method (). Then, for each identified locus, if there was no significant SNP located in this region by the standard GWAS model , we say that this locus was a new signal. To validate the new signals identified in , we compared them with a validation GWAS dataset. For each new signal, if there were one or more significant SNP(s) located in this region in the validation GWAS, we say that this new signal was validated.
Removing the genetic components from metabolite PCs
An analysis with adjustment for metabolites takes account of both genetic and environmental effects on metabolite. With genome-wide genotypic data, in principle, we can adjust for genetic effects directly through joint modeling of genotypes (or their PCs), though technical challenges remain. On the other hand, it is often difficult or even unknown to measure all relevant environmental factors. Hence, it would be interesting to account for only environmental effects through metabolites. For this purpose, we propose removing the genetic component from metabolomics data. Specifically, for each trait, we first conducted GWASs for each of the 20 metabolite PCs (mPCs) adjusting for sex, age, and the top 10 genotype PCs. Both the mPCs and covariates were standardized (as in the GWAS for the traits). Then, we obtained the genetic component of each (standardized) mPC by fitting a polygenic risk score with continuous shrinkage prior (PRS-CS) model with the corresponding mPC as the trait: we applied PRS-CS with the LD reference panel constructed using the UKB EUR samples and the default parameters provided by the software (https://github.com/getian107/PRScs/). The genetic component was calculated as the sum of the genetic markers across the genome, weighted by the posterior SNP effect sizes calculated by PRS-CS. We obtained the environmental component as the residual after subtracting the estimated genetic component by PRS-CS from the (standardized) mPC. Finally, we conducted GWASs for each trait, adjusting for sex, age, the top 10 genotype PCs and the environmental components of the top 20 mPCs, denoted as model .It is noted that the genetic components estimated by PRS-CS (or any other modeling method) may not completely capture all genetic effects while at the same time partially reflecting some genetically correlated environmental effects. Hence, our proposed method is just an approximation.
Simulation setups
In the simulation, we assumed that the metabolites only captured environmental effects, and we simulated the traits based on the UKB real data and some fitted model for BMI, SBP, and HTN, respectively, corresponding to three different scenarios. As shown later in Table 4, adjusting for the mPCs (with genetic components removed) in the model additionally explained a relatively large proportion of variance in BMI but only a moderate one for SBP. We also considered HTN as an example of binary trait. We chose three SNPs that were genome-wide significantly associated with BMI, SBP, and HTN in , but not in , respectively. Specifically, for quantitative traits, we generated the traits as follows:where g was sampled from the UKB individuals’ genotypes and standardized, was sampled from the UKB standardized covariates adjusted in (including the first 10 genotype PCs, sex, and age), was sampled from the UKB top 20 standardized mPCs (with the genetic components removed), and and were the standardized estimates from the null model of , i.e., Y = sex + age + genotype PCs + mPCs(no genetic comp) with both the outcome and predictors standardized. is generated from , where . For the binary trait, we generated the binary outcome as follows:andwhere g, b1, b2, C1, and C2 were generated the same way as before, p is the vector of probabilities, and the outcome was generated from the Bernoulli distribution with corresponding probability . The GWAS association effect β was assigned a zero value and a non-zero value, where the non-zero value was determined by the corresponding standardized estimates in GWASs for the three traits. The sample sizes were also assigned according to the real data analysis, which were 90,000 for BMI and HTN and 85,000 for SBP.
Table 4
of the (null) standard GWAS model and our proposed new GWAS models with top 20 metabolite PCs (before and after removing the genetic components)
Model
BMI
Height
SBP
DBP
WC
HC
WHR
HTN
0. Y = sex + age + genotype PCs
0.011
0.528
0.123
0.029
0.225
0.001
0.451
0.092
1. Y = sex + age + genotype PCs + mPCs
0.293(∗∗∗)a
0.537(∗∗∗)
0.178(∗∗∗)
0.109(∗∗∗)
0.464(∗∗∗)
0.169(∗∗∗)
0.587(∗∗∗)
0.194(∗∗∗)
2. Y = sex + age + genotype PCs + mPCs(no genetic comp)
0.305
0.536
0.180
0.112
0.473
0.178
0.590
0.197
An adjusted for each quantitative trait and Nagelkerke’s for the binary trait were used.
The p value from the likelihood ratio test comparing the full model 1 (including mPCs) with the reduced model 0 (without mPCs) was highly significant (smaller than 2.2e−16).
For each setup, the simulation was replicated 1,000 times. We report the mean and standard deviation (SD) of the β estimates and the mean of the estimated standard errors (SEs).
Results
After QC, there were around 90,000 individuals and 580,000 SNPs left in the analysis. Table 1 shows the sample sizes for the discovery and validation GWAS datasets.
Table 1
The sample sizes of the discovery and validation data for 8 traits
Trait
Discovery data
Validation data – UKB Neale lab
Validation data – other
BMI
91,337
359,983
∼700,000
Height
91,438
360,388
∼700,000
WC
91,476
360,564
∼200,000
HC
91,469
360,521
∼200,000
WHR
91,460
NA
∼500,000
SBP
85,950
340,159
∼700,000
DBP
85,951
340,162
∼700,000
HTN
91,620
361,141
NA
The sample sizes of the discovery and validation data for 8 traitsFor each of the 8 traits, we performed two genome-wide association analyses without or with adjustment for the top 20 mPCs, corresponding to the standard GWAS and our proposed new Ma-GWAS, denoted as models and , respectively. We identified significant SNPs with p <5e−8 and mapped them to independent loci. We validated the new signals using other validation GWAS summary data.
The new method identified more trait-associated loci
Table 2 shows the main results for the 8 traits. The first and second rows are the numbers of significant loci identified by the standard GWAS () and our new Ma-GWAS (), respectively, and the third row is the number of significant loci identified by the new method () but missed by the standard GWAS (). The fourth and fifth rows are the numbers of those new signals (in the third row) that could be validated in the UKB Neale lab data and other larger GWAS data, respectively. For example, for trait BMI, the standard GWAS identified 23 loci, while the new method () with additional adjustment for metabolomics data identified 48 loci. There were 31 new loci identified only by but missed by , compared with 6 loci identified by but not by . Furthermore, 17 of the 31 new loci were validated in the BMI GWAS results conducted by the Neale lab. In addition, with a larger sample size of around 700,000, 26 of the 31 new loci were validated by another validation GWAS. The Manhattan plots of the 4 sets of the GWAS results for BMI are shown in Figure 1, while those for other traits are given in the supplemental information (Figures S1–S7).
Table 2
Results for the 8 traits
BMI
Height
SBP
DBP
WC
HC
WHR
HTN
# Significant loci in subcohort UKB (M0)
23
235
16
13
16
23
15
15
# Significant loci in subcohort UKB + metabolite (M1)
48
226
19
26
36
50
30
19
# New loci (M1 versus M0)
31
6
4
13
27
32
22
7
# Validated new loci in Neale lab UKB
17
6
3
11
16
28
NA
6
# Validated new loci in other dataset
26
6
4
13
4
8
17
NA
From the top row to the bottom, the number of significant loci identified by the standard GWAS , by the proposed new method (after adjusting for metabolites), by but not by , and the number of new loci validated by the Neale lab results and other GWAS results.
Figure 1
Manhattan plots for BMI GWAS results
Top 2: the standard GWAS and proposed new method (without and with adjustment for metabolomics data) for the discovery sample (i.e., UKB sub-cohort). Bottom 2: two validation GWAS datasets.
Results for the 8 traitsFrom the top row to the bottom, the number of significant loci identified by the standard GWAS , by the proposed new method (after adjusting for metabolites), by but not by , and the number of new loci validated by the Neale lab results and other GWAS results.Manhattan plots for BMI GWAS resultsTop 2: the standard GWAS and proposed new method (without and with adjustment for metabolomics data) for the discovery sample (i.e., UKB sub-cohort). Bottom 2: two validation GWAS datasets.It is noted that most of the new loci were validated in the larger GWAS. In particular, for Height, SBP, and DBP, all new loci were validated. For DBP, while all 13 loci identified by the standard GWAS were also identified by the new method , the new method identified 13 more new loci, which were all validated by the other larger GWAS.Table 3 shows the result of the numbers of new loci and validated ones for the 8 traits based on the new definition of genomic loci given by FUMA. It is confirmed again that the proposed new method, by adjusting for metabolomics data in GWAS analysis, identified additional new loci for all 8 traits, and many of them were replicated in the validation GWAS data, though the numbers of validated new loci were slightly smaller than that given by the other definition of genomic loci as shown in Table 2.
Table 3
Results for the 8 traits (with loci defined by FUMA)
BMI
Height
SBP
DBP
WC
HC
WHR
HTN
# New loci
31
6
4
13
26
32
22
7
# Validated new loci in Neale lab UKB
13
5
3
10
10
25
NA
5
# Validated new loci in other dataset
21
6
3
11
4
7
11
NA
The new method improved estimation efficiency. From the top row to the bottom, the number of new loci identified by the proposed new method , and the numbers of new loci validated by the Neale lab results and by other GWAS results, respectively.
Results for the 8 traits (with loci defined by FUMA)The new method improved estimation efficiency. From the top row to the bottom, the number of new loci identified by the proposed new method , and the numbers of new loci validated by the Neale lab results and by other GWAS results, respectively.The main motivation for adjusting for metabolites in GWASs for more efficient estimation to improve power is through reduction of residual variances. To investigate this, we fitted models with only the usual covariates including sex, age, and top genotype PCs and the new models with sex, age, genotype PCs, and mPCs as predictors. Table 4 shows the coefficients of determination, R-squared (), of the two models for each of the 8 traits. It is confirmed that additionally adjusting for mPCs led to a larger and thus accounting for more variations of a trait, implying more efficient estimation by the new method.of the (null) standard GWAS model and our proposed new GWAS models with top 20 metabolite PCs (before and after removing the genetic components)An adjusted for each quantitative trait and Nagelkerke’s for the binary trait were used.The p value from the likelihood ratio test comparing the full model 1 (including mPCs) with the reduced model 0 (without mPCs) was highly significant (smaller than 2.2e−16).Next, we directly compared the estimated effect sizes of SNPs and their corresponding SEs from the standard GWAS and our proposed new GWAS models as shown in Equations 1 and 2. To save space, we only presented the representative results for BMI here, while the detailed results for the other 7 traits are in the supplemental information (see Figures S8–S13). The top panel in Figure 2 compares the SEs: each point represents one SNP, and we only kept SNPs with p values less than 1e−4 in either one of the two models. It is clear that the new method (adjusting for metabolomic PCs) yielded more precise estimates with smaller SEs. Specifically, the SEs from the new method were 0.85 of those from the standard GWAS, a reduction of 15%, equivalent to a 1/0.852 – 1 = 38.4% gain in sample size by the new method over the standard GWAS. For the six other quantitative traits, i.e., Height, SBP, DBP, WC, HC, and WHR, the reductions of their SEs by the new method were 1%, 3%, 4%, 17%, 9%, and 13%, equivalent to the GWAS sample size gains of 2%, 6.3%, 8.5%, 45.2%, 20.8%, and 32.1%, respectively. It is notable that the magnitudes of the reduced SEs by the new method were proportional to the increased explained by the mPCs for the quantitative traits as shown in Table 4.
Figure 2
Results comparison between models $M_0$ and $M_1$ for BMI
Comparison of the estimated effect sizes and estimated standard errors of significant SNPs (B) and their standard errors (SEs) (A) for BMI between the standard GWAS and the proposed new method (after adjusting for the top 20 metabolite PCs).
Results comparison between models $M_0$ and $M_1$ for BMIComparison of the estimated effect sizes and estimated standard errors of significant SNPs (B) and their standard errors (SEs) (A) for BMI between the standard GWAS and the proposed new method (after adjusting for the top 20 metabolite PCs).The bottom panel in Figure 2 compares the estimated effect sizes for SNPs that were significant in either one of the two models (p < 5e−8). The colors of the points correspond to the SNPs that were significant only in (red), significant only in (green), and significant in both and (blue). The triangles correspond to the SNPs that were validated in either one of the two validation GWAS datasets, and the circles correspond to the SNPs that were not validated or did not appear in the validation GWAS data. Two patterns are observed for the new signals (i.e., the green points that were only significant in ). The first is the presence of a subset of the points close to the identity line (i.e., the 45° diagonal line), meaning that the two estimates from the two models were similar. These new signals were only identified by largely because of its more precise estimation, and most of these new signals were also validated by at least one larger GWAS dataset. The second pattern corresponds to the remaining (green) points farther away from the identity line, suggesting that their statistical significance was (partly) due to their larger effect sizes estimated by . Similar observations held for the other quantitative traits except for Height. For height, almost all points were aligned on or around the diagonal line, i.e., the estimated effect sizes were similar from the two models, which, along with their closer SEs, suggests that height is likely less affected by factors measured or manifested by metabolites. For more details on the other quantitative traits, see Figures S10–S13.For binary trait HTN, both the estimated effect sizes and their SEs from the new method were slightly larger than their counterparts from the standard GWAS. While the larger SEs for (binary) HTN may be surprising, given that our method always yielded smaller SEs for the quantitative traits, this finding is due to non-linear logistic regression (compared with linear regression). This is in agreement with the literature:49, 50, 51, 52 adjusting for important covariates such as prognostic factors (corresponding to metabolomics data here) for a binary outcome with an independent and non-null treatment (corresponding to an SNP here) in logistic regression for a randomized clinical trial would always lead to both a larger treatment effect estimate and a larger SE, but at the end, the estimation efficiency, and thus the power, will always be improved, as confirmed by more genetic loci identified by the new method shown earlier.By default, we used the sample of self-reported White individuals, for which, as suggested by a reviewer, population stratification might still exist after adjusting for genotype PCs. We provided the QQ-plots for the GWAS scans on the 8 traits in the supplemental information (Figures S14 and S15). Only Height had a much-inflated inflation factor of 1.56, while the other 7 traits had inflation factors between 1.1 and 1.22. We also note that the Neale lab UKB GWAS data for Height also had a highly inflated inflation factor of 1.8. To alleviate the issue of population stratification, we further restricted the sample to those self-identified as “White British” and having a similar genetic ancestry (based on UKB data field 22006) and conducted the same analysis. The inflation factors were reduced for all traits, in particular from 1.56 to 1.35 for Height. Similarly, the analysis using the more restricted sample of individuals also supported our findings that novel loci were identified after adjusting for mPCs. The QQ-plots and other detailed results are given in Figures S16 and S17.
Why top 20 mPCs?
We chose to include the top 20 mPCs in the proposed model mainly because they explained a high proportion of the total variation (about 95%) in the 249 metabolite biomarkers, while the number of top mPCs, 20 here, was small compared with the sample size of GWASs. We also performed the same analysis but with the top 5 and top 10 mPCs, which explained around 81% and 89% of the total variance, respectively. We found that the results were similar to that with the top 20 mPCs—multiple novel loci were identified compared with the standard GWAS, and most of the novel loci were also validated by other GWAS data. Detailed results are given in the supplemental information (Tables S1–S3). In addition, we looked at the top 10 metabolites contributing most to each of the top 20 mPCs. We found that out of the 18 metabolite classes present in the UKB data, the top 10 metabolites of the top 5, top 10, and top 20 mPCs covered 12, 16, and 18 metabolite classes, respectively. This suggests that including the top 20 PCs covered all classes of metabolites that might be able to explain more background genetic and environmental effects, while including only the top 5 or top 10 PCs would miss some of them.
Removing the genetic components from mPCs
To focus on adjusting for only the environmental effects reflected in metabolite, we removed the genetic component from each of the top 20 mPCs, then conducted the GWAS on each trait. As shown in Table 5, we can see a similar pattern (as in Table 2) that adjusting for the environmental components of metabolites in identified new loci beyond those by the standard GWAS model , though the number of new loci identified by was less than that by . Furthermore, most of the new loci identified by were validated as well.
Table 5
Results for the 8 traits comparing standard GWAS () and GWAS with adjustment for only the environmental components of top 20 metabolite PCs ()
BMI
Height
SBP
DBP
WC
HC
WHR
HTN
# Significant loci in subcohort UKB (M0)
23
235
16
13
16
23
15
15
# Significant loci in subcohort UKB + environmental component (M2)
37
230
17
17
28
47
17
16
# New loci (M2 versus M0)
20
5
2
5
15
26
10
3
# Validated new loci in Neale lab UKB
12
5
2
5
11
25
NA
2
# Validated new loci in other dataset
19
5
2
5
2
8
8
NA
Results for the 8 traits comparing standard GWAS () and GWAS with adjustment for only the environmental components of top 20 metabolite PCs ()As shown in Table 4, interestingly, the s in model 2 (after removing the genetic components in the top mPCs) were quite close to, though all except one were slightly larger than, those in model 1 (before removing the genetic components), suggesting that the mPCs largely and perhaps even mostly represented environmental effects on the traits, thus offering one explanation on why adjusting for mPCs improved power.As before (i.e., without removing the genetic components from mPCs), we compared the estimated effect sizes of SNPs and their corresponding SEs from the standard GWAS and our proposed new GWAS for BMI, while the detailed results for the other 7 traits are in the supplemental information (Figures S18–S23). As shown in the top panel in Figure 3, the new method (adjusting for environmental components of mPCs) yielded more precise estimates with smaller SEs. Specifically, the SEs from the new method were 0.84 of those from the standard GWAS. For the other six quantitative traits, the results were similar as before: the reductions of their SEs by the new method ranged from 1% to 18%. On the other hand, as shown in the bottom panel in Figure 3, the estimated effect sizes for SNPs by the new method were more similar to those from the standard GWAS. For the binary trait HTN, both the SEs and the estimated effect sizes by the new method were slightly larger than those from the standard GWAS, as observed and explained before.
Figure 3
Results comparison between models $M_0$ and $M_2$ for BMI
Comparison of the estimated effect sizes and estimated standard errors of significant SNPs (B) and their SEs (A) for BMI between the standard GWAS and the proposed new method (after adjusting for only the environmental components of the top 20 metabolite PCs).
Results comparison between models $M_0$ and $M_2$ for BMIComparison of the estimated effect sizes and estimated standard errors of significant SNPs (B) and their SEs (A) for BMI between the standard GWAS and the proposed new method (after adjusting for only the environmental components of the top 20 metabolite PCs).
Simulation results
On each simulated dataset, we applied the standard GWAS that only adjusted for the usual covariates including genotype PCs, sex, and age ( in Equation 3) and our proposed Ma-GWAS that adjusted for mPCs ( in Equation 3) in addition to the usual covariates. The left column in Figure 4 shows the box plots of the estimated GWAS associations across 1,000 replicates under different simulation setups, and we also added the corresponding type I error (T1E) and power. We can see that the standard GWAS gave biased estimates due to ignoring the mPCs as confounders (because mPCs were correlated with both the genotypes and the traits) and sometimes inflated T1E, while Ma-GWAS gave unbiased estimates with well-controlled T1E under the true model. Importantly Ma-GWAS was more powerful than the standard GWAS. The right column in Figure 4 compares the estimated SEs of between the two methods. Here, we only show the plots for a non-zero β for each trait, but the results were similar when (see Table S5). In agreement with our real data analysis, Ma-GWAS gave more precise estimates with smaller SEs for the two quantitative traits. For BMI, the metabolites could capture a larger portion of environmental effects, sp the ratio of the estimated SEs between the two methods was smaller around 84%; in comparison, for SBP, the ratio was larger around 97%. For (binary) HTN, similar to the real data analysis results, we again observed that both the SEs and the estimated effect sizes for non-zero β by Ma-GWAS were larger than those from the standard GWAS. In summary, the simulation results confirmed that adjusting for metabolites gave more precise/efficient estimation and thus higher power as we observed in the real data analysis.
Figure 4
Simulation results across 1,000 replicates for BMI, SBP, and HTN (from the top to bottom rows)
The left column shows the box plots of the estimated GWAS effects with empirical type I error (T1E) rates (at the nominal significance level of 0.05) and power (at the genome-wide significance level of 5e−8) indicated. The right column compares the estimated standard errors between the two methods.
Simulation results across 1,000 replicates for BMI, SBP, and HTN (from the top to bottom rows)The left column shows the box plots of the estimated GWAS effects with empirical type I error (T1E) rates (at the nominal significance level of 0.05) and power (at the genome-wide significance level of 5e−8) indicated. The right column compares the estimated standard errors between the two methods.
Discussion
We have investigated and confirmed that adjusting for metabolomics data in GWAS analysis can boost its power. In an application to 7 quantitative traits and one binary disease of UKB samples with metabolomics data, we found that adjusting for the top 20 metabolomics PCs in GWASs substantially improved the power over the standard GWAS; most of the newly detected trait-associated loci were validated by larger GWASs. The main reason is that metabolomics (and other omics) data reflect both genetic and environmental influences, thus using them to account for additional variations among individuals is expected to reduce the residual variance in GWAS, leading to more precise/efficient estimation of the effect sizes of the SNPs, thus improving power. In particular, across the 7 quantitative traits, the reduction of the SEs of the estimated SNP effect sizes ranged from 1% to 13%, equivalent to the GWAS sample size gains from 2.0% (for Height) to 45.2% (for WC). Furthermore, after regressing out the genetic components from the mPCs, we obtained similar GWAS results, suggesting that perhaps the estimation and power gains from adjusting for mPCs mainly came from accounting for environmental effects reflected in metabolomics data.There are several directions for future study. First, since some metabolites may mediate the effects of some SNPs on a trait, adjusting for these metabolites may lead to reduced power while in some other situations leading to biased inference. Our analysis results suggest that the potential gain outweighs the possible loss in power, especially if we take the proposed approach as complementary to the standard GWAS approach. Alternatively, we may regress out genetic components from metabolomics data (e.g., mPCs, as done in our analysis) before adjusting for mPCs. Nevertheless, to better deal with the problem, it is important to have a better understanding of the causal mechanism underlying genetic variants, metabolites, and a trait of interest. Relatedly, perhaps there are certain types of metabolites more suitable to be adjusted for a given GWAS trait. Similarly, there might be certain categories of traits more likely to benefit more from the adjustment of (certain types of) metabolites in GWASs. This could be due to that some traits were more largely and directly influenced by genetic and/or environmental factors (e.g., those involved in or influencing metabolic pathways) that are also better captured by the available metabolomics data, as partially reflected by their larger s explained by metabolic data. Hence, an explained by metabolic data for a trait may be used to predict how useful the given metabolic data for the trait, thus possibly informing the choice of the metabolic data and/or the trait. More studies are needed. Second, our proposed method is orthogonal to many existing approaches of incorporating the use of functional annotations of the genome and thus can be applied on top of these methods. It would be interesting to see how such a combined approach would perform in real data applications. Third, in this study, to adjust for the hidden genetic and environmental effects manifested by metabolomics data, we simply included the top 20 PCs obtained from the 249 metabolites as fixed effects in the GWAS regression model. There could be other better approaches to integrate metabolomics (or other omics) data in GWAS. Lastly, more real data applications, possibly with various omics data, are warranted.
Authors: Hugues Aschard; Bjarni J Vilhjálmsson; Amit D Joshi; Alkes L Price; Peter Kraft Journal: Am J Hum Genet Date: 2015-01-29 Impact factor: 11.025
Authors: Elena V Feofanova; Han Chen; Yulin Dai; Peilin Jia; Megan L Grove; Alanna C Morrison; Qibin Qi; Martha Daviglus; Jianwen Cai; Kari E North; Cathy C Laurie; Robert C Kaplan; Eric Boerwinkle; Bing Yu Journal: Am J Hum Genet Date: 2020-10-07 Impact factor: 11.025
Authors: Yen-Yi Ho; Emily C Baechler; Ward Ortmann; Timothy W Behrens; Robert R Graham; Tushar R Bhangale; Wei Pan Journal: Hum Hered Date: 2014-07-30 Impact factor: 0.444
Authors: Inga Steinbrenner; Ulla T Schultheiss; Fruzsina Kotsis; Pascal Schlosser; Helena Stockmann; Robert P Mohney; Matthias Schmid; Peter J Oefner; Kai-Uwe Eckardt; Anna Köttgen; Peggy Sekula Journal: Am J Kidney Dis Date: 2021-04-09 Impact factor: 8.860
Authors: Evangelos Evangelou; Helen R Warren; David Mosen-Ansorena; Borbala Mifsud; Raha Pazoki; He Gao; Georgios Ntritsos; Niki Dimou; Claudia P Cabrera; Ibrahim Karaman; Fu Liang Ng; Marina Evangelou; Katarzyna Witkowska; Evan Tzanis; Jacklyn N Hellwege; Ayush Giri; Digna R Velez Edwards; Yan V Sun; Kelly Cho; J Michael Gaziano; Peter W F Wilson; Philip S Tsao; Csaba P Kovesdy; Tonu Esko; Reedik Mägi; Lili Milani; Peter Almgren; Thibaud Boutin; Stéphanie Debette; Jun Ding; Franco Giulianini; Elizabeth G Holliday; Anne U Jackson; Ruifang Li-Gao; Wei-Yu Lin; Jian'an Luan; Massimo Mangino; Christopher Oldmeadow; Bram Peter Prins; Yong Qian; Muralidharan Sargurupremraj; Nabi Shah; Praveen Surendran; Sébastien Thériault; Niek Verweij; Sara M Willems; Jing-Hua Zhao; Philippe Amouyel; John Connell; Renée de Mutsert; Alex S F Doney; Martin Farrall; Cristina Menni; Andrew D Morris; Raymond Noordam; Guillaume Paré; Neil R Poulter; Denis C Shields; Alice Stanton; Simon Thom; Gonçalo Abecasis; Najaf Amin; Dan E Arking; Kristin L Ayers; Caterina M Barbieri; Chiara Batini; Joshua C Bis; Tineka Blake; Murielle Bochud; Michael Boehnke; Eric Boerwinkle; Dorret I Boomsma; Erwin P Bottinger; Peter S Braund; Marco Brumat; Archie Campbell; Harry Campbell; Aravinda Chakravarti; John C Chambers; Ganesh Chauhan; Marina Ciullo; Massimiliano Cocca; Francis Collins; Heather J Cordell; Gail Davies; Martin H de Borst; Eco J de Geus; Ian J Deary; Joris Deelen; Fabiola Del Greco M; Cumhur Yusuf Demirkale; Marcus Dörr; Georg B Ehret; Roberto Elosua; Stefan Enroth; A Mesut Erzurumluoglu; Teresa Ferreira; Mattias Frånberg; Oscar H Franco; Ilaria Gandin; Paolo Gasparini; Vilmantas Giedraitis; Christian Gieger; Giorgia Girotto; Anuj Goel; Alan J Gow; Vilmundur Gudnason; Xiuqing Guo; Ulf Gyllensten; Anders Hamsten; Tamara B Harris; Sarah E Harris; Catharina A Hartman; Aki S Havulinna; Andrew A Hicks; Edith Hofer; Albert Hofman; Jouke-Jan Hottenga; Jennifer E Huffman; Shih-Jen Hwang; Erik Ingelsson; Alan James; Rick Jansen; Marjo-Riitta Jarvelin; Roby Joehanes; Åsa Johansson; Andrew D Johnson; Peter K Joshi; Pekka Jousilahti; J Wouter Jukema; Antti Jula; Mika Kähönen; Sekar Kathiresan; Bernard D Keavney; Kay-Tee Khaw; Paul Knekt; Joanne Knight; Ivana Kolcic; Jaspal S Kooner; Seppo Koskinen; Kati Kristiansson; Zoltan Kutalik; Maris Laan; Marty Larson; Lenore J Launer; Benjamin Lehne; Terho Lehtimäki; David C M Liewald; Li Lin; Lars Lind; Cecilia M Lindgren; YongMei Liu; Ruth J F Loos; Lorna M Lopez; Yingchang Lu; Leo-Pekka Lyytikäinen; Anubha Mahajan; Chrysovalanto Mamasoula; Jaume Marrugat; Jonathan Marten; Yuri Milaneschi; Anna Morgan; Andrew P Morris; Alanna C Morrison; Peter J Munson; Mike A Nalls; Priyanka Nandakumar; Christopher P Nelson; Teemu Niiranen; Ilja M Nolte; Teresa Nutile; Albertine J Oldehinkel; Ben A Oostra; Paul F O'Reilly; Elin Org; Sandosh Padmanabhan; Walter Palmas; Aarno Palotie; Alison Pattie; Brenda W J H Penninx; Markus Perola; Annette Peters; Ozren Polasek; Peter P Pramstaller; Quang Tri Nguyen; Olli T Raitakari; Meixia Ren; Rainer Rettig; Kenneth Rice; Paul M Ridker; Janina S Ried; Harriëtte Riese; Samuli Ripatti; Antonietta Robino; Lynda M Rose; Jerome I Rotter; Igor Rudan; Daniela Ruggiero; Yasaman Saba; Cinzia F Sala; Veikko Salomaa; Nilesh J Samani; Antti-Pekka Sarin; Reinhold Schmidt; Helena Schmidt; Nick Shrine; David Siscovick; Albert V Smith; Harold Snieder; Siim Sõber; Rossella Sorice; John M Starr; David J Stott; David P Strachan; Rona J Strawbridge; Johan Sundström; Morris A Swertz; Kent D Taylor; Alexander Teumer; Martin D Tobin; Maciej Tomaszewski; Daniela Toniolo; Michela Traglia; Stella Trompet; Jaakko Tuomilehto; Christophe Tzourio; André G Uitterlinden; Ahmad Vaez; Peter J van der Most; Cornelia M van Duijn; Anne-Claire Vergnaud; Germaine C Verwoert; Veronique Vitart; Uwe Völker; Peter Vollenweider; Dragana Vuckovic; Hugh Watkins; Sarah H Wild; Gonneke Willemsen; James F Wilson; Alan F Wright; Jie Yao; Tatijana Zemunik; Weihua Zhang; John R Attia; Adam S Butterworth; Daniel I Chasman; David Conen; Francesco Cucca; John Danesh; Caroline Hayward; Joanna M M Howson; Markku Laakso; Edward G Lakatta; Claudia Langenberg; Olle Melander; Dennis O Mook-Kanamori; Colin N A Palmer; Lorenz Risch; Robert A Scott; Rodney J Scott; Peter Sever; Tim D Spector; Pim van der Harst; Nicholas J Wareham; Eleftheria Zeggini; Daniel Levy; Patricia B Munroe; Christopher Newton-Cheh; Morris J Brown; Andres Metspalu; Adriana M Hung; Christopher J O'Donnell; Todd L Edwards; Bruce M Psaty; Ioanna Tzoulaki; Michael R Barnes; Louise V Wain; Paul Elliott; Mark J Caulfield Journal: Nat Genet Date: 2018-09-17 Impact factor: 41.307