Literature DB >> 29218908

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions.

Jason Westra1, Nicholas Hartman, Bethany Lake, Gregory Shearer, Nathan Tintle.   

Abstract

Standard approaches to evaluate the impact of single nucleotide polymorphisms (SNP) on quantitative phenotypes use linear models. However, these normal-based approaches may not optimally model phenotypes which are better represented by Gaussian mixture distributions (e.g., some metabolomics data). We develop a likelihood ratio test on the mixing proportions of two-component Gaussian mixture distributions and consider more restrictive models to increase power in light of a priori biological knowledge. Data were simulated to validate the improved power of the likelihood ratio test and the restricted likelihood ratio test over a linear model and a log transformed linear model. Then, using real data from the Framingham Heart Study, we analyzed 20,315 SNPs on chromosome 11, demonstrating that the proposed likelihood ratio test identifies SNPs well known to participate in the desaturation of certain fatty acids. Our study both validates the approach of increasing power by using the likelihood ratio test that leverages Gaussian mixture models, and creates a model with improved sensitivity and interpretability.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29218908      PMCID: PMC5757879     

Source DB:  PubMed          Journal:  Pac Symp Biocomput        ISSN: 2335-6928


1. Introduction

Genome-wide association studies (GWAS) continue to be viewed as a standard approach to evaluating the genetic component of a variety of diseases and other phenotypes of interest [1]. Standard approaches to the analysis of genotype associations with quantitative phenotypes use linear models. As suggested in Tintle et al. [2], bimodal distributions are frequently observed in continuous phenotype samples of metabolites, challenging the normality assumption needed in many existing GWAS analysis approaches. For example, red blood cell fatty acid levels have been found to contribute to coronary heart disease [3]. As outlined in Tintle et al. [2], it is biologically reasonable to consider one’s fatty acid levels as coming from a mixture of Gaussian distributions, with each of the two or three mean fatty acid levels determined by genetics, and variation around the mean level determined by other factors (e.g., diet; lifestyle). While the standard way of analyzing fatty acids follows the typical GWAS linear model approach, in cases where the distribution does not appear to be normally distributed, a log transformation is sometimes used [4]. However, this log transformation may fail to accurately capture the true distribution of the genotypic and phenotypic data since it ignores the biological reasoning for observing a non-normal distribution. It may be more powerful to directly model the normal mixture distribution and then test for genotype-phenotype association. Recently, Kim et al. proposed a likelihood ratio test to test for association between copy number polymorphisms (CNP) with quantitative phenotypes and case control outcomes which followed a mixture of Gaussian distributions [5]. The likelihood ratio test evaluates possible differences in the mixing proportions of the Gaussian components by different copy number. Kim et al. showed that the likelihood ratio test was more powerful than a 2 x d chi-squared test with d equal to the number of CNP categories when the underlying data was from a mixture distribution. We propose adapting the Kim et al. likelihood ratio test to the standard genotype-phenotype testing situation for phenotypes which are distributed as a mixture of Gaussian distributions, like some metabolomics data (e.g., fatty acid levels). We will provide a theoretical framework for the likelihood ratio test, evaluate its performance on simulated data and then apply it to a real set of fatty acid data from the Framingham Heart Study.

2. Methods

2.1. Notation

Let X be a quantitative phenotype that follows a two-component Gaussian mixture distribution. Thus, X ~ πN (μ1, σ2) + (1 − π) N (μ2, σ2) where π is the mixing parameter of the Gaussian components. Let μ1 and μ2 be the mean parameters such that μ1 ≠ μ2, and we assume a common variance σ2 for both components. We assume π = p01(n0/N)+ p11(n1/N)+ p21(n2/N) where p1 (t = 0, 1, 2) is the proportion of genotype t in the first component of the mixture distribution, n (t = 0, 1, 2) is the number of individuals with genotype t, and N is the total number of individuals. We consider the null hypothesis H0 : p01 = p11 = p21 and the alternative Ha: at least one is not equal. Let pϕ = p0 = p1 = p2 (i =1,2) (see Figure 1 for a visual representation). Let x (b = 1, 2, … n) and (t = 0, 1, 2) be a random variable representing the phenotype for individual b who has genotype t, and let w be a vector of all x. Across all the components, the mixing proportion for genotype t must sum to 1 such that p1 + p2 =1 ( t= 0,1,2).
Figure 1

visually illustrates the null and alternative models. The black, light grey, and dark grey two-component mixture distributions are the phenotype distributions for the less common homozygote, the heterozygote and the more common homozygote, respectively. In the null model, 75% of the observations in each genotype are in the component with the smaller mean. In the alternative model, the mixing proportion for the component density with the smaller mean varies across genotypes.

2.2. Likelihood functions

2.2.1. Null and alternative likelihood function

The likelihood function under the null hypothesis is: The likelihood function under the unrestricted alternative hypothesis is:

2.2.2. Restricted likelihood function

When there is a biological understanding of the phenotype-genotype relationship, we recommend restricting the mixing proportions of the test to fit the biological model. We demonstrate two possible models, but our general method easily extends to other models. The first model (LRTpro; Table 1) we consider is that the proportion of change between genotypes 0 and 1 is equal to the change between genotypes 1 and 2. Therefore, we can restrict our parameters of interest to , and . The second restricted model (LRTadd; Table 2) that we demonstrate describes an equal difference in proportions between groups 0 and 1 and groups 1 and 2. We can restrict our parameters of interest to , and . Therefore, the likelihood function under these restrictions is:
Table 1

LRTpro

GenotypeComponent 1 of Mixture DistributionComponent 2 of Mixture Distribution
0p011 − p01
1p01q1 − (p01q)
2p01q21 − (p01q2)
Table 2

LRTadd

GenotypeComponent 1 of Mixture DistributionComponent 2 of Mixture Distribution
0p011 − p01
1p01q1 − (p01q)
2p01 − 2q1 − (p01 − 2q)

2.2.3. Test statistics

Because p2 =1 − p1 for all t, we can express each likelihood as a function of the parameters μ1, μ2, σ2, and the mixing proportion(s) associated with the N(μ1, σ2) distribution. The resulting likelihood ratio test statistics are given by: Extending the argument provided by Kim et al. the LRTS under the null hypothesis follows a central chi-squared distribution with the degrees of freedom equal to the difference in parameters of the null and alternative models [5]. Therefore, under the null hypothesis, the LRTS has a central chi-squared distribution with 2 degrees of freedom, and the LRTSres follows a central chi-squared distribution with 1 degree of freedom.

2.3. Simulation

Using R software, we simulated 1000 datasets with 10,000 individuals per data set. For each, individual, the genotype for a single SNP was generated by assuming Hardy-Weinberg equilibrium and minor allele frequency of either 0.05, 0.10, or 0.25. Trait values for individuals were simulated from two component Gaussian mixture distributions with centers one unit apart and equal variance of the components σ2 = 0.5 or 0.75. For the mixing proportions of individuals with genotype 0, we used p01 = 0.9 or p01 = 0.75. We used two different biological models to simulate. In the proportional model we set q equal to 1, 0.9, or 0.75 so that the other mixing proportions were p11 = p01q and p21 = p01q2. In the additive model we set q equal to 0.1 or 0.2 so that the mixing proportions were p11 = p01 − q and p21 = p01 −2q. Simulations were performed on all combinations of the parameters.

2.4. Statistical analysis

To evaluate the performance of these tests in direct comparison to the standard procedure of linear and log-linear models, all tests were run on each simulated SNP and phenotype. Each test produced a p-value, test statistic and parameter estimates. Type I error rates and power estimates were calculated by dividing the number of observations less than a significance level (Type I error 0.01, power 0.0001) by the total number of simulations. We used an Expectation Maximization (EM) algorithm to find the global maximums of equations (4) and (5). One hundred random start points (RSP) were used for the null likelihood, and 50 RSP and one start point from the maximum of the null were used in the alternative [5]. The EM algorithm ran until a tolerance of 10−5 was reached or until 600 and 300 iterations were performed for the null and alternative models respectively.

2.5. Real data application

We analyzed 20315 SNPs on chromosome 11 for 5936 individuals from the Framingham Heart Study using the proposed LRTpro test. We looked exclusively at members in the offspring and generation 3 cohorts, all of whom are of European descent. Detailed descriptions of the sample are available elsewhere [6] [9]. We looked at the red blood cell fatty acid level ratio of arachidonic acid (AA) to dihomo-gamma-linoleneic acid (DGLA). These fatty acid levels were analyzed by gas chromatography as previously described [6]. The desaturation of AA to DGLA occurs primarily via enzymatic activity in the FADS gene complex on chromosome 11. We will use a Bonferroni correction to control the probability of type I errors at 2.47x10−6 (0.05/20315).

3. Results

3.1. Verifying the null distribution and type I error rate

To confirm that the null distribution of the unrestricted model is a chi-square distribution with two degrees of freedom and that the null distribution of the restricted model is a chi-square distribution with one degree of freedom, we examined simulations when q = 1. In addition to examining the novel tests proposed here (LRTpro, LRTadd) we also explored the type I error rates of the linear model, log-linear model, and LRT across these same simulations. As shown in Table 3 the type I error rate was controlled by all tests.
Table 3

Type I Error Estimates

Nominal Significance Level

SD0.050.010.001Kolmogorov-Smirnov test p-value1
LRTpro0.50.04970.0110.00120.6846
0.750.05150.00970.00100.8832

LRTadd0.50.04720.01080.00120.7277
0.750.04950.00850.00080.7091

LRT0.50.05570.01080.00120.2269
0.750.04780.00780.00130.7435

Linear Model0.50.05380.01070.0007
0.750.04580.00700.0005

Log Linear Model0.50.05230.01080.0007
0.750.04600.00830.0007

As compared to a chi-square distribution.

3.2. Power estimates

There were 48 simulations where the alternative hypothesis was true. As summarized in Table 4 (full detailed results are in Supplemental Table 1), the LRTpro has empirical power equal to or greater than all the other tests in all situations. LRTadd was the second most powerful test in all 48 simulations. When comparing a linear model to the unconstrained LRT test directly there were 21 simulations where they had different power. In two-thirds of these cases (14 out of 21), LRT had higher power than the linear model. The log-linear model never had an empirical power higher than any other test.
Table 4

Power Estimates

modelqmafp01Linear ModelLog Linear ModelLRTproLRTaddLRT
add0.10.050.750.3430.260.4030.390.295
0.90.440.3160.6310.6240.508

0.10.750.8240.7360.8790.8710.798
0.90.8980.7930.9670.9660.938

0.250.750.9990.9970.9990.9990.999
0.90.9990.999111

pro0.90.050.750.120.0950.1560.1530.105
0.90.3250.2120.4780.4670.362

0.10.750.3880.310.460.4510.342
0.90.750.6220.8910.8870.831

0.250.750.9040.8440.9360.9320.892
0.90.9980.975111

Power estimates for standard deviation of .75 for alpha = 0.0001

The choice of 0.0001 as a cutoff for our power estimates is arbitrary as Figure 2 demonstrates. The LRTpro tends to have a smaller p-value than the linear model for all thresholds since almost all of the points are above the gray line.
Figure 2

P-value comparison between LRTpro and the linear model.

3.3. Robustness of model selection

Since choosing a restriction based on prior knowledge as is done in both LRTpro and LRTadd may not be possible in every circumstance, it may not be necessary to choose the exact model. Table 4 shows that LRTpro and LRTadd were the most powerful tests even when the other model was simulated. These two restrictions are of similar patterns, but the increase of power is substantial. Therefore, choosing a model at least similar to the true model can increase the power of the test.

3.4. Parameter estimation

In order to conduct the LRT, estimates of the underlying parameters of the two-component distribution are obtained. Table 5 illustrates the accuracy and precision of the resulting estimates across a range of simulation settings for the LRTpro approach, with full results for all tests in supplemental tables 2 and 3. In general, LRTpro and LRTadd yielded unbiased and accurate estimates across settings. In Table 5, one can see that LRTpro accurately predicted the means of the components both across a wide range of settings and with low variation of the estimate. LRTpro estimated well even when the data was simulated from the additive model. Similar results are obtained when estimating the mixing proportion (see Table 6) and the standard deviation of the components (see supplemental table 4).
Table 5

Estimates of Means for LRTpro

True modelTrue p01qμ1Standard deviation of μ1μ2Standard deviation of μ2
Add0.750.10.00050.029361.00220.0401
0.90.1−0.00210.02401.00360.0740
0.750.2−0.00070.02401.00110.0349
0.90.2−0.00200.02061.00050.0547

Pro0.750.750.00050.026781.00050.0356
0.90.75−0.00140.01101.00080.0518
0.750.90.00020.02931.00300.0400
0.90.9−0.00250.024961.00280.0781

Estimates aggregated across all settings with these parameters and all simulations within each setting, with the true value of μ1 =0 and μ2 =1.

Table 6

Estimates of Mixing Proportions for LRTpro

modelTrue p01qp01sdTrue p11p11sdTrue p21p21sd
Add0.750.10.75090.02800.650.51970.15370.550.52900.0625
0.90.10.89730.02650.80.61340.27860.70.60590.1627
0.750.20.74960.02470.550.50970.05820.350.50720.1731
0.90.20.89850.02200.70.65590.12200.50.55230.0744

Pro0.750.750.75040.02480.56250.53900.05970.42190.47070.1188
0.90.750.89830.02050.67500.66270.06910.50630.51390.0677
0.750.90.75070.02840.67500.53850.17540.60750.53580.1037
0.90.90.89660.02780.81000.62900.28230.7290.61700.1814

Estimates aggregated across all settings with these parameters and all simulations within each setting.

3.5. Real data results

After analyzing 20321 SNPs on Chromosome 11 in relation to the AA/DGLA ratio, the LRTpro test identified 28 SNPs as significantly associated after applying a Bonferonni multiple testing correction. These 28 SNPs came from 5 different regions on chromosome 11, all of which validated previous GWAS findings. Nineteen significant SNPs are in the well documented [10] [12]FADS region (bp = 61622896– 61978819). Genes in this region that contain significant SNPs include DAGLA, MYRF, FADS1, FADS2, FADS3, and RAB3IL1 all of which have strong biological basis for desaturation activity [10]. As an example interpretation of the results in Table 7, we first note that the significant tests all show similar estimates of the two components of the AA/DGLA ratio (mean of component one between 0.16 and 0.18; mean of component two between 0.097 and 0.101; SD of each component between 0.023 and 0.024). When an individual is genotyped and is the common homozygote at rs174549, they have a 3.6% chance of having their AA/DGLA ratio in the first component. However, if the individual has one less common allele, his chance increases to 18.3%, and with a second copy of the minor allele, it will increase to 93.7%.
Table 7

Most significant SNPs in each region

rs## of SNPs 1MAFPosGeneLRTpro p-valuep01p11p21μ1μ2σ
rs107511240.34685432084DLG22.50x10−80.0620.1140.1620.1740.1000.023
rs1122065810.35099618283CNTN54.52x10−70.1100.0750.0510.1790.1010.024
rs712901550.1981107724851.86x10−70.1050.0590.0340.1790.1010.024
rs1121775310.1671201814152.94x10−90.1080.0520.0250.1800.1010.024
rs174549190.29061803910FADS15.32x10−3120.0360.1830.9370.1600.0970.024

4. Discussion

GWAS typically utilize linear models, thus making an assumption about the underlying normality of the data. When data is not normal, a Gaussian mixture distribution may represent a statistically justified and biologically interpretable model of the data. We proposed a constrained likelihood ratio test, which across many simulation settings, was more powerful than the standard linear model and gave accurate parameter estimates. When applied to a real dataset, the method identified biologically relevant SNPs in the well understood FADS region, along with parameter estimates to aid in biological interpretability of the impact of the SNP. The general LRT framework proposed here shows reasonably good performance compared to the additive linear model, but can be improved upon by further constraining the model and ‘saving’ a degree of freedom. Our simulations suggest relatively robust performance of the constrained methods (LRTpro and LRTadd) to misspecification of the true model though additional simulations across a wider range of misspecifications are needed. We note that, due to the use of the EM algorithm to generate parameter estimates for use in the LRT, computational time for our proposed methods (3 minutes per test on a single processor with a sample size of 10,000) are much greater than that of the traditional linear model. Nevertheless, with the increasing computational power and the limited number of high minor allele frequency SNPs, it is plausible to run GWAS with this method and is a reasonable option for candidate gene approaches. Further work is necessary to investigate potential areas of computational improvement. Numerous areas of future work and extension are possible. First, extensions of this work are needed to incorporate covariates and family structure into the method. Standard methods (e.g., first modeling the phenotype by covariates and/or family structure and then modeling the residuals) make normality assumptions and, so, may not be optimal candidates for extension in this Gaussian mixture modeling framework. Imputed data often provides dosages instead of discrete genotypes. Work is needed to extend this framework to allow for dosages in this testing framework. Further applications to genome wide data is necessary to fully understand the impact of this new method. Finally, extensions for multiple-marker testing and relaxing the equal variance assumption are also targets for further exploration. We have developed a likelihood ratio test that analyzes the differences in mixing proportions between genotypes. The method and null distribution were validated through simulation. There was notable power increase over the more commonly used linear model, especially when we further increased power by restricting the model to incorporate prior biological belief. We have shown that this method is able to accurately predict model parameters. The model was applied to real data, and it replicated many previous findings while also providing more interpretable results. Further work is necessary to apply the model to a wider range of real metabolomics data and to investigate extensions of the model to handle covariates and imputed genotypes.
  11 in total

1.  Clinical correlates and heritability of erythrocyte eicosapentaenoic and docosahexaenoic acid content in the Framingham Heart Study.

Authors:  William S Harris; James V Pottala; Sean M Lacey; Ramachandran S Vasan; Martin G Larson; Sander J Robins
Journal:  Atherosclerosis       Date:  2012-06-07       Impact factor: 5.162

Review 2.  Genetics of the Framingham Heart Study population.

Authors:  Diddahally R Govindaraju; L Adrienne Cupples; William B Kannel; Christopher J O'Donnell; Larry D Atwood; Ralph B D'Agostino; Caroline S Fox; Marty Larson; Daniel Levy; Joanne Murabito; Ramachandran S Vasan; Greta Lee Splansky; Philip A Wolf; Emelia J Benjamin
Journal:  Adv Genet       Date:  2008       Impact factor: 1.944

3.  Human metabolic individuality in biomedical and pharmaceutical research.

Authors:  So-Youn Shin; Ann-Kristin Petersen; Nicole Soranzo; Christian Gieger; Karsten Suhre; Robert P Mohney; David Meredith; Brigitte Wägele; Elisabeth Altmaier; Panos Deloukas; Jeanette Erdmann; Elin Grundberg; Christopher J Hammond; Martin Hrabé de Angelis; Gabi Kastenmüller; Anna Köttgen; Florian Kronenberg; Massimo Mangino; Christa Meisinger; Thomas Meitinger; Hans-Werner Mewes; Michael V Milburn; Cornelia Prehn; Johannes Raffler; Janina S Ried; Werner Römisch-Margl; Nilesh J Samani; Kerrin S Small; H-Erich Wichmann; Guangju Zhai; Thomas Illig; Tim D Spector; Jerzy Adamski
Journal:  Nature       Date:  2011-08-31       Impact factor: 49.962

4.  Plasma concentrations of trans fatty acids in persons with type 2 diabetes between September 2002 and April 2004.

Authors:  Dawn C Schwenke; John P Foreyt; Edgar R Miller; Rebecca S Reeves; Mara Z Vitolins
Journal:  Am J Clin Nutr       Date:  2013-02-27       Impact factor: 7.045

5.  Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium: Design of prospective meta-analyses of genome-wide association studies from 5 cohorts.

Authors:  Bruce M Psaty; Christopher J O'Donnell; Vilmundur Gudnason; Kathryn L Lunetta; Aaron R Folsom; Jerome I Rotter; André G Uitterlinden; Tamara B Harris; Jacqueline C M Witteman; Eric Boerwinkle
Journal:  Circ Cardiovasc Genet       Date:  2009-02

Review 6.  10 Years of GWAS Discovery: Biology, Function, and Translation.

Authors:  Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang
Journal:  Am J Hum Genet       Date:  2017-07-06       Impact factor: 11.025

7.  Genetic loci associated with plasma phospholipid n-3 fatty acids: a meta-analysis of genome-wide association studies from the CHARGE Consortium.

Authors:  Rozenn N Lemaitre; Toshiko Tanaka; Weihong Tang; Ani Manichaikul; Millennia Foy; Edmond K Kabagambe; Jennifer A Nettleton; Irena B King; Lu-Chen Weng; Sayanti Bhattacharya; Stefania Bandinelli; Joshua C Bis; Stephen S Rich; David R Jacobs; Antonio Cherubini; Barbara McKnight; Shuang Liang; Xiangjun Gu; Kenneth Rice; Cathy C Laurie; Thomas Lumley; Brian L Browning; Bruce M Psaty; Yii-Der I Chen; Yechiel Friedlander; Luc Djousse; Jason H Y Wu; David S Siscovick; André G Uitterlinden; Donna K Arnett; Luigi Ferrucci; Myriam Fornage; Michael Y Tsai; Dariush Mozaffarian; Lyn M Steffen
Journal:  PLoS Genet       Date:  2011-07-28       Impact factor: 5.917

8.  Red blood cell fatty acid patterns and acute coronary syndrome.

Authors:  Gregory C Shearer; James V Pottala; John A Spertus; William S Harris
Journal:  PLoS One       Date:  2009-05-06       Impact factor: 3.240

9.  The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports.

Authors:  L Adrienne Cupples; Heather T Arruda; Emelia J Benjamin; Ralph B D'Agostino; Serkalem Demissie; Anita L DeStefano; Josée Dupuis; Kathleen M Falls; Caroline S Fox; Daniel J Gottlieb; Diddahally R Govindaraju; Chao-Yu Guo; Nancy L Heard-Costa; Shih-Jen Hwang; Sekar Kathiresan; Douglas P Kiel; Jason M Laramie; Martin G Larson; Daniel Levy; Chun-Yu Liu; Kathryn L Lunetta; Matthew D Mailman; Alisa K Manning; James B Meigs; Joanne M Murabito; Christopher Newton-Cheh; George T O'Connor; Christopher J O'Donnell; Mona Pandey; Sudha Seshadri; Ramachandran S Vasan; Zhen Y Wang; Jemma B Wilk; Philip A Wolf; Qiong Yang; Larry D Atwood
Journal:  BMC Med Genet       Date:  2007       Impact factor: 2.103

10.  Computing power and sample size for case-control association studies with copy number polymorphism: application of mixture-based likelihood ratio test.

Authors:  Wonkuk Kim; Derek Gordon; Jonathan Sebat; Kenny Q Ye; Stephen J Finch
Journal:  PLoS One       Date:  2008-10-22       Impact factor: 3.240

View more
  1 in total

1.  PRECISION MEDICINE: FROM DIPLOTYPES TO DISPARITIES TOWARDS IMPROVED HEALTH AND THERAPIES.

Authors:  Dana C Crawford; Alexander A Morgan; Joshua C Denny; Bruce J Aronow; Steven E Brenner
Journal:  Pac Symp Biocomput       Date:  2018
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.