Literature DB >> 25519407

Accounting for relatedness in family-based association studies: application to Genetic Analysis Workshop 18 data.

Jakris Eu-Ahsunthornwattana¹, Richard Aj Howey², Heather J Cordell².

Abstract

In the last few years, a bewildering variety of methods/software packages that use linear mixed models to account for sample relatedness on the basis of genome-wide genomic information have been proposed. We compared these approaches as implemented in the programs EMMAX, FaST-LMM, Gemma, and GenABEL (FASTA/GRAMMAR-Gamma) on the Genetic Analysis Workshop 18 data. All methods performed quite similarly and were successful in reducing the genomic control inflation factor to reasonable levels, particularly when the mean values of the observations were used, although more variation was observed when data from each time point were used individually. From a practical point of view, we conclude that it makes little difference to the results which method/software package is used, and the user can make the choice of package on the basis of personal taste or computational speed/convenience.

Entities: CellLine Chemical Disease Species

Year: 2014 PMID： 25519407 PMCID： PMC4143672 DOI： 10.1186/1753-6561-8-S1-S79

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

A number of different methods/software packages have been proposed in the last few years that implement linear mixed-model approaches to account for population structure and relatedness among samples in genome-wide association studies (GWAS), but no detailed comparisons among them have been made before our effort. Indeed, when a new method/package is developed, it is often quite unclear whether or how it differs substantially from those already available. To address this question, we explored the performance of various implementations of such methods in the longitudinal Genetic Analysis Workshop 18 (GAW18) data set.

Methods

We analyzed the GAW18 GWAS data [1] using the real phenotypes and the first set of simulated phenotypes. This analysis was performed without knowledge of the underlying simulating model. The genotype data were cleaned using standard procedures [2]. This resulted in 4 individuals being excluded because of their total lack of genotype data, and another individual being excluded because of outlying ethnicity (Chinese [CHB] or Japanese [JPT]), leaving 954 individuals whose genotype data were used. We removed 43,987 monomorphic or low-frequency (minor allele frequency [MAF] <1%) single-nucleotide polymorphisms (SNPs), 109 SNPs with missing rate above 10% (this criterion took into account the apparently high missing rate in some SNPs likely to be caused by the differences in genotyping technology used in the samples), and 1 SNP that failed Hardy-Weinberg equilibrium testing in the control founder population. A total of 427,952 SNPs were retained for analysis. We conducted linear regression of the real and simulated systolic blood pressure and simulated diastolic blood pressure at each time point regressed on age, medication, and smoking status. For the real diastolic blood pressure--which, as could be physiologically expected, seemed to have a nonlinear relationship with age--we used a quadratic regression, including age and age squared as predictors. The phenotype data from all individuals were used for these regressions. Residuals from these regressions in subjects who also had genotype data were then used for the genome-wide analyses. Genome-wide association analyses, adjusting for familial relatedness using genomic data, were performed using a variety of linear mixed model approaches. All approaches attempt to fit the model =β+Q+, where =(y1, ..., y)is a vector of responses on n subjects; X= (x) is the n × K matrix of predictor values for variables to be modeled as fixed effects (including covariates and genotypes at any SNPs currently under test); β=(β1, ... βK)are regression coefficients (to be estimated) representing the linear effects of the predictors on the response; Q are random effects, Q~N(0,2σg2Φ), and ε are random errors, ε~N(0,σe2I), where σg2 and σe2 are parameters (to be estimated) representing the genetic and environmental components of variance respectively; Φ is the n × n matrix of pairwise kinship coefficients; and I is the n × n identity matrix. The approaches vary with respect to precise details of the calculation of kinship or "relatedness" and with respect to whether an exact method or a fast approximation is used (for more details, see descriptions in references [3-9]). In each case we used a subset of 21,153 SNPs to perform the relatedness calculations, namely SNPs with MAF >0.4, <5% missing data, and "pruned" to be in approximate linkage equilibrium via the PLINK command "-indep 50 5 2". In analyses of other data sets we have found little difference between results when using such a pruned set of SNPs for calculating relatedness and when using the full set of SNPs (data not shown). The methods considered were: (a) EMMAX [3], which implements 2 methods for relatedness calculations: one based on identity-by-state (IBS) sharing and one based on the Balding-Nichols method [4]; (b) FaST-LMM [5], which also implements 2 methods to adjust for relatedness: one using a standard covariance matrix and one using the realized relationship matrix; (c) the polygenic/mmscore functions in GenABEL [6], which implement the FASTA method [7]; (d) the polygenic/grammar functions in GenABEL, which implement the GRAMMAR-Gamma approximation [8]; and (e) Gemma [9], which uses an efficient exact method. Simple linear regression without any relatedness adjustment was also performed in FaST-LMM. All analyses were performed using both the residual from each individual observation (modeled without regard to its true longitudinal nature, or longitudinal) and the mean of the residuals for each subject, or mean. Genomic inflation factors (λ) were calculated as proposed by Devlin and Roeder [10]. We also assessed the genomic inflation factors for unadjusted χ2 and Cochran-Armitage trend tests of hypertension status at each time point as calculated using PLINK [11].

Results and discussion

Figure 1 shows the Q-Q plots and genomic inflation factors for different methods. It is well known that population substructure and relatedness will cause an inflated distribution of genome-wide association test statistics (λ >1.00) if not appropriately modeled. All methods performed reasonably well for the mean residuals, controlling the λ to 0.99 to 1.03. For longitudinal data, most methods also performed well, with λ in the range of 0.95 to 1.05, except perhaps for GRAMMAR-Gamma, which achieved λs of approximately 1.08 to 1.09 for the simulated phenotypes. However, even these values were much less inflated compared to the λ values of 1.22 to 1.68 (mean) and 2.04 to 3.41 (longitudinal) seen in the unadjusted analyses. The higher inflation in longitudinal analyses (even when adjusting for relatedness) could be expected from the fact that additional (nongenetic) within-subject correlation was not allowed for in these analyses; indeed, one could argue that this behavior is statistically the "correct" behavior, with GRAMMAR-Gamma (which gave the highest inflation) showing the "most correct" behavior. Interestingly, EMMAX using the IBS matrix seemed to have the opposite behavior, for reasons we are currently unable to determine.

Figure 1

Q-Q plots and genomic inflation factors for different methods. These were calculated for each phenotype (real diastolic blood pressure [DBP], real systolic blood pressure [SBP], simulated DBP, and simulated SBP), using either longitudinal ("long") or average ("mean") residuals. EM_BN, EMMAX using Balding-Nichols matrix; EM_IBS, EMMAX using IBS matrix; FLM_C, FaST-LMM using standard covariance matrix; FLM_R, FaST-LMM using realized relationship matrix; GA_FA, GenABEL/FASTA; GA_GRG, GenABEL/GRAMMAR-Gamma; GMA_C, Gemma using centralized covariance matrix; GMA_S, Gemma using standardized covariance matrix. The diagonal line represents the identity line in each panel. For the analyses using hypertension status, the unadjusted genomic inflations were between 1.21 and 1.55 for the Cochran-Armitage trend test and between 1.01 and 1.27 for the χ2 test. Figure 2 compares the individual −log10 p values from different methods. Most methods gave highly concordant results, particularly EMMAX (BN) and Gemma, whereas the 2 GenABEL methods were similar but less concordant. This is analogous to findings on single-observation data by Zhou and Stephens [9]. FaST-LMM tended to perform slightly differently from the other methods at SNPs with lower significance, although the results overall were still quite similar.

Figure 2

Comparison of −log. The upper triangles show the values based on mean residuals, while the lower triangles show the values calculated using longitudinal data. DBP, diastolic blood pressure; EM_BN, EMMAX using Balding-Nichols matrix; EM_IBS, EMMAX using IBS matrix; FLM_C, FaST-LMM using standard covariance matrix; FLM_R, FaST-LMM using realized relationship matrix; GA_FA, GenABEL/FASTA; GA_GRG, GenABEL/GRAMMAR-Gamma; GMA_C, Gemma using centralized covariance matrix; GMA_S, Gemma using standardized covariance matrix; SBP, systolic blood pressure. Figure 3 shows a selection of Manhattan plots. For each phenotype, the results from all methods were quite similar, although the longitudinal data tended to show stronger signals. No clearly significant SNP was found in any phenotype, which is not surprising given the relatively small size of the GAW18 data set, which is underpowered for detecting (at genome-wide levels of significance) anything other than strong genetic effects. The high concordance in significance levels (at any given SNP) achieved by the different software packages (see Figure 2) indicates that no package is substantially more powerful than another, as expected from the fact that all packages implement slightly different versions of essentially the same statistical model.

Figure 3

A selection of Manhattan plots showing . DBP, diastolic blood pressure; EM_BN, EMMAX using Balding-Nichols matrix; FLM_R, FaST-LMM using realized relationship matrix; GA_FA, GenABEL/FASTA; SBP, systolic blood pressure. Although the results from all packages considered here were similar, the implementations did vary in speed. All packages performed the analysis in reasonable time (less than 1 day) on our system. Precise timings will depend on the computer resources and architecture available, but as a rule of thumb we found FaST-LMM and GRAMMAR-Gamma to be the fastest (taking just a few hours), followed by EMMAX and Gemma, which took 12 to 16 hours, and GenABEL/FASTA, which took 18 to 20 hours.

Conclusions

All methods performed well and results were similar, particularly at the most significant SNPs. We conclude that (at least for nonlongitudinal traits) it makes little difference to the results which method/software package is used, and the user can make the choice of package on the basis of personal taste, speed, or computational convenience. For longitudinal traits (modeled without regard to their longitudinal nature) the slight differences seen between the methods would be an interesting topic for further investigation, but it is beyond the scope of the current article.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JE conducted the statistical analyses and drafted the manuscript. RAJH prepared the data and conducted statistical analyses. HJC conceived the overall study and critically revised the manuscript. All authors read and approved the final manuscript.

11 in total

1. Genomic control for association studies.

Authors: B Devlin; K Roeder
Journal: Biometrics Date: 1999-12 Impact factor: 2.571

2. Improved linear mixed models for genome-wide association studies.

Authors: Jennifer Listgarten; Christoph Lippert; Carl M Kadie; Robert I Davidson; Eleazar Eskin; David Heckerman
Journal: Nat Methods Date: 2012-05-30 Impact factor: 28.547

3. GenABEL: an R library for genome-wide association analysis.

Authors: Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal: Bioinformatics Date: 2007-03-23 Impact factor: 6.937

4. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

5. Family-based association tests for genomewide association scans.

Authors: Wei-Min Chen; Goncalo R Abecasis
Journal: Am J Hum Genet Date: 2007-09-18 Impact factor: 11.025

6. Rapid variance components-based method for whole-genome association analysis.

Authors: Gulnara R Svishcheva; Tatiana I Axenovich; Nadezhda M Belonogova; Cornelia M van Duijn; Yurii S Aulchenko
Journal: Nat Genet Date: 2012-09-16 Impact factor: 38.330

7. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity.

Authors: D J Balding; R A Nichols
Journal: Genetica Date: 1995 Impact factor: 1.082

8. Data quality control in genetic case-control association studies.

Authors: Carl A Anderson; Fredrik H Pettersson; Geraldine M Clarke; Lon R Cardon; Andrew P Morris; Krina T Zondervan
Journal: Nat Protoc Date: 2010-08-26 Impact factor: 13.491

9. Genome-wide efficient mixed-model analysis for association studies.

Authors: Xiang Zhou; Matthew Stephens
Journal: Nat Genet Date: 2012-06-17 Impact factor: 38.330

10. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees.

Authors: Laura Almasy; Thomas D Dyer; Juan M Peralta; Goo Jun; Andrew R Wood; Christian Fuchsberger; Marcio A Almeida; Jack W Kent; Sharon Fowler; Tom W Blackwell; Sobha Puppala; Satish Kumar; Joanne E Curran; Donna Lehman; Goncalo Abecasis; Ravindranath Duggirala; John Blangero
Journal: BMC Proc Date: 2014-06-17

6 in total

1. Using gene expression data to identify causal pathways between genotype and phenotype in a complex disease: application to Genetic Analysis Workshop 19.

Authors: Holly F Ainsworth; Heather J Cordell
Journal: BMC Proc Date: 2016-10-18

2. Examination of previously identified associations within the Genetic Analysis Workshop 19 data.

Authors: Richard A J Howey; Jakris Eu-Ahsunthornwattana; Rebecca Darlay; Heather J Cordell
Journal: BMC Proc Date: 2016-10-18

3. Comparison of methods to account for relatedness in genome-wide association studies with family-based data.

Authors: Jakris Eu-Ahsunthornwattana; E Nancy Miller; Michaela Fakiola; Selma M B Jeronimo; Jenefer M Blackwell; Heather J Cordell
Journal: PLoS Genet Date: 2014-07-17 Impact factor: 5.917

4. Comparing Analytic Methods for Longitudinal GWAS and a Case-Study Evaluating Chemotherapy Course Length in Pediatric AML. A Report from the Children's Oncology Group.

Authors: Marijana Vujkovic; Richard Aplenc; Todd A Alonzo; Alan S Gamis; Yimei Li
Journal: Front Genet Date: 2016-08-05 Impact factor: 4.599

5. A Comparison of Statistical Methods for the Discovery of Genetic Risk Factors Using Longitudinal Family Study Designs.

Authors: Kelly M Burkett; Marie-Hélène Roy-Gagnon; Jean-François Lefebvre; Cheng Wang; Bénédicte Fontaine-Bisson; Lise Dubois
Journal: Front Immunol Date: 2015-11-19 Impact factor: 7.561

6. Family-Based Genome-Wide Association Study of Autism Spectrum Disorder in Middle Eastern Families.

Authors: Yasser Al-Sarraj; Eman Al-Dous; Rowaida Z Taha; Dina Ahram; Fouad Alshaban; Mohammed Tolfat; Hatem El-Shanti; Omar M E Albagha
Journal: Genes (Basel) Date: 2021-05-18 Impact factor: 4.096

6 in total