Literature DB >> 25519329

Comparison of several sequence-based association methods in pedigrees.

George Mathew¹, Varghese George², Hongyan Xu².

Abstract

Genome-wide association studies are very powerful in determining the genetic variants affecting complex diseases. Most of the available methods are very useful in detecting association between common variants and complex diseases. Recently, methods to detect rare variants in association with complex diseases have been developed with the increasingly available sequencing data from next-generation sequencing. In this paper, we evaluate and compare several of these recent methods for performing statistical association using whole genome sequencing data in pedigrees. Specifically, functional principal component analysis (FPCA), extended combined multivariate and collapsing (CMC) method for families, a generalized T(2) method, and chi-square minimum approach were compared by analyzing all the genetic variants, common and rare, of both the real data set and the simulated data set provided as part of Genetic Analysis Workshop 18.

Entities: CellLine Chemical Disease Gene Species

Year: 2014 PMID： 25519329 PMCID： PMC4143807 DOI： 10.1186/1753-6561-8-S1-S48

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

With advances in genotyping technologies, genome-wide association studies (GWAS) became a very popular procedure to identify disease genes and other traits by conducting statistical tests on many thousands of single-nucleotide polymorphisms (SNPs). The procedure has great potential for discovering genetic variants influencing complex diseases. However, these procedures have discovered loci that account only for a small percentage of phenotypic variance [1]. One of the reasons for this difficulty may be that rare variants might explain disease susceptibility [2-4]. Recently, several methods have been developed to determine the influence of rare variants on complex diseases. These methods differ from the traditional methods of testing where the focus has been on individual common variants. It is understood that those variants with a population frequency greater than 5% are considered to be common variants, those with less than 1% population frequency as rare variants, and the rest as low-frequency variants [4]. The common variants are believed to be from distant ancestors, whereas rare variants are from recent ancestors [5]. Most of these methods assume the individuals are independent and are designed for population-based data. Only recently have several methods been developed that can perform statistical association of sequence data in pedigrees. In this paper, we used functional principal component analysis (FPCA) [4], the generalized T2 approach [4], the combined multivariate and collapsing (CMC) test for family data [2,4], and the chi-square minimum approach for family data [4] to analyze association of the dichotomous hypertension trait with all genetic variants, common and rare, of the real data set and all replicates of chromosome 3 of the simulated data set provided by Genetic Analysis Workshop 18 (GAW 18) [6]. We compared the results to assess the merits of these methods.

Methods

An extension of the generalized T2 test [7] for family-based association studies is provided by Zhu and Xiong [4]. The test statistic is given by , where T2 is the generalized T2 statistic [7], and P[4, p. 1030] is the correction factor to account for the familial correlation in the pedigree data. A similar extension of the CMC test is also developed and is provided by equation (15) in Zhu and Xiong [4]. The test statistic is given by , where Tis the CMC statistic for the population-based association test, and Pis the correction factor to adjust Tstatistic so that it is valid for pedigree data. The FPCA statistic for the population-based association test in Luo et al [8] also has a similar extension for family data [4], and is given by , where Tand Pare defined as in the previous test statistics for pedigree data. Also, , T, and Thave chi-square distributions [4]. The chi-square minimum statistic chooses the minimum of the p-values from the individual chi-square tests for each genetic variant from a genomic region. The chi-square minimum statistic (Chi_min) adjusts for relatedness of pedigree members using P[4]. We applied the above 4 methods to analyze the real data set from all odd-numbered chromosomes using hypertension status at exam 1 as the phenotype. The genotypes at each variant are coded as 0, 1, or 2 for aa, Aa, and AA, where allele A is the minor allele. The start and end boundary of all the human genes were obtained from hg19 genome assembly at NCBI. The genetic variants within 1 gene were analyzed together as each gene is 1 natural functional unit. For the FPCA method, if there are too few genetic variants, that is, less than 3, the estimate of the functional relation of the allele counts across the genetic variants will be far off. Consequently, genes with fewer than 3 genetic variants were not analyzed. The significant genes from the 4 methods were then compared with the findings from previous GWAS for genes associated with blood pressure. To examine the type I error rate and power of the 4methods, we applied these methods to the 200 replicates of the simulated data set and analyzed the data from chromosome 3. As in the real data analysis, we chose the hypertension status at exam 1 as the phenotype.

Results

Table 1 gives the number of significant genes at several α levels from the real data analysis for a total of 10,580 genes from the odd-numbered human chromosomes. FPCA method finds the fewest number of significant genes compared to the other 3 methods. Chi_min finds the highest number of significant genes at 0.05 and 0.01 levels. However, T2 finds more significant genes at lower significance levels (0.001, 0.0001, and 4.7 × 10−6). The number of genessignificant at the 4.7 × 10−6 level by the FPCA, Chi_min, CMC, and T2 methods are 0, 15, 598, and 1794, respectively. Figure 1 is a Venn diagram showing overlaps of the significant genes from Chi_min, CMC, and T2 at 4.7 × 10−6 level. It is interesting to note that all 598 significant genes found by CMC overlap with those found by T2.

Table 1

Number of significant genes out of 10,580 genes in the odd-numbered human chromosomes of the real data set at various significance levels

Method	Significance level

	0.05	0.01	0.001	0.0001	4.7 × 10⁻⁶
FPCA	158	33	3	1	0
Chi_min	8321	5123	1402	172	15
T²	3902	3079	2436	2050	1794
CMC	2083	1329	907	717	598

Figure 1

Venn diagram showing overlaps of the significant genes from Chi_min, CMC, and T2 at 4.7 × 10−6 level from the analysis all odd-numbered chromosomes of the real data set.

Number of significant genes out of 10,580 genes in the odd-numbered human chromosomes of the real data set at various significance levels Venn diagram showing overlaps of the significant genes from Chi_min, CMC, and T2 at 4.7 × 10−6 level from the analysis all odd-numbered chromosomes of the real data set. The number of significant genes presented in Table 1 will contain false positives, as with any statistical test. To get an idea of the number of "true findings," we compared our results with those findings of GWAS for blood-pressure-associated genes. We performed a comprehensive literature review, and 84 genes were identified as being associated with blood pressure from GWAS. Table 2 shows the number of overlapped genes between our analysis and the GWAS findings.

Table 2

Number of overlapped genes associated with blood pressure from GWAS findings at various significance levels

Method	0.05	0.01	0.001	0.0001	4.7 × 10⁻⁶
FPCA	0	0	0	0	0
Chi_min	36	23	9	1	0
T²	20	18	14	12	12
CMC	12	10	7	5	4

Number of overlapped genes associated with blood pressure from GWAS findings at various significance levels We analyzed chromosome 3 of the simulated data set. There are a total of 1120 genes on chromosome 3, of which 30 were used for causal variants of hypertension in the simulation model. The remaining 1090 were assumed to be unrelated to the disease and are used only for calculating type I error rate. The linkage disequilibrium (LD) between the genetic variants from these groups of 1090 genes and 30 genes were analyzed with Haploview [9] and no significant LD was found. Table 3 lists the type I error rates from the analysis of all 200 replicates by all 4 methods at various significance levels.

Table 3

Type I error probability estimates by FPCA, Chi_min, T2, and CMC methods from all 200 replicates of chromosome 3 of the simulated data set

α	FPCA	Chi_min	T²	CMC
0.05	0.02567	0.86265	0.05061	0.04763
0.01	0.00656	0.61023	0.01202	0.00908
0.001	0.00096	0.25719	0.00093	0.00136
0.0001	0.00016	0.07272	0.00013	0.00011

Type I error probability estimates by FPCA, Chi_min, T2, and CMC methods from all 200 replicates of chromosome 3 of the simulated data set The analysis of the 30 positive genes is used to calculate the power of the various methods. Table 4 lists the estimates of the power by the various methods.

Table 4

Estimates of power by FPCA, Chi_min, T2, and CMC methods from all 200 replicates of chromosome 3 of thesimulated data set

α	FPCA	Chi_min	T²	CMC
0.05	0.045	0.95433	0.6585	0.338
0.01	0.01883	0.72117	0.57717	0.24583
0.001	0.00483	0.33233	0.50117	0.18083
0.0001	0.00117	0.09667	0.448	0.14233

Estimates of power by FPCA, Chi_min, T2, and CMC methods from all 200 replicates of chromosome 3 of thesimulated data set

Discussion

With the increasingly available sequence data from the next-generation sequencing technologies, it is important for a statistical association method to handle both common and rare genetic variants. It is also important for these methods to handle data from pedigrees because rare genetic variants are enriched in families with multiple affected individuals, which could confer more statistical power. From our analysis of the real data, T2 seems to be a better method than the other 3 methods because it finds more significant genes at low significance levels. At the Bonferroni corrected p-value of 4.7 × 10−6 , T2 identified the genes CASZ1, ADAMTS8, NUCB2, ABCC8, SLC4A7, MAP4, CASR, EBF1, PLEKHA7, SOX6, ULK4, and MECOM. The last 4 genes were also identified by the CMC method. All the genes mentioned above were found to be associated with blood pressure, in particular ULK4 and PLEKAH7 by Levy et al [10], and MAP4 by Wain et al [11]. As with GWAS, we need to keep a low significance level to account for multiple testing. We note from the analysis of the simulated data that FPCA has empirical type I error rate much less than the nominal value, making it very conservative. The Chi_min method has inflated type I error rate. The type I error rates by T2 and CMC are close to the nominal value. Also, T2 has better power than CMC, which is consistent with the result from the real data. From the analysis of the data sets we find that T2 is a better method, which is different from the findings of Zhu and Xiong [4], which suggest that FPCA is a better procedure. There are 2 possible reasons why FPCA performs less well here. One reason may be that the SNPs in the genes are sparse. If there are too few SNPs in 1 gene, the FPCA may not perform well because the number of SNPs is not enough to estimate the function describing the allele counts across the SNPs in the gene. A second reason may be that the assumption of a smooth function of the allele counts across the SNPs for the FPCA may not hold for the GAW 18data sets. We observed a large overlap between the results of CMC and T2. This mainly comes from the fact that CMC uses the T2 approach with common variants. There is also a tendency to pick up more genes with more variants for both CMC and T2 methods.

Conclusions

From the analysis results of both real and simulated data, T2 is a preferable method for pedigree-based association studies with whole-genome sequencing data because it controls the false positive rate and is more powerful than the other two methods with similar type I error rates.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HX conceived of the study, performed the analysis, and helped to draft the manuscript. GM performed the analysis and helped to draft the manuscript. VG participated in the design and coordination of the study. All authors read and approved the manuscript.

11 in total

1. Generalized T2 test for genome association studies.

Authors: Momiao Xiong; Jinying Zhao; Eric Boerwinkle
Journal: Am J Hum Genet Date: 2002-03-29 Impact factor: 11.025

2. Haploview: analysis and visualization of LD and haplotype maps.

Authors: J C Barrett; B Fry; J Maller; M J Daly
Journal: Bioinformatics Date: 2004-08-05 Impact factor: 6.937

3. Clan genomics and the complex architecture of human disease.

Authors: James R Lupski; John W Belmont; Eric Boerwinkle; Richard A Gibbs
Journal: Cell Date: 2011-09-30 Impact factor: 41.582

4. Family-based association studies for next-generation sequencing.

Authors: Yun Zhu; Momiao Xiong
Journal: Am J Hum Genet Date: 2012-06-08 Impact factor: 11.025

5. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms.

Authors: Ivan P Gorlov; Olga Y Gorlova; Shamil R Sunyaev; Margaret R Spitz; Christopher I Amos
Journal: Am J Hum Genet Date: 2008-01 Impact factor: 11.025

6. Association studies for next-generation sequencing.

Authors: Li Luo; Eric Boerwinkle; Momiao Xiong
Journal: Genome Res Date: 2011-04-26 Impact factor: 9.043

Review 7. Common vs. rare allele hypotheses for complex diseases.

Authors: Nicholas J Schork; Sarah S Murray; Kelly A Frazer; Eric J Topol
Journal: Curr Opin Genet Dev Date: 2009-05-28 Impact factor: 5.578

8. Genome-wide association study of blood pressure and hypertension.

Authors: Daniel Levy; Georg B Ehret; Kenneth Rice; Germaine C Verwoert; Lenore J Launer; Abbas Dehghan; Nicole L Glazer; Alanna C Morrison; Andrew D Johnson; Thor Aspelund; Yurii Aulchenko; Thomas Lumley; Anna Köttgen; Ramachandran S Vasan; Fernando Rivadeneira; Gudny Eiriksdottir; Xiuqing Guo; Dan E Arking; Gary F Mitchell; Francesco U S Mattace-Raso; Albert V Smith; Kent Taylor; Robert B Scharpf; Shih-Jen Hwang; Eric J G Sijbrands; Joshua Bis; Tamara B Harris; Santhi K Ganesh; Christopher J O'Donnell; Albert Hofman; Jerome I Rotter; Josef Coresh; Emelia J Benjamin; André G Uitterlinden; Gerardo Heiss; Caroline S Fox; Jacqueline C M Witteman; Eric Boerwinkle; Thomas J Wang; Vilmundur Gudnason; Martin G Larson; Aravinda Chakravarti; Bruce M Psaty; Cornelia M van Duijn
Journal: Nat Genet Date: 2009-05-10 Impact factor: 38.330

9. Genome-wide association study identifies six new loci influencing pulse pressure and mean arterial pressure.

Authors: Louise V Wain; Germaine C Verwoert; Paul F O'Reilly; Gang Shi; Toby Johnson; Andrew D Johnson; Murielle Bochud; Kenneth M Rice; Peter Henneman; Albert V Smith; Georg B Ehret; Najaf Amin; Martin G Larson; Vincent Mooser; David Hadley; Marcus Dörr; Joshua C Bis; Thor Aspelund; Tõnu Esko; A Cecile J W Janssens; Jing Hua Zhao; Simon Heath; Maris Laan; Jingyuan Fu; Giorgio Pistis; Jian'an Luan; Pankaj Arora; Gavin Lucas; Nicola Pirastu; Irene Pichler; Anne U Jackson; Rebecca J Webster; Feng Zhang; John F Peden; Helena Schmidt; Toshiko Tanaka; Harry Campbell; Wilmar Igl; Yuri Milaneschi; Jouke-Jan Hottenga; Veronique Vitart; Daniel I Chasman; Stella Trompet; Jennifer L Bragg-Gresham; Behrooz Z Alizadeh; John C Chambers; Xiuqing Guo; Terho Lehtimäki; Brigitte Kühnel; Lorna M Lopez; Ozren Polašek; Mladen Boban; Christopher P Nelson; Alanna C Morrison; Vasyl Pihur; Santhi K Ganesh; Albert Hofman; Suman Kundu; Francesco U S Mattace-Raso; Fernando Rivadeneira; Eric J G Sijbrands; Andre G Uitterlinden; Shih-Jen Hwang; Ramachandran S Vasan; Thomas J Wang; Sven Bergmann; Peter Vollenweider; Gérard Waeber; Jaana Laitinen; Anneli Pouta; Paavo Zitting; Wendy L McArdle; Heyo K Kroemer; Uwe Völker; Henry Völzke; Nicole L Glazer; Kent D Taylor; Tamara B Harris; Helene Alavere; Toomas Haller; Aime Keis; Mari-Liis Tammesoo; Yurii Aulchenko; Inês Barroso; Kay-Tee Khaw; Pilar Galan; Serge Hercberg; Mark Lathrop; Susana Eyheramendy; Elin Org; Siim Sõber; Xiaowen Lu; Ilja M Nolte; Brenda W Penninx; Tanguy Corre; Corrado Masciullo; Cinzia Sala; Leif Groop; Benjamin F Voight; Olle Melander; Christopher J O'Donnell; Veikko Salomaa; Adamo Pio d'Adamo; Antonella Fabretto; Flavio Faletra; Sheila Ulivi; Fabiola M Del Greco; Maurizio Facheris; Francis S Collins; Richard N Bergman; John P Beilby; Joseph Hung; A William Musk; Massimo Mangino; So-Youn Shin; Nicole Soranzo; Hugh Watkins; Anuj Goel; Anders Hamsten; Pierre Gider; Marisa Loitfelder; Marion Zeginigg; Dena Hernandez; Samer S Najjar; Pau Navarro; Sarah H Wild; Anna Maria Corsi; Andrew Singleton; Eco J C de Geus; Gonneke Willemsen; Alex N Parker; Lynda M Rose; Brendan Buckley; David Stott; Marco Orru; Manuela Uda; Melanie M van der Klauw; Weihua Zhang; Xinzhong Li; James Scott; Yii-Der Ida Chen; Gregory L Burke; Mika Kähönen; Jorma Viikari; Angela Döring; Thomas Meitinger; Gail Davies; John M Starr; Valur Emilsson; Andrew Plump; Jan H Lindeman; Peter A C 't Hoen; Inke R König; Janine F Felix; Robert Clarke; Jemma C Hopewell; Halit Ongen; Monique Breteler; Stéphanie Debette; Anita L Destefano; Myriam Fornage; Gary F Mitchell; Nicholas L Smith; Hilma Holm; Kari Stefansson; Gudmar Thorleifsson; Unnur Thorsteinsdottir; Nilesh J Samani; Michael Preuss; Igor Rudan; Caroline Hayward; Ian J Deary; H-Erich Wichmann; Olli T Raitakari; Walter Palmas; Jaspal S Kooner; Ronald P Stolk; J Wouter Jukema; Alan F Wright; Dorret I Boomsma; Stefania Bandinelli; Ulf B Gyllensten; James F Wilson; Luigi Ferrucci; Reinhold Schmidt; Martin Farrall; Tim D Spector; Lyle J Palmer; Jaakko Tuomilehto; Arne Pfeufer; Paolo Gasparini; David Siscovick; David Altshuler; Ruth J F Loos; Daniela Toniolo; Harold Snieder; Christian Gieger; Pierre Meneton; Nicholas J Wareham; Ben A Oostra; Andres Metspalu; Lenore Launer; Rainer Rettig; David P Strachan; Jacques S Beckmann; Jacqueline C M Witteman; Jeanette Erdmann; Ko Willems van Dijk; Eric Boerwinkle; Michael Boehnke; Paul M Ridker; Marjo-Riitta Jarvelin; Aravinda Chakravarti; Goncalo R Abecasis; Vilmundur Gudnason; Christopher Newton-Cheh; Daniel Levy; Patricia B Munroe; Bruce M Psaty; Mark J Caulfield; Dabeeru C Rao; Martin D Tobin; Paul Elliott; Cornelia M van Duijn
Journal: Nat Genet Date: 2011-09-11 Impact factor: 38.330

10. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees.

Authors: Laura Almasy; Thomas D Dyer; Juan M Peralta; Goo Jun; Andrew R Wood; Christian Fuchsberger; Marcio A Almeida; Jack W Kent; Sharon Fowler; Tom W Blackwell; Sobha Puppala; Satish Kumar; Joanne E Curran; Donna Lehman; Goncalo Abecasis; Ravindranath Duggirala; John Blangero
Journal: BMC Proc Date: 2014-06-17

1 in total

1. Summary of results and discussions from the gene-based tests group at Genetic Analysis Workshop 18.

Authors: Heather J Cordell
Journal: Genet Epidemiol Date: 2014-09 Impact factor: 2.135

1 in total