Literature DB >> 24288575

Robust joint analysis with data fusion in two-stage quantitative trait genome-wide association studies.

Dong-Dong Pan¹, Wen-Jun Xiong, Ji-Yuan Zhou, Ying Pan, Guo-Li Zhou, Wing-Kam Fung.

Abstract

Genome-wide association studies (GWASs) in identifying the disease-associated genetic variants have been proved to be a great pioneering work. Two-stage design and analysis are often adopted in GWASs. Considering the genetic model uncertainty, many robust procedures have been proposed and applied in GWASs. However, the existing approaches mostly focused on binary traits, and few work has been done on continuous (quantitative) traits, since the statistical significance of these robust tests is difficult to calculate. In this paper, we develop a powerful F-statistic-based robust joint analysis method for quantitative traits using the combined raw data from both stages in the framework of two-staged GWASs. Explicit expressions are obtained to calculate the statistical significance and power. We show using simulations that the proposed method is substantially more robust than the F-test based on the additive model when the underlying genetic model is unknown. An example for rheumatic arthritis (RA) is used for illustration.

Entities: Chemical Disease Mutation Species

Mesh：

Year: 2013 PMID： 24288575 PMCID： PMC3832968 DOI： 10.1155/2013/843563

Source DB: PubMed Journal: Comput Math Methods Med ISSN： 1748-670X Impact factor: 2.238

1. Introduction

Genome-wide association studies (GWASs) have identified a large number of genomic regions (especially single-nucleotide polymorphisms (SNPs)) with a wide variety of complex traits/diseases. In a GWAS, two most common types of data, qualitative (or binary) and quantitative (or continuous) traits, are analyzed and two contentious points are often faced; one is how to construct the test statistic considering the genetic model uncertainty and the other is how to evaluate the statistical significance for controlling the false positive rates efficiently (e.g., [1, 2]). Considering these issues, a lot of work has been done on the binary trait in the past 10 years (e.g., [3-7]). Computer algorithms have also been developed to calculated the significance level of robust tests in GWASs, taking into account the genetic model uncertainty [8]. However, few work has been done on continuous traits, only recently So and Sham [9] proposed a MAX3 based on score test statistics, and Li et al. [10] gave a MAX3 based on F-test statistics. Note that these tests just focus on single-marker analysis in one-stage analysis. Although the costs of whole-genome genotyping are decreasing with the high-throughput biological technology, the total costs for a GWAS are still very expensive due to the thousands of sampling units and huge amounts of single-nucleotide polymorphisms. In order to save the costs, the two-stage design and the corresponding statistical analysis where all the SNPs are genotyped in Stage 1 on a portion of the samples and the promising SNPs with small P-values (e.g., <0.001) based on some efficient tests are further screened on the remaining subjects, are often adopted in practice (e.g., [11-15]). In genetic association studies, especially GWASs, genetic markers are routinely tested under the assumption of additive effects. Although convenient to use, those tests are optimal only when the true underlying genetic model is additive so that they are not robust against the genetic model misspecification. To our best knowledge, few work has been done on the two-stage joint analysis for quantitative trait GWASs allowing for genetic model uncertainty. Here, we attempt to develop a joint analysis method with data fusion in the two-stage design using F-statistic, since F-test is commonly employed from the linear regression model for quantitative trait, and Li et al. [10] show that MAX3 based on F-statistics is more powerful than So and Sham's method by extensively numerical simulation. The content of this paper is organized as follows. In Section 2, we give some notations and the proposed robust joint test statistics. Further, we derive the asymptotic distribution of the test statistics under the null and the alternative hypotheses. In Section 3, we show that the proposed joint analysis method is substantially more robust than the additive-model-based F-test from the numerical results of power comparison when the real genetic model is unknown. After that, an illustrative example for rheumatic arthritis (RA) is presented. Finally, we give some discussion of this paper in Section 4.

2. Methods

2.1. Notations

Assume that n individuals are randomly selected to be genotyped in a two-staged GWAS for a certain quantitative trait and that π is the sampling proportion in Stage 1. Let n 1 = nπ and n 2 = n(1 − π) be the sample sizes for Stages 1 and 2, respectively. Consider a biallelic marker with two alleles G and g. Without loss of generality, we assume that G is the minor or high-risk allele. We suppose that the total m SNPs are genotyped on the samples of Stage 1, and SNPs with P-values less than γ in Stage 1 will be further genotyped and tested in Stage 2. Let the significance level be α, and then the genome-wide significance level per SNP is α/m with the Bonferroni adjustments. Let Y 1 = (y 1,y 2,…,y )′ and Y 2 = (y ,y ,…,y )′ be the observed quantitative outcome vectors for Stage 1 and Stage 2, respectively. Without loss of generality, we assume that the first n 10 individuals in Stage 1 have the genotype gg, the second n 11 individuals in Stage 1 have the genotype Gg, and the last n 12 subjects in Stage 1 possess the genotype GG. Similarly, the first n 20 subjects in Stage 2 have the genotype gg, the second n 21 individuals in Stage 2 have the genotype Gg, and the last n 22 subjects in Stage 2 possess the genotype GG. Let 0 = (0,0,…, 0)′ and 1 = (1,1,…, 1)′, and let O be the k × j matrix with all its entries being zero and I be the n × n identity matrix.

2.2. F-Statistic-Based Robust Joint Analysis

We firstly briefly introduce F-statistic-based MAX3 by Li et al. [10] just using the data from Stage 1. Consider the following linear regression model: where β 0 is the nuisance parameter for the intercept, β 1 is the parameter of interest for genetic effect, and g is the genotype value, which takes 0, 1, or 2 corresponding to the count of G at a marker locus for the ith subject, i = 1,2,…, n 1. The hypotheses of interest are The variable g in the previously stated equation is coded differently for the three common genetic models. Let X 1 = (1, G 1), X 1 = (1, G 1), and X 1 = (1, G 1) be the design matrices under three commonly used genetic models, where G 1 = (0′,1′)′ corresponds to the recessive model, G 1 = (g 1,g 2,…,g )′ corresponds to the additive model, and G 1 = (0′,1′)′ is for the dominant model. Denote X 1 = (1, x 11, x 12), where x 11 = (0′,1′,0′)′ and x 12 = (0′,0′,1′)′. The modified F-test statistics under the recessive, additive, and dominant models for Stage 1 are given by where The robust test statistic in Stage 1 is We now give the proposed robust joint analysis. In the framework of two-stage design GWAS of quantitative traits, the SNPs with P-values less than γ will be genotyped on the remaining n 2 subjects in Stage 2. Following the previous notation for Stage 1, corresponding to the recessive, additive, and dominant models, the genotype data in Stage 2 are denoted by G 2 = (0′,1′)′, G 2 = (g ,g ,…,g )′, and G 2 = (0′,1′)′, respectively, and the design matrices are X 2 = (1, G 2), X 2 = (1, G 2), and X 2 = (1, G 2), respectively. Denote X 2 = (1, x 21, x 22), where x 21 = (0′,1′,0′)′ and x 22 = (0′,0′,1′)′. Then, we can obtain three modified F-test statistics under the recessive, additive, and dominant models for Stage 2 similarly, and denote them by F 2 , F 2 , and F 2 . Let Y = (Y 1′,Y 2′)′, G = (G 1′,G 2′)′, G = (G 1′,G 2′)′, and G = (G 1′,G 2′)′. Denote N 0 = n 10 + n 20, N 1 = n 11 + n 21, and N 2 = n 12 + n 22 for the combined sample sizes from two stages, corresponding to three genotypes. Then the proposed F-test statistics under three genetic models on the basis of the combined data are as follows: where X = (1, G ), X = (1, G ), X = (1, G ), and , Furthermore, we propose the joint testing statistic as In order to calculate the power of the proposed joint analysis, we have to get the thresholds, which is determined by the significance level. Denote the threshold for choosing the promising SNPs in Stage 1 by u 1, which is the solution of Since the genome-wide significance level is α/m, in order to control the false positive rate, we have where u is the cut-off point for the joint statistic. Once we have u 1 and u , the power is calculated by We now give the detail to calculate the cut-off point and power above. The left side of (10) can be further expressed as For controlling the type I error rate and calculating the power, we need to know the distribution or the asymptotic distribution of (Z 1 ,Z 1 ,Z 1 ,Z ,Z ,Z ,RSS1,RRS)′ under both H 0 and H 1. Note that whether H 0 or H 1 holds, RSS1 and RSS and (Z 1 ,Z 1 ,Z 1 ,Z ,Z ,Z )′ are mutually independent (the proof is given in Appendix A). Denote the correlation matrix of (Z 1 ,Z 1 ,Z 1 )′ by V 1 = (v )3×3, whose entries are v 11 = v 22 = v 33 = 1, , , and , respectively. Similarly, let V = (v *)3×3 be the correlation matrix of (Z ,Z ,Z )′ with v 11* = v 22* = v 33* = 1, , , and . Then, we can derive that RRS1/σ 2 ~ χ 2, RSS/σ 2 ~ χ 2, and where ρ = (ρ )3×3 is the correlation matrix between (Z 1 ,Z 1 ,Z 1 )′ and (Z ,Z ,Z )′, with Under H 1, for a given odds ratio OR = exp⁡(β 1) for subjects with two copies of risk allele corresponding to recessive model or one copy of risk allele corresponding to additive or dominant models, we have the following: when the true genetic model is recessive, where = (μ 1 ,μ 1 ,μ 1 ,μ ,μ ,μ )′ with when the true genetic model is additive, where = (μ 1 ,μ 1 ,μ 1 ,μ ,μ ,μ )′ with when the true genetic model is dominant, where = (μ 1 ,μ 1 ,μ 1 ,μ ,μ ,μ )′ with We develop a method for simplifying the calculations of Pr(F 1 MAX ≤ u 1) and Pr(F MAX ≤ u ) and Pr(F 1 MAX ≤ u 1, F MAX ≤ u ). The details are included in Appendix B, and the calculations of Pr(F 1 MAX ≤ u 1) and Pr(F MAX ≤ u ) and Pr(F 1 MAX ≤ u 1, F MAX ≤ u ) are essentially similar.

3. Results

3.1. Power Comparison

We conduct simulation studies to evaluate the performance of the proposed method under three commonly used genetic models (recessive, additive, and dominant models). We mainly compare the power of two approaches; one is the proposed method in this paper, and the other is the joint analysis based on the F-test statistics F 1 and F . For convenience, we refer to the proposed method as MAXFJ and AFJ for the other one. We choose the sample size n = 2000, and m = 5 × 105. The proportion of subjects genotyped in Stage 1 has three levels π = 0.3,0.4,0.5. We set the genome-wide significance level as α = 0.05 and that the significance level per SNP as α/m = 1 × 10−7. In Stage 1, the P-value threshold for SNPs selected for followup is set to be 1 × 10−4 and 2 × 10−4. We assume that the Hardy-Weinberg equilibrium holds in the general sample population, and then there are on average n × (1 − MAF)2, 2n × MAF × (1 − MAF), and n × MAF2 individuals with genotype gg, Gg, and GG, respectively, where the minor allele frequency is set to be 0.15, 0.30 and 0.45. To make the power comparison more distinctly, we specify different genetic effect parameters β 1 under three genetic models as follows: β 1 = 0.5 for the recessive model, β 1 = 0.3 for the additive model, and β 1 = 0.4 for the dominant model. The power results are displayed in Tables 1 and 2 for γ = 1 × 10−4 and γ = 2 × 10−4, respectively. They indicate that MAXFJ is more efficiency robust than AFJ across various inheritance models. As expected, AFJ is more powerful than MAXFJ under the additive model. However, MAFJ performs much more powerful than AFJ when the true genetic model is recessive. For instance, in Table 2, with π = 0.4 and MAF = 0.3, the powers of AFJ and MAXFJ are 0.101 and 0.529, respectively. In summary, MAXFJ is substantially more powerful than AFJ in two-staged GWAS of quantitative traits, when the model for AFJ is misspecified.

Table 1

Power comparison (n = 2000, γ = 1 × 10−4, α = 0.05, and m = 5 × 105).

π	MAF	REC		ADD		DOM
π	MAF	AFJ	MAXFJ	AFJ	MAXFJ	AFJ	MAXFJ
0.30	0.15	7.5e − 5	0.005	0.426	0.365	0.610	0.618
	0.30	0.052	0.285	0.811	0.759	0.698	0.784
	0.45	0.487	0.785	0.893	0.854	0.449	0.647

0.40	0.15	1.1e − 4	0.009	0.651	0.589	0.826	0.837
	0.30	0.086	0.470	0.945	0.922	0.887	0.938
	0.45	0.711	0.938	0.979	0.968	0.677	0.859

0.50	0.15	1.0e − 4	0.010	0.802	0.751	0.933	0.941
	0.30	0.121	0.639	0.987	0.980	0.965	0.986
	0.45	0.856	0.987	0.997	0.995	0.826	0.953

Table 2

Power comparison (n = 2000, γ = 2 × 10−4, α = 0.05, and m = 5 × 105).

π	MAF	REC		ADD		DOM
π	MAF	AFJ	MAXFJ	AFJ	MAXFJ	AFJ	MAXFJ
0.30	0.15	1.3e − 4	0.006	0.489	0.426	0.676	0.681
	0.30	0.066	0.340	0.852	0.806	0.754	0.828
	0.45	0.556	0.833	0.922	0.891	0.516	0.706

0.40	0.15	1.2e − 4	0.011	0.709	0.651	0.866	0.876
	0.30	0.101	0.529	0.961	0.943	0.916	0.956
	0.45	0.765	0.957	0.987	0.979	0.732	0.892

0.50	0.15	1.7e − 4	0.012	0.838	0.793	0.951	0.958
	0.30	0.133	0.683	0.992	0.987	0.975	0.991
	0.45	0.888	0.992	0.998	0.997	0.860	0.967

3.2. An Illustration Example: Rheumatoid Arthritis

Rheumatoid arthritis (RA) is an autoimmune disease (resulting in a chronically systemic inflammatory disorder) which mainly attacks synovial joints. About 1% of the common adult population worldwide is affected by RA [16]. It has been pointed out that the genetic variants might play a major role in RA susceptibility [17]. Genetic Analysis Workshop 16 (GAW16) based on the North American Rheumatoid Arthritis Consortium (NARAC) is a GWAS testing association with RA using about 5 × 105 SNPs [18-20]. It included 868 individuals who were RA positive (cases) and also had continuous trait anticyclic citrullinated peptide (anti-CCP) measures and 1194 controls sampled from the New York Cancer Project (NYCP) without RA which had no anti-CCP measures. Huizinga et al. [21] pointed out that a greater anti-CCP would be linked to better prediction of increased risk developing RA. Chen et al. [22] showed that SNP rs2476601 located in PTPN22 had the most significant association with RA. Here, we only focus on SNP rs2476601 and apply two joint analysis methods (AFJ and MAXFJ) to evaluate its statistical significance. The minimum of anti-CCP among 868 cases was affected to each control, and a log transformation of anti-CCP was applied in the analysis. Then, we considered π = 0.3,0.4,0.5 three simulation circumstances. For π = 0.3, thirty percent of individuals were randomly sampled from all cases and controls and were used as the data from Stage 1, and the rest of individuals were treated as the data of Stage 2. The P-values of AFJ and MAXFJ were calculated, respectively. We repeated the above procedure 1,000 times and saved the corresponding P-values. A base-10 logarithm transformation and an opposite transformation were successively applied to these P-values, and the histogram and density of these transformed data were obtained (Figure 1). Similarly, we conducted the simulation and calculation for π = 0.4 and 0.5, and the corresponding histogram and density were presented in Figures 2 and 3. Examination of Figures 1–3 showed that the P-values of MAXFJ are more stable than those of AFJ and the estimated density curves of MAXFJ are more closer to the symmetrical normal distribution while the estimated density curves of AFJ are rather skewed, which indicated that MAXFJ possesses more robust performance when the real genetic models are unknown.

Figure 1

The histogram and density of −log10 P when π = 0.3 (the left subgraph corresponds to MAXFJ while the right one for AFJ).

Figure 2

The histogram and density of −log10 P when π = 0.4 (the left subgraph corresponds to MAXFJ while the right one for AFJ).

Figure 3

The histogram and density of −log10 P when π = 0.5 (the left subgraph corresponds to MAXFJ while the right one for AFJ).

4. Discussion

We have developed a feasible two-stage design and the corresponding robust joint analysis approach for quantitative trait GWASs. The method is based on the F-statistics over three different genetic models. The denominator of the used F-statistic, which is constructed without assuming any genetic model, is different from the commonly used one. This adoption can reduce the computation intensity. Taking advantage of an ingenious design matrix, we successfully construct the common denominator of three F-test statistics for the joint analysis with combined raw data from both stages. The statistical significance (P-value) for the proposed joint analysis method can be calculated with the derived analytic expressions on the basis of the asymptotic distributions, which greatly reduce the complexity and computational intensity compared with the resampling-type permutation and bootstrap procedures. Our numerical results demonstrate that this novel approach has the greater efficiency robustness for genetic model uncertainty than the F-statistic-based joint analysis which assumes the additive genetic model. In this work, we did not investigate the power of joint analysis based on other existing robust association methods for quantitative traits such as So and Sham's method. We find that it is very difficult to extend So and Sham's method (score test-based MAX3) to two-staged GWASs with quantitative outcomes, since it is almost impossible to derive the joint distribution of score tests from two stages. For simplicity, here we do not take into account the effects of covariates in the considered two-stage design. However, in real application, the proposed method can be easily applied to the situation including one or more covariates as shown by the original MAXF by Li et al. [10]. It is important to stress that we combine the raw data from two stages to construct the joint statistic, unlike the joint analysis for binary traits using the weighted sum of two statistics in Stages 1 and 2 [12]. Furthermore, one basic assumption in this paper is that the effect sizes of genetic variants between two stages are identical (i.e., no heterogeneity exists), which is the natural and reasonable precondition for the data fusion strategy. In addition, the population-based genetic association studies may be affected by the population stratification, and this needs future research to examine it.

20 in total

1. Characterizing the quantitative genetic contribution to rheumatoid arthritis using data from twins.

Authors: A J MacGregor; H Snieder; A S Rigby; M Koskenvuo; J Kaprio; K Aho; A J Silman
Journal: Arthritis Rheum Date: 2000-01

2. Joint analysis of binary and quantitative traits with data sharing and outcome-dependent sampling.

Authors: Gang Zheng; Colin O Wu; Minjung Kwak; Wenhua Jiang; Jungnam Joo; Joao A C Lima
Journal: Genet Epidemiol Date: 2012-03-28 Impact factor: 2.135

3. A genome-wide association study identifies novel risk loci for type 2 diabetes.

Authors: Robert Sladek; Ghislain Rocheleau; Johan Rung; Christian Dina; Lishuang Shen; David Serre; Philippe Boutin; Daniel Vincent; Alexandre Belisle; Samy Hadjadj; Beverley Balkau; Barbara Heude; Guillaume Charpentier; Thomas J Hudson; Alexandre Montpetit; Alexey V Pshezhetsky; Marc Prentki; Barry I Posner; David J Balding; David Meyre; Constantin Polychronakos; Philippe Froguel
Journal: Nature Date: 2007-02-11 Impact factor: 49.962

4. A powerful approach for association analysis incorporating imprinting effects.

Authors: Fan Xia; Ji-Yuan Zhou; Wing Kam Fung
Journal: Bioinformatics Date: 2011-07-28 Impact factor: 6.937

5. Refining the complex rheumatoid arthritis phenotype based on specificity of the HLA-DRB1 shared epitope for antibodies to citrullinated proteins.

Authors: Tom W J Huizinga; Christopher I Amos; Annette H M van der Helm-van Mil; Wei Chen; Floris A van Gaalen; Damini Jawaheer; Geziena M T Schreuder; Mark Wener; Ferdinand C Breedveld; Naila Ahmad; Raymond F Lum; Rene R P de Vries; Peter K Gregersen; Rene E M Toes; Lindsey A Criswell
Journal: Arthritis Rheum Date: 2005-11