Literature DB >> 27617223

An enhanced version of Cochran-Armitage trend test for genome-wide association studies.

Mansi Ghodsi¹, Saeid Amiri², Hossein Hassani³, Zara Ghodsi¹.

Abstract

Genome-wide association studies the evaluation of association between candidate gene and disease status is widely carried out using Cochran-Armitage trend test. However, only a small number of research papers have evaluated the distribution of p-values for the Cochran-Armitage trend test. In this paper, an enhanced version of Cochran-Armitage trend test based on bootstrap approach is introduced. The achieved results confirm that the distribution of p-values of the proposed approach fits better to the uniform distribution, and it is thus concluded that the proposed method, which needs less assumptions in comparison with the conventional method, can be successfully used to test the genetic association.

Entities: Disease Gene Species

Keywords: Bootstrap method; Chi-squared test; Contingency table; Genetic association; Monte Carlo simulation; p-values

Year: 2016 PMID： 27617223 PMCID： PMC5006094 DOI： 10.1016/j.mgene.2016.07.001

Source DB: PubMed Journal: Meta Gene ISSN： 2214-5400

Introduction

A central goal of genome wide association studies (GWAS) is to identify genetic risk factors for complex disorders. In order to find the disease genetic risk factors in a population, GWAS measures DNA sequence variations across human genome (Bush and Moore, 2012). Practitioners in medical sciences and bioinformatics use GWAS to investigate the relations in different disorders; GWAS of different cancers (Easton and Eeles, 2008), GWAS of pancreatic cancer (Amundadottir et al., 2009). The idea of genetic variations with alleles that are common in the population may explain much of the heritability of common diseases, see (Reich and Lander, 2001) and (Schork et al., 2009). Review of GWAS can be found in several texts and papers, see (Moore et al., 2010) among others. In the simplest form of association mapping, a set of markers are genotyped in both sample of cases and sample of unrelated controls and then using different association tests, allele frequency differences or genotype frequency differences at each marker will be studied (Pritchard and Donnelly, 2001). The main idea behind GWAS studies relies on the fact that if a mutation has positive correlation with susceptibility of a disease, then that mutation is expected to be more frequent among affected individuals than those unaffected individuals (Pritchard and Donnelly, 2001). Hence, considering the existence of linkage disequilibrium (LD) between the marker locus and the susceptibility mutation, the marker close to the disease mutation may also present a frequency difference between case and control group of study (Pritchard and Donnelly, 2001). Case-control traits can be analysed using either logistic regression or contingency table techniques (Bush and Moore, 2012). Contingency table methods examine the deviation from independence that is expected under the null hypothesis of observing no association between the disease under study and the measured allelic/genotyping frequency differences (Bush and Moore, 2012). Pearson chi-squared test and the related Fisher's exact test are the most widely used tests for independence of the rows and columns of the contingency table (Bush and Moore, 2012). It should be noted that the association tests are performed separately for each individual marker and depending on the aim of study, the data for each marker with minor allele a and major allele A can be represented either as genotype count (e.g., a/a, A/a and A/A) or allele count (e.g., a and A) (Clarke et al., 2011). It is widely believed that the allelic association test with 1 degrees of freedom (df) is more reliable than the genotypic test with 2 df. However, it is imperative to note that this superior performance can only be considered for the case of having the penetrance of the heterozygote genotype between the penetrance of the two homozygote genotypes (Clarke et al., 2011). When the distribution of genotypes in the population deviates from Hardy-Weinberg proportions (HWE), of which additive, dominant and recessive models are all examples (Clarke et al., 2011), the frequency of genotypes rather than alleles should be compared by the Cochran-Armitage test for trend (Sasieni, 1997). For more information on different models see (Clarke et al., 2011). Thus, the advantage of the Cochran-Armitage trend test in comparison to Pearson's Chi-Square test is that it possesses the superior conservation and is not dependent on the HWE assumption (Sasieni, 1997). Therefore a number of authors have recommended to use the Cochran-Armitage trend test as the genotype-based test for association (Sasieni, 1997, Corcoran et al., 2000, Li, 2008, Risch and Merikangas, 1996, Risch, 2000). It should also be noted that the allelic and trend statistic are equivalent when the combined sample is in HWE (Sasieni, 1997). However, a major drawback of model based methods is that the statistical properties depend on the choice of weights. Thus, the model miss-specifications minimize the power of the test (Sasieni, 1997, Corcoran et al., 2000, Li, 2008, Risch and Merikangas, 1996, Risch, 2000). Furthermore, Escott-Price et al. (2013) showed that, although in most scenarios the Cochran-Armitage trend test is more powerful than the chi-squared test of genotype counts, the advantage is not substantial. Even, when the disease locus is extremely biased from the additive model, the chi-squared test of genotype counts can be more powerful than the Cochran-Armitage trend test due to the choice of scores for each genotype in the trend test (Escott-Price et al., 2013). Although, there are considerable studies about the advantages and disadvantages of Cochran-Armitage trend test, to the best of our knowledge, there is a small number of researches which evaluated the distribution of p-values for this association test. In this paper the distribution of the p-values derived by the Cochran-Armitage trend test has been studied and it has been shown that unlike the considered presumption those p-values obtained by this test are not uniformly distributed. To overcome this issue, we introduce a new method, based on the bootstrap technique, for computing the p-value of the Cochran-Armitage trend test. The bootstrap method has become a standard tool in statistical analysis and is an indispensable tool for testing statistical hypotheses. Using resampling, bootstrap approximates the sampling distribution of a statistic under the null (or the alternative) hypothesis. Bootstrap provides a practical complement to asymptotic parametric inference, hence have attracted many attentions in the applied. The efficiency of the nonparametric bootstrap method has also been shown by Amiri and von Rosen (2011) in which for example in the case of the Pearson chi-squared statistic with a Yates' correction and Fisher's exact test, remarkable improvement has been achieved. The Pearson chi-squared statistic with a Yates' correction and Fisher's exact test, are quite conservative and fail to reject the null hypothesis and can not be recommended to test independence with small sample sizes. The remainder of this paper is organized as follows. The concept of Cochran-Armitage trend test is explained in Section 2. Section 3 studies the alternative approach to draw the inference including the bootstrap version of Cochran-Armitage trend test. Section 4 investigates the proposed method using the Monte Carlo simulation, which show they are the accurate tests in terms of the significant level and statistical power. Section 4 also demonstrates the improvements in goodness-of-fitness achieved by the introduced bootstrap approach. The paper concludes with a concise summary in Section 5.

Cochran-Armitage trend test

The Cochran-Armitage's trend test is a widely used test for trend among binomial proportions which uses the genotype contingency table (Table 1) in a different manner than Pearson's test. Power is very often improved as long as the probability of having disease increases with the number of disease-associated alleles. In genetic association studies in which the underlying genetic model is unknown, the additive version of this test is most commonly used. In order to measure the effect of genotype i and to detect particular types of association, we introduce a weight w. The special choice (w0, w1, w2) = (0,1,2), represents the additive effect of allele A. (See Table 2.)

Table 1

Genotype counts distribution for the case-control studies.

	w₀ = 0	w₁ = 1	w₂ = 2	Total
Case	n₀	n₁	n₂	n
Control	m₀	m₁	m₂	m
Total	N₀	N₁	N₂	N

Table 2

Frequency table.

score
w₀	w₁	…	w_J − 1	total
n₀	n₁	…	n_J − 1	n
m₀	m₁	…	m_J − 1	m
N₀	N₁	…	N_J − 1	N

Let us consider a single-marker locus with two possible alleles which are commonly denoted by A and a. Thus, each individual has three possible genotypes AA , Aa, and aa. In the following we denote the two alleles by 0 and 1 instead of A and a and the genotypes by 0 , 1 , 2, the sum of the two allele indices involved. We assume a random sample of n cases and m unrelated controls. The case-control data can then be summarized according to genotypes as shown in Table 1. Here, (n0 , n1 , n2) are counts of the genotypes in cases and (m0 , m1 , m2) are counts of the genotypes in controls, and (N0 , N1 , N2) are counts of the genotypes in case-control samples. Let n and m be the total number of cases and controls, respectively, and the total sample size, N = n + m. As cases and controls are independently sampled the genotype counts for cases and controls follow independent multinomial distributions with parameters (p0 , p1 , p2), and (p0′ , p1′ , p2′), respectively, where p and p′, i = 0,1,2, are the genotype probabilities in cases and controls. Under the null hypothesis of no association, H0 : p = p′ for i = 0,1,2. The Cochran-Armitage's trend test statistic for the data in Table 1 is given by The statistic in Eq. (1) follows the chi-square distribution with one degree of freedom (df), see (Armitage, 1955). Let us denote the Cochran-Armitage trend test as CA in the rest of work. Agresti (2007) states CA in terms of the Pearson chi-squared statistic. Consider a contingency table 2 × J with ordered column, see Table 1. Let n ~ bin(N, p), j = 0 , … , J − 1, it is of interest to test the following null hypothesis It can be carried out by using a linear probability model One can use the ordinary least square approach for testing β. Let , and . The prediction equation iswhere Using the Pearson chi-squared statistics Whereunder linear probability model X2(L) ~ χ2 that using the application of Cochran's theorem, Z2 ~ χ12. It can be used to test H0 : β = 0 for the linear trend, the test of independence using Z2 is called the Cochran-Armitage (CA) trend test.

Bootstrap Cochran-Armitag trend test

The bootstrap method has brought a vast new body of statistics in the form of nonparametric approaches to model uncertainty, in which not only the individual parameters of the probability distribution, but also the entire distribution are sought (Amiri, 2013). This has led to a versatile tool for data analysis, in particular in the field of statistical hypothesis tests. Two monographs on the bootstrap method written by Efron and Tibshirani (1994) and Davison and Hinkley (1997) are very useful in this regard in that they focus more on applications than on the theoretical approach. The idea of the bootstrap method is to approximate the sampling distribution of the proposed statistic, and this technique is based on resampling, which provides a practical complement to asymptotic parametric methods. The flexibility and robustness of this technique, especially in situations where the violation of assumptions is being dealt with, can be counted as two advantages of the technique (Good, 2013). Amiri and von Rosen (2011) use the bootstrap to carry out the test of the contingency table. In order to test the association using the bootstrap method, the resampling should be performed on E, where it is held for the expected value in the (i, j)th cell. The principle of the bootstrap test is the performance of bootstrap resampling under the null hypothesis, which is explained in (Efron and Tibshirani, 1994) and (Davison and Hinkley, 1997). The null hypothesis of the lack of the association in the contingency table is H0 : p = pp., it leads to H0 : E = OO./O.., and therefore resampling under the null hypothesis is resampling on E rather than O, where O is held for the observed value. The dot in the subscription denotes summation. Let X2⁎ be the resampled that is done under null hypothesis and has χ2, since X2⁎ = Z2⁎ + X2⁎(L), Z2⁎ has χ12 and can be used to test H0 : p0 = p1 = … = p.

First approach: NBCA

The test can be done using the following steps, Calculate T or Z2. Resample data under E, and obtain the contingency table N⁎ = {n0⁎, … , n⁎, m0⁎, … , m⁎}, where . Repeat the second step B times, and calculate T⁎ or Z2⁎, b = 1 , … , B. Estimate p-value using Let us denote the above approach as NBCA.

Second approach: PBCA

Another approach is to consider a parametric bootstrap. To this end, consider each allele or column are produced from the independent pdf. Under the null hypothesisthat is actually a product of four binomial pdf. In order to estimate p, the maximum likelihood estimation (MLE) of it can be used i.e., . The procedure of the test is the same as below, just in the step 3, the resampled contingency tables are generated using . The test referred to as PBCA in the rest of work.

Numerical studies

This section demonstrates the validity of the proposed methods for the inference of Cochran-Armitage trend test. In order to study the finite sample properties of the proposed approaches, Monte Carlo experiments are used. The proposed methods are simultaneously based on the same simulated data in order to provide a meaningful comparison of various algorithms. In total 5000 simulations were performed. In order to make a comparative evaluation of the procedures, we seek the certain desirable features such as the actual significance level. In order to produce the simulation, the data is generated using (n0, n1, n2) : Multi(n, (0.2,0.4,0.2)) and (m0, m1, m2) : Multi(m, (0.2,0.4,0.2)). The Q-Q plot of the p-value of the proposed tests are given in Fig. 1, which shows that the p-value using the bootstrap tests fit better to the uniform distribution, that admits the bootstrap can be nominated to draw the inference.

Fig. 1

The Q-Q plot of the p-value for the proposed tests, n = m = 50.

Racine and Mackinnon (2007) suggest where U : Unif(0, 1). Under null hypothesis, P(p − value < α) = α for any finite B, specially if the number of bootstrap is not large. The simulated data are generated using (n0, n1, n2) : Multi(n, (0.2,0.4,0.2)) and (m0, m1, m2) : Multi(m, (0.4,0.2,0.2)). The Violin plot of the simulated power is given in Fig. 2. The Violin plot is a combination of a box plot and a kernel density plot; it starts with a box plot, and then adds a rotated kernel density plot to each side of box plot that provides a better indication of the shape of distribution and summary of data.

Fig. 2

Violin plot of the simulated p-value for the proposed tests when null hypothesis is not correct.

Fig. 3 illustrates the Q-Q plot of generated random number of size 100,000 from χ12. The results confirm that the statistic with distribution χ12 suffer from the lack of goodness-of-fitness in the right tail. This fact is also quite evident for Cochran-Armiatage trend test.

Fig. 3

Q-Q-plots of random number generated from χ12.

In order to study the efficiency the proposed approaches, we generate 2000 tables with (n0, n1, n2) : Multi(n, (0.2,0.4,0.2)) and (m0, m1, m2) : Multi(m, (0.2,0.4,0.2)), where n = m = 20. The Q-Q plot from χ12 and the bootstrap approach is given in Fig. 4, where clearly confirms the superior of the proposed approaches. Note also that both proposed bootstrap approaches perform similarly here.

Fig. 4

The Q-Q plot of in terms of the expected values from χ12 and the bootstrap.

Conclusion

This article explores the genetic association study for the case-control design that draw the inference of the equality of the genotype frequencies. GWAS represent important challenges and opportunities in bioinformatics as they enable modeling of complex genotype-phenotype relationships using the mathematical and statistical approaches. Such models aids us to understand and interpret genetic association studies and promotes the development of powerful algorithms to examine genotype-phenotype relationships. In this paper, we explored the Cochran-Armitage trend test and its bootstrap versions. It was shown that the proposed bootstrap can be used to test the genetic association. The results confirm that the p-value of the proposed approaches fits better to the uniform distribution, specially on the right side. Another advantage of the proposed tests require less assumption in comparison with the conventional method. The results also support that the proposed approaches can be successfully employed to test the genetic association. Extending the proposed idea in this paper to obtain a better test that is more robust under choosing weights for the Cochran-Armitage trend test is our future research plan.

15 in total

1. Power comparisons for tests of trend in dose-response studies.

Authors: C Corcoran; C Mehta; P Senchaudhuri
Journal: Stat Med Date: 2000-11-30 Impact factor: 2.373

Review 2. Three lectures on case-control genetic association analysis.

Authors: Wentian Li
Journal: Brief Bioinform Date: 2007-12-14 Impact factor: 11.622

3. On the efficiency of bootstrap method into the analysis contingency table.

Authors: Saeid Amiri; Dietrich von Rosen
Journal: Comput Methods Programs Biomed Date: 2011-04-02 Impact factor: 5.428

4. The future of genetic studies of complex human diseases.

Authors: N Risch; K Merikangas
Journal: Science Date: 1996-09-13 Impact factor: 47.728

Review 5. Searching for genetic determinants in the new millennium.

Authors: N J Risch
Journal: Nature Date: 2000-06-15 Impact factor: 49.962

6. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer.

Authors: Laufey Amundadottir; Peter Kraft; Rachael Z Stolzenberg-Solomon; Charles S Fuchs; Gloria M Petersen; Alan A Arslan; H Bas Bueno-de-Mesquita; Myron Gross; Kathy Helzlsouer; Eric J Jacobs; Andrea LaCroix; Wei Zheng; Demetrius Albanes; William Bamlet; Christine D Berg; Franco Berrino; Sheila Bingham; Julie E Buring; Paige M Bracci; Federico Canzian; Françoise Clavel-Chapelon; Sandra Clipp; Michelle Cotterchio; Mariza de Andrade; Eric J Duell; John W Fox; Steven Gallinger; J Michael Gaziano; Edward L Giovannucci; Michael Goggins; Carlos A González; Göran Hallmans; Susan E Hankinson; Manal Hassan; Elizabeth A Holly; David J Hunter; Amy Hutchinson; Rebecca Jackson; Kevin B Jacobs; Mazda Jenab; Rudolf Kaaks; Alison P Klein; Charles Kooperberg; Robert C Kurtz; Donghui Li; Shannon M Lynch; Margaret Mandelson; Robert R McWilliams; Julie B Mendelsohn; Dominique S Michaud; Sara H Olson; Kim Overvad; Alpa V Patel; Petra H M Peeters; Aleksandar Rajkovic; Elio Riboli; Harvey A Risch; Xiao-Ou Shu; Gilles Thomas; Geoffrey S Tobias; Dimitrios Trichopoulos; Stephen K Van Den Eeden; Jarmo Virtamo; Jean Wactawski-Wende; Brian M Wolpin; Herbert Yu; Kai Yu; Anne Zeleniuch-Jacquotte; Stephen J Chanock; Patricia Hartge; Robert N Hoover
Journal: Nat Genet Date: 2009-08-02 Impact factor: 38.330