Literature DB >> 25519401

Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach.

Osvaldo Espin-Garcia¹, Xiaowei Shen¹, Xin Qiu², Yonathan Brhane³, Geoffrey Liu⁴, Wei Xu⁵.

Abstract

We propose a genetic association analysis using Dirichlet regression to analyze the Genetic Analysis Workshop 18 data. Clinical variables, arranged in a longitudinal data structure, are employed to fit a multistate transition model in which the transition probabilities are served as a response in the proposed analysis. Furthermore, a gene-based association analysis via penalized regression is implemented using the markers at a single-nucleotide polymorphism level that we previously identified via nonpenalized Dirichlet regression.

Entities: CellLine Chemical Disease Gene Mutation Species

Year: 2014 PMID： 25519401 PMCID： PMC4143809 DOI： 10.1186/1753-6561-8-S1-S70

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Genetic association analyses have had tremendous successes in recent years; however, most of these analyses were based on binary or continuous responses. Thus we propose a multivariate response vector indicating probabilities of transitions to predefined hypertensive states. This enables us to reflect the inherent uncertainty involved in the probability that a patient will transfer to a given state. An important feature of our approach is the incorporation of prehypertension as an intermediate state. As Winegarden argues, prehypertension blood pressure in young patients helps predict the development of hypertension [1].

Methods

Definition of response

We defined a response summarizing the phenotype information into a vector that will be used in a genetic association analysis. The response is defined as a 3-dimensional vector of probabilities , such that each element measures the probability of a transition to a blood pressure level (normotensive, prehypertensive, or hypertensive) given the previous level. The analysis was done without the knowledge of the underlying simulation model and we used the real phenotype data only.

Data quality control

Data quality control was performed in PLINK [2]. We only considered the data from chromosome 3 for analysis. We used a call rate for individuals of 95%, a Hardy-Weinberg disequilibrium test at a significance level of 1 × 10−6, and a missing rate of 95% for each marker. Markers with a minor allele frequency of at least 5% were retained for analysis. Additionally, all individuals' time points with at least 1 missing clinical variable were excluded.

Multistate transition model

We describe hypertension, our trait of interest, using a 3-state model based on recorded blood pressure levels for each individual at each examination. The states are defined as follows: normal blood pressure (state 1) when the systolic blood pressure is less than 120 mm Hg and diastolic blood pressure is less than 80 mm Hg; prehypertension (state 2) when the blood pressure level is not in state 1, the systolic blood pressure is less than 140 mm Hg, and the diastolic blood pressure is less than 90 mm Hg; and hypertension (state 3) for all other cases. Also, if a patient used antihypertensive medication, the state assigned at that examination is hypertension (state 3) regardless of the recorded blood pressure levels. Once the states are defined, we consider a multistate transition model; it is important to note that all 9 transitions are possible. Our interest in transition models lies in estimating of the transition probabilities as defined in Kalbfleisch and Lawless [3] which are given by where and denote the observed state and the covariates for subject i at the examination respectively. This model takes advantage of the longitudinal data structure and the definition of the response follows naturally. To estimate the transition probabilities, we fit a multinomial regression model, based on covariates (gender, smoking status and age) and the state at the previous examination. To get expressions for we consider a generalized logit model of the form besides, , where is the observed vector of covariates for subject i plus a categorical variable denoting the effect of examination time in the model (and possible interactions), and is the vector of coefficients for the corresponding multinomial regression model. Thus, a transition probability matrix (TPM) is defined for each individual as follows Therefore, the response for subject i is a row taken from and is determined by conditioning on the patient's last available observed state and covariates.

Dirichlet regression

Once the response is modeled our objective is to determine whether there is an association between it and the genotypes. We assess this association using Dirichlet regression [4], which suits this response structure. The advantage of this approach lies in its tractability in dealing with the proposed response. It also allows a more comprehensive understanding of the genetic effect on the expression of hypertension, and therefore in its possible interpretation. For instance, if a signal was detected for a marker, it would suggest an association between the marker and the transition of blood pressure states jointly rather than a single level. Therefore, the Dirichlet approach is more informative in the sense of explaining the plausibility of each defined state. To relate the genetic information and the defined response under a Dirichlet regression approach, the likelihood given each individual's vector of covariates, , is where and is the gamma function. The parameters, , are defined in terms of a linear predictor using a logarithm link, where M is the number of covariates included in the model and is the vector of regression coefficients that explains the effects (in log scale) of the covariates on the jth component. Considering the above, 2 models are analyzed: Model 1 (M1): (base model) Model 2 (M2): (adjusted model) Here represents the number of copies of the minor allele on the kth single-nucleotide polymorphism (SNP) for the ith individual under an additive genetic model; is the ith row of contrast matrix for the pedigree number considered as a categorical variable and is the vector of regression coefficients on the jth component. (Note . Our interest in these models lies in the potential genetic effect of each marker on the proposed response. To assess this, Wald statistics were used to test the null hypothesis of no association between each SNP and the response, (vs.), .

Gene-based association

Once we identify significant SNPs through the genetic association analysis as described above, we proceed to perform the analysis at a gene level. To achieve this, we propose a penalized regression. Including all the markers simultaneously, this penalized method aims to select those SNPs with higher association. The analysis is done on those candidate genes that contain at least 1 significant marker that has already been determined. Variable selection on the SNPs is assessed via a penalized likelihood of the form where represents the log-likelihood of a dirichlet distributed sample with response matrix (or for M2) is the design matrix, p denotes the number of markers considered for variable selection; k is the number of states; η is the regression coefficients vector; c and λ are parameters for the penalized regression; and and are the penalty norms. It is important to note that when we have a ridge regression penalty, whereas when we have a lasso penalty. We implement the variable selection for the penalized dirichlet regression using R code provided on the Statistical Genetics and Genomics Laboratory at the University of Pennsylvania webpage [5].

Results

Data quality results

The Genetic Analysis Workshop 18 (GAW18) data consists of 855 individuals with genotype and phenotype information. As a result of missing data, transition probabilities are estimated for only 835 individuals. Of these, 43 are removed because of low call rate. The overall call rate for the remaining 792 individuals is 99.82%. The Genome Wide Association Study (GWAS) data includes 65,519 SNPs for chromosome 3, of which 59 are excluded because it is not possible to reliably obtain position information for these markers. The remaining 65,460 SNPs are considered for data quality control. Because of a low genotyping rate, 114 markers are removed; none are excluded by the Hardy-Weinberg equilibrium test; and 13,011 markers are removed because of low minor allele frequency. The remaining 52,357 markers are considered for analysis.

Analysis results

The parameter estimates for the transition models are obtained using R [6]. To test examination time effect; likelihood ratio tests are performed in which the null model considers only the available clinical variables. Table 1 presents the final transition models.

Table 1

Selected transition models

Transition	Model
1→j	logy1j/y11=γ1j0+γ1j1xsex+γ1j2xsmoke+γ1j3xage*timej=2,3
2→j	logy2j/y22=γ2j0+γ2j1xsex+γ2j2xsmoke+γ2j3xage+timej=1,3
3→j	logy3j/y33=γ3l0+γ3j1xsex+γ3j2xsmoke+γ3j3xagej=1,2

Selected transition models After the response is estimated, models M1 and M2 are fit using R [7] for each available SNP. Figure 1 displays the Manhattan plots for the p values that result from testing the null hypotheses of no association between the markers and the response. The graphs show that only 1 marker under M2 is significant at the standard significance level for GWAS (5 × 10−8). Interestingly, the same marker is the most significant marker under M1, although it is not significant at the standard threshold. This suggests that the adjustment for family incorporated in M2 accounts for the family structure in the data. Also, the proposed methodology demonstrates consistency in that the same marker proves to be the most significant under both models. Table 2 summarizes these findings.

Figure 1

Manhattan plots for genetic association analysis.

Table 2

Association analysis results

SNP	Gene	MA	MAF (%)	p Value (M1)	p Value (M2)
rs12492830	PCCB	C	7.22	1.1 × 10⁻⁷	3.9 × 10⁻⁸

MA, minor allele MAF, minor allele frequency.

Manhattan plots for genetic association analysis. Association analysis results MA, minor allele MAF, minor allele frequency. Once significant markers were identified, a gene-level association analysis is performed using the penalized regression described above for different levels of c (0, 0.3, 0.5, 0.7, and 1). The analysis is conducted utilizing both the GWAS and the dosage imputed genotypes (GENO) information as the explanatory variables. Figure 2 shows the penalized regression results for the gene containing the significant SNP (rs12492830) for only. This level of c is a blended penalty function, equally weighting the ridge and lasso penalties. Table 3 shows the results for different levels of c under M2 for gene PCCB.

Figure 2

Penalized regression results for M2 only.

Table 3

Comparison of penalized regression under different levels of c

			No. of parameters selected (iterations for convergence)
Data	# SNPs	c=	0	0.3	0.5	0.7	1

GWAS	22		10 (260)	10 (148)	8 (135)	7 (163)	7 (174)

GENO	607		42 (427)	35 (199)	24 (176)	25 (136)	19 (115)

Penalized regression results for M2 only. Comparison of penalized regression under different levels of c

Discussion

The present work implements a multistate transition model that conveniently accommodates the longitudinal data structure. Whether the information contained by the available clinical variables is sufficient for predicting the hypertensive state is debatable, however. Although the adjusted model (M2) is an improvement over the base model (M1), neither of the described models accounts for correlation between individuals nor heteroscedasticity. One way to possibly overcome this is to incorporate a latent variable into the model. Such an extension follows. Model 3 (M3): where is the ith element of a vector u that follows a distribution; here K is twice the estimated kinship matrix. In this case, however, the estimation of the parameters of interest, , is not straightforward. Further research of this methodology is warranted. With respect to the penalized regression, to avoid an arbitrary selection of c,a cross-validation method could be implemented.

Conclusions

We propose a methodology that conveniently uses the longitudinal data structure to define a probabilistic outcome, which, we believe, explains hypertension in a more suitable way. Dirichlet regression provides an interesting approach that, along with other more common responses, can be successfully used in genetic association analysis. Our model finds a statistically significant marker at the standard significant level for GWAS, which is noteworthy, considering that it is often difficult to find association. Moreover, when the penalized method is used on the GENO data we are able to find significant markers in addition to those have already found using GWAS data.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

OEG and WX designed the overall study; OEG conceived the study, conducted statistical analyses, and wrote the manuscript; XS, XQ, and YB helped develop the study; GL revised the clinical aspects of the study. All authors read and approved the final manuscript.

3 in total

1 in total

1. Genetic Analysis Workshop 18 single-nucleotide variant prioritization based on protein impact, sequence conservation, and gene annotation.

Authors: Thomas Nalpathamkalam; Andriy Derkach; Andrew D Paterson; Daniele Merico
Journal: BMC Proc Date: 2014-06-17