Literature DB >> 25519412

Comparing baseline and longitudinal measures in association studies.

Shuai Wang1, Wei Gao1, Julius Ngwa1, Catherine Allard2, Ching-Ti Liu1, L Adrienne Cupples1.   

Abstract

In recent years, longitudinal family-based studies have had success in identifying genetic variants that influence complex traits in genome-wide association studies. In this paper, we suggest that longitudinal analyses may contain valuable information that can enable identification of additional associations compared to baseline analyses. Using Genetic Analysis Workshop 18 data, consisting of whole genome sequence data in a pedigree-based sample, we compared 3 methods for the genetic analysis of longitudinal data to an analysis that used baseline data only. These longitudinal methods were (a) longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a single-nucleotide polymorphism at the second stage. All methods accounted for the familial correlation among subjects within a pedigree. The analyses considered common variants with minor allele frequency above 5% on chromosome 3. Analyses were performed without knowledge of the simulation model. The 3 longitudinal methods showed consistent results, which were generally different from those found by using only the baseline observation. The gene CACNA2D3, identified by both longitudinal and baseline approaches, had a stronger signal in the longitudinal analysis (p = 2.65 × 10(-7)) compared to that in the baseline analysis (p = 2.48 × 10(-5)). The effect size of the longitudinal mixed-effects model and mean trait were higher compared to the 2-stage approach. The longitudinal results provided stable results different from that using 1 observation at baseline and generally had lower p values.

Entities:  

Year:  2014        PMID: 25519412      PMCID: PMC4143666          DOI: 10.1186/1753-6561-8-S1-S84

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

Longitudinal data analyses are widely used in genome-wide association studies to assess genetic and environmental risk factors and their association with phenotypes of interest [1-3]. They are more complicated than analyses using only baseline measures because subjects are followed over time and change is measured during follow-up. Standard linear regression techniques are not applicable in this setting because of the correlation that exists among the repeated measures per subject. Methods for longitudinal study designs have enabled the investigation of genetic variation influencing trait values over time [3]. In Genetic Analysis Workshop 13, Gauderman et al [4] provided an overview of a wide range of methods for the genetic analysis of longitudinal data in families. They summarized these methods into 2 groups: (a) 2-stage approaches, in which a summary statistic is obtained and used in genetic analysis, and (b) joint modeling, in which the genetic and longitudinal data are analyzed simultaneously in a single analysis. They argued that the use of a mean-type statistic could provide greater power compared to a slope-type statistic for detecting a gene effect. Zhu et al [1] performed a genome-wide association in which they identified genes and gene-environment interactions associated with longitudinal traits. They implemented a multivariate adaptive spline for the analysis of the longitudinal data. In this paper, our main object is to compare existing methods of longitudinal data analyses with those that use only 1 baseline measure in association studies. We explore the following longitudinal methods: (a) a longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a single-nucleotide polymorphism (SNP) in the second stage. These longitudinal methods use statistics that capture the level of a trait, such as a mean, to detect genetic associations as opposed to methods that focus on the change in the trait over time, such as a slope. Despite the strengths and integrated approach of a longitudinal mixed model, its implementation is very computer-intensive because of its complex structure. Therefore, the main motivation for trying some "simpler" alternative longitudinal models, such as analysis of the mean trait over time and a 2-stage analysis, is to see if they can serve as good substitutes with equally good performance.

Methods

Study subjects and phenotype

We used real phenotype data collected in the San Antonio Family Heart Study, including sex, age, year of examination, systolic and diastolic blood pressure, use of antihypertensive medications, and tobacco smoking at up to 4 time points for 939 subjects in 20 pedigrees. Of the 939 participants, 244 attended only 1 exam; for the remaining subjects, the median follow-up time was 11 years with a median gap time between assessments of 5 years. We analyzed 2 continuous traits: systolic blood pressure (SBP) and diastolic blood pressure (DBP). For participants on medication, we imputed both SBP and DBP to mimic what their unmedicated values would be. If a subject was on medications at an exam, we imputed the blood pressure at this exam to be the average blood pressure of all observations with higher values among those of the same gender and ± 10 years of the age of the subject. We performed a preliminary analysis to select covariates for both SBP and DBP. Variables significantly associated (p <0.05) with SBP or DBP were selected. For SBP, we adjusted for age, sex, and tobacco smoking. For DBP, we adjusted for age, sex, tobacco smoking, and centered age squared.

Genetic data

The genetic data from Genetic Analysis Workshop 18 (GAW18) consisted of whole genome sequence data in a pedigree-based sample with longitudinal phenotype data for hypertension and related traits. A total of 26.8 million SNPs were identified in the 483 individuals. After eliminating 19 outlier individuals who failed to meet SNP quality control criteria such as fractions and ratio of homogeneous and heterogeneous sites and fraction of novel SNPs, 24 million SNPs passed support vector machine and indel proximity filters. Genotype calls cleaned of mendelian errors and dosages were provided for 959 individuals (464 directly sequenced and the rest imputed) for 8,348,674 locations in the genome. A majority of the SNPs were rare variants; 51% had a minor allele frequency (MAF) below 1%. As suggested by GAW18 leaders, all analyses for this current paper were based on 402,985 common variants (MAF ≥5%) of chromosome 3 only, accounting for around one-third of the total number of variants on the chromosome.

Statistical analyses

Baseline association analysis

For comparison with the methods that used the longitudinal data, we applied a baseline association analysis that considered only the first observation (baseline) for each person. In addition to adjusting for covariates, we incorporated a familial correlation structure (kinship coefficient matrix) into the model as , where i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the phenotype at baseline, denotes the covariates at baseline, and denotes the SNP dosage. is the fixed intercept, is a vector of regression coefficients for the m covariates, and is the SNP effect size; is the random intercept for the (i,j)th person. Within each pedigree, the is normally distributed with a mean of 0 and a covariance matrix of (the kinship matrix), contributing a diagonal block for each pedigree to the overall covariance matrix; is an error term with a mean of 0 and a variance of . This model was implemented using the lmekin package in R (version 2.9.2) package "kinship" [5], which employed maximum likelihood methods to estimate parameters. The notations of used in this baseline model apply to the following models where applicable. To compare with the baseline approach, we considered 3 approaches for longitudinal analyses of these data: (a) longitudinal mixed-effects association analysis, (b) mean measure in longitudinal association analysis, and (c) 2-stage longitudinal association analysis.

Longitudinal mixed-effects association analysis

We used a random-intercept mixed effects model with familial correlation structure [7]. The model is: Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the trait at time point t; denotes the covariates at time t, including time-dependent covariates. This model was implemented in the R (version 2.15.1) package "pedigreemm" [6], which used the method of restricted maximum likelihood for parameter estimation.

Mean measure in longitudinal association analysis

We also considered the mean across all time points as the trait and its corresponding averaged covariates as one alternative for longitudinal association analysis. This model is: Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the mean trait across time. denotes the covariates, which for time-dependent covariates is the average measure across time. This model was implemented using the function lmekin in R (version 2.9.2) package "kinship" [5], using maximum likelihood methods to estimate parameters.

Two-stage longitudinal association analysis

Another longitudinal approach employs a 2-stage strategy [4]. In the first stage, a random intercept, , as the level of the trait for each person was generated from a growth curve model: Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the trait at time point t. denotes the covariates including time-dependent covariates. is the fixed intercept of the first stage; is the random intercept. As above, the covariance structure of is which contributes a diagonal block for each pedigree to the overall covariance matrix. In the second stage, random intercept is treated as the "new" trait and regressed on a SNP as follows: Here denotes the SNP dosage. is the intercept of the second stage; is the SNP effect size; is an error term with a mean of 0 and a variance of is the random intercept that adjusts for the familiar correlation of ; and, similarly, the vector is normally distributed with a mean of 0 and a covariance matrix of contributing a diagonal block for each pedigree to the overall covariance matrix. Gauderman et al [4] pointed out that a mean-based statistic is more powerful to detect a genetic association than a slope-based statistic (eg, a random slope). So here we adopted the random intercept of the first stage rather than the random slope as the "trait" in the second stage. The first-stage model was implemented using lmekin of the R (version 2.15.1) package "coxme" [6], which could handle more than 1 random effect; the second-stage model was implemented using lmekin of the R (version 2.9.2) package "kinship"[5], which adopted a faster computing algorithm. Both packages used maximum likelihood in parameter estimation.

Power and type I error

We conducted power calculations for all 4 methods and evaluated type I error by means of the genomic control value. We chose the variant (chromosome 3: 47956424) on gene MAP4, the top variant influencing simulated SBP and DBP, as the functional variant for power calculations. To determine power, we tested the null hypothesis that the trait SBP was not associated with the functional variant, versus the alternative hypothesis that it is associated. Therefore, results would be considered statistically significant if the p value obtained using the analysis methods fell below a predetermined threshold. Here we divided the significance level 0.05 by the approximate number (25,676) of independent SNPs on chromosome 3 to adjust for multiple testing. We used PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) [8] to prune out SNPs on chromosome 3 where the pairwise linkage disequilibrium was 0.2 or greater, and 25,676 SNPs remained. For each of the 4 methods, the estimated power was the proportion of replicates in which the method detected a significant association between the trait and the functional variant. For each of the 4 methods, genomic control value was used to assess the extent of the inflation of type I error, based on the p value of common variants on chromosome 3.

Results

Association analysis of real data

For SBP, there were no shared results in the top 10 hits between the baseline approach and the other 3 longitudinal methods (Table 1). Some shared genes identified by the longitudinal methods were FGF12 and FHIT. The mean measure and 2-stage methods yielded similar results. For DBP, the 3 longitudinal methods yielded consistent results (as shown in Figure 1, right side): the top 10 hits came from the same gene (CACNA2D3 in Table 2; eg, SNP 3_54748234 has a p value of 2.65 × 10−7), with SNPs nearly reaching a Bonferroni significance threshold. This gene was also found using the baseline method but was less significant (rank = 2, p = 2.76E-05 in Table 2).
Table 1

TOP 10 hits of SBP on chromosome 3 across the baseline method and the 3 longitudinal methods

BaselineLongitudinalMean measureTwo-stage
SNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genes

3_1498711595.311.221.63E-05LOC646903,3_106220130−4.361.022.16E-05LOC1003026403_106220130−4.681.059.16E-06BCHE3_1650469202.960.681.33E-05SLITRK3

3_1331609114.020.931.83E-05 BFSP2 3_1136520273.770.892.34E-05 GRAMD1C 3_106220437−4.651.051.06E-05 FGF12 3_599669752.520.581.59E-05 FHIT

3_1498942195.101.223.41E-05LOC646903,3_106217172−4.331.032.57E-05LOC1003026403_106217172−4.661.051.07E-05 FGF12 3_192240010−3.670.862.14E-05 FGF12

3_1223902795.241.263.65E-05PARP143_1650469204.211.002.82E-05SLITRK33_106218053−4.621.071.64E-05 FHIT 3_192239815−3.810.902.31E-05 FGF12

3_571730213.870.933.75E-05 IL17RD 3_106220437−4.291.022.89E-05LOC1003026403_106219390−4.621.071.64E-05 DOCK3 3_509962894.180.992.55E-05 DOCK3

3_1200204893.730.903.95E-05LRRC583_599669753.580.863.15E-05 FHIT 3_106231571−4.541.051.71E-05BCHE3_1650492742.750.663.35E-05SLITRK3

3_1200232423.730.903.95E-05LRRC583_192239815−5.471.313.20E-05 FGF12 3_106232849−4.511.051.86E-05BCHE3_1650464022.710.664.21E-05SLITRK3

3_108188993−3.770.914.00E-05 MYH15 3_192240010−5.281.273.27E-05 FGF12 3_106220258−4.391.021.86E-05 ZNF385D 3_1650534042.630.666.83E-05SLITRK3

3_1406400767.591.844.08E-05SLC25A363_72678387−4.301.043.38E-05SHQ13_106220368−4.451.052.41E-05 MYH15 3_215207304.601.167.82E-05 ZNF385D

3_158228266−4.871.184.31E-05 RSRC1 3_106218053−4.281.044.20E-05LOC1003026403_72678387−4.531.072.56E-05BCHE3_1136520272.400.607.94E-05 GRAMD1C

*In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene.

Figure 1

Manhattan plots on chromosome 3 using both the baseline and 3 longitudinal methods for SBP and DBP.

Table 2

TOP 10 hits of DBP on chromosome 3 across the baseline method and the 3 longitudinal methods

BaselineLongitudinalMean-measureTwo-stage
SNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genesSNPEffect sizeSEPClosest* genes

3_1647970244.981.172.48E-05 SI 3_547482342.200.432.65E-07 CACNA2D3 3_547482342.230.443.73E-07 CACNA2D3 3_547570321.100.211.83E-07 CACNA2D3

3_54748368−2.150.512.76E-05 CACNA2D3 3_547570322.210.433.21E-07 CACNA2D3 3_547570322.220.445.77E-07 CACNA2D3 3_547482341.110.212.39E-07 CACNA2D3

3_1241420192.850.694.62E-05 KALRN 3_54748368−2.150.423.84E-07 CACNA2D3 3_54748368-2.170.436.80E-07 CACNA2D3 3_547932531.050.216.86E-07 CACNA2D3

3_186144694−3.170.785.83E-05LOC2535733_547849522.110.437.90E-07 CACNA2D3 3_547932532.090.431.67E-06 CACNA2D3 3_547994491.050.217.79E-07 CACNA2D3

3_38845381−3.680.916.06E-05SCN10A3_547932532.080.429.59E-07 CACNA2D3 3_547849522.090.442.05E-06 CACNA2D3 3_547849521.050.217.90E-07 CACNA2D3

3_186209848−3.050.780.000107LOC2535733_547792402.080.431.04E-06 CACNA2D3 3_547994492.090.442.31E-06 CACNA2D3 3_54748368−1.050.218.15E-07 CACNA2D3

3_876195002.320.600.000109POU1F13_54756448−2.100.431.07E-06 CACNA2D3 3_54756448-2.090.442.35E-06 CACNA2D3 3_547792401.030.219.61E-07 CACNA2D3

3_177961323−2.050.530.000115KCNMB2-IT13_547561962.060.431.46E-06 CACNA2D3 3_547561962.080.442.48E-06 CACNA2D3 3_548073201.030.219.67E-07 CACNA2D3

3_726516682.050.530.000132SHQ13_54793450−2.070.431.57E-06 CACNA2D3 3_547472442.070.442.65E-06 CACNA2D3 3_547561961.030.211.26E-06 CACNA2D3

3_186149493−3.040.790.000134LOC2535733_547400112.050.431.75E-06 CACNA2D3 3_547400112.070.442.67E-06 CACNA2D3 3_547997061.020.211.27E-06 CACNA2D3

*In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene.

TOP 10 hits of SBP on chromosome 3 across the baseline method and the 3 longitudinal methods *In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene. Manhattan plots on chromosome 3 using both the baseline and 3 longitudinal methods for SBP and DBP. TOP 10 hits of DBP on chromosome 3 across the baseline method and the 3 longitudinal methods *In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene. Power was computed to assess the baseline method and the 3 longitudinal methods (Table 3). The 3 longitudinal methods had at least 10.5% higher power than the baseline method. Among the longitudinal methods, the power of both mean measure and 2-stage methods was comparable (41% and 40.5%, respectively) and substantially higher than that of the linear mixed-effects (LME) method (32.5%). None of the 4 methods showed elevated type I error because the genomic control value ranged from about 0.98 to 1.034.
Table 3

Power calculation of all 4 methods (based on the 200 simulations)

MethodBaselineMean measureTwo-stageLME
Power22%41%40.5%32.5%
Power calculation of all 4 methods (based on the 200 simulations)

Discussion and conclusions

For both traits, the genes identified by the 3 longitudinal methods were consistent, but different from those found with the baseline approach. From the perspective of computational time, the mean measure and 2-stage methods were more computer efficient than the LME method. Furthermore, these 2 longitudinal methods were more powerful than the LME method. These 2 methods can act as efficient and powerful "substitutes" for LME. The mean measure method worked as well as the 2-stage method, identifying the same genes. The signals found with the 2-stage method (third row of Manhattan plot in Figure 1) were almost identical to those with the LME method, for both SBP and DBP. Therefore, we concluded that the mean measure and 2-stage methods were 2 efficient ways to analyze longitudinal data when the goal is to examine level of a trait. Only the longitudinal approach can evaluate associations with trends over time.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SW, WG, JN, CA, CTL, LAC designed the overall study, SW, WG, JN, CA conducted statistical analyses and SW, WG, JN drafted the manuscript. All authors read and approved the final manuscript.
  6 in total

1.  Longitudinal data analysis in pedigree studies.

Authors:  W James Gauderman; Stuart Macgregor; Laurent Briollais; Katrina Scurrah; Martin Tobin; Taesung Park; Dai Wang; Shaoqi Rao; Sally John; Shelley Bull
Journal:  Genet Epidemiol       Date:  2003       Impact factor: 2.135

2.  PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors:  Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal:  Am J Hum Genet       Date:  2007-07-25       Impact factor: 11.025

3.  Random-effects models for longitudinal data.

Authors:  N M Laird; J H Ware
Journal:  Biometrics       Date:  1982-12       Impact factor: 2.571

4.  Longitudinal genome-wide association of cardiovascular disease risk factors in the Bogalusa heart study.

Authors:  Erin N Smith; Wei Chen; Mika Kähönen; Johannes Kettunen; Terho Lehtimäki; Leena Peltonen; Olli T Raitakari; Rany M Salem; Nicholas J Schork; Marian Shaw; Sathanur R Srinivasan; Eric J Topol; Jorma S Viikari; Gerald S Berenson; Sarah S Murray
Journal:  PLoS Genet       Date:  2010-09-09       Impact factor: 5.917

5.  A genome-wide association analysis of Framingham Heart Study longitudinal data using multivariate adaptive splines.

Authors:  Wensheng Zhu; Kelly Cho; Xiang Chen; Meizhuo Zhang; Minghui Wang; Heping Zhang
Journal:  BMC Proc       Date:  2009-12-15

6.  A multilevel linear mixed model of the association between candidate genes and weight and body mass index using the Framingham longitudinal family data.

Authors:  Jian'an Luan; Berit Kerner; Jing-Hua Zhao; Ruth Jf Loos; Stephen J Sharp; Bengt O Muthén; Nicholas J Wareham
Journal:  BMC Proc       Date:  2009-12-15
  6 in total
  1 in total

1.  Comparing Analytic Methods for Longitudinal GWAS and a Case-Study Evaluating Chemotherapy Course Length in Pediatric AML. A Report from the Children's Oncology Group.

Authors:  Marijana Vujkovic; Richard Aplenc; Todd A Alonzo; Alan S Gamis; Yimei Li
Journal:  Front Genet       Date:  2016-08-05       Impact factor: 4.599

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.