Literature DB >> 34863089

Genetic association tests in family samples for multi-category phenotypes.

Shuai Wang¹, James B Meigs^2,3,4, Josée Dupuis⁵.

Abstract

BACKGROUND: Advancements in statistical methods and sequencing technology have led to numerous novel discoveries in human genetics in the past two decades. Among phenotypes of interest, most attention has been given to studying genetic associations with continuous or binary traits. Efficient statistical methods have been proposed and are available for both types of traits under different study designs. However, for multinomial categorical traits in related samples, there is a lack of efficient statistical methods and software.
RESULTS: We propose an efficient score test to analyze a multinomial trait in family samples, in the context of genome-wide association/sequencing studies. An alternative Wald statistic is also proposed. We also extend the methodology to be applicable to ordinal traits. We performed extensive simulation studies to evaluate the type-I error of the score test, Wald test compared to the multinomial logistic regression for unrelated samples, under different allele frequency and study designs. We also evaluate the power of these methods. Results show that both the score and Wald tests have a well-controlled type-I error rate, but the multinomial logistic regression has an inflated type-I error rate when applied to family samples. We illustrated the application of the score test with an application to the Framingham Heart Study to uncover genetic variants associated with diabesity, a multi-category phenotype.
CONCLUSION: Both proposed tests have correct type-I error rate and similar power. However, because the Wald statistics rely on computer-intensive estimation, it is less efficient than the score test in terms of applications to large-scale genetic association studies. We provide computer implementation for both multinomial and ordinal traits.

Entities: Chemical

Keywords: Categorical; EGEE; Family samples; Framingham heart study; GWAS; Multinomial; Ordinal; Score test; Sequencing; Wald test

Mesh：

Year: 2021 PMID： 34863089 PMCID： PMC8642939 DOI： 10.1186/s12864-021-08107-x

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 4.547

Background

Genetic association tests for continuous or binary phenotypes have uncovered many susceptibility genes or variants related to diseases. Various methods and efficient software have been developed and used for continuous and binary traits. For family samples, due to the correlation between relatives and violation of the independence assumption of ordinary linear regression, some alternative approaches were proposed. For example, Therneau and colleagues developed an R package (coxme) implementing linear mixed effects model to evaluate the association between a genetic variant and a continuous trait or survival outcome accounting for correlation present in family samples. Similar extensions to account for familial correlation using mixed effects models have been proposed for gene-based association tests [1]. The progress in family sample designs has been restricted mostly to quantitative traits or binary traits. However, methods are needed to study categorical traits with more than two categories in family samples. For example, the phenotype diabesity has been defined as a four-category (diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity) variable constructed jointly from type 2 diabetes and obesity. Currently, approaches for genetic association analysis of such multinomial traits are limited. Zhang and colleagues [2] proposed a proportional odds logistic model which allows for the inclusion of covariates. However, it has a few limitations. First, this approach is restricted to nuclear families and cannot handle complex family structures. Second, no software implementation has been made publicly available. Diao and Lin [3] proposed a general framework for linkage and association tests for ordinal traits. Their method utilized adaptive Gaussian quadrature to approximate the maximum log-likelihood and a likelihood ratio test was proposed to test the hypothesis of no association between a genetic variant and an ordinal trait of interest. Again, this approach also has not been widely used due to lack of computer-efficient software and the fact that the likelihood ratio test is computationally intensive. Another possible option is to use the SAS generalized linear mixed models (GLMM) procedure, which can incorporate a kinship matrix. However, in real applications, the current implementation of the GLMM cannot handle extended families due to the computational burden. More recently, Wang and colleagues [4] proposed a Bayesian framework incorporating kinship matrix as a random effect, which however can not be applied to large-scale genetic study because of lack of computational efficiency. Bi and colleagues [5] proposed a computer-efficient framework (POLMM), specifically for ordinal traits. Because it doesn’t allow for a user-provided kinship matrix, such as the one estimated from pedigree or using a typical genetic software, this will be a limitation for family-based cohort studies with known relationships. Our proposed method is complementary to these two approaches as it can be applied to family samples without available genome-wide data to compute a GRM, and without the proportional odds assumption. In this paper, we propose a computationally efficient score test based on extended generalized estimating equations (EGEE) for large-scale genetics studies of multi-category phenotypes accounting for familial correlation. We evaluate our approach using simulations and apply it to a genome-wide scan to identify genetic variants associated with diabesity, a four-category phenotype, with the healthy referent category being no diabetes and no obesity and the unhealthiest category, “diabese” (diabetes and obesity), having a prevalence of at least 25% in several countries [6].

Results

Type-I error

The results of family-based and unrelated samples are summarized in Table 1-2 respectively. Both the score and Wald tests have well-controlled type-I error rates across all MAF scenarios except for rare variants. This conclusion applies to both family-based and unrelated designs. The multinomial logistic regression, which ignores familial correlation, returns an inflated type-I error rate in the presence of related individuals, although its type-I error rate for unrelated study design is well-controlled. In the application to ordinal trait (Table 3), robust score test preserves the type-I error in all MAF scenarios although the simulated phenotype distribution is highly unbalanced. The Wald test is only very slightly inflated for very rare variants when evaluated at 0.0001. We have also generated QQ-plots (Additional File 3) for the robust score test and the simplified score test for results from all MAF scenarios for both multinomial and ordinal traits when applied to family-based samples. The QQ-plots are consistent with the empirical type-I error summarized in the tables below.

Table 1

Simulation results of type-I error for family-based samples

MAF	Robust Score test			Wald test			Logistic regression(LRT)
MAF	α = 0.01	α = 0.001	α = 0.0001	α = 0.01	α = 0.001	α = 0.0001	α = 0.01	α = 0.001	α = 0.0001
0.01	0.014	0.0020	0.0003	0.012	0.0024	0.00058	0.023	0.0023	0.0006
0.02	0.013	0.0020	0.0003	0.010	0.0011	0.0003	0.021	0.0027	0.0004
0.03	0.012	0.0017	0.0002	0.009	0.0012	0.0002	0.022	0.0025	0.0004
0.04	0.011	0.0014	0.0002	0.007	0.0010	0.0002	0.022	0.0025	0.0006
0.05	0.011	0.0013	0.0002	0.011	0.0008	0.0002	0.021	0.0026	0.0002
0.1	0.011	0.0010	0.0001	0.009	0.0008	0.0001	0.021	0.0024	0.0003
0.2	0.010	0.0010	0.0001	0.010	0.0010	0.0001	0.019	0.0033	0.0004
0.3	0.010	0.0010	0.0001	0.011	0.0013	0.0001	0.021	0.0033	0.0011

Table 2

Simulation results of type-I error for unrelated samples

MAF	Score test		Wald test		Logistic regression(LRT)
MAF	α = 0.01	α = 0.001	α = 0.01	α = 0.001	α = 0.01	α = 0.001
0.01	0.011	0.0010	0.008	0.0006	0.011	0.0011
0.02	0.010	0.0016	0.010	0.0016	0.012	0.0014
0.03	0.012	0.0012	0.011	0.0010	0.011	0.0014
0.04	0.011	0.0010	0.010	0.0010	0.011	0.0010
0.05	0.010	0.0010	0.006	0.0004	0.010	0.0005
0.1	0.010	0.0010	0.010	0.0008	0.010	0.0010
0.2	0.010	0.0010	0.009	0.0004	0.010	0.0010
0.3	0.009	0.0011	0.009	0.0006	0.010	0.0010

Table 3

Simulation results of type-I error for family-based samples for ordinal traits

MAF	Robust Score test			Wald test
MAF	α = 0.01	α = 0.001	α = 0.0001	α = 0.01	α = 0.001	α = 0.0001
0.01	0.010	0.0008	0.00009	0.012	0.0013	0.00019
0.02	0.009	0.0008	0.00008	0.011	0.0011	0.00013
0.03	0.010	0.0010	0.00009	0.011	0.0012	0.00012
0.04	0.009	0.0009	0.00008	0.010	0.0011	0.00011
0.05	0.010	0.0009	0.00010	0.011	0.0011	0.00014
0.1	0.010	0.0010	0.00009	0.010	0.0011	0.00009
0.2	0.010	0.0009	0.00009	0.010	0.0010	0.00012
0.3	0.009	0.0009	0.00010	0.010	0.0010	0.00012

Simulation results of type-I error for family-based samples Simulation results of type-I error for unrelated samples Simulation results of type-I error for family-based samples for ordinal traits

Power evaluation

The results of family-based and unrelated samples are summarized in Tables 4 and 5, respectively. Because we have concluded that multinomial logistic regression leads to inflated type-I error rates, the power rate of multinomial logistic regression is not evaluated for family-based samples (Table 4). The score and Wald tests have approximately the same power rate for each scenario (MAF, study design). The logistic regression using LRT has approximately the same power as the other two approaches in unrelated samples.

Table 4

Power results for family-based samples

MAF	α=0.01		α=0.001		α = 5 × 10⁻⁸
0.01	score	97.2	score	92.4	score	42.5
0.01	Wald	96.7	Wald	90.2	Wald	29.8
0.02	score	96.5	score	89.1	score	33.0
0.02	Wald	96.6	Wald	86.4	Wald	24.4
0.03	score	95.5	score	87.5	score	25.6
0.03	Wald	95.1	Wald	84.6	Wald	20.5
0.04	score	94.9	score	85.4	score	23.4
0.04	Wald	94.6	Wald	82.4	Wald	17.7
0.05	score	94.3	score	83.6	Score	20.9
0.05	Wald	93.5	Wald	81.2	Wald	15.8
0.1	score	93.0	score	78.6	score	13.8
0.1	Wald	94.3	Wald	79.6	Wald	11.6
0.2	score	89.4	score	71.7	score	7.6
0.2	Wald	91.1	Wald	74.3	Wald	8.0
0.3	score	87.4	score	68.2	score	6.4
0.3	Wald	89.0	Wald	71.0	Wald	6.5

Table 5

Power results for unrelated samples

MAF	α=0.01		α=0.001		α = 5 × 10⁻⁸
0.01	score	95.0	score	85.8	score	26.5
	Wald	94.1	Wald	82.8	Wald	15.6
	Logistic (LRT)	92.9	Logistic (LRT)	79.6	Logistic (LRT)	11.4
0.02	score	93.3	score	82.3	score	20.0
	Wald	92.8	Wald	80.6	Wald	14.6
	Logistic (LRT)	91.6	Logistic (LRT)	77.4	Logistic (LRT)	10.6
0.03	score	92.7	score	81.1	score	15.9
	Wald	92.4	Wald	79.5	Wald	12.7
	Logistic (LRT)	91.2	Logistic (LRT)	76.8	Logistic (LRT)	9.5
0.04	score	92.4	score	79.2	score	14.1
	Wald	92.0	Wald	78.2	Wald	11.3
	Logistic (LRT)	90.8	Logistic (LRT)	75.4	Logistic (LRT)	8.6
0.05	score	92.0	score	77.9	score	13.1
	Wald	91.9	Wald	77.3	Wald	10.9
	Logistic (LRT)	90.9	Logistic (LRT)	74.7	Logistic (LRT)	8.2
0.1	score	91.3	score	75.7	score	10.4
	Wald	91.0	Wald	74.9	Wald	9.5
	Logistic (LRT)	90.3	Logistic (LRT)	73.2	Logistic (LRT)	7.8
0.2	score	89.9	score	73.2	score	8.1
	Wald	89.7	Wald	73.0	Wald	7.5
	Logistic (LRT)	89.3	Logistic (LRT)	71.9	Logistic (LRT)	6.8
0.3	score	89.2	score	72.2	score	7.0
	Wald	89.3	Wald	71.8	Wald	6.5
	Logistic (LRT)	89.0	Logistic (LRT)	71.5	Logistic (LRT)	6.3

Power results for family-based samples Power results for unrelated samples Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT) Logistic (LRT)

Data analysis

Low-frequency (MAF < 0.01) and poorly imputed variants (imputation ratio < 0.3) have been excluded to avoid spurious results. All results are presented in the Manhattan plot (Fig. 1., and Manhattan plots for diabetes, obesity in Additional File 3) and QQ-plot (Fig. 2.). The variants that have reached a genome-wide significance threshold of 5 × 10−8 or a suggestive threshold of 4 × 10−7 (calculated as 1/number of tests = 1/2542166) are summarized in Table 6. All variants in Table 6 are located within the , and genes. is known to have variants associated with fasting insulin and HOMA-IR in African Americans without diabetes [7]. The direct association between and dibestes or obesity is not known in literature. gene encodes a member of the cytochrome P450 superfamily of liver enzymes. Although the direct relationship between and diabetes/obesity was not well known, some variants located in have been identified in previous studies to be associated with relevant metabolism traits. For instance, one study in 2011 [8] indicated diabetes is associated with a significant decrease in hepatic enzymatic activity and protein level. Several studies have demonstrated nonalcoholic fatty liver disease and diabetes are associated with decreased expression of the protein encoded by this gene in human livers [9, 10]. Two variants on were identified to be associated with Ticagrelor levels in individuals with acute coronary syndromes treated with ticagrelor [11] and serum metabolite measurement [12] respectively. Because this gene might have clinical value for treating chronic metabolic diseases such as nonalcoholic fatty liver disease [13], future research efforts targeting this gene area are worthwhile. Additional information about this region might be discovered with targeted sequencing.

Fig. 1

Manhattan plot of diabesity using the FHS data and Hapmap imputed genotypes

Fig. 2

QQ-plot of diabesity using the FHS data and Hapmap imputed genotypes

Table 6

Top SNPs and the closest genes

Chr	Lead SNP	p-value	bp(GRCh38)	Loci(closest gene)
5	rs16875172	1.1710^(− 7) to 5.3410^(− 7)	5:77422013–5:77559950	AP3B1
7	rs528144	2.9910^(− 7) to 5.5610^(− 6)	7:99257162–7:99918674	CYP3A43
13	rs1925751	7.3410^(−7) to 5.9710^(− 6)	13:66763957–13: 66795551	LOC105370246

Manhattan plot of diabesity using the FHS data and Hapmap imputed genotypes QQ-plot of diabesity using the FHS data and Hapmap imputed genotypes Top SNPs and the closest genes For target validation purpose, we have performed two additional GWAS of diabetes and obesity respectively using our approach (Manhattan plots in Additional File 3). The plots confirm that all signals are observed from the combined phenotype and not driven by a single binary trait (diabetes or obesity). We apply our ordinal approach to the secondary outcome (“ordinal” diabesity) and compare to results obtained from POLMM, an approach for ordinal trait. We observe that the results are similar with small differences (Fig. 3. and 4). Compared to results obtained from the multinomial trait (Fig. 1.), the ordinal trait highlights one region near on Chromosome 1 and one region near on Chromosome 4 of potential interest in the search for genes associated with “ordinal” diabesity.

Fig. 3

Manhattan plot of “ordinal” diabesity using the FHS data and Hapmap imputed genotypes (proposed ordinal approach)

Fig. 4

Manhattan plot of “ordinal” diabesity using the FHS data and Hapmap imputed genotypes (POLMM approach)

Manhattan plot of “ordinal” diabesity using the FHS data and Hapmap imputed genotypes (proposed ordinal approach) Manhattan plot of “ordinal” diabesity using the FHS data and Hapmap imputed genotypes (POLMM approach)

Discussion

The proposed score test offers advantages over the Wald test and the multinomial logistic regression in the following aspects. First, it is more computationally efficient, especially for large-scale genetic studies such as GWAS, or sequencing studies because the iterative Fisher’s scoring algorithm is only applied once under the null hypothesis while the iterative algorithm is implemented for each variant when computing the Wald test statistic. Therefore, for a large-scale genetic study, the Wald test will be less computationally efficient than the score test. We have summarized the computing time for the score and Wald tests in Table 7 for different sample sizes as implemented in R functions using a 3-category multinomial phenotype on a i7-8565u processor with 16GB RAM. Second, the simulation studies show that the type-I error of both the score and Wald tests is well controlled for most scenarios. In contrast, the multinomial logistic regression results in a very inflated type-I error rate for family-based design when the familial correlation is ignored, and therefore it is not recommended for family-based studies. When the phenotypes are extremely unbalanced, e.g. the allocation ratio of the 4 categories is approximately 2.5:1:10:23, both score and Wald tests can result in slightly inflated type-I error for rare variants in the simulation studies. This conclusion has been noted in most approaches [5]. However, when the phenotype distribution is more balanced, the tests return valid type-I error rates for all MAF scenarios, as demonstrated in the QQ-plot of the FHS data analysis (Fig. 2.). We have observed that the type-I error of ordinal traits is very robust to an unbalanced distribution of the phenotype for all MAF scenarios (Table 3), as indicated by the calculated type-I error rate obtained from 500,000 simulations by treating the simulated phenotype as an ordinal trait. Lastly, the score test has approximately the same power as the Wald test, under the scenarios we evaluated.

Table 7

Computing time of robust score and Wald tests on a i7-8565u processor with 16GB RAM

Sample size	Robust score test	Wald test
5000 (182 families)	3.09 s (initial) + 1.17 s per SNP	3.44 s per SNP
10,000 (364 families)	4.47 (initial) + 2.45 s per SNP	6.92 s per SNP
20,000 (728 families)	5.47 (initial) + 5.25 s per SNP	10.95 s per SNP

Computing time of robust score and Wald tests on a i7-8565u processor with 16GB RAM It is worth noting that the EGEE are simply reduced to the score equations of generalized linear models for a multinomial variable when applied to unrelated samples. Because the same iteratively reweighted least square method is employed under this particular circumstance, the parameter estimates are identical to those obtained using a generalized linear model function for multinomial variables. This equivalence enhances the applicability of this approach to a general population, regardless of the underlying study design. The score test can be readily extended to ordinal traits (i.e. categorical traits for which the values are ordered.) in family samples. Due to the nature of the ordinal regression model, fewer regression parameters are estimated. Because applications to ordinal traits are a special case of the general framework proposed with reduced complexity, the validity of simulation results should hold when applied to ordinal traits. When K = 2, i.e. an ordinal trait with only two categories, the estimates will be the same when using either multinomial or ordinal function, i.e. estimates of a binary logistic regression accounting for familial correlation. Our proposed approaches have enabled the identification of a few loci associated with diabesity. As discussed, none of the signals were driven solely by one of the two binary traits (diabetes or obesity). Targeted sequencing might reveal more information, by providing a more comprehensive overview of rare and low-frequency variants in that specific regions. We also provide a comparison of our ordinal approach to POLMM for an ordinal trait and found that both approaches have revealed similar regions of association.

Conclusions

Score tests should be considered for large-scale genetic association testing due to their computational advantage. Because the Wald test also has valid type-I error rates and its computational efficiency is comparable to the score test (Table 7), if computing resources allow, the Wald test can also be applied for large-scale genetic studies. As illustrated using Framingham heart study data, the proposed score test has enabled the identification of several loci associated with diabesity. One of the drawbacks of the score test is the lack of effect estimates. When only a handful of associated variants are identified from a genetic association study, the effect size and statistical significance of each variant can be estimated using the Wald test. In addition to the multinomial application, we have also provided a computer implementation for ordinal traits in Additional File 2. Although we presented association results from additively coded genetic variants, the application and implementation are not restricted to SNPs, but also applicable to a genetic risk score, weighted-sum gene test [14], and other genetic summary measures.

Methods

Assuming that there are N independent families (i = 1, …, N), with n individuals in family i and a total of subjects, the basic model for a K-category (multinomial) trait, with the Kth level chosen as the reference level, is written as, The n × 1 response variable Y has K unordered levels, i.e. k = 1, …, K, resulting in (K-1) equations; is the genotype vector of size n × 1; X is the n × q covariates matrix; α = (α1, …, α, …, α) is the intercept vector for the (K-1) equations; β = (β1, …, β, …, β) is the effect size vector of the genotype in the (K-1) equations; and γ = (, …, , …, ) are the parameters of the covariates X, for the (K-1) equations with a dimension of q × 1 for each . Although there are a variety of choices for the link function g, here we demonstrate with the canonical link function, the general logit, i.e.

Extended generalized estimating equations (EGEE)

We adopt the idea of EGEE previously proposed [15, 16] to approximate the likelihood using quasi-likelihood, to handle correlated observations. The variance of the response variable Y, is defined using (K-1) indicator variables as follows: = [I(Y = 1), …, I(Y = (K − 1))] ’. The expected value of is E[] = [P(Y = 1), …, P(Y = (K − 1))]′ and the variance of can be derived as: Let = r where is a matrix of ones with a dimension of (K-1) by (K-1), and r is an unknown correlation parameter to be estimated with value between −1 and 1. The implementation of the approach provided in Additional File 1 can also accommodate two-parameter with diagonal elements set to r1 and all off-diagonal elements set to r2. The matrix is used to model the correlation between any two individuals in the same family along with the use of relationship matrix, such that = Φ ⨂ ( Φ is the relationship matrix of the i-th family defined as twice the kinship matrix), similar to how the familial correlation was handled in previous publications [17, 18]. , the overall variance matrix of for the i-th independent family is constructed as sd()sd() with the variance of each subject var() (j = 1, …, n), as derived above, where and The following score equations of EGEE [15, 16, 19] are used to estimate the regression parameters = (α1, β1, , …, α, β, ) and the correlation parameter r. where the n(K − 1) × (2 + q)(K − 1) matrix is stacked vertically from (j = 1, …, n) and defined as ; is the vectorized with a dimension of by 1, is an identity matrix with a size of and is the vectorized version of . Similarly, is vectorized version derived from the following: where (j = 1, …, n) and (k = 1, …, (K-1)) is defined as I(Y = k) − P(Y = k). Therefore, E[] = . Fisher’s scoring algorithm is used to update both and r from m-th iteration to (m + 1)-th iteration, written as where and D stands for the first-order derivative with respect to (, r), until the pre-specified convergence criterion is met. Estimates of multinomial logistic regression and r = 0 or 0.5 usually work well in terms of starting values. Note the score equations will be reduced to the following GEE form [20] when applied to N unrelated samples. The coefficients estimation will follow the same iteratively reweighted least square method of generalized linear model [21] for multinomial outcome until a pre-specified convergence criterion is met.

Robust score test

To determine if a genetic variant is associated with a multi-category phenotype, the following null hypothesis is tested H0 : = 0. We first define the score vectors , , U(3) = U. The score statistic is proposed as follows: Where are parameter estimates under H0 : = 0. , (, r) = (−2111−1, ) with subscript 2 denoting rows/columns that correspond to β, subscript 1 denoting rows/columns that correspond to γ …γ, and is an identity matrix of size (K-1). The score statistic follows a asymptotically according to the derivation for bivariate association testing in family samples [17, 22]. One of the major advantages is its robustness to incorrect variance specification. If the variance (i = 1, … N) is pre-specified correctly, then var((, r)) will equal to ∗ restricted to β, γ …γ, and the score statistic will be simplified to where (The subscript 2 denotes the (K-1) row/columns corresponding to β1, …, β(; “-“denotes excluding these rows/columns) and .

Wald test

The Wald test is an alternative test with lower computational efficiency when applied to a large-scale genetic study. The Wald test statistic is proposed as follows: This test statistic follows a asymptotically. The parameters are obtained from the score equations with no constraints (i.e. H0 ∪ H) until the pre-specified convergence criterion is met. The full variance matrix of all parameters is derived as . is extracted from , a sandwich-type variance estimator [19], with rows and columns corresponding to .

Ordinal traits

Under the same framework, using the statistical theory of ordinal regression, the above score and Wald tests can be easily extended to test the association of a genetic variant with an ordinal trait for a family-based design. More specifically, because P(Y = k) can be derived from P(Y = k) = P(Y ≤ k) − P(Y ≤ k − 1) using proportional cumulative logit models, then the same EGEE equations are used for parameter estimation. However, the dimensions of EGEE equations are reduced and mathematical formulas of the matrix elements are derived differently due to the use of proportional cumulative logit models. A computer implementation for both multinomial and ordinal phenotypes is provided in Additional File 1-2.

Simulations

We conduct type-I error and power simulation studies to evaluate the validity of our score test in assessing the association between single-nucleotide variants (SNVs) with different minor allele frequencies (MAF) and a categorical trait with four categories (“multinomial” trait), and compare the score test to the Wald test and the multinomial logistic regression which does not account for related samples. We then conduct simulations to assess the power of all three approaches.

Type-I error

We compare the type-I error rate of the robust score test to the Wald test as well as multinomial logistic regression (without accounting for related samples) in both family-based and unrelated designs. We simulate a 4-category trait under the null hypothesis that there is no genetic association with the trait, i.e. H0 : β1 = … = β3 = 0. Eight SNV scenarios with MAF ranging from 0.01 to 0.3 are explored. For each SNV scenario and sample design, 500,000 replicates are simulated and the type-I error rate is defined as the proportion of simulations significant at the threshold of 0.01, 0.001, and 0.0001. For family-based samples, we also have conducted simulations to evaluate the type-I error of robust score and Wald test when applied to ordinal traits, based on 500, 000 replicates for each MAF scenario. : In each replicate, a total of 1000 independent 3-generation families with 2 grandparents who have one son and one daughter (Fig. 5.) are simulated. The number of grandchildren (3rd-generation) is randomly determined from a discrete uniform distribution ranging from 1 to 4. Within each of the 1000 families, we simulate additively coded genotypes (0, 1, or 2 minor alleles) of the grandparents under Hardy-Weinberg equilibrium, and the 2nd and 3rd generations’ genotypes are then simulated using random allele dropping. Two covariates (age and sex) are simulated. The sex of the 3rd-generation is randomly assigned, and the covariate of age is simulated in the following way [17]: we start by simulating the age of female offspring (2nd generation) from a continuous uniform distribution ranging from 25 to 50. Her spouse’s age is set to be within 5-year of her age. The male offspring’s ages (2nd generation) are set to be within 5 years of the sister with at least a 1-year gap to exclude twins. Then we simulate the age of the grandparents (1st generation). The grandmother is assumed to be 20 to 45 years older than both offspring (2nd generation), and the grandfather’s age is set to be within 5-year of the grandmother’s age and he must be at least 20 years older than his older offspring. Finally, we simulate the age of the 3rd generation, in such a way that everyone in the 3rd generation is assumed to be 20 to 45 years younger than the mother (2nd generation) and at least 20 years younger than the father (2nd generation). Two continuous traits are simulated from age and sex, based on the following two equations, i.e. age and sex explains around 3 and 0.002% of the total variance of the latent variable u1 versus 0.8 and 0.01% of the latent variable u2: where , the additive covariance matrix is and the environmental covariance matrix is . Φ is the relationship matrix which is a kinship matrix multiplied by 2.

Fig. 5

All possible family structures

All possible family structures We transform u1, u2 to two binary traits using a threshold model with a disease prevalence of 10 and 35%, assuming a disease with a moderate prevalence such as type 2 diabetes (T2D) and a high prevalence such as obesity. The multinomial trait is then defined by these two binary traits as follows: diabetes & obesity, diabetes but no obesity, obesity but no diabetes and no diabetes & no obesity, in adults. : In each replicate, we simulate a total of 5000 independent subjects with ages ranging from 18 to 90. A total of 5000 independent additively-coded genotypes are simulated. The sex is randomly assigned (1 = male; 2 = female). We then simulate two continuous traits influenced by age and sex only, based on the following two equations, so that age and sex explain around 3.2 and 0.8% of the total variance of u1 versus 0.94 and 0.01% of u2 respectively: where with . We transform u1, u2 as described in the family design section. We evaluate the type-I error of the proposed score test and Wald test, and then compare them to the multinomial logistic regression assuming independence among observations (using likelihood ratio test (LRT)).

Power evaluation

We compare the power of the score to the Wald test and multinomial logistic regression under the same allele scenarios and with the same family/unrelated structure as described above. In addition to the effects of age and sex, we also include an additively coded genetic variant g which explains approximately 0.5% of the variance of each continuous trait, i.e. With this phenotype generation model, both traits are simulated under the alternative hypothesis that there is an association between the trait and the genetic variant. For each MAF scenario, a total of 5000 replicates are generated. The power rate is then evaluated for 3 different significance thresholds including the commonly used GWAS threshold for each method.

Framingham heart study

The motivation for developing this efficient score test is to make the application to a large-scale genetic study computationally feasible, especially after the cost of whole-genome sequencing has been greatly reduced in recent years. We apply the robust score test to the Framingham Heart Study (FHS) [17, 23]. A total of 7564 participants from 1315 families are analyzed, after excluding observations with missing values in body mass index (BMI), age, sex, the first 10 principal components (PC) s or T2D status. The primary outcome is diabesity with four categories as defined above. Diabesity is considered a modern epidemic and the largest in human history [24]. However, there are very few papers available regarding genetic association studies on this trait. We analyze the association between diabesity and genotypes from the Framingham SNP Health Association Resource (SHARe) project sponsored by the National Heart, Lung and Blood Institute (NHLBI), adjusting for age, sex, and the first 10 PCs. Genotypes from Affymetrix 550 K genotyping arrays (Affymetrix, Santa Clara, CA, USA), supplemented by the Affymetrix MIPS array, are available on 8481 participants after exclusion for low call rate (< 97%), heterozygosity rate outside of 5 SDs from the mean or excess Mendelian errors (> 1000). Additional SNVs are imputed with the software MACH (Markov Chain-based haplotyper) using the HapMap 2 reference haplotypes [25]. To help understand the GWAS results of diabesity and given the fact that diabesity is jointly constructed from obesity and diabetes, we perform two additional family-based logistic regression analyses using our approach to study the association of diabetes and genotypes, and the association of obesity and genotypes respectively. A secondary outcome treats the diabesity as an ordinal variable with 4 levels of increasing severity. We apply both our ordinal approach and POLMM with derived sparse GRM matrix to the secondary outcome and compare the results. Additional file 1. Additional file 2. Additional file 3.

19 in total

1. Detection of genes for ordinal traits in nuclear families and a unified approach for association studies.

Authors: Heping Zhang; Xueqin Wang; Yuanqing Ye
Journal: Genetics Date: 2005-10-11 Impact factor: 4.562

2. Effect of genetic variations on ticagrelor plasma levels and clinical outcomes.

Authors: Christoph Varenhorst; Niclas Eriksson; Åsa Johansson; Bryan J Barratt; Emil Hagström; Axel Åkerblom; Ann-Christine Syvänen; Richard C Becker; Stefan K James; Hugo A Katus; Steen Husted; Ph Gabriel Steg; Agneta Siegbahn; Deepak Voora; Renli Teng; Robert F Storey; Lars Wallentin
Journal: Eur Heart J Date: 2015-05-02 Impact factor: 29.983

3. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.

Authors: Yun Li; Cristen J Willer; Jun Ding; Paul Scheet; Gonçalo R Abecasis
Journal: Genet Epidemiol Date: 2010-12 Impact factor: 2.135

4. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations.

Authors: Jianfeng Liu; Yufang Pei; Chris J Papasian; Hong-Wen Deng
Journal: Genet Epidemiol Date: 2009-04 Impact factor: 2.135

5. Association between nonalcoholic hepatic steatosis and hepatic cytochrome P-450 3A activity.

Authors: Dhanashri Kolwankar; Raj Vuppalanchi; Brian Ethell; David R Jones; Steven A Wrighton; Stephen D Hall; Naga Chalasani
Journal: Clin Gastroenterol Hepatol Date: 2007-03 Impact factor: 11.382

6. GEE-based SNP set association test for continuous and discrete traits in family-based association studies.

Authors: Xuefeng Wang; Seunggeun Lee; Xiaofeng Zhu; Susan Redline; Xihong Lin
Journal: Genet Epidemiol Date: 2013-10-25 Impact factor: 2.135

7. Variance-components methods for linkage and association analysis of ordinal traits in general pedigrees.

Authors: G Diao; D Y Lin
Journal: Genet Epidemiol Date: 2010-04 Impact factor: 2.135

8. Genome-wide detection of allele specific copy number variation associated with insulin resistance in African Americans from the HyperGEN study.

Authors: Marguerite R Irvin; Nathan E Wineinger; Treva K Rice; Nicholas M Pajewski; Edmond K Kabagambe; Charles C Gu; Jim Pankow; Kari E North; Jemma B Wilk; Barry I Freedman; Nora Franceschini; Uli Broeckel; Hemant K Tiwari; Donna K Arnett
Journal: PLoS One Date: 2011-08-25 Impact factor: 3.240

9. Diabetes and its drivers: the largest epidemic in human history?

Authors: Paul Z Zimmet
Journal: Clin Diabetes Endocrinol Date: 2017-01-18

10. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917