Literature DB >> 28303589

Secondary phenotype analysis in ascertained family designs: application to the Leiden longevity study.

Renaud Tissier¹, Roula Tsonaka¹, Simon P Mooijaart², Eline Slagboom³, Jeanine J Houwing-Duistermaat^1,4.

Abstract

The case-control design is often used to test associations between the case-control status and genetic variants. In addition to this primary phenotype, a number of additional traits, known as secondary phenotypes, are routinely recorded, and typically, associations between genetic factors and these secondary traits are studied too. Analysing secondary phenotypes in case-control studies may lead to biased genetic effect estimates, especially when the marker tested is associated with the primary phenotype and when the primary and secondary phenotypes tested are correlated. Several methods have been proposed in the literature to overcome the problem, but they are limited to case-control studies and not directly applicable to more complex designs, such as the multiple-cases family studies. A proper secondary phenotype analysis, in this case, is complicated by the within families correlations on top of the biased sampling design. We propose a novel approach to accommodate the ascertainment process while explicitly modelling the familial relationships. Our approach pairs existing methods for mixed-effects models with the retrospective likelihood framework and uses a multivariate probit model to capture the association between the mixed type primary and secondary phenotypes. To examine the efficiency and bias of the estimates, we performed simulations under several scenarios for the association between the primary phenotype, secondary phenotype and genetic markers. We will illustrate the method by analysing the association between triglyceride levels and glucose (secondary phenotypes) and genetic markers from the Leiden Longevity Study, a multiple-cases family study that investigates longevity.

Entities: Chemical Disease Gene Species

Keywords: ascertainment; family data; genetic association and heritability; mixed models; multivariate probit model

Mesh：

Substances：

Year: 2017 PMID： 28303589 PMCID： PMC5485037 DOI： 10.1002/sim.7281

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

Introduction

In order to understand biological mechanisms underlying disease and health, epidemiological studies measure genetic markers, classical variables, and novel omics datasets and model the relationship between these variables and the phenotype of interest. Here, we consider outcome‐dependent sampling designs with binary outcome variables. In addition to studying these binary (primary) phenotypes, the classical or omics variables are typically also analysed as outcome variables (secondary phenotypes). For example, modelling of associations between these traits and genetic factors, such as single‐nucleotide polymorphisms (SNPs) or polygenic risk scores (sumscores based on SNPs) 1. However, an important complication that is often ignored is that a proper analysis of the secondary traits should correct for the sampling mechanism on the primary phenotype (Figure 1). Note that we assume that the secondary phenotype has an effect on the primary phenotype. The reverse situation will not be treated because of reverse causality challenges 2. In our motivating case study, the Leiden Longevity Study (LLS) 3 families with at least two long‐lived siblings are recruited. Obviously, these families do not represent a random sample from the population and inferences cannot be generalized to the whole population, unless the sampling mechanism is properly modelled. Several datasets are measured in the offspring of the long‐lived siblings, namely, lipidomics, glycomics, metabolomics and imaging. These offspring share a part of their genetic variation with the long‐lived parent and therefore are expected to represent a healthy subpopulation while the partners represent the population. As data example, we will model the effect of genetic factors on the secondary traits glucose and triglyceride levels in the offspring (cases) and their partners (controls). To be able to extrapolate results to the general population, we need to account for the over sampling of long‐lived subjects in the families of the LLS. There are several multiple‐case family studies. For human longevity, Genetics of Healthy Ageing 4 used the same study design as the LLS. Other examples are Genetics in Familial Thrombosis (with at least two cases with thrombosis) 5, 6 and the ongoing study from Leiden Family Lab (famlab: https://www.leidenfamilylab.nl), which recruits families with at least two cases with social anxiety disorder. The novel methods presented in this paper will also be essential for modelling secondary phenotypes in these studies.

Figure 1

Directed acyclic graph representing the case where bias is expected when estimating the association between the genetic marker and the secondary phenotype. Arrows represent existing association between each node of the graph. A secondary phenotype analysis investigates whether there is an association between the genetic factor and the secondary phenotype. In the context of case‐control studies, Monsees et al. 2 showed that bias can occur when estimating the SNP effect on secondary phenotypes if the primary and secondary phenotypes are associated. This is often the case because both outcomes are measured on the same subjects and secondary phenotypes are typically chosen for their potential associations with the primary phenotype. They also showed that the amount of bias is dependent on the prevalence of the primary phenotype, the strength of the association between the primary and secondary phenotypes, and the association between the tested marker and the primary trait (Figure 1). To deal with the bias problem, investigators first used ad hoc methods, that is, using controls only, cases only, combined data of cases and controls or joint analysis of cases and controls adjusting for the case‐control status. However, several authors showed that these simple approaches can lead to false positive results 2, 7, 8. This is due to the sampling design, namely, the secondary phenotype data are not sampled according to the case‐control design as the primary phenotype. Several sophisticated methodologies have been developed to correct for the sampling mechanisms and provide unbiased genetic effect estimates: (i) inverse‐probability‐of‐sampling‐weighting approaches 2, 9, 10 that correct for the sampling mechanism by weighting appropriately individuals in case‐control studies; (ii) retrospective likelihood‐based approaches that indirectly adjust for ascertainment 8, 11; and (iii) a weighted combination of two estimates obtained with the retrospective likelihood approach in the presence or not of an interaction between SNPs and primary phenotypes 12. Even though these approaches can successfully correct for the biased design used to collect the data, they are not directly applicable to more complex designs such as the LLS that motivates this work. In particular, inverse probability weighting approaches require knowledge of the sampling weights for each family. These weights are not available for the LLS because it is unknown what the prevalence of families with at least two nonagenarians is in the population. In addition, the correlations between the family members cannot be ignored, and therefore, it is evident that statistical methodology for proper secondary phenotypes analysis in this context is needed. To this end, under the retrospective likelihood framework, we develop a multivariate probit regression model inspired by the work of Najita et al. 13 to model jointly the distribution of the primary and secondary phenotype. This approach allows us to deal with the ascertainment issue while taking into account the individual relatedness and the genetic and environmental variations. The paper is organized as follows: in Section 2, we present the retrospective likelihood approach to correct for the over sampling of long‐lived subjects and the multivariate probit regression model for the joint modelling of the mixed type primary and secondary phenotypes. In Section 3, we evaluate empirically the performance of the method in terms of bias and efficiency and contrast it with the naive approach that ignores the sampling mechanism. Finally, in Section 4, we illustrate the potential of our proposed method in the analysis of triglyceride levels and glucose in the LLS.

Methods

Retrospective likelihood approach

Let N be the total number of families in the study. For the family i(i=1…N) of size n , let Y , X and G be the n ×1 vectors for the case‐control status, the secondary phenotype and the genotype, respectively. Motivated by the LLS, we will work under the retrospective likelihood approach to correct for the ascertainment of the families. Such an approach is attractive when modelling the ascertainment mechanism is not straightforward, as in the LLS where sampling depends on the previous generation (an example of a pedigree in LLS is shown in Figure 2). In fact the retrospective likelihood approach implicitly corrects for the ascertainment mechanism, under the assumption that the ascertainment depends only on the primary phenotype Y. In particular, for the ith family it holds with A s c the ascertainment process. By applying Bayes' rule, we obtain To fully specify (2), we need to model properly the conditional joint distribution of the primary and the secondary phenotypes given the genotype P(X ,Y |G ), the marginal probability of the primary phenotype P(Y |G ) and the genotype probability of the ith family P(G ). Each one of these elements are described in Sections 2.2 and 2.3.

Figure 2

Example of a family pedigree from the Leiden Longevity Study. Squares and circles represent men and women, respectively; crossed symbols represent deceased individuals. In black are the long‐lived individuals on whom the ascertainment is based; in grey are the cases of the study (offsprings of long‐lived siblings), and in white are the controls.

Mixed‐effects models for the analysis of family data

To model the correlation of the phenotypes Y and X within families, a common choice is to use random effects. For the binary primary phenotype, we propose to use a multivariate probit model with random effects. The advantage of this model is that it involves only the integrals of the multivariate normal cumulative distribution function for which efficient algorithms have been developed. In contrast, for the more commonly used logistic regression model, the integrals have to be approximated, for example, by using Gauss–Hermite quadrature, which might be computationally intensive for large pedigrees. Let be a set of family‐specific random effects designed to handle familial genetic correlation and be the vector of genotypes for family i. For the probit model, the observed response Y is viewed as a censored observation from an underlying continuous latent variable Y ∗ with where −∞=γ 0<γ 1<γ 2=+∞ are suitable threshold parameters. For the underlying latent variable Y ∗, we assume the mixed‐effects regression model where is independent of . Here, α=(α 0,α 1) denotes the regression coefficient vector with α 0 the intercept and α 1 the parameter representing the effect of the genotype on Y. At the family level, we assume , with R the coefficient of relationships matrix with elements with d denoting the genetic distance between subjects l and m in the family. The parameter represents the residual additive genetic variation not explained by g . Note that models the polygenic inheritance in a family. For identifiability reasons, restrictions are required on both the scale and location of Y ∗, namely, we set σ 2=1 and γ 1=0. Thus, in the mixed‐effects probit regression, the disease risk conditional on the random‐effects and genotypic information g is modelled as follows with Φ(z) the cumulative distribution function of the standard normal distribution. The marginal density under the probit model takes the form: To model the secondary phenotype X , we use a linear mixed model: where β=(β 0,β 1) denotes the regression coefficient vector with β 0 the intercept and β 1 the parameter representing the effect of the genotype on X, is the random parameter used to model the genetic correlation structure within each family for the secondary trait, and σ is the residual standard deviation. To model jointly X and Y using the model specifications ((3) and (4)), we introduce a shared random effect u ∼N(0,1) and propose the following model: where u is assumed to be independent of and . We introduce a coefficient δ in order to have different phenotypic variances for the random effect u . In case of small datasets or small family sizes, it can be better to constrain δ to be equal to 1 for a simpler model. Let and denote the corresponding variance–covariance matrices of the marginal distributions of X and , and let be their covariance. The joint distribution of Y ∗ and X is then . In the special case for n =2, the variance–covariance matrix becomes Using the properties of the multivariate normal distribution, the joint distribution for the observed primary and secondary phenotypes takes the form Thus by using the probit regression model for the primary trait, we have developed an efficient approach to model the correlation between the primary and secondary trait. From model (5) and the variance–covariance matrix (6), several marginal correlations between and within family members can be deduced: where ρ represents the association between the primary and secondary phenotype. We can also derive the closed form for the heritability estimates of the secondary phenotype that quantifies the percentage of genetic variation in the total variance: Note that when genetic factors are included in the model formula (7) gives the residual heritability.

Genotype probability

Finally, another key component in the formulation of the retrospective likelihood (2) is the computation of the genotype probability for each family i. Let G and G denote the genotypes of the mother and father of an individual j if this individual is a nonfounder member of family i. Under the assumption of random mating and Mendelian inheritance, the genotype probabilities can be written as presented by Duncan Thoms 14: The probabilities P(g |G ,G ) are the transmission probabilities that can be modelled using Mendelian inheritance. Finally, P(G ), P(G ) and P(g ) can be modelled by assuming Hardy–Weinberg proportions (1−q)2, 2q(1−q), q 2 that depend on q, the minor allele frequency. Here, we propose to use external information for q or to estimate q from the control sample before maximizing the likelihood. Note that when genotypes of the parents are missing, the probability can be obtained by summing over the possible parental genotypes. In case of a more complex pedigree, a recursive algorithm known as peeling can be used 15. For the LLS where families are sibships, the probability is as follows: where is the model parameters vector.

Estimation and statistical testing

To estimate the parameters of the joint model, we maximize the logarithm of the likelihood described in (8). This involves a combination of numerical optimization and integration. For the evaluation of the integral in the multivariate normal distribution, we use the deterministic algorithm Miwa described in 16. For the optimization, we use the Broyden–Fletcher–Goldfarb–Shanno algorithm implemented in the function optim(.) in R. The Broyden–Fletcher–Goldfarb–Shanno algorithm is a quasi‐Newton method, which means that the Hessian matrix does not need to be evaluated directly but is approximated by using specified gradient evaluations. To test for the presence of an effect of the SNPs on the secondary phenotype, we use the likelihood ratio test. Note that when the interest of a researcher is solely testing for genetic association, a score statistic is an alternative to the likelihood ratio statistic.

Continuous polygenic score

Our approach can also be applied in the case of modelling the association between continuous covariates and secondary phenotypes. For example, polygenic scores have been used to summarize genetic effects among an ensemble of SNPs that have been identified in large genome‐wide association studies (GWASes) 17, 18, 19. Polygenic scores are typically linear combinations of SNPs: , where δ =1 or δ is obtained from previous GWASes. For genetic scores, we need to integrate over the distribution of the polygenic score instead of summing over the genotypes in the denominator of (2). For the distribution of the polygenic score, we use a multivariate normal distribution , with μ the mean value of the genetic score, σ the standard deviation of the genetic score and R the relationship matrix of family i. The likelihood contribution for family i is given by Computation of the integral can be quite intensive and challenging. In order to gain efficiency, we write the marginal model of Y ∗ (5) as , with . Now follows the following multivariate normal distribution: . Note that when a polygenic risk score is included in the model for the secondary phenotype, the parameter represents the residual polygenic inheritance.

Inclusion of covariates in the model

Often, researchers want to adjust for covariates such as age, sex, treatment and so on in the model. Let Z be such a covariate. To estimate the effect Z on the secondary phenotype, we propose to maximize the joint likelihood of X and G conditionally on the primary phenotype Y and Z. Thereby, we avoid modeling of the distribution of Z within the families. Indeed, under the assumption of independence between genotype and Z, we obtain

Simulation study

A simulation study was set up to evaluate the performance of our proposed method for the estimation of the association between a genetic factor and the secondary phenotype and the estimation of the heritability of the secondary phenotype. We compare the proposed method with the naive approach that is typically followed in practice, namely, analysis of the secondary trait without correcting for the sampling mechanism. In particular, in this case, we fit the standard linear mixed‐effects model for the secondary phenotype and explicitly model the familial relationships as described in ((4)). The two methods are compared in terms of bias, root mean square error (RMSE) and 95% coverage probabilities. We consider SNPs (discrete variables) and polygenic scores (continuous variables). Several settings are considered for the disease prevalence, the strength of the association between the genetic factor and the primary phenotype, the strength of the ascertainment mechanism and the number of sibships. We simulated sibships of size 5. With respect to the familial relationships, we consider only sibships such that our simulation resembles the LLS design. For the prevalence of the primary phenotype, we consider two settings, namely, a disease prevalence of 1% that corresponds to α 0≈−2.32 and of 5% that corresponds to α 0≈−1.64. In addition, the variance parameters have been chosen such that they correspond to a heritability of 50%. Specifically, we use =2, , and . This corresponds to a correlation of 0.78 between the primary and the secondary phenotypes. To speed up computations, we assume that when fitting the models to the simulated datasets. For each scenario, 500 datasets are simulated using model (5).

Simulation results for a single‐nucleotide polymorphism

The genotypes of the SNPs are simulated assuming a minor allele frequency of 0.3 in the population. For the secondary phenotype model, the following fixed effects values are used: β 0=3.5 and β 1=0.2, whereas for the primary phenotype model, the effect sizes are α 1= 0.1 or 0.5. Finally, for each of the four scenarios (rare or common disease, and weak and strong SNP effect on the primary phenotype), we consider two ascertainment mechanisms, namely, the sampled sibships of size five have at least one affected or at least two affected members. Figure 3 presents the estimates and 95% confidence intervals for the scenario of 400 sibships. Figure 3 shows that ignoring the sampling mechanism (naive method) leads to biased estimates of the SNP effect, and the size of this bias increases with the strength of the ascertainment mechanism and the association between the SNP and the primary phenotype. Overall, we observe that the proposed method gives unbiased estimates of the SNP effect on the secondary phenotype. The coverage probabilities reach the nominal level (see section A of the Supporting Information). Regarding the prevalence of the primary phenotype, we observe that for the naive method bias increases with lower prevalence, while the proposed method remains robust to the lower amount of information because of the rare primary phenotype. In general, the proposed method leads to smaller root mean square error than the naive approach and better coverage probabilities.

Figure 3

Estimates and 95% confidence intervals for the single‐nucleotide polymorphism (SNP) effect on the secondary phenotype for the retrospective likelihood approach and the naive method. Results are obtained from 500 simulated datasets of 400 sibships for two ascertainment schedules. The top and bottom panel correspond to a rare or common primary phenotype with a prevalence around 1% and 5%, respectively. In black and red are represented results for small (α 1=0.1) and large (α 1=0.5) effect sizes of the SNP on the primary phenotype, respectively. The horizontal line corresponds to the true SNP effect on the secondary phenotype. In Table 1, we present the heritability estimates of the secondary phenotype for a common disease, under the various ascertainment mechanisms and the two values of α 1. It is obvious that the heritability estimates are influenced by the ascertainment mechanisms when using the naive approach. Indeed, the naive method tends to underestimate the heritability for each mechanism and this underestimation increases as the ascertainment mechanisms become more stringent. The heritability estimates are 25–27% for sibships with at least one affected sibling and drop to 13–14% for sibships with at least two affected siblings. On the contrary, the proposed method is robust to the stringency of the ascertainment mechanism.

Table 1

		SNP model		Polygenic score model
Ascertainment	α ₁	Retrospective	Naive	Retrospective	Naive
1. At least two cases
	0.10	0.48 (0.07) (0.22)	0.13 (0.07) (0.37)	0.50 (0.03) (0.13)	0.14 (0.03) (0.36)
	0.50	0.48 (0.07) (0.22)	0.14 (0.07) (0.36)	0.52 (0.03) 0.12)	0.15 (0.03) (0.34)
2. At least one case
	0.10	0.50 (0.08) (0.17)	0.25 (0.08) (0.25)	0.48 (0.04) (0.12)	0.25 (0.03) (0.24)
	0.50	0.50 (0.08) (0.17)	0.27 (0.08) (0.24)	0.50 (0.04) (0.10)	0.26 (0.04) (0.23)

The heritability value is 50% under the generating model. Datasets consist of 400 sibships of size 5. Results are based on 500 replicates. SNP, single‐nucleotide polymorphism.

Heritability results of the simulation studies for a SNP and a polygenic score: estimates with standard deviations and root mean square error (in brackets) for the heritability of the secondary phenotype for a common disease (prevalence ≈5%), when sibships with at least one or at least two cases are sampled and for two values of α 1, that is, SNP or polygenic score effect on primary phenotype. The heritability value is 50% under the generating model. Datasets consist of 400 sibships of size 5. Results are based on 500 replicates. SNP, single‐nucleotide polymorphism. Next, we study the robustness of our approach to one violation of the model assumptions, namely, we simulated under a logit link for the primary phenotype and used the probit link for modelling. Results for the SNP effect and the heritability are presented in Table 2. These results show that even though our approach gives biased estimates for the primary phenotype model, the parameters estimates for the secondary phenotype model are not affected. All the results are presented in Section A of the Supporting Information.

Table 2

Ascertainment	α ₁	β ₁	Heritability
0. True value		0.200	0.500
1. At least two cases
	0.100	0.199 (0.104) (0.104) (0.948)	0.509 (0.017) (0.110)
	0.500	0.197 (0.106) (0.110) (0.945)	0.516 (0.014) (0.108)
2. At least one case
	0.100	0.200 (0.104) (0.107) (0.961)	0.510 (0.012) (0.096)
	0.500	0.199 (0.107) (0.111) (0.960)	0.513 (0.010) (0.087)

In brackets are standard deviations, root mean square error and coverage probability (for the effect size only). Datasets consist of 400 sibships of size 5. Results are based on 500 replicates.

Robustness of the retrospective likelihood methods to violation of the probit model assumption for the primary phenotype: estimates of the effect size of the single‐nucleotide polymorphism on the secondary phenotype (β 1) and heritability of the secondary phenotype are given for a common disease (prevalence ≈5%), for the two ascertainment mechanisms and two values of α 1. In brackets are standard deviations, root mean square error and coverage probability (for the effect size only). Datasets consist of 400 sibships of size 5. Results are based on 500 replicates. Although we focus on parameter estimation, model fitting and heritability estimation for genetic association with a secondary phenotype, we also investigate the performance of the likelihood ratio test under the null hypothesis of no genetic association with a secondary phenotype at two levels of genetic association with the primary phenotype. In each of the four considered scenarios, we simulate 10 000 replicates. In Table 3, the empirical type I error rates are given for the rare disease scenario (i.e. prevalence 1%). We observe that while our approach preserves the type I error rate at a nominal level, the naive approach has, systematically, an inflated type I error rate. The type I error rate for the naive method increases with stronger ascertainment and larger SNP effect on the primary phenotype.

Table 3

Empirical type I errors rates for testing for association between a genetic marker and a secondary phenotype using the likelihood ratio test for four scenarios.

	Nominal level (α)	Retrospective likelihood	Naive method
At least two cases
α ₁=0.1
	0.05	0.0509	0.0580
	0.01	0.0118	0.0152
	0.001	0.0017	0.0025
α ₁=0.5
	0.05	0.0505	0.0878
	0.01	0.0113	0.0222
	0.001	0.0013	0.0043
At least one case
α ₁=0.1
	0.05	0.0524	0.0514
	0.01	0.0102	0.0098
	0.001	0.0018	0.0014
α ₁=0.5
	0.05	0.0522	0.0558
	0.01	0.0098	0.0097
	0.001	0.0009	0.0016

Sibships with at least one or with at least two cases are considered. Two values for the association between the single‐nucleotide polymorphism and the primary phenotype, namely, α 1 = 0.1 and α 1 = 0.5, are used. Datasets consist of 400 sibships of size 5. Results are based on 10 000 replicates.

Empirical type I errors rates for testing for association between a genetic marker and a secondary phenotype using the likelihood ratio test for four scenarios. Sibships with at least one or with at least two cases are considered. Two values for the association between the single‐nucleotide polymorphism and the primary phenotype, namely, α 1 = 0.1 and α 1 = 0.5, are used. Datasets consist of 400 sibships of size 5. Results are based on 10 000 replicates.

Simulation results for a polygenic score

To study the performance of the proposed method for polygenic score, we simulated centred and standardized scores. The parameters of the secondary phenotype model were chosen as for the SNP simulations: β 0=3.5 and β 1=0.2, whereas for the primary phenotype model, effect sizes of α 1= 0.1 or 0.5 were used. Figure 4 presents the estimates and confidence intervals for datasets with 400 sibships. Our approach provides unbiased estimates of the effect of the polygenic score on the secondary phenotype. In contrast, the naive approach provides biased estimates and the bias increases when the ascertainment process is more stringent or when α 1 is larger.

Figure 4

Estimates and 95% confidence intervals for the polygenic score effect on the secondary phenotype for the retrospective likelihood approach and the naive method. Results are obtained from 500 simulated datasets of 400 sibships for two ascertainment schedules. The top and bottom panel correspond to a rare or common primary phenotype with a prevalence around 1% and 5%, respectively. In black and red are represented results for small (α 1=0.1) and large (α 1=0.5) effect sizes of the polygenic score on the primary phenotype, respectively. The horizontal line corresponds to the true polygenic score effect on the secondary phenotype. The results of the residual heritability estimates after adjustment for polygenic scores agree with the results obtained when a SNP is included in the model (Table 1). The naive approach did not perform well: estimates between 25–26% and 14–15% for an ascertainment process of at least one affected sibling and at least two affected siblings, respectively, instead of 50%.

Application: analysis of the Leiden Longevity Study

In this Section, we will exemplify our proposed method in the analysis of the LLS briefly introduced in Section 1. The LLS is a family‐based study set up to identify mechanisms that contribute to healthy ageing and longevity. The inclusion criteria of the study are sibships with at least two nonagenarian siblings, that is, the selection takes place at Generation II (Figure 2). Several secondary phenotypes and GWAS data have been measured for the offspring of these siblings (Generation III in Figure 2) and their partners. Because the offspring have at least one nonagenarian parent, they are also likely to become long‐lived. Therefore, the set of offspring and their partners corresponds to a case‐control design with related subjects where the offspring in Generation III are considered as cases and their partners as controls. Overall, 421 families with 1671 offspring (cases) and 744 partners (controls) have been included in the study. Because the families are relatively small, we use the model that assumes an equal variance for the shared effect for the two traits. Here, we model the association between genetic factors and the secondary phenotypes triglyceride and glucose levels. For both traits, there is evidence of an association with human longevity and both traits are normally distributed. For the sake of comparison in addition to our proposed method, we will present results using the naive approach, that is, standard linear mixed model. Analyses using the linear mixed model that conditions also on the case‐control status will not be presented because the parameters do not have a comparable interpretation between the two approaches. The p‐values presented subsequently are obtained using the likelihood ratio test.

Triglyceride levels analysis

Triglyceride levels have been found to be associated with the primary trait longevity ( p‐value = 0.0005 for women and p‐value = 0.04 for men), and the size of association is sex dependent. Therefore, a sex‐stratified analysis has been considered further. For the purposes of our illustration, we restricted our analysis to seven genes on chromosome 11 that are known to be associated with triglyceride levels. These genes are APOA1, APOA4, APOA5, APOC3, ZNF259, BUD13 and DSCAML1. The selection of the genes was performed using the NHGRI‐EBI GWAS catalog 20. For these genes, we have genotypes of 41 SNPs that have no missing values in our datasets. Triglyceride levels were standardized, and we included age as a covariate in the analysis. We ran the analysis with the constrained approach, that is, δ=1. We observe that none of the SNPs analysed is significantly associated with triglyceride levels either in men or in women; hence, for most SNPs, the estimates of the effect sizes agree between the two approaches. The SNPs showing the largest differences are, in men, SNP 22: for our Retrospective Approach (RA) and for the Naive Approach (NA) and SNP 26: and . For women, more SNPs give different estimates between the two approaches, that is, SNP 1 ( , ), SNP 2 ( = 7.2e‐06 ), SNP 13 ( , ) and SNP 19 ( , ) showed the biggest differences. Results for the SNPs are presented in Section B of the Supporting Information. We verified whether the assumption of equal variances for the primary and secondary phenotype for the shared effects is justified. We fitted also the model with non‐constrained δ. We noticed that for some of the SNPs, the model parameters are hard to estimate and the estimates of the variances of the shared and residual random effects in the model for the second phenotype are swapped. Overall, the estimates of the effect of the SNP on the secondary phenotype are very similar to the model that assumes equal variances. Results of these analyses are presented in Section B of the Supporting Information.

Glucose levels analysis

In previous analysis of glucose levels in the offspring and partners of the LSS, Mooijart 21 studied the association between glucose and a polygenic score. The genetic score was defined as the total number of risk alleles across 15 SNPs that are known to be associated with Type II diabetes. The generalized estimating equation method was applied to take into account the familial relationships. The paper showed that a higher number of Type II diabetes risk alleles is associated with a higher serum concentration of glucose ( p‐value = 0.016). A statistically significant association was found between the glucose level and case‐control status ( p‐value < 0.001). However, the sampling process was not taken into account in the analysis and thus the results might be biased. We applied the proposed method to estimate the heritability of glucose levels and to test for the presence of an association between the glucose levels and the polygenic score. In addition, we applied the naive approach that did not correct for case‐control status. We did not stratify according to sex in these analyses. For this analysis, the polygenic score was standardized. Using the RA, the association between the genetic score and the glucose level is estimated by with a standard error of s t E=0.023(p‐value =0.015). The naive approach also yields a significant association between the genetic score and glucose levels ( , s t E=0.026, p−v a l u e=0.020). By using the NA, we obtained for the glucose levels a genetic variance of and a total variance of , which corresponds to a residual heritability of . Our RA yields a genetic variance of and a total variance of that corresponds to a residual heritability of .

Discussion

In this paper, we developed a new method for the proper analysis of secondary traits for multiple‐cases family designs. A key component in our proposed method is the joint modelling of the primary and secondary phenotypes. We developed a multivariate probit model that can also capture the within families dependencies. A retrospective likelihood approach has been followed to correct for the ascertainment process. Thereby unbiased estimates of the association between genetic factors and secondary traits can be obtained. Simulation results showed that our approach preserves the type I error at nominal level and provides accurate estimates irrespective of the disease prevalence, the strength of the association between the genetic variants and the primary phenotype, and the ascertainment mechanism. Another important empirical finding is that the heritability estimates for the secondary traits can be severely underestimated unless the sampling mechanism is taken into account. With respect to the analysis of the motivating case study, for the SNPs, the differences between the effect sizes obtained by our proposed method and the naive approach were small. The small differences obtained between the naive and the retrospective approach are mainly because of the small effect sizes of the genetic markers selected on the primary phenotype. Indeed, the three main factors influencing the magnitude of the bias when using the naive approach are the correlation between the secondary phenotype and the primary phenotype, the strength of the ascertainment, and the strength of the association between the genetic marker and the primary phenotype. Heritability is one of the properties that a trait needs to possess to be declared an endophenotype for a specific disease. The other criteria are the trait is associated with the disease status in the population, the trait manifests whether illness is active or in remission (state independent) and the trait and the disease status co‐segregate within a family 22. The Leiden Family Lab (https://www.leidenfamilylab.nl) aims to identify endophenotypes for social anxiety disorder. The study comprises families with at least two cases with social anxiety. The methods presented in this paper will be used for the analyses of this study to identify endophenotypes and are relevant for other family studies, as well. In this paper, we proposed to include additional covariates in the model by using the likelihood conditional on these covariates. Alternatively, the joint likelihood of the secondary phenotype, genotype and covariate conditionally on the primary phenotype can be used. This alternative approach might be more efficient 23. However, this likelihood requires distributional assumptions for the covariates within families, which can be complex for related individuals. Moreover, maximization of the likelihood might become time consuming. Ghosh et al. 24 propose a pseudo‐likelihood and a profile approach to include covariates in a secondary phenotype analysis for case‐control data. This work needs to be extended to family data. A Monte Carlo approach might be considered to compute the integrals (Tsonaka et al. 25). Typically, there are missing genotypes. In unrelated individuals, genotypes can be imputed based on the haplotype structure obtained from a reference panel. For family data, the imputation should also take into account the genotypes of other family members. Software exists that can perform such analysis, for example, the Genotype Imputation Given Inheritance program 26. However, for the computation of the denominator in Equation (2), these imputed genotype probabilities have to be taken into account. Because of the computational intensity of the proposed method, it is not yet possible to run full GWAs analyses of secondary phenotypes. However, the proposed method can be used on a set of pre‐selected variants, for example, after an initial screening with the naive approach to the primary and secondary phenotypes or when investigating pleiotropic effects. To reduce the computation time of the multivariate integrals in the numerator and the denominator, a faster algorithm can be used than the one used in this paper. The randomized quasi‐Monte Carlo procedure, developed by Genz 27, is less accurate but faster especially for large pedigrees. Development of less computational intensive methods is one of the topics for future research. With regard to pleiotropic effects, a criticism of probit random‐effects models is that in the presence of high dimensional random effects we cannot move from the subject‐specific interpretation for the fixed effects parameters to the population‐level interpretation as in the random‐intercepts case. When the outcome is binary and families are relatively small, estimation of the intercept and variances terms can be difficult, and consequently, coverage probabilities can be poor. Tsonaka et al. 6 showed efficiency gains by using information on disease prevalence. Their methods need to be adapted to our setting of the analysis of two phenotypes. When the parameters of the primary phenotype model are not of interest and this model is only used to correct for the ascertainment mechanism that is driven by the primary phenotype, we showed that secondary phenotype analyses with the proposed method are robust to using the probit instead of the logit link function. Future directions in the LSS and Leiden Family Lab Study will address the pending availability of multiple omics and functional MRI data, respectively, and joint modelling of several glycans or voxels is of interest. Extending our approach, in this case, is algebraically straightforward, but practical implementation may be challenging because of computational intensity especially with a large number of secondary phenotypes. The use of composite likelihood approaches might be a solution and is our current research topic. Finally, an attractive alternative approach to properly analyse secondary traits is to apply inverse probability weighting. However, it is crucial to correctly specify the weights. Currently, we do not have sufficient information to be able to estimate these weights for our studies. However, with access to electronic records for research, such as information from general practitioners to estimate the weights, it is likely that inverse probability weighting approaches can be developed. Table S1: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S2: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S3: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S4: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S5: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S6: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S7: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S8: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S9: Estimates with standard deviations and RMSE of the heritability of the secondary phenotype are given for a common disease (prevalence ≈ 5%), for the four ascertainment mechanisms and two values of α1. Table S10: Estimates with standard deviations and RMSE of the heritability of the secondary phenotype are given for a rare disease (prevalence ≈ 1%), for the four ascertainment mechanisms and two values of α1. Figure S1: Estimates and 95% confidence intervals for the SNP effect on the secondary phenotype for the retrospective likelihood approach and the naive method. Results are obtained from 500 simulated datasets of 100 families for 2 ascertainment schedules. The top and bottom panel correspond to a rare or common primary phenotype with a prevalence around 1% and 5% respectively. In black and red are represented results for small (α1=0.5) and large (α1=1) effect sizes of the SNP on the primary phenotype respectively. The horizontal line corresponds to the true SNP effect on the secondary phenotype. Table S11: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S12: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S13: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S14: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S15: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S16: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S17: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S18: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S19: Estimates with standard deviations and RMSE of the heritability of the secondary phenotype are given for a common disease (prevalence ≈ 5%), for the four ascertainment mechanisms and two values of α1. Table S20: Estimates with standard deviations and RMSE of the heritability of the secondary phenotype are given for a rare disease (prevalence ≈ 1%), for the four ascertainment mechanisms and two values of α1. Table S21: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S22: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S23: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S24: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a rare disease (prevalence around 1%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S25: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S26: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 2 cases of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S27: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.1. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S28: Simulations results obtained on 500 datasets with 400 families of size 5 with at least 1 case of a common disease (prevalence around 5%) and α1=0.5. Into brackets are respectively standard deviations, root mean square errors and the coverage probabilities. Table S29: Estimates with standard deviations and RMSE of the heritability of the secondary phenotype are given for a rare disease (prevalence ≈ 1%), for the four ascertainment mechanisms and two values of α1. Figure S2: Estimates for the association of 41 SNPs with triglyceride in the LLS. In the top and bottom are the estimates of the 41 SNPs for women and for men respectively. The black line represents no SNP effect on triglyceride. Table S30: Leiden Longevity Study: Estimates of the association between the 41 selected SNPS and triglyceride levels for womend and for three different approaches. The retrospective likelihood approach with same variance assumed for the shared random effect, with different variances, and the naive approach. Are also presented the absolute difference between the estimates of the two last approaches with the first one. Into brackets are the standard errors. Table S31: Leiden Longevity Study: Estimates of the association between the 41 selected SNPS and triglyceride levels for men and for three different approaches. The retrospective likelihood approach with same variance assumed for the shared random effect, with different variances, and the naive approach. Are also presented the absolute difference between the estimates of the two last approaches with the first one. Into brackets are the standard errors. Click here for additional data file.

23 in total

1. A Novel Application of a Bivariate Regression Model for Binary and Continuous Outcomes to Studies of Fetal Toxicity.

Authors: Julie S Najita; Yi Li; Paul J Catalano
Journal: J R Stat Soc Ser C Appl Stat Date: 2009-09-01 Impact factor: 1.864

2. Unified Analysis of Secondary Traits in Case-Control Association Studies.

Authors: Arpita Ghosh; Fred A Wright; Fei Zou
Journal: J Am Stat Assoc Date: 2013 Impact factor: 5.033

3. A Gaussian copula approach for the analysis of secondary phenotypes in case-control genetic association studies.

Authors: Jing He; Hongzhe Li; Andrew C Edmondson; Daniel J Rader; Mingyao Li
Journal: Biostatistics Date: 2011-09-19 Impact factor: 5.899

4. Re-using data from case-control studies.

Authors: A J Lee; L McMurchy; A J Scott
Journal: Stat Med Date: 1997-06-30 Impact factor: 2.373

5. A retrospective likelihood approach for efficient integration of multiple omics factors in case-control association studies.

Authors: Brunilda Balliu; Roula Tsonaka; Stefan Boehringer; Jeanine Houwing-Duistermaat
Journal: Genet Epidemiol Date: 2015-01-24 Impact factor: 2.135

6. Genome-wide linkage scan in affected sibling pairs identifies novel susceptibility region for venous thromboembolism: Genetics In Familial Thrombosis study.

Authors: M C H de Visser; R van Minkelen; V van Marion; M den Heijer; J Eikenboom; H L Vos; P E Slagboom; J J Houwing-Duistermaat; F R Rosendaal; R M Bertina
Journal: J Thromb Haemost Date: 2013-08 Impact factor: 5.824

7. Design, recruitment, logistics, and data management of the GEHA (Genetics of Healthy Ageing) project.

Authors: A Skytthe; S Valensin; B Jeune; E Cevenini; F Balard; M Beekman; V Bezrukov; H Blanche; L Bolund; K Broczek; C Carru; K Christensen; L Christiansen; J C Collerton; R Cotichini; A J M de Craen; S Dato; K Davies; G De Benedictis; L Deiana; F Flachsbart; J Gampe; C Gilbault; E S Gonos; E Haimes; A Hervonen; M A Hurme; D Janiszewska; M Jylhä; T B L Kirkwood; P Kristensen; P Laiho; A Leon; A Marchisio; R Masciulli; A Nebel; G Passarino; G Pelicci; L Peltonen; M Perola; M Poulain; I M Rea; J Remacle; J M Robine; S Schreiber; M Scurti; F Sevini; E Sikora; A Skouteri; P E Slagboom; L Spazzafumo; M A Stazi; V Toccaceli; O Toussaint; O Törnwall; J W Vaupel; K Voutetakis; C Franceschi
Journal: Exp Gerontol Date: 2011-08-16 Impact factor: 4.032

8. Genome-wide association scans for secondary traits using case-control samples.

Authors: Genevieve M Monsees; Rulla M Tamimi; Peter Kraft
Journal: Genet Epidemiol Date: 2009-12 Impact factor: 2.135

9. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.

Authors: Danielle Welter; Jacqueline MacArthur; Joannella Morales; Tony Burdett; Peggy Hall; Heather Junkins; Alan Klemm; Paul Flicek; Teri Manolio; Lucia Hindorff; Helen Parkinson
Journal: Nucleic Acids Res Date: 2013-12-06 Impact factor: 16.971

10. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder.

Authors: Shaun M Purcell; Naomi R Wray; Jennifer L Stone; Peter M Visscher; Michael C O'Donovan; Patrick F Sullivan; Pamela Sklar
Journal: Nature Date: 2009-07-01 Impact factor: 49.962

8 in total

1. The Leiden Family Lab study on Social Anxiety Disorder: A multiplex, multigenerational family study on neurocognitive endophenotypes.

Authors: Janna Marie Bas-Hoogendam; Anita Harrewijn; Renaud L M Tissier; Melle J W van der Molen; Henk van Steenbergen; Irene M van Vliet; Catrien G Reichart; Jeanine J Houwing-Duistermaat; P Eline Slagboom; Nic J A van der Wee; P Michiel Westenberg
Journal: Int J Methods Psychiatr Res Date: 2018-04-26 Impact factor: 4.035

2. An atlas of evidence-based phenotypic associations across the mouse phenome.

Authors: Nobuhiko Tanaka; Hiroshi Masuya
Journal: Sci Rep Date: 2020-03-03 Impact factor: 4.379

3. Impaired neural habituation to neutral faces in families genetically enriched for social anxiety disorder.

Authors: Janna M Bas-Hoogendam; Henk van Steenbergen; Jennifer Urbano Blackford; Renaud L M Tissier; Nic J A van der Wee; P Michiel Westenberg
Journal: Depress Anxiety Date: 2019-10-10 Impact factor: 6.505

4. Investigating microstructure of white matter tracts as candidate endophenotypes of Social Anxiety Disorder - Findings from the Leiden Family Lab study on Social Anxiety Disorder (LFLSAD).

Authors: Eline F Roelofs; Janna Marie Bas-Hoogendam; Hanneke van Ewijk; Habib Ganjgahi; Steven J A van der Werff; Marjolein E A Barendse; P Michiel Westenberg; Robert R J M Vermeiren; Nic J A van der Wee
Journal: Neuroimage Clin Date: 2020-11-05 Impact factor: 4.881

5. Statistical integration of two omics datasets using GO2PLS.

Authors: Zhujie Gu; Said El Bouhaddani; Jiayi Pei; Jeanine Houwing-Duistermaat; Hae-Won Uh
Journal: BMC Bioinformatics Date: 2021-03-18 Impact factor: 3.169

6. Genome-wide analysis in multiple-case families: assessing the relationship between triglyceride and methylation.

Authors: Angga M Fuady; Renaud L M Tissier; Jeanine J Houwing-Duistermaat
Journal: BMC Proc Date: 2018-09-17

7. Application of novel and existing methods to identify genes with evidence of epigenetic association: results from GAW20.

Authors: Angga M Fuady; Samantha Lent; Chloé Sarnowski; Nathan L Tintle
Journal: BMC Genet Date: 2018-09-17 Impact factor: 2.797

8. Amygdala hyperreactivity to faces conditioned with a social-evaluative meaning- a multiplex, multigenerational fMRI study on social anxiety endophenotypes.

Authors: Janna Marie Bas-Hoogendam; Henk van Steenbergen; Nic J A van der Wee; P Michiel Westenberg
Journal: Neuroimage Clin Date: 2020-03-16 Impact factor: 4.881

8 in total