Literature DB >> 18343883

A likelihood-based approach to mixed modeling with ambiguity in cluster identifiers.

Andrea S Foulkes¹, Recai Yucel, Xiaohong Li.

Abstract

This manuscript describes a novel, linear mixed-effects model-fitting technique for the setting in which correlated data indicators are not completely observed. Mixed modeling is a useful analytical tool for characterizing genotype-phenotype associations among multiple potentially informative genetic loci. This approach involves grouping individuals into genetic clusters, where individuals in the same cluster have similar or identical multilocus genotypes. In haplotype-based investigations of unrelated individuals, corresponding cluster assignments are unobservable since the alignment of alleles within chromosomal copies is not generally observed. We derive an expectation conditional maximization approach to estimation in the mixed modeling setting, where cluster assignments are ambiguous. The approach has broad relevance to the analysis of data with missing correlated data identifiers. An example is provided based on data arising from a cohort of human immunodeficiency virus type-1-infected individuals at risk for antiretroviral therapy-associated dyslipidemia.

Entities: Chemical Disease Gene Mutation Species

Keywords: Expectation conditional maximization; Genotype; HIV-1; Haplotype; Lipids; Missing identifiers; Mixed-effects models; Phenotype; Population-based genetic association studies

Mesh：

Year: 2008 PMID： 18343883 PMCID： PMC2536727 DOI： 10.1093/biostatistics/kxm055

Source DB: PubMed Journal: Biostatistics ISSN： 1465-4644 Impact factor: 5.899

INTRODUCTION

Mixed-effects modeling is a well-established method for the analysis of correlated data where correlation among observations can arise from repeated measures or clustering. Since the landmark paper of Laird and Ware (1982), an extensive literature has developed that spans a range of model-fitting techniques and applications, including Diggle , Vonesh and Chinchilli (1997), Pinheiro and Bates (2000), Verbeke and Molenberghs (2000), McCulloch and Searle (2001), Demidenko (2004), and Fitzmaurice , among others. Together, these provide a clear and comprehensive discussion of state-of-the-art methods for estimation, testing, and prediction in the context of linear, generalized linear, and nonlinear mixed-effects modeling. In addition, a broad array of applications are presented with complete discussion of available software tools for implementation of existing methods. To our knowledge, a fully likelihood–based method that specifically addresses unobservable correlated data indicators, that is, missing individual or cluster identifiers, has not been described. The data settings motivating our research are population-based genetic association studies of unrelated individuals for whom haplotypic phase, that is, the alignment of alleles on a single chromosome, is unobservable. In a recent manuscript, we describe a multistage approach for this setting that involves (1) estimating haplotype frequencies using only the available genetic information, (2) multiply imputing cluster membership identifiers, (3) for each of these imputations, fitting a mixed-effects model for the outcome of interest, such as a measure of disease progression, using existing analytical tools, and (4) combining the results across imputations to make inference (Foulkes ). While this approach is straightforward to implement, it does not provide knowledge about the outcome to inform the haplotype frequency estimation. That is, estimation of haplotype frequencies (step 1 above) is done independently of the mixed-effects model–fitting procedure (step 3). In the present manuscript, we derive a novel, likelihood-based approach that incorporates the haplotype estimation component into the model-fitting procedure. Our approach has the marked advantage of drawing strength from a clinical measure (outcome in the model framework) to update the haplotype frequency estimates. Specifically, we derive an expectation conditional maximization (ECM) algorithm for this missing data setting. Expectation–maximization (EM)-type algorithms have been described for fitting mixed-effects models (Laird and Ware, 1982), (Jennrich and Schluchter, 1986), (Laird ), (Jamshidian and Jennrich, 1993). In its original formulation, model random effects are treated as missing data (McCulloch and Searle, 2001, p. 264). We extend this for our setting by letting both the random effects and the correlated data indicators together constitute the missing component. We also distinguish our setting from the more common missing data settings in which covariate or response data are missing and/or there is imbalance in the design, that is, unevenly spaced measurements over time. Methods for these settings are well described as noted in Fitzmaurice and others (p. 375 2004) and McCulloch and Searle (p. 94 2001). The ECM approach originally proposed by Meng and Rubin (1993) extends the EM algorithm of Dempster to reduce complexities in the maximization step by partitioning the set of parameters into disjoint and exhaustive subsets with likelihood functions that are easier to maximize. Two alternative maximization algorithms are well described in the context of fitting mixed models, Newton–Raphson and Fisher scoring (FS) (Lindstrom and Bates, 1988), (Wolfinger ), (Pinheiro and Bates, 2000), (Demidenko, 2004), and combinations of each with EM-type algorithms provide both efficiency and stability. A combination of FS and the EM was recently proposed for missing covariate and response data by Schafer and Yucel (2002). While further extensions for missing cluster identifiers are tenable, the ECM algorithm is efficient, provides simple interpretable solutions at each step, and converges reliably to maximum likelihood (ML) estimates by guaranteeing an increase in the likelihood function at each iteration (Little and Rubin, 1987). Unobservable cluster identifiers can arise in a variety of settings. For example, hospital records may have incomplete information on patients’ local area identifiers such as ZIP codes, which may be desirable in modeling treatment patterns (Chiu and others, 2005). Alternatively, clusters may define underlying biological constructs that are not observable. In general, investigators can identify a subset of clusters that are consistent with the observed data. For example, additional information available from either census records or hospital records may identify a set of possible ZIP codes. In the context of characterizing biological states, genetic indicators can inform us about the set of possible groupings of individuals (Foulkes and DeGruttola, 2002). The data motivating our research arise from a cohort of human immunodeficiency virus type-1 (HIV-1)–infected individuals on highly active antiretroviral therapy (HAART). Long-term exposure to HAART has be associated with an array of lipid abnormalities that can lead to early onset of cardiovascular disease in this population. Our investigation aims to characterize the associations among genetic polymorphisms and lipids, controlling for the effects of drug exposures and other relevant clinical and demographic factors. Ultimately, understanding the pharmacogenomic underpinnings to complex diseases, such as cardiovascular disease, will have broad implications for tailoring therapy decisions to patient-specific characteristics. In general, the pair of single nucleotide polymorphisms (SNPs) at each locus within a gene is observed; however, the alignment of these nucleotides across loci for a given chromosomal copy is unobservable. This unobservable information, commonly referred to as haplotypic phase, can be biologically and clinically informative and ignoring it may lead to a loss of power to detect associations. In this manuscript, we describe how the ECM approach accounting for uncertainty in cluster identifiers can be applied to the setting of ambiguous-phase haplotype data to discover clinically relevant biological associations. This approach represents a contribution to existing methodology since it addresses simultaneously the need to consider multiple genetic indicators and the unobservable aspect of haplotypic phase using a fully likelihood–based approach. Recently, Foulkes proposed applying mixed-effects models to data arising from genetic association studies of unrelated individuals. A primary strength of this approach is that it allows for assessing overall variability across combinations of multiple genetic polymorphisms using a single, omnibus test while controlling for potential confounding by environmental and clinical characteristics. Empirical Bayes estimates of multilocus genotype effects and corresponding prediction intervals lend additional insight into the specific polymorphisms contributing to measures of disease progression. The method proposed herein extends this approach to handle the setting in which genetic information is unobservable. The proposed method also extends the generalized linear modeling approach of Lake and Lin and Zeng (2006) that both describe implementation of an EM algorithm for unobservable haplotype data. Notably, both the mixed modeling approach and the methods given in Lin and Zeng (2006) can accommodate specific departures from Hardy–Weinberg equilibrium (HWE). The primary difference between the approaches is that the mixed modeling approach we present assumes that haplotype effects are random, arising from an underlying probability distribution. This provides a flexible analytic framework for characterizing a large number of genetic indicators and may offer a solution to the degrees-of-freedom problem inherent in tests of haplotype–trait associations as described by Tzeng . Finally, mixed-effects models have been described as a special case of structural equation models (SEMs) or latent class models (Sanchez ). Here, we introduce a doubly latent class structure since there are latent cluster random effects as well as unobservable cluster identifiers. Notably, the inclusion of both latent class indicators and latent random effects has been described in the SEM literature. For example in Muthen and Shedden (1999), alcohol dependency classes have latent indicators while person-specific random effects are included to account for repeated measures over time. In our setting, clusters similarly have latent indicators but it is the same clusters (and not individuals) that are assumed to have random effects. This renders our setting distinct. The heterogeneity model of Verbeke and Lesaffre (1996), on the other hand, assumes an unobservable mixture distribution on the random effects, that is, that the random cluster effects are themselves clustered. This is again different from our setting since we assume that the cluster effects arise from a single distribution while the membership to these clusters is potentially unobservable. These are subtle distinctions but important ones requiring novel associated methods. We begin in Section 2 by outlining our notation, the assumed underlying model, and a brief summary of estimation via the EM algorithm in the usual linear mixed modeling setting in which cluster assignments are fully observed. We then describe a novel estimation approach in the general context of cluster ambiguity in Section 3. Extensions for investigation of genetic associations are provided in Section 4. Finally, in Section 5 we present a summary of results from a simulation study and from applying this approach to a study of HAART-associated dyslipidemia in HIV-1-infected individuals. A comprehensive discussion of the simulation approach and corresponding findings is provided in Appendix 3.

BACKGROUND

Notation and model

Consider the linear mixed-effects model given in (2.1) for i =1,…,M, where Y is an n×1 vector with jth element equal to the response for the jth observation in cluster i, n is the number of observations in cluster i, X is a corresponding matrix of covariates, and Z is the design matrix for random cluster effects. We assume and . In the general mixed modeling setting, Z is observed and an EM approach is used to estimate β and θ, where β is the vector of mean parameters and θ =(D, σϵ2) is a vector of variance components. This approach is described in Laird and Ware (1982) and summarized in Section 2.2 below: Now, suppose Z = blkdiag[Z], where N = ∑n. In the ambiguous cluster setting, both Z (the indicator for cluster membership) and n are potentially unobserved. In addition, the elements of Y and X will vary depending on cluster assignments. Let the observed data relevant to cluster assignments be G. We define 𝒮 to be the set of all design matrices Z that are consistent with these observed data. For simplicity of notation in subsequent sections, we let Y = (Y1,Y2,…,Y), X = [X1,X2,…,X], 𝒟 = blkdiag[D], b = (b1,…,b), and ϵ = (ϵ1,…,ϵ). The model in (2.1) can be rewritten in complete matrix notation as described in (2.2). The variance of Y is given by W = Z𝒟Z + σ2I:

Estimation in the fully observed cluster setting

First, consider the usual linear mixed modeling setting in which cluster assignments are fully observed and the traditional EM approach to estimation of Laird and Ware (1982). This approach proceeds by first calculating the ML estimate of β assuming the current estimate of θ. This calculation is straightforward since a closed-form solution exists. Second, we update the estimate of θ assuming the current estimates of β. Estimation at this step proceeds using an EM algorithm, which involves first determining the conditional expectation of the complete-data log-likelihood (E-step) and then maximizing this to arrive at new parameter estimates (M-step). This process is then repeated iteratively until a convergence criterion is met. If the variance parameters are known, the ML estimate of β is given by in (2.3). In general, θ is not known and we replace W in this equation with its ML estimate given by . Based on the complete-data likelihood, where the complete data consist of Y, b, and ϵ, the sufficient statistics for θ = [σ2,D] are given by t1 = ∑ϵϵ and t2 = ∑bb. The M-step of the EM algorithm is composed of arriving at ML estimates and assuming the current estimate of Ω = (β,θ): The E-step involves setting the sufficient statistics t1( and t2( equal to their expectation conditional on the observed data Y, as summarized in (2.4). It is straightforward to show and . Restricted maximum likelihood (REML) estimates are obtained by adding and to each equation, respectively, where . These additional terms account for estimation of the mean parameter β:

METHODS

In this section, we extend the methods described in Section 2.2 to handle ambiguity in the correlated data indicators. That is, we assume that the i of (2.1) is not observed. Since estimation of β requires knowledge of the unobserved cluster assignments, an additional implementation of the EM algorithm is required at the first step. Specifically, estimation of β will depend on weights equal to the estimated posterior probabilities of each potential cluster assignment given the observed data. This requires first assuming a distribution for the number of observations in each cluster as described in Section 3.1 below. We let this distribution be a function of the parameter vector α = (α1,…,α), where α is the population frequency of cluster i. We then proceed similarly to the unambiguous setting by first estimating the mean components Φ = [β,α] assuming θ is known and, second, estimating the variance parameters θ given the current estimate of Φ.

Defining a distribution for cluster counts

We assume that the probability of a particular configuration of cluster assignments (represented by the design matrix Z) follows a multinomial distribution. This probability density is given explicitly in (3.1), where n is the number of observations in cluster i for the given Z, α = (α1,…,α), α is the population frequency of cluster i, and ∑α = 1 or, equivalently, α = 1 − ∑α. Note that the usual constant term (N!/n1!…n!) is not included in this formula since the probability is for a single configuration Z: The number of clusters, given by M, is assumed to be known. This is a reasonable assumption in most data settings. For example, in Section 4 clusters are formulated based on pairs of haplotypes; the number of possible pairs is a fixed number that depends on the number of SNPs under investigation within a gene. Alternatively, clusters may represent hospitals or schools and the number of such units is generally fixed at the onset of a study.

Estimating mean parameters, conditional on θ

The complete data consist of Y, Z, b, and ϵ and are denoted Ycomplete. In estimating the mean parameters, we treat b and ϵ as known and write the complete-data likelihood for Φ = (β,α) given θ as Lc(Φ|Ycomplete,θ) in (3.2). Here, Pr(Y|Z,β,θ) is the marginal conditional density for the observed data Y and is given by (3.3). Note that the particular configuration of cluster assignments will contribute to W and we therefore include an additional Z subscript in our notation: The E-step involves calculating the conditional expectation of the complete-data log-likelihood. This conditional expectation is given in (3.4), where p(Ω) is the posterior probability of the combination of cluster assignments (again denoted by the design matrix Z) given Y and Ω = (Φ,θ). Recall that 𝒮 is the set of all design matrices Z that are consistent with the observed data. A formulation of this posterior probability is given in (3.5). At this step, we update our estimate of p(Ω) assuming the current estimate of Ω. That is, we calculate , where is the vector of current parameter estimates: The M-step involves maximizing the conditional expectation of the complete-data log-likelihood conditional on the current estimate of the posterior probabilities calculated in the E-step. Maxima can be obtained by maximizing the system of equations given in (3.6) and (3.7). Here, we use the relationship that the derivative of the conditional expectation is equal to the conditional expectation of the score function. Resulting closed-form solutions for and are given in (3.8) and (3.9): In the case that the variance parameters are known, we iterate between updating our estimates of p(Ω) and updating our estimates of Φ. The EM algorithm ensures that we will increase the likelihood at each iteration. In general, the variance components θ are not known; however, we can obtain ML estimates of θ and condition on these estimates. This amounts to substituting these ML estimates into the above equations. In the following paragraphs, we describe a modification of the EM algorithm for estimation of θ that additionally incorporates posterior probabilities associated with each combination of cluster assignments.

Estimating variance components, conditional on Φ

For the purpose of estimating variance parameters, we define the complete-data log-likelihood by Lc(θ|Ycomplete,Φ) in (3.10). Note f(Y|b,ϵ,β) is a dirac function ( = 1 under model) and only depends on β so is ignored in estimation of θ. Using the same approach as described in Section 2.2, we set the sufficient statistics for θ equal to their expectation. Here, we sum additionally over the set of all design matrices Z that are consistent with the observed data and weight by corresponding posterior probabilities . Again the maximization step involves setting , where and M is the number of clusters corresponding to the specific configuration given by Z. Adjustments to these equations to arrive at REML estimates proceed as described in Section 2.2:

Summary of approach

In summary, ML estimation proceeds by iterating between 2 EM algorithms: (1) estimation of Φ and (2) estimation of θ. For computational efficiency, we implement one iteration of the first EM algorithm conditional on the current estimate of θ and then one iteration of the second EM algorithm conditional on the current estimate of Φ. This is then repeated until a convergence criterion is met. Initial values for the parameter estimates are arrived at by randomly assigning clusters in the case of ambiguity and fitting the usual mixed-effects model. This approach is summarized in the following step-by-step procedure: Initialize k = 0. Determine initial values for by randomly assigning cluster membership in the case of ambiguity and fitting usual mixed effects model. Calculate based on Equation 3.5 using the current estimate of . Update and using Equations 3.8 and 3.9, assuming the current estimates of pZ(Ω) = pZ() and . Denote the new estimate for Φ by . Update and using Equation 3.11, assuming the current estimates of Φ = Φ(, pZ(Ω) = pZ() and . Denote the new estimate for θ by . Let k = k + 1 and repeat steps (2)–(4) until a convergence criterion is met.

CONSIDERATIONS FOR GENETIC ASSOCIATION STUDIES

Data arising from genetic association studies of unrelated individuals are generally composed of 3 components: (1) one or more outcomes (commonly referred to as phenotypes, these can be continuous or a binary indicator for case–control status), (2) covariates and potential confounders, including clinical and demographic factors, and (3) genotypes, consisting of the pair of nucleotides present at each locus within and across the candidate genes under consideration. In general, the alignment of nucleotides on a single chromosomal copy, commonly referred to as haplotypic phase, is not observable. For example, if an individual is heterozygous at 2 loci within a gene so that their observed genotype is (Aa,Bb), then the corresponding possible haplotype pairs for this individual are (AB,ab) or (Ab,aB).

Defining clusters

The mixed modeling approach to the analysis of genetic association studies begins by grouping individuals into clusters so that individuals within the same cluster have similar or identical underlying genetic compositions. For example, in Foulkes individuals with identical multilocus genotypes (i.e. the same pattern of SNPs across multiple loci within or across a gene) are deterministically assigned to a corresponding cluster. Once these clusters are defined, analysis proceeds using the mixed-effects modeling framework just as it would in a typical clustered data setting. In the context of studying haplotypic variations, such a grouping based on genetic compositions is generally unobservable. That is, just as haplotypic phase is unobservable, cluster assignments based on this information must also be unobservable. More formally, suppose ℋ = (h1,…,h) represents the set of all haplotype pairs (diplotypes) consistent with the observed genotypes for a given gene. We define clusters C1,…,C such that an individual with haplotype combination contained in ℋ belongs to cluster C, where ℋ⊂ℋ. In the most general case, we assume that the number of clusters equals the number of haplotype pairs, that is, K = M and ℋ = h so that all individuals within the same cluster have identical diplotypes. For example, in the case of 2 SNPs within a gene, there are 4 possible haplotypes and K = 10 possible diplotypes. In the general case, we define 10 corresponding clusters. Several alternative formulations of the clusters are tenable, and these can reflect the biological hypothesis under investigation. For example, returning to the simple case above, we can group all diplotypes that contain at least one copy of the rare haplotype into a single cluster. This would result in 7 clusters and is consistent with a dominant genetic model in which one copy of the disease allele results in an altered phenotype. A visual representation of these 2 approaches to defining clusters is given in Figure 1.

Fig. 1.

Sample approaches to defining clusters. For the 2 SNP example in which the observed genotypes are (AA, Aa,or aa) and (BB, Bb, or bb), there are 4 possible haplotypes, AB, Ab, aB, and ab, and 10 possible diplotypes. The most general approach to defining clusters results in 10 clusters consisting of all these possible combinations of 2 haplotypes. These are indicated by shaded rectangles. An alternative approach groups all diplotypes with at least one copy for the rare ab haplotype into a single cluster. This is indicated by the dashed rectangle that combines 4 of the previously defined clusters into a single cluster. In this case, there are a total of 7 clusters.

Estimation

Returning to the model in (2.1), Z again indicates membership to cluster i (or equivalently for the general case, presence of haplotype pair h) and is potentially unobservable. The observed data G that inform the cluster memberships are the observed genotypes. Recall that the population frequency of each cluster is given by α for i = 1,…,M. Under the assumption of HWE, the probability of a pair of haplotypes is equal to the product of the corresponding marginal frequencies. In our setting, the HWE assumption is not required since we estimate cluster-level (diplotype) probabilities. This results from the fact that we allow for 2 components of the data to inform our estimation of cluster frequencies. The first is those individuals whose cluster membership is unambiguous and the second is the phenotype (Y). In the special case that we are estimating the frequencies of clusters that are completely ambiguous, that is, all individuals within the clusters are ambiguous, then we rely solely on Y for this purpose, unless we make an additional assumption such as HWE. In this extreme case, while we are able to estimate cluster frequencies and calculate corresponding empirical Bayes estimates of random effects, it is not possible to distinguish which values correspond to which clusters without additional assumptions. Notably, the omnibus test for overall variability in the random cluster effects is still valid. If the HWE assumption is reasonable, the proposed method can be refined further to define cluster probabilities as the product of the corresponding 2 haplotype probabilities. That is, (3.1) can be reexpressed as described in (4.1), where H represents a single haplotype, M* is the number of unique haplotypes, and is the number of copies of H observed across all clusters for the configuration given in Z. The ML estimate of δ is as given in (3.9), where n is replaced with . For the simple example described in Figure 1, under the HWE assumption the number of frequency parameters reduces from M = 10 to M* = 4. Note, however, that in both cases, the number of random effects is 10 since there are 10 clusters. Sensitivity of this approach to violations of HWE is described in Section 5.1 and Appendix 3:

Testing and prediction

For the case in which we are interested in overall genetic effects and not interactions between genes and other covariates, b reduces to a scalar and Z = 1 is an n×1 vector of 1s. In the application of mixed modeling to genetic data, interest may lie in testing for significant variability across random effects (e.g. H0:σ2 = 0). A likelihood ratio test comparing the expected complete-data log-likelihood for the full model (with random cluster effects) to the reduced model (without random cluster effects) can be applied. Finally, empirical Bayes estimates of the random effects inform us about the cluster-specific effects on the phenotype under consideration. These are calculated in the usual way with additional weights equal to the posterior probabilities of cluster assignments and given in (4.2):

Computational and identifiability considerations

Suppose heterozygosity is observed in our sample at exactly r sites for the gene under investigation. In this case, the number of possible haplotypes is 2 and the number of haplotype pairs (clusters) is R = . Notably, some clusters consist only of individuals whose haplotypic phase is fully determined. For example, consider the clusters illustrated in Figure 1. The cluster consisting of the haplotype pair (AB,Ab) corresponds uniquely to individuals with the genotype (AA,Bb). Since this genotype has heterozygosity at only a single site, the corresponding haplotypic phase is known. That is, the haplotypic phase of individuals within the (AB,Ab) cluster is completely observed. On the other hand, for the example provided, the true cluster assignment for individuals with uncertainty in phase will be either (AB,ab) or (Ab,aB). In general, R* = of the R clusters consists of individuals for whom haplotypic phase is ambiguous. If there are K individuals with ambiguous phase, then the number of possible cluster assignment configurations is |𝒮| = (R*). This is the number of elements Z of the set 𝒮 in (3.4)–(4.2). For the simple case in which r = 2, we have R* = 2 and |𝒮| = 2. Thus, the computational burden of the proposed modeling approach is clearly quite large; however, a few matrix identities help to reduce the computational intensity. Specifically, suppose each individual has a single random effect so that Z = 1 and 𝒟 = σ2I. In this case, we can write W − 1 as in (4.3), where J = 11. A formal derivation of this identity is given in Appendix 1(a). Note that W− 1 depends only on the number of individuals within each cluster and not the specific configuration of the individuals. Since multiple elements of 𝒮 yield the same numbers of observations per cluster, use of (4.3) reduces the number of calculations of W − 1 from 2 to K + 1: If we further assume X = 1 so that the model given in (2.1) consists of an intercept and no covariates, then reduces to (4.4). A detailed derivation is provided in Appendix 1(b). The sum over ∈ S now represents a sum over all combinations of cluster sizes and is the sum of PZ () for all Z consistent with the configuration . Note that ∑Y does depend on the particular configuration of cluster assignments and thus must be determined for each Z∈𝒮. Again calculation of the inverse (the first term in the product in (4.4)) depends only on the number of individuals per cluster, thus reducing the number of computations substantially: Gains in computational efficiency can also be achieved by partitioning Y, X, and Z into their ambiguous and unambiguous components. Let Y = [Ya|Yu], X = [Xa|Xu], and , where the subscripts a and u indicate subsets of the observed data corresponding to individuals whose cluster assignments are ambiguous and unambiguous, respectively. Since W is block diagonal (clusters are assumed independent), W−1 can be partitioned similarly into , where Wa − 1 = (Za𝒟Za + σ2I) − 1 and Wu − 1 = (Zu𝒟Zu + σ2I) − 1. We can now write (3.8), (3.9), and (3.11) in terms of sums of ambiguous and unambiguous components, as described in Appendix 2(a)–(c). The unambiguous components need only be calculated once while the ambiguous components depend on Z ∈ 𝒮. The most general approach to defining clusters in the haplotype setting, as described in Figure 1, results in all individuals with ambiguity in phase belonging to a subset of clusters while no fully observed individuals belong to clusters in this subset. As noted in Section 4.2, estimation of cluster-level frequencies in this setting relies solely on the response variable and is not identifiable in the sense that we cannot distinguish which frequencies and empirical Bayes estimates correspond to which particular clusters. While the omnibus test for overall variability is valid in this setting, estimation of haplotype frequencies under the HWE assumption may be more relevant. In the general setting of missing correlated data identifiers, this would represent an extreme case.

DATA RESULTS

Simulation study

In order to evaluate the mixed modeling approach for characterizing haplotype–trait associations, we conducted a simulation study that includes the following components: (1) a sensitivity analysis of the mixed modeling approach to the number of clusters (haplotypes) and model misspecification. Both founder effect models (assuming dominant and recessive traits) and departures from HWE are considered. (2) A comparison to alternative methods, including a traditional analysis of variance (ANOVA) approach and a 2-stage multiple imputation (MI) approach (Foulkes ). (3) Detailed simulation findings, including power, coverage rates (CRs), and false-discovery rates (FDRs) for varying effect sizes (ratios of standard deviations) and degrees of ambiguity in cluster assignments. Details of the simulation study and corresponding results are provided in Appendix 3. Briefly, we found that the mixed modeling approach has reasonable power for a sample size of n = 200 and between 10 and 36 clusters (4–8 haplotypes). Performance is relatively poor, however, under misspecification of the random-effects distribution. The reduction in power is especially pronounced for the recessive founder model in which only a single cluster effect arises from a normal distribution with > 0 variability and the remaining cluster effects have 0 variability. On the other hand, power appears stable under moderate deviations from HWE when this is assumed. For a ratio of standard deviations (σ/σ) of 0.4 and sample size of n = 200, power for the ANOVA and mixed modeling approaches is comparable while the number of clusters is less than 20. For more than 20 clusters, power of the mixed modeling approach (based on the single degree of freedom test) is greater. Power for the 2-stage MI approach is comparable to the fully likelihood–based approach described herein for one data example. Modest reductions in power are observed for increasing ambiguity in cluster identifiers while corresponding CRs for variance parameters are lower. Finally, FDRs increase from 5% to 7–9% with an increase in cluster ambiguity up to 20%.

Example

Recent studies indicate that long-term exposure to certain combination antiretroviral therapies (ARTs) may lead to a host of lipid abnormalities including increases in triglycerides and total cholesterol and a reduction in high-density lipoprotein cholesterol (HDL-C). In turn, this can lead to accelerated onset of cardiovascular disease and death, presenting a grave concern for HIV-1-infected individuals receiving continuous long-term therapy. However, the large number of available ARTs provides a great potential to tailor treatments to individual-level characteristics. Furthermore, understanding the characteristics of individuals at greatest risk for cardiovascular complications will provide clinicians the opportunity to target interventions, such as administration of lipid-lowering therapies. The data motivating our research arise from a cohort of N = 626 HIV-1-infected individuals at risk for ART-associated dyslipidemia. These data were collected as part of multiple AIDS Clinical Trials Group studies and combined under New Work Concept Sheet 224. The primary aim of this study is to identify genetic factors that predict lipid abnormalities after controlling for traditional risk factors and other clinical parameters, including age, sex, use of lipid-lowering therapy, and current ART exposure. First-stage analysis results and general descriptive information on the cohort are provided in Foulkes . This analysis revealed potential effect modification by race/ethnicity, and so for the purpose of illustration, we describe here application of the above method within Hispanics (N = 109). The effects of haplotypic variation in endothelial lipase (EL) on HDL-C are considered. The SNPs chosen for analysis are rs12970066, Asn396Ser, and rs3829632 (-1309A/G) and were determined based on prior knowledge of association with plasma lipoproteins and for capturing genetic variability within this gene. A haplotype-based analysis can be advantageous if the observed SNPs are in linkage disequilibrium with the disease-causing variant and are not themselves functional; in general and in this setting, the functionality of specific SNPs is not fully characterized, and thus, a haplotype-based analysis can provide new insight. We assume HWE within the single race/ethnicity group and apply the ECM approach described in Section 4.2. A summary of genotype frequencies is given in Table 1. In this sample, N = 13 individuals have uncertainty in haplotypic phase due to heterozygosity at rs12970066 (AG) and Asn396Ser (CG). Notably, variability is not observed in the third SNP (rs3829632) within Hispanics; however, we include this SNP in our presentation for completeness. Covariates included in model fitting are age, gender, CD4 count, current ART exposure, use of lipid-lowering therapy, and study. N = 100 individuals with complete data are included in the analysis.

Table 1.

	EL genotypes			Count (%)
	rs12970066	Asn396Ser	rs3829632 (-1309A/G)
1	AA	CC	AA	23 (0.21)
2	AA	CG	AA	24 (0.22)
3	AA	GG	AA	4 (0.04)
4	AG	CC	AA	31 (0.28)
5	AG	CG	AA	13 (0.12)
6	GG	CC	AA	14 (0.13)
				Total: 109

EL genotype within Hispanics. Genotype counts for combination of 3 SNPs in EL. Although variability in rs3829632 is not observed within the subset of Hispanics, this SNP is included in the presentation for completeness A convergence criterion of a maximum absolute percentage change in parameter estimates from one iteration to the next of < 1×10 − 5 is used. Convergence is met after 20 iterations. Resulting haplotype frequency estimates are given in Table 2. The estimated variance of the random haplotype effects is = 0.013 with a corresponding likelihood ratio test statistic of χ1,02 = 6.69 (p < 0.05). The estimated error variance is = 0.057. Empirical Bayes estimates of the random haplotype pair effects are given in Figure 2. These results suggest overall variability in the haplotypic effects of EL on HDL-C. The cluster (ACA,AGA) has the largest absolute estimated effect, suggesting that individuals with this pair of haplotypes will have a lower predicted HDL-C level. Since HDL-C is considered as the good cholesterol, these individuals may be at greatest risk for ART associated lipid complications and candidates for targeted intervention.

Fig. 2.

Empirical Bayes predictions of random EL cluster effects. Asterisk indicates cluster membership ambiguity.

Empirical Bayes predictions of random EL cluster effects. Asterisk indicates cluster membership ambiguity. Estimated haplotype frequencies within Hispanics. Estimated haplotype frequencies based on application of mixed model–fitting procedure assuming HWE

DISCUSSION

In this manuscript, we describe an ECM approach to finding ML parameter estimates in the linear mixed-effects model setting when the correlated data indicators are ambiguous. This research was motivated by interest in characterizing genetic effects on a phenotype when haplotypic phase is unobservable. The proposed approach, however, has broader relevance to other settings in which cluster identifiers are not known with certainty. Notably, a similar approach can be applied to missing genotype data, where multilocus genotype group identifiers are treated as ambiguous. We focused on the simple linear mixed-effects model with a single random cluster effect. Alternative formulation of the design matrix for the random effects allows for assessing interactions between patient-specific characteristics and haplotypes. For example, inclusion of an ART drug indicator in the Z matrix of (2.1) would allow us to investigate drug-by-gene effects in a pharmacogenomic study. Extensions to settings with alternative, noncontinuous outcomes and semiparametric mixed models that relax the normality assumption on the random effects require additional consideration. As expected, our simulation study suggests relatively poor performance in the context of a recessive, founder model in which the normality assumption of the random effects is severely violated. Further investigations of performance under alternative model formulations, as well as explorations of the utility of semiparametric procedures and model diagnostics in these settings, would be interesting. In another recent manuscript, we describe an MI approach for this setting in which the haplotype frequencies are estimated independently of the outcome (Foulkes ). An EM algorithm as described by Excoffier and Slatkin (1995) can be applied for haplotype reconstruction and then multiple imputed data sets derived by repeated weighted sampling based on the estimated posterior probabilities. The primary advantage of the joint approach we describe herein is that it incorporates information about the phenotype in the estimation procedure. For example, if ambiguity rests between 2 clusters with effects b and b, where b > b, then this approach will tend to assign individuals with higher observed phenotypes to C and individuals with lower phenotypes to C. Notably, in one simple simulation study, the 2 approaches yielded similar results while the computational burden associated with the joint approach is much greater. In light of the theoretical advantages, however, further and extensive consideration of alternative settings and the extent to which incorporating this additional layer of information indeed results in greater efficiency is warranted. A primary limitation of the mixed modeling approach for haplotype–trait association studies is that as the number of SNPs increases, the number of haplotypes (and therefore clusters) can quickly approach the number of individuals under investigation. In genome-wide association studies, taking a random sample of SNPs within a known disease pathway or genetic region may be tenable and appropriate for the random-effects modeling framework. Paradoxically, increasing the number of variables (SNPs) can also lead to greater phase ambiguity in the data, suggesting an important trade-off between information gained from more accurate haplotype reconstruction and potential power loss associated with increasing ambiguity. As mentioned in Section 1, the proposed approach represents an extension of SEMs with a doubly latent class structure defined by both latent random effects and unobservable class identifiers. Further extensions that draw on the literature of SEMs may provide additional tools for incorporating known biological function, such as gene-specific pathways to disease. For example, multiple random effects based on sets of genes with similar, known functionality may provide additional insight into the determinants of complex diseases. The framework we describe allows for this multivariable investigation while accounting for the unobservable nature of haplotypic phase in association studies.

FUNDING

National Institute of Allergy and Infection Diseases (NIAID) (AI056983); National Institute of Diabetes and Digestive and Kidney Diseases (DK021224); Adult AIDS Clinical Trials Group funded by the NIAID (AI38858); CRI: Computational Biology Facility for Western Massachusetts (CNS 0551500) to computing cluster.

Table 2.

Estimated haplotype frequencies within Hispanics. Estimated haplotype frequencies based on application of mixed model–fitting procedure assuming HWE

	EL haplotypes			Estimated frequency
	rs12970066	Asn396Ser	rs3829632(-1309A/G)
1	A	C	A	0.470
2	A	G	A	0.205
3	G	C	A	0.325
4	G	G	A	< 0.001

Table 3.

Simulation results for differing percents ambiguity and variance ratios

Ambiguity (%)	σ_b/σ_ϵ	Power (%)	Bias ()†					CR‡
			β₀	_a	_u	σ_ϵ²	σ_b²	β₀	_a	_u	σ_ϵ²	σ_b²
0*	0.2	32	0.0039(0.084)	—	0.00083(0.015)	0.0021(0.10)	0.011(0.043)	0.95	—	0.97	0.95	0.94
	0.4	88	0.00046(0.12)	—	0.00024(0.015)	0.0073(0.11)	0.017(0.082)	0.94	—	0.97	0.96	0.97
	0.6	99	0.0061(0.16)	—	0.00048(0.015)	0.011(0.11)	0.037(0.16)	0.95	—	0.97	0.96	0.95
	0.8	100	0.0019(0.19)	—	0.00048(0.015)	0.0086(0.11)	0.014(0.25)	0.96	—	0.96	0.95	0.95
	1.0	100	0.0036(0.24)	—	0.00048(0.015)	0.013(0.11)	0.025(0.36)	0.96	—	0.97	0.94	0.96
5*	0.2	34	0.011(0.088)	0.00085(0.020)	0.00026(0.015)	0.010(0.11)	0.0020(0.056)	0.96	0.96	0.97	0.95	0.94
	0.4	89	0.0039(0.12)	0.0016(0.020)	0.00026(0.014)	0.0053(0.11)	0.0066(0.095)	0.96	0.99	0.96	0.93	0.96
	0.6	100	0.0046(0.16)	0.00031(0.018)	0.00026(0.014)	0.00071(0.11)	0.0091(0.16)	0.96	0.98	0.97	0.94	0.98
	0.8	100	0.032(0.20)	0.00073(0.019)	0.00026(0.014)	0.0096(0.11)	0.012(0.26)	0.93	0.96	0.97	0.97	0.97
	1.0	100	0.011(0.23)	0.0010(0.018)	0.00039(0.015)	0.0038(0.11)	0.034(0.36)	0.97	0.98	0.97	0.91	0.94
10**	0.2	33	0.0027(0.091)	0.00049(0.019)	0.00084(0.014)	0.00015(0.10)	0.021(0.049)	0.96	0.97	0.96	0.97	0.93
	0.4	87	0.0099(0.14)	0.0011(0.019)	0.00057(0.014)	0.0051(0.11)	0.025(0.093)	0.95	0.97	0.97	0.94	0.95
	0.6	100	0.0011(0.17)	0.00095(0.019)	0.00048(0.014)	0.026(0.11)	0.048(0.17)	0.96	0.97	0.97	0.94	0.91
	0.8	100	0.026(0.19)	0.0017(0.018)	0.00066(0.014)	0.0073(0.10)	0.061(0.31)	0.93	0.97	0.97	0.96	0.94
	1.0	100	0.016(0.22)	0.00068(0.018)	0.00080(0.014)	0.0059(0.12)	0.097(0.39)	0.95	0.97	0.97	0.95	0.93
20**	0.2	30	0.0037(0.089)	0.0022(0.020)	0.00019(0.013)	0.0011(0.10)	0.044(0.071)	0.95	0.97	0.96	0.95	0.90
	0.4	88	0.0076(0.11)	0.0022(0.019)	0.00077(0.013)	0.0083(0.11)	0.083(0.12)	0.97	0.98	0.96	0.95	0.88
	0.6	99	0.013(0.15)	0.0019(0.020)	0.00058(0.013)	0.0042(0.11)	0.13(0.21)	0.95	0.98	0.97	0.95	0.91
	0.8	100	0.017(0.20)	0.0011(0.020)	0.00077(0.013)	0.010(0.11)	0.13(0.29)	0.95	0.98	0.96	0.97	0.92
	1.0	100	0.0014(0.23)	0.0012(0.020)	0.00077(0.013)	0.0029(0.11)	0.27(0.44)	0.94	0.97	0.97	0.95	0.89

Results are based on *400 and **200 simulations per condition (σ/σϵ) with samples of size n = 200 and m = 21 clusters.

†Bias is defined as the absolute difference between the median of the estimate over the simulations and the true parameter value. a and u are the average bias across the ambiguous and unambiguous clusters, respectively. Standard errors () are calculated based on all simulations within a condition.

‡CR is defined as the proportion of simulations for which the true parameter value is within the corresponding 95% confidence interval.

11 in total

1. Characterizing the relationship between HIV-1 genotype and phenotype: prediction-based classification.

Authors: A S Foulkes; Gruttola V De
Journal: Biometrics Date: 2002-03 Impact factor: 2.571

2. Finite mixture modeling with mixture outcomes using the EM algorithm.

Authors: B Muthén; K Shedden
Journal: Biometrics Date: 1999-06 Impact factor: 2.571

3. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous.

Authors: S L Lake; H Lyon; K Tantisira; E K Silverman; S T Weiss; N M Laird; D J Schaid
Journal: Hum Hered Date: 2003 Impact factor: 0.444

4. Comparison of prospective and retrospective methods for haplotype inference in case-control studies.

Authors: Glen A Satten; Michael P Epstein
Journal: Genet Epidemiol Date: 2004-11 Impact factor: 2.135

5. Mixed modelling to characterize genotype-phenotype associations.

Authors: A S Foulkes; M Reilly; L Zhou; M Wolfe; D J Rader
Journal: Stat Med Date: 2005-03-15 Impact factor: 2.373

6. Regression-based association analysis with clustered haplotypes through use of genotypes.

Authors: Jung-Ying Tzeng; Chih-Hao Wang; Jau-Tsuen Kao; Chuhsing Kate Hsiao
Journal: Am J Hum Genet Date: 2005-12-19 Impact factor: 11.025

7. Unbalanced repeated-measures models with structured covariance matrices.

Authors: R I Jennrich; M D Schluchter
Journal: Biometrics Date: 1986-12 Impact factor: 2.571

8. Random-effects models for longitudinal data.

Authors: N M Laird; J H Ware
Journal: Biometrics Date: 1982-12 Impact factor: 2.571

9. Mixed modeling and multiple imputation for unobservable genotype clusters.

Authors: A S Foulkes; R Yucel; M P Reilly
Journal: Stat Med Date: 2008-07-10 Impact factor: 2.373

10. Associations among race/ethnicity, ApoC-III genotypes, and lipids in HIV-1-infected individuals on antiretroviral therapy.

Authors: Andrea S Foulkes; David A Wohl; Ian Frank; Elaine Puleo; Stephanie Restine; Megan L Wolfe; Michael P Dube; Pablo Tebas; Muredach P Reilly
Journal: PLoS Med Date: 2006-03 Impact factor: 11.069

7 in total

1. State of the Multiple Imputation Software.

Authors: Recai M Yucel
Journal: J Stat Softw Date: 2011-12 Impact factor: 6.440

2. Mixture modelling as an exploratory framework for genotype-trait associations.

Authors: Kinman Au; Rongheng Lin; Andrea S Foulkes
Journal: J R Stat Soc Ser C Appl Stat Date: 2011-05 Impact factor: 1.864

3. Latent variable modeling paradigms for genotype-trait association studies.

Authors: Yan Liu; Andrea S Foulkes
Journal: Biom J Date: 2011-09 Impact factor: 2.207

4. Double-blinded, randomized, and controlled study on the effects of canagliflozin after bariatric surgery: A pilot study.

Authors: Sangeeta R Kashyap; Karim Kheniser; Ali Aminian; Philip Schauer; Carel Le Roux; Bartolome Burguera
Journal: Obes Sci Pract Date: 2020-03-17

5. Estimating and testing haplotype-trait associations in non-diploid populations.

Authors: X Li; B N Thomas; S M Rich; D Ecker; J K Tumwine; A S Foulkes
Journal: J R Stat Soc Ser C Appl Stat Date: 2009-12 Impact factor: 1.864

6. Mixed modeling of meta-analysis P-values (MixMAP) suggests multiple novel gene loci for low density lipoprotein cholesterol.

Authors: Andrea S Foulkes; Gregory J Matthews; Ujjwal Das; Jane F Ferguson; Rongheng Lin; Muredach P Reilly
Journal: PLoS One Date: 2013-02-06 Impact factor: 3.240

7. Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response.

Authors: Recai M Yucel
Journal: Philos Trans A Math Phys Eng Sci Date: 2008-07-13 Impact factor: 4.226

7 in total