| Literature DB >> 18343883 |
Andrea S Foulkes1, Recai Yucel, Xiaohong Li.
Abstract
This manuscript describes a novel, linear mixed-effects model-fitting technique for the setting in which correlated data indicators are not completely observed. Mixed modeling is a useful analytical tool for characterizing genotype-phenotype associations among multiple potentially informative genetic loci. This approach involves grouping individuals into genetic clusters, where individuals in the same cluster have similar or identical multilocus genotypes. In haplotype-based investigations of unrelated individuals, corresponding cluster assignments are unobservable since the alignment of alleles within chromosomal copies is not generally observed. We derive an expectation conditional maximization approach to estimation in the mixed modeling setting, where cluster assignments are ambiguous. The approach has broad relevance to the analysis of data with missing correlated data identifiers. An example is provided based on data arising from a cohort of human immunodeficiency virus type-1-infected individuals at risk for antiretroviral therapy-associated dyslipidemia.Entities:
Keywords: Expectation conditional maximization; Genotype; HIV-1; Haplotype; Lipids; Missing identifiers; Mixed-effects models; Phenotype; Population-based genetic association studies
Mesh:
Year: 2008 PMID: 18343883 PMCID: PMC2536727 DOI: 10.1093/biostatistics/kxm055
Source DB: PubMed Journal: Biostatistics ISSN: 1465-4644 Impact factor: 5.899
Fig. 1.Sample approaches to defining clusters. For the 2 SNP example in which the observed genotypes are (AA, Aa,or aa) and (BB, Bb, or bb), there are 4 possible haplotypes, AB, Ab, aB, and ab, and 10 possible diplotypes. The most general approach to defining clusters results in 10 clusters consisting of all these possible combinations of 2 haplotypes. These are indicated by shaded rectangles. An alternative approach groups all diplotypes with at least one copy for the rare ab haplotype into a single cluster. This is indicated by the dashed rectangle that combines 4 of the previously defined clusters into a single cluster. In this case, there are a total of 7 clusters.
EL genotype within Hispanics. Genotype counts for combination of 3 SNPs in EL. Although variability in rs3829632 is not observed within the subset of Hispanics, this SNP is included in the presentation for completeness
| EL genotypes | Count (%) | |||
| rs12970066 | Asn396Ser | rs3829632 (-1309A/G) | ||
| 1 | AA | CC | AA | 23 (0.21) |
| 2 | AA | CG | AA | 24 (0.22) |
| 3 | AA | GG | AA | 4 (0.04) |
| 4 | AG | CC | AA | 31 (0.28) |
| 5 | AG | CG | AA | 13 (0.12) |
| 6 | GG | CC | AA | 14 (0.13) |
| Total: 109 | ||||
Fig. 2.Empirical Bayes predictions of random EL cluster effects. Asterisk indicates cluster membership ambiguity.
Estimated haplotype frequencies within Hispanics. Estimated haplotype frequencies based on application of mixed model–fitting procedure assuming HWE
| EL haplotypes | Estimated frequency | |||
| rs12970066 | Asn396Ser | rs3829632(-1309A/G) | ||
| 1 | A | C | A | 0.470 |
| 2 | A | G | A | 0.205 |
| 3 | G | C | A | 0.325 |
| 4 | G | G | A | < 0.001 |
Simulation results for differing percents ambiguity and variance ratios
| Ambiguity (%) | Power (%) | Bias ( | CR‡ | |||||||||
| 0* | 0.2 | 32 | 0.0039(0.084) | — | 0.00083(0.015) | 0.0021(0.10) | 0.011(0.043) | 0.95 | — | 0.97 | 0.95 | 0.94 |
| 0.4 | 88 | 0.00046(0.12) | — | 0.00024(0.015) | 0.0073(0.11) | 0.017(0.082) | 0.94 | — | 0.97 | 0.96 | 0.97 | |
| 0.6 | 99 | 0.0061(0.16) | — | 0.00048(0.015) | 0.011(0.11) | 0.037(0.16) | 0.95 | — | 0.97 | 0.96 | 0.95 | |
| 0.8 | 100 | 0.0019(0.19) | — | 0.00048(0.015) | 0.0086(0.11) | 0.014(0.25) | 0.96 | — | 0.96 | 0.95 | 0.95 | |
| 1.0 | 100 | 0.0036(0.24) | — | 0.00048(0.015) | 0.013(0.11) | 0.025(0.36) | 0.96 | — | 0.97 | 0.94 | 0.96 | |
| 5* | 0.2 | 34 | 0.011(0.088) | 0.00085(0.020) | 0.00026(0.015) | 0.010(0.11) | 0.0020(0.056) | 0.96 | 0.96 | 0.97 | 0.95 | 0.94 |
| 0.4 | 89 | 0.0039(0.12) | 0.0016(0.020) | 0.00026(0.014) | 0.0053(0.11) | 0.0066(0.095) | 0.96 | 0.99 | 0.96 | 0.93 | 0.96 | |
| 0.6 | 100 | 0.0046(0.16) | 0.00031(0.018) | 0.00026(0.014) | 0.00071(0.11) | 0.0091(0.16) | 0.96 | 0.98 | 0.97 | 0.94 | 0.98 | |
| 0.8 | 100 | 0.032(0.20) | 0.00073(0.019) | 0.00026(0.014) | 0.0096(0.11) | 0.012(0.26) | 0.93 | 0.96 | 0.97 | 0.97 | 0.97 | |
| 1.0 | 100 | 0.011(0.23) | 0.0010(0.018) | 0.00039(0.015) | 0.0038(0.11) | 0.034(0.36) | 0.97 | 0.98 | 0.97 | 0.91 | 0.94 | |
| 10** | 0.2 | 33 | 0.0027(0.091) | 0.00049(0.019) | 0.00084(0.014) | 0.00015(0.10) | 0.021(0.049) | 0.96 | 0.97 | 0.96 | 0.97 | 0.93 |
| 0.4 | 87 | 0.0099(0.14) | 0.0011(0.019) | 0.00057(0.014) | 0.0051(0.11) | 0.025(0.093) | 0.95 | 0.97 | 0.97 | 0.94 | 0.95 | |
| 0.6 | 100 | 0.0011(0.17) | 0.00095(0.019) | 0.00048(0.014) | 0.026(0.11) | 0.048(0.17) | 0.96 | 0.97 | 0.97 | 0.94 | 0.91 | |
| 0.8 | 100 | 0.026(0.19) | 0.0017(0.018) | 0.00066(0.014) | 0.0073(0.10) | 0.061(0.31) | 0.93 | 0.97 | 0.97 | 0.96 | 0.94 | |
| 1.0 | 100 | 0.016(0.22) | 0.00068(0.018) | 0.00080(0.014) | 0.0059(0.12) | 0.097(0.39) | 0.95 | 0.97 | 0.97 | 0.95 | 0.93 | |
| 20** | 0.2 | 30 | 0.0037(0.089) | 0.0022(0.020) | 0.00019(0.013) | 0.0011(0.10) | 0.044(0.071) | 0.95 | 0.97 | 0.96 | 0.95 | 0.90 |
| 0.4 | 88 | 0.0076(0.11) | 0.0022(0.019) | 0.00077(0.013) | 0.0083(0.11) | 0.083(0.12) | 0.97 | 0.98 | 0.96 | 0.95 | 0.88 | |
| 0.6 | 99 | 0.013(0.15) | 0.0019(0.020) | 0.00058(0.013) | 0.0042(0.11) | 0.13(0.21) | 0.95 | 0.98 | 0.97 | 0.95 | 0.91 | |
| 0.8 | 100 | 0.017(0.20) | 0.0011(0.020) | 0.00077(0.013) | 0.010(0.11) | 0.13(0.29) | 0.95 | 0.98 | 0.96 | 0.97 | 0.92 | |
| 1.0 | 100 | 0.0014(0.23) | 0.0012(0.020) | 0.00077(0.013) | 0.0029(0.11) | 0.27(0.44) | 0.94 | 0.97 | 0.97 | 0.95 | 0.89 | |
Results are based on *400 and **200 simulations per condition (σ/σϵ) with samples of size n = 200 and m = 21 clusters.
†Bias is defined as the absolute difference between the median of the estimate over the simulations and the true parameter value. a and u are the average bias across the ambiguous and unambiguous clusters, respectively. Standard errors () are calculated based on all simulations within a condition.
‡CR is defined as the proportion of simulations for which the true parameter value is within the corresponding 95% confidence interval.