| Literature DB >> 27695091 |
Sheng Zhong1, Duo Jiang1, Mary Sara McPeek1,2.
Abstract
We consider the problem of genetic association testing of a binary trait in a sample that contains related individuals, where we adjust for relevant covariates and allow for missing data. We propose CERAMIC, an estimating equation approach that can be viewed as a hybrid of logistic regression and linear mixed-effects model (LMM) approaches. CERAMIC extends the recently proposed CARAT method to allow samples with related individuals and to incorporate partially missing data. In simulations, we show that CERAMIC outperforms existing LMM and generalized LMM approaches, maintaining high power and correct type 1 error across a wider range of scenarios. CERAMIC results in a particularly large power increase over existing methods when the sample includes related individuals with some missing data (e.g., when some individuals with phenotype and covariate information have missing genotype), because CERAMIC is able to make use of the relationship information to incorporate partially missing data in the analysis while correcting for dependence. Because CERAMIC is based on a retrospective analysis, it is robust to misspecification of the phenotype model, resulting in better control of type 1 error and higher power than that of prospective methods, such as GMMAT, when the phenotype model is misspecified. CERAMIC is computationally efficient for genomewide analysis in samples of related individuals of almost any configuration, including small families, unrelated individuals and even large, complex pedigrees. We apply CERAMIC to data on type 2 diabetes (T2D) from the Framingham Heart Study. In a genome scan, 9 of the 10 smallest CERAMIC p-values occur in or near either known T2D susceptibility loci or plausible candidates, verifying that CERAMIC is able to home in on the important loci in a genome scan.Entities:
Mesh:
Year: 2016 PMID: 27695091 PMCID: PMC5047592 DOI: 10.1371/journal.pgen.1006329
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Some Relevant Features of the Methods Compared in Simulations.
| Method | Mean Model | Additive Polygenic VCs? | Analysis Type | Sophisticated Use of Missing Data? |
|---|---|---|---|---|
| CERAMIC | logistic | yes | retrospective | yes |
| GMMAT | logistic | yes | prospective | no |
| CARAT | logistic | yes | retrospective | no |
| MQLS-LOG | logistic | no | retrospective | yes |
| EMMAX, GEMMA | linear | yes | prospective | no |
| MASTOR | linear | yes | retrospective | yes |
| MQLS-LIN | linear | no | retrospective | yes |
“Sophisticated Use of Missing Data” refers to methods that do something more sophisticated than plugging in the mean genotype value for missing genotype values or removing missing values.
Fig 1Three-Generation Pedigree Used in Simulations.
In the simulations, each sampled family is assumed to have a 3-generation, 16-person pedigree of this form.
Empirical Type 1 Error of CERAMIC, MQLS-LOG and MQLS-LIN, Based on 25,000 Simulated Replicates.
| Empirical Type I Error of | ||||||
|---|---|---|---|---|---|---|
| Trait Model | Subset Used | MAF | Nominal Level | CERAMIC | MQLS-LOG | MQLS-LIN |
| Logistic | All | 0.2 | .05 | .049 | .050 | .049 |
| Logistic | All | 0.1 | .05 | .051 | .050 | .050 |
| Logistic | MX | 0.2 | .05 | .049 | .049 | .049 |
| Logistic | MX | 0.1 | .05 | .051 | .051 | .050 |
| Logistic | All | 0.2 | .001 | .0011 | .0010 | .0011 |
| Logistic | All | 0.1 | .001 | .0007 | .0008 | .0007 |
| Logistic | MX | 0.2 | .001 | .0010 | .0010 | .0010 |
| Logistic | MX | 0.1 | .001 | .0009 | .0009 | .001 |
| Liability | All | 0.2 | .05 | .051 | .050 | .050 |
| Liability | All | 0.1 | .05 | .048 | .0474 | .048 |
| Liability | MX | 0.2 | .05 | .052 | .052 | .051 |
| Liability | MX | 0.1 | .05 | .050 | .050 | .051 |
| Liability | All | 0.2 | .001 | .0011 | .0011 | .0010 |
| Liability | All | 0.1 | .001 | .0011 | .0011 | .0012 |
| Liability | MX | 0.2 | .001 | .0008 | .0008 | .0008 |
| Liability | MX | 0.1 | .001 | .0009 | .0009 | .0007 |
“Logistic” denotes the mixed-effects logistic regression model of Eq 26 with (θ, θ) = (.6, .4). “Liability” denotes the liability threshold model of Eq 27 with (π, π) = (.4, .4). “All” indicates that all individuals, including those with partially missing data are included in the analyses for all three statistics. “MX” (for “missing excluded”) indicates that only individuals with complete data are included in the analyses for all three statistics. “MAF” denotes the minor allele frequency of the tested SNP. The radius of the 95% confidence interval for nominal level .05 is .0027, and that for nominal level .001 is .0004.
Type 1 Error when the Trait Model Is Misspecified.
| Empirical Type I Error of | |||||
|---|---|---|---|---|---|
| Partially Missing Data? | Model | Setting | MQLS-LOG | CERAMIC | GMMAT |
| Yes | Threshold (60,20) | A | .050 | .050 | |
| Yes | Threshold (40,40) | A | .048 | .048 | |
| Yes | Threshold (20,60) | A | .050 | .050 | |
| Yes | Threshold (0,80) | A | .049 | .049 | |
| Yes | Threshold (20,60) | B | .051 | .051 | |
| Yes | Threshold (20,60) | .01 | .051 | .051 | |
| Yes | Logistic (80,20) | A | .051 | .052 | |
| Yes | Logistic (60,40) | A | .050 | .050 | |
| Yes | Logistic (40,60) | A | .050 | .050 | |
| Yes | Logistic (20,80) | A | .048 | .048 | |
| Yes | Logistic (0,100) | A | .052 | .052 | .052 |
| No | Threshold (60,20) | A | .048 | .048 | |
| No | Threshold (40,40) | A | .051 | .051 | .048 |
| No | Threshold (20,60) | A | .051 | .051 | |
| No | Threshold (0,80) | A | .051 | .051 | |
| No | Threshold (20,60) | B | .050 | .050 | |
| No | Threshold (20,60) | .01 | .049 | .049 | |
| No | Logistic (80,20) | A | .049 | .049 | .050 |
| No | Logistic (60,40) | A | .051 | .051 | .050 |
| No | Logistic (40,60) | A | .050 | .050 | |
| No | Logistic (20,80) | A | .048 | .048 | |
| No | Logistic (0,100) | A | .049 | .049 | |
“Trait Model is Misspecified” refers to the fact that the relevant covariates are left out of the fitted model. Under “Model”, for example, “Threshold (60, 20)” refers to the mixed effects liability threshold trait model with (π, π) = (60, 20) and “Logistic (80,20)” refers to the mixed effects logistic trait model with (θ, θ) = (80, 20). Under “Setting”, “A” refers to ascertainment setting A, “B” refers to the setting in which a shared environment random effect is included in the trait model and ascertainment setting B is used, and “.01” refers to the setting in which the prevalence is set to.01 and ascertainment setting A is used. In all scenarios in which partially missing data are included, the number of individuals sampled in each simulated replicate is 1,200, while in all complete data settings, the number of individuals sampled in each simulated replicate is 600. Empirical type 1 error is assessed based on 25,000 simulations. The radius of the 95% confidence interval for nominal level .05 is .0027. Bold indicates a type 1 error rate that is outside the 95% confidence interval.
Fig 2Empirical Power of CERAMIC and Other Methods.
Empirical power is based on 10,000 replicates. The error bars indicate 95% confidence intervals. Panels A and B are for the case in which all relevant covariates are included in the fitted model. Panels C and D are for the case in which the relevant covariates are not included in the fitted model. In Panels A and C, the trait is simulated by the mixed-effects logistic regression model, and it Panels B and D, it is by the liability threshold model. The horizontal scale in the plots indicates the relative impact of covariates versus additive polygenic effects on the phenotype, with the far left corresponding to no polygenic effects and strong effects of covariates and the far right corresponding to no effect of covariates and strong polygenic effects. In all cases, partially missing data are simulated and ascertainment setting A is used. In the settings of panels C and D, the MQLS-LOG and MQLS-LIN methods give identical results, so only the MQLS-LOG results are depicted, and similarly, the CERAMIC and MASTOR methods give identical results, so only the CERAMIC results are depicted.
Fig 3Comparison of Extent of Power Recovery with Missing Genotypes for CERAMIC, CARAT and GMMAT.
Empirical power is based on 10,000 replicates. The error bars indicate 95% confidence intervals. The trait is simulated according to the liability threshold trait model with (π, π) = (40; 40) with 1,200 individuals in each simulated replicate, under acertainment setting A with missing data. In the “Remove Missing Genotypes” setting, individuals with missing genotypes are removed from the input files before the methods are run. In the “Use Missing Genotypes” setting, individuals with missing genotypes remain in the input files, and GMMAT is run with the option to impute the mean genotype value for the missing genotypes, CERAMIC is run with default settings, and CARAT is run with the mean genotype value plugged in for the missing genotypes in the input file. In the “Complete Genotype Data” setting, the missing genotype values are “unmasked” and included in the input files for all methods.
Parameter Estimates, , in the Null Phenotypic Model of CERAMIC, for Type 2 Diabetes in the Framingham Heart Study.
| Parameter | Estimate | SE |
|---|---|---|
| .41 | − | |
| Intercept | -6.3 | .41 |
| Coefficient of sex | -.75 | .1 |
| Coefficient of BMI | .24 | .01 |
Sex is coded as female = 2, male = 1.
Parameter Estimates in MASTOR’s Null Phenotypic Model for Type 2 Diabetes in the Framingham Heart Study.
| Parameter | MLE (SE) |
|---|---|
| .45 (.08) | |
|
| .074 (.01) |
|
| .090 (.01) |
|
| .16 (.005) |
| Intercept | -.64 (.06) |
| Coefficient of sex | -.12 (.02) |
| Coefficient of BMI | .041 (.002) |
There are only 2 independently specified VC parameters in the model; the 4 VC parameters in the table are related by the equations and . Sex is coded as female = 2, male = 1.
SNPs with Strongest Association with Type 2 Diabetes in the Framingham Heart Study.
| P-value Based on | ||||||||
|---|---|---|---|---|---|---|---|---|
| SNP | Chr | Position | Nearest Gene | CERAMIC | MASTOR | EMMAX | MQLS-LOG | MQLS-LIN |
| rs13116548 | 4 | 169874620 | 2.6e-05 | 7.8e-05 | 2.7e-05 | 5.2e-05 | ||
| rs1548315 | 4 | 169841518 | 2.5e-05 | 5.8e-05 | 2.9e-05 | 5.2e-05 | ||
| rs6817551 | 4 | 169829345 | 3.6e-05 | 4.7e-05 | 4.7e-05 | 8.0e-05 | ||
| rs11733251 | 4 | 169881917 | 2.7e-05 | 4.7e-05 | 2.9e-05 | 4.9e-05 | ||
| rs2331450 | 4 | 169901225 | 2.8e-05 | 5.9e-05 | 3.2e-05 | 5.2e-05 | ||
| rs10518037 | 4 | 169899023 | 2.9e-05 | 5.9e-05 | 3.4e-05 | 5.6e-05 | ||
| rs1531254 | 4 | 169881245 | 3.2e-05 | 5.6e-05 | 3.1e-05 | 5.4e-05 | ||
| rs1870306 | 4 | 169923672 | 3.1e-05 | 8.9e-05 | 3.0e-05 | 4.6e-05 | ||
| rs17083935 | 9 | 83326808 | 1.7e-05 | 2.3e-05 | 3.9e-05 | 5.8e-05 | ||
| rs12004598 | 9 | 83354770 | 2.0e-05 | 4.9e-05 | 3.4e-05 | 5.0e-05 | ||
| rs17083941 | 9 | 83327608 | 2.6e-05 | 5.7e-05 | 5.3e-05 | 8.0e-05 | ||
| rs4506565 | 10 | 114756041 | 1.7e-07 | 1.3e-07 | 3.0e-07 | 1.0e-07 | ||
| rs7901695 | 10 | 114754088 | 3.1e-07 | 5.4e-07 | 4.5e-07 | 1.6e-07 | ||
| rs12243326 | 10 | 114788815 | 1.0e-06 | 3.8e-06 | 2.2e-06 | 1.2e-06 | ||
| rs4132670 | 10 | 114767771 | 1.3e-06 | 1.8e-06 | 2.6e-06 | 8.8e-07 | ||
| rs7488766 | 12 | 132665596 | 1.5e-05 | 6.3e-06 | 2.6e-05 | 1.6e-05 | ||
| rs11874767 | 18 | 3952917 | 1.8e-05 | 3.3e-05 | 3.5e-05 | 6.3e-05 | ||
Bold indicates the smallest p-value for each SNP. MIM numbers of genes not mentioned in the text: PALLD (MIM 608092), CBR4 (No MIM number), DLGAP1 (MIM 605445).