| Literature DB >> 15819990 |
Brian J Edwards1, Chad Haynes, Mark A Levenstien, Stephen J Finch, Derek Gordon.
Abstract
BACKGROUND: Phenotype error causes reduction in power to detect genetic association. We present a quantification of phenotype error, also known as diagnostic error, on power and sample size calculations for case-control genetic association studies between a marker locus and a disease phenotype. We consider the classic Pearson chi-square test for independence as our test of genetic association. To determine asymptotic power analytically, we compute the distribution's non-centrality parameter, which is a function of the case and control sample sizes, genotype frequencies, disease prevalence, and phenotype misclassification probabilities. We derive the non-centrality parameter in the presence of phenotype errors and equivalent formulas for misclassification cost (the percentage increase in minimum sample size needed to maintain constant asymptotic power at a fixed significance level for each percentage increase in a given misclassification parameter). We use a linear Taylor Series approximation for the cost of phenotype misclassification to determine lower bounds for the relative costs of misclassifying a true affected (respectively, unaffected) as a control (respectively, case). Power is verified by computer simulation.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15819990 PMCID: PMC1131899 DOI: 10.1186/1471-2156-6-18
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Parameter settings for null and power simulations with di-allelic and tetra-allelic loci
| Low | High | |
| True case and control genotype frequencies | ||
| Pr(affected misclassified as a control) ( | 0.05 | 0.15 |
| Pr(unaffected misclassified as a case) ( | 0.05 | 0.15 |
| Disease prevalence ( | 0.005 | 0.05 |
| Number of cases ( | 500 | 1000 |
| Number of controls ( | 500 | 1000 |
| Significance level | 5% | 1% |
| Genotype frequency parameter for tetra-allelic loci (power simulations) | ||
| 1 | 2 |
This table presents the low and high parameter settings we consider for null and power simulation calculations for di-allelic and tetra-allelic loci. As per the 27 factorial design, null and power simulations are performed on 128 distinct sets of parameter settings. Each simulation uses 100,000 iterations to determine empirical significance level (null) or simulation power. For di-allelic loci, case and control genotype frequencies are determined by the parameter p (see Methods – design of simulation program – power calculations for a fixed sample size). For tetra-allelic loci, genotype frequencies are determined by the parameter d (see Methods – Design of simulation program – power calculations for a fixed sample size).
Percentiles for absolute difference between asymptotic power and simulation power
| 5% significance level | 1% significance level | |
| Di-allelic locus | ||
| Minimum | 0.0000 | 0.0000 |
| 10th percentile | 0.0002 | 0.0002 |
| 25th percentile | 0.0005 | 0.0004 |
| 50th percentile | 0.0010 | 0.0011 |
| 75th percentile | 0.0028 | 0.0026 |
| 90th percentile | 0.0065 | 0.0057 |
| Maximum | 0.0099 | 0.0119 |
| Tetra-allelic locus | ||
| Minimum | 0.0000 | 0.0000 |
| 10th percentile | 0.0000 | 0.0000 |
| 25th percentile | 0.0007 | 0.0008 |
| 50th percentile | 0.0012 | 0.0014 |
| 75th percentile | 0.0028 | 0.0032 |
| 90th percentile | 0.0072 | 0.0081 |
| Maximum | 0.0102 | 0.0111 |
Power simulations are performed at 100,000 iterations for each set of parameter specifications in the Methods section. Here we report various percentiles of the absolute difference |simulation power - asymptotic power| for our simulations. For each locus type (di-allelic, tetra-allelic), percentiles are computed using 27 = 128 settings documented in table 1.
Cost coefficients for different types of misclassification
| 0.005 | 0.5 | 0.05 | 0.01 | 540.29 |
| 0.15 | 0.01 | 458.99 | ||
| 1 | 0.05 | 0.01 | 478.32 | |
| 0.15 | 0.01 | 432.67 | ||
| 2 | 0.05 | 0.01 | 440.18 | |
| 0.15 | 0.01 | 415.60 | ||
| 0.05 | 0.5 | 0.05 | 0.09 | 51.59 |
| 0.15 | 0.10 | 43.82 | ||
| 1 | 0.05 | 0.08 | 45.67 | |
| 0.15 | 0.10 | 41.31 | ||
| 2 | 0.05 | 0.08 | 42.03 | |
| 0.15 | 0.10 | 39.68 |
The column heading for this table are as follows: K = prevalence; R* = ratio of controls to cases; p = SNP minor allele frequency in affected population; C= Cost coefficient corresponding to misclassification parameter θ – this is a lower bound of the percent increase in sample size necessary to maintain constant asymptotic power for every 1% increase in θ C= Cost coefficient corresponding to misclassification parameter φ – this is a lower bound of the percent increase in sample size necessary to maintain constant asymptotic power for every 1% increase in φ. The cost coefficients are computed using equation (1).
Figure 1Contour plot of minimum number of cases needed to maintain constant asymptotic power of 95% at a 5% significance level in the presence of phenotype misclassification for Alzheimer's disease ApoE example. We compute the increase in minimum cases () needed to maintain constant 95% asymptotic power at the 5% significance level (using a central χ2 distribution with 5 degrees of freedom) in the presence of errors. Sample sizes are computed using equation (3). The affected and unaffected genotype frequencies are taken from a previous publication [9, 14]. In that work, the marker locus considered was ApoE and the disease phenotype was Alzheimer's disease. We use the LRTae estimates from table 5 of that work [9]. Six genotypes are observed in most populations. The frequencies we use to perform the sample size calculations in figure 1 are presented in the Methods section (Minimum sample size requirements in presence of phenotype misclassification – Alzheimer's Disease ApoE example). We assume that equal numbers of cases and controls are collected. Also, we specify a prevalence K = 0.02, which is consistent with recent published reports for Alzheimer's Disease in the U. S. [32]. Sample sizes are calculated for each misclassification parameter θ, φ ranging from 0.0 to 0.15 in increments of 0.01. The number of cases ranges from 484 when θ = φ = 0 to 10,187 when θ = φ = 0.15. In this figure, each (approximately) horizontal line represents a constant sample size as a function of the misclassification parameters θ and φ. For two consecutive horizontal lines, the values in between those lines (represented by different colors) have sample sizes that are between the sample sizes indicated by the two horizontal lines.
Figure 2Power to detect association for two different settings of prevalence when only one phenotype misclassification parameter is non-zero. In this figure, the horizontal axis refers to the misclassification probability for one parameter when the second parameter is 0. For example, the graphs labeled "φ = 0" provide power calculations at two settings of disease prevalence (K = 0.05, K = 0.01) as a function of θ values ranging from 0.0 to 0.15 on the horizontal axis. Similarly, the graphs labeled "θ = 0" provide power calculations at two settings of disease prevalence (K = 0.05, K = 0.01) as a function of φ ranging from 0.0 to 0.15 on the horizontal axis.