| Literature DB >> 19384427 |
Yuehua Cui1, Wenjiang Fu, Kelian Sun, Roberto Romero, Rongling Wu.
Abstract
Detecting the patterns of DNA sequence variants across the human genome is a crucial step for unraveling the genetic basis of complex human diseases. The human HapMap constructed by single nucleotide polymorphisms (SNPs) provides efficient sequence variation information that can speed up the discovery of genes related to common diseases. In this article, we present a generalized linear model for identifying specific nucleotide variants that encode complex human diseases. A novel approach is derived to group haplotypes to form composite diplotypes, which largely reduces the model degrees of freedom for an association test and hence increases the power when multiple SNP markers are involved. An efficient two-stage estimation procedure based on the expectation-maximization (EM) algorithm is derived to estimate parameters. Non-genetic environmental or clinical risk factors can also be fitted into the model. Computer simulations show that our model has reasonable power and type I error rate with appropriate sample size. It is also suggested through simulations that a balanced design with approximately equal number of cases and controls should be preferred to maintain small estimation bias and reasonable testing power. To illustrate the utility, we apply the method to a genetic association study of large for gestational age (LGA) neonates. The model provides a powerful tool for elucidating the genetic basis of complex binary diseases.Entities:
Keywords: EM algorithm; Nucleotide sequence; complex disease; haplotype.; logistic regression
Year: 2007 PMID: 19384427 PMCID: PMC2652402 DOI: 10.2174/138920207782446188
Source DB: PubMed Journal: Curr Genomics ISSN: 1389-2029 Impact factor: 2.236
Possible Diplotype and Composite Diplotype Configurations of Nine Genotypes at Two SNPs and their Haplotype Composition Frequencies
| Genotype | Diplotype | Composite Diplotype | No. of Observation | |||
|---|---|---|---|---|---|---|
| Configuration | Frequency | Relative Frequency | Symbol | Diplotype Function | ||
| 11/11 | [11][11] | 1 | [11][11] | |||
| 11/12 | [11][12] | 1 | [11] | |||
| 11/22 | [12][12] | 1 | ||||
| 12/11 | [11][21] | 1 | [11] | |||
| 12/12 | ||||||
| 12/22 | [12][22] | 1 | ||||
| 22/11 | [21][21] | 1 | ||||
| 22/12 | [21][22] | 1 | ||||
| 22/22 | [22][22] | 1 | ||||
where p11 , p12 , p21 and p22 are the frequencies for haplotype [11], 12, 21, and 22, respectively. The relative frequency refers to the probability that a specific diplotype is observed. For unambiguous genotype (phase known), the relative frequency is 1. For the double heterozygotic genotype 12/12, the probability of observing diplotype [11][22] is ø , and observing diplotype [12][12] is 1- ø .
The Type I Error Estimated from 1000 Simulation Replicates Under the 2 and 3-SNP Models with Nominal Level 0.05
| n | 2-SNP model | 3-SNP Model |
|---|---|---|
| 100 | 0.073 | 0.06 |
| 200 | 0.056 | 0.055 |
| 300 | 0.058 | 0.054 |
| 400 | 0.045 | 0.049 |
| 500 | 0.047 | 0.048 |
The Mean MLEs with their Square Root Mean Square Errors (RMSEs) (in Parentheses) of Population and Quantitative Parameters of the BTNs Estimated from 1000 Simulation Replicates Under the 2-SNP Model
| Power | ||||||||
|---|---|---|---|---|---|---|---|---|
| 100 | 0.582 | 1.082 | -0.067 | 1.618 | 0.698 | 0.701 | 0.021 | 64 |
| (0.695) | (0.686) | (0.818) | (0.403) | (0.033) | (0.032) | (0.021) | ||
| 200 | 0.505 | 1.051 | 0.015 | 1.565 | 0.701 | 0.699 | 0.02 | 93.3 |
| (0.278) | (0.314) | (0.407) | (0.269) | (0.024) | (0.024) | (0.016) | ||
| 500 | 0.510 | 1.007 | -0.006 | 1.521 | 0.7 | 0.7 | 0.02 | 100 |
| (0.176) | (0.188) | (0.252) | (0.160) | (0.015) | (0.014) | (0.009) | ||
| 100 | 0.586 | 1.089 | 1.008 | 1.643 | 0.698 | 0.701 | 0.021 | 74.8 |
| (0.718) | (0.710) | (0.887) | (0.468) | (0.033) | (0.033) | (0.021) | ||
| 200 | 0.523 | 1.038 | 1.013 | 1.557 | 0.701 | 0.7 | 0.02 | 96.4 |
| (0.279) | (0.308) | (0.423) | (0.284) | (0.023) | (0.023) | (0.015) | ||
| 500 | 0.506 | 1.009 | 1.002 | 1.518 | 0.7 | 0.7 | 0.02 | 100 |
| (0.169) | (0.191) | (0.265) | (0.166) | (0.015) | (0.014) | (0.009) | ||
| 100 | 0.581 | 1.098 | -1.119 | 1.622 | 0.699 | 0.699 | 0.019 | 88 |
| (0.789) | (0.798) | (0.919) | (0.418) | (0.032) | (0.034) | 0.022 | ||
| 200 | 0.519 | 1.040 | -1.036 | 1.572 | 0.7 | 0.699 | 0.02 | 97.6 |
| (0.287) | (0.316) | (0.421) | (0.269) | (0.023) | (0.023) | 0.015 | ||
| 500 | 0.504 | 1.016 | -1.013 | 1.530 | 0.7 | 0.699 | 0.02 | 100 |
| (0.186) | (0.182) | (0.261) | (0.164) | (0.015) | (0.015) | (0.009) |
, and D are the allelic frequencies of alleles and at two SNPs and their linkage disequilibrium, respectively. α0 is the intercept, and α1 and α2 are the additive and dominant effects respectively by assuming that haplotype is different from the rest haplotypes. γ is the covariate effect. The first row contains the given values of all parameters. Power is calculated as the percentages of all simulations in which the true disease-gene association is detected.
The Mean MLEs with their Square Root Mean Square Errors (RMSEs) (in Parentheses) of Population and Genetic Parameters of the BTNs Estimated from 1000 Simulation Replicates Under the 3-SNP Model
| Power | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True* | 0.5 | 1.0 | 1.5 | 0.7 | 0.7 | 0.7 | 0.04 | 0.025 | 0.025 | 0.02 | ||
| 100 | 0.622 | 1.156 | -0.117 | 1.614 | 0.698 | 0.699 | 0.699 | 0.04 | 0.025 | 0.024 | 0.02 | 68.8 |
| (0.998) | 0.992) | (1.076) | (0.405) | (0.033) | (0.033) | (0.033) | (0.020) | (0.021) | (0.021) | (0.013) | ||
| 200 | 0.542 | 1.038 | -0.028 | 1.547 | 0.699 | 0.700 | 0.699 | 0.04 | 0.026 | 0.024 | 0.02 | 93.1 |
| (0.307) | (0.311) | (0.413) | (0.245) | (0.023) | (0.023) | (0.024) | (0.014) | (0.015) | (0.015) | (0.009) | ||
| 500 | 0.504 | 1.013 | -0.004 | 1.522 | 0.700 | 0.700 | 0.699 | 0.04 | 0.025 | 0.024 | 0.02 | 100 |
| 0.179) | 0.186) | (0.256) | (0.159) | (0.015) | (0.015) | (0.016) | (0.009) | (0.009) | (0.009) | (0.006) | ||
| 100 | 0.677 | 1.201 | 0.975 | 1.654 | 0.700 | 0.699 | 0.699 | 0.04 | 0.025 | 0.024 | 0.02 | 83.2 |
| (1.276) | (1.302) | (1.566) | (0.480) | (0.033) | (0.033) | (0.033) | (0.020) | (0.021) | (0.020) | (0.013) | ||
| 200 | 0.544 | 1.036 | 0.997 | 1.547 | 0.699 | 0.700 | 0.699 | 0.04 | 0.026 | 0.024 | 0.02 | 99.2 |
| (0.305) | (0.318) | (0.459) | (0.274) | (0.023) | (0.023) | (0.024) | (0.014) | (0.015) | (0.015) | (0.009) | ||
| 500 | 0.506 | 1.012 | 1.013 | 1.521 | 0.700 | 0.700 | 0.699 | 0.04 | 0.025 | 0.024 | 0.02 | 100 |
| (0.178) | (0.190) | (0.281) | (0.169) | (0.015) | (0.015) | (0.015) | (0.009) | (0.009) | (0.009) | (0.006) | ||
| 100 | 0.637 | 1.18 | -1.151 | 1.631 | 0.702 | 0.701 | 0.702 | 0.04 | 0.025 | 0.024 | 0.02 | 89.3 |
| (1.128) | (1.15) | (1.217) | (0.429) | (0.033) | (0.033) | (0.032) | (0.021) | (0.021) | (0.021) | (0.013) | ||
| 200 | 0.52 | 1.032 | -1.025 | 1.552 | 0.699 | 0.7 | 0.7 | 0.04 | 0.026 | 0.024 | 0.02 | 97.8 |
| (0.299) | (0.296) | (0.438) | (0.262) | (0.024) | (0.023) | (0.023) | (0.014) | (0.015) | (0.015) | (0.009) | ||
| 500 | 0.502 | 1.01 | -1.01 | 1.521 | 0.700 | 0.700 | 0.699 | 0.04 | 0.025 | 0.024 | 0.02 | 100 |
| (0.183) | (0.193) | (0.262) | (0.153) | (0.015) | (0.015) | (0.015) | (0.009) | (0.009) | (0.009) | (0.006) |
, and are the allelic frequencies of alleles , and at three SNPs, and D12 , D23 , D13 , and D123 are their linkage disequilibrium, respectively by assuming that haplotype is different from the rest haplotypes. See Table 3 for explanations of other parameters.
The Maximum Likelihood Estimates (MLEs) of the Population and Quantitative Parameters for Significant BTNs Associated with LGA Detected within APOC3 Gene. The Standard Errors of the Quantitative Parameters are Given in the Parenthesis
| AIC | LR1 | |||
|---|---|---|---|---|
| Risk haplotype | [TC] | |||
| [TG] | 565.29 | 0.336 | 0.845 | |
| [CG] | - | - | - | |
| [CC] | 562.68 | 2.948 | 0.229 | |
| Haplotype frequencies | 0.6196 | |||
| 0.1259 | ||||
| 0 | ||||
| 0.2545 | ||||
| Allele frequencies and LD | 0.7455 | |||
| 0.3804 | ||||
| -0.1577 | ||||
| Intercept | -3.235(0.851) | |||
| Additive effect | -0.606(0.547) [ | |||
| Dominant effect | -0.092(0.612) [ | |||
| MA | 0.036(0.016) [ | |||
| MW | 0.043(0.022)[ | |||
| PTD | -0.328(0.281) | |||
| BS | -0.547(0.215)[ | |||
| MBMI | -0.076(0.056) |
LR1 is the likelihood ratio test statistic based on hypothesis (12). The risk haplotype detected on the basis of the AIC value and LR test is indicated in boldface.** and * refer to significance at the 0.01 and 0.05 level, respectively.
MA=maternal age; MW=maternal weight; PTD=number of preterm deliveries; BS=baby sex; MBMI=maternal body mass index.