| Literature DB >> 19605740 |
James Y Dai1, Michael Leblanc, Nicholas L Smith, Bruce Psaty, Charles Kooperberg.
Abstract
Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19605740 PMCID: PMC2742496 DOI: 10.1093/biostatistics/kxp023
Source DB: PubMed Journal: Biostatistics ISSN: 1465-4644 Impact factor: 5.899
Fig. 1.An example to show that there generally exists an optimal set of SNPs for association analysis. The order of SNPs in a haplotype is ABCDE(X). (a) The disease-causal locus X occurs before A in lineage. The optimal set for genetic association is just A. (b) The disease-causal locus X occurs after A and in parallel to D, E. The optimal set for genetic association is A, D, E. (c) The disease-causal locus X occurs after E in lineage. There is recombination between haplotypes 6 and 3, generating a recombinant 8. The vertical arrow on the top of haplotype 8 points to the break point of recombination. The optimal set for genetic association contains A, E or B, E.
Fig. 2.The tree illustration of the sequential partition of haplotypes in Figure 1(b). The left panel shows the growing set of SNPs used in analysis and the right panel shows the partitions resulted from the haplotypes based on the current set of SNPs. The minimal set of SNPs that captures the genetic association is (A, D, E), with the disease risk concentrated on the haplotype 100. The path leading to discovering it could be 1 → 10 → 100. The corresponding order of SNPs in the haplotypes is A → AD → ADE.
The first model in simulations based on empirical data: 2 tagSNPs display strong LD with the unscored causal locus. TagSNPs 3 and 5 are genotyped and SNP 12 is the unscored functional locus. Haplotype 11 by SNPs 3 and 5 perfectly tagsSNP 12. The haplotype frequencies are estimated from 23 Americans of European descend in the PGA database
| Haplotype/SNP | 3 | 5 | 12 | Haplotype frequency |
| 1 | 0 | 0 | 0 | 0.847 |
| 2 | 0 | 1 | 0 | 0.087 |
| 3 | 1 | 0 | 0 | 0.022 |
| 4 | 1 | 1 | 1 | 0.044 |
Simulations based on empirical data: a comparison of type I errors and power for various methods under 2 disease model in 500 simulations. Standard errors are given in parentheses. For the first model, we generate data for 800 cases and 800 controls with ORs 1.5, 1.75, and 2; for the second model, we generate data for 400 cases and 400 controls with ORs 1.25, 1.5, and 1.75
| Method | Type I error | Power | |||
| OR = 1.5 | OR = 1.75 | OR = 2 | |||
| Model 1 | Single-locus scan | 0.048 (0.010) | 0.318 (0.015) | 0.648 (0.015) | 0.914 (0.013) |
| Phase known | Full haplotype | 0.052 (0.010) | 0.284 (0.020) | 0.590 (0.022) | 0.870 (0.015) |
| CLADHC | 0.056 (0.010) | 0.256 (0.020) | 0.546 (0.022) | 0.854 (0.016) | |
| SHARE | 0.050 (0.010) | 0.336 (0.021) | 0.654 (0.021) | 0.928 (0.012) | |
| Phase unknown | Haplotype score | 0.035 (0.008) | 0.288 (0.020) | 0.544 (0.022) | 0.863 (0.015) |
| SHARE | 0.054 (0.010) | 0.326 (0.021) | 0.650 (0.021) | 0.900 (0.013) | |
| OR = 1.25 | OR = 1.5 | OR = 1.75 | |||
| Model 2 | Single-locus scan | 0.046 (0.007) | 0.176 (0.017) | 0.608 (0.015) | 0.916 (0.012) |
| Phase known | Full haplotype | 0.046 (0.009) | 0.184 (0.017) | 0.616 (0.022) | 0.920 (0.012) |
| CLADHC | 0.062 (0.011) | 0.138 (0.015) | 0.548 (0.022) | 0.882 (0.014) | |
| SHARE | 0.046 (0.009) | 0.182 (0.017) | 0.678 (0.021) | 0.952 (0.010) | |
| Phase unknown | Haplotype score | 0.050 (0.010) | 0.158 (0.016) | 0.586 (0.022) | 0.900 (0.013) |
| SHARE | 0.044 (0.009) | 0.190 (0.018) | 0.666 (0.021) | 0.942 (0.010) | |
The unscored disease-causing locus is best captured by haplotypes based on 2 tagSNPs.
Two tagSNPs separated apart carry disease risk additively.
Simulations based on coalescence by ms: a comparison of type I errors and power for various methods in 500 simulations. Standard errors are given in parentheses. The high and low LD represent recombination rate per site per generation of 10−9 and 10−7, respectively. The sample consists of 1000 cases and 1000 controls
| LD[#SNP | Method | Type I error | Power | ||
| OR = 1.5 | OR = 1.75 | OR = 2 | |||
| High[15,16] | Single-locus scan | 0.052 (0.010) | 0.254 (0.019) | 0.416 (0.022) | 0.592 (0.022) |
| Phase known | Full haplotype | 0.050 (0.010) | 0.242 (0.019) | 0.456 (0.022) | 0.648 (0.021) |
| CLADHC | 0.042 (0.009) | 0.246 (0.019) | 0.464 (0.022) | 0.678 (0.021) | |
| SHARE | 0.050 (0.010) | 0.312 (0.021) | 0.548 (0.022) | 0.728 (0.020) | |
| Phase unknown | Haplotype score | 0.034 (0.008) | 0.216 (0.018) | 0.462 (0.022) | 0.664 (0.021) |
| SHARE | 0.050 (0.010) | 0.308 (0.021) | 0.528 (0.022) | 0.738 (0.020) | |
| Low[15,30] | Single-locus scan | 0.038 (0.009) | 0.154 (0.016) | 0.316 (0.021) | 0.420 (0.022) |
| Phase known | Full haplotype | 0.040 (0.009) | 0.116 (0.014) | 0.232 (0.019) | 0.370 (0.022) |
| CLADHC | 0.044 (0.009) | 0.146 (0.016) | 0.328 (0.021) | 0.510 (0.022) | |
| SHARE | 0.042 (0.009) | 0.174 (0.017) | 0.382 (0.022) | 0.516 (0.022) | |
| Phase known | Haplotype score | 0.032 (0.009) | 0.084 (0.012) | 0.148 (0.016) | 0.322 (0.021) |
| SHARE | 0.048 (0.010) | 0.160 (0.016) | 0.354 (0.021) | 0.498 (0.022) | |
The median number of tagSNPs in 500 simulated data.
The median number of common haplotypes in 500 simulations. The common haplotypes are defined as those with frequencies larger than 1%.
Fig. 3.The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a null model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.
The results of a SHARE analysis on 5 SNPs in the TFPI gene. SNPs 1–5 are rs2192824, rs2300412, rs8176597, rs8176612, and rs3771059, respectively. The left table shows 10 haplotypes using all 5 SNPs in the analysis. The right table shows that using SHARE, the association was narrowed down to haplotypes based on rs2192824 and rs2300412
| Haplotype | 1 | 2 | 3 | 4 | 5 | Frequency |
| 1 | 0 | 0 | 0 | 0 | 0 | 0.2254 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0.0018 |
| 3 | 0 | 0 | 0 | 1 | 0 | 0.0002 |
| 4 | 0 | 1 | 0 | 0 | 0 | 0.0593 |
| 5 | 0 | 1 | 0 | 0 | 1 | 0.2644 |
| 6 | 0 | 1 | 0 | 1 | 0 | 0.0003 |
| 7 | 0 | 1 | 1 | 0 | 0 | 0.0515 |
| 8 | 1 | 0 | 0 | 0 | 0 | 0.2989 |
| 9 | 1 | 0 | 0 | 0 | 1 | 0.0376 |
| 10 | 1 | 0 | 0 | 1 | 0 | 0.0605 |