| Literature DB >> 16451686 |
Joe M Butler1, D Timothy Bishop, Jennifer H Barrett.
Abstract
In genetic association studies, linkage disequilibrium (LD) within a region can be exploited to select a subset of single-nucleotide polymorphisms (SNPs) to genotype with minimal loss of information. A novel entropy-based method for selecting SNPs is proposed and compared to an existing method based on the coefficient of determination (R2) using simulated data from Genetic Analysis Workshop 14. The effect of the size of the sample used to investigate LD (by estimating haplotype frequencies) and hence select the SNPs is also investigated for both measures. It is found that the novel method and the established method select SNP subsets that do not differ greatly. The entropy-based measure may thus have value because it is easier to compute than R2. Increasing the sample size used to estimate haplotype frequencies improves the predictive power of the subset of SNPs selected. A smaller subset of SNPs chosen using a large initial sample to estimate LD can in some instances be more informative than a larger subset chosen based on poor estimates of LD (using a small initial sample). An initial sample size of 50 individuals is sufficient in most situations investigated, which involved selection from a set of 7 SNPs, although to select a larger number of SNPs, a larger initial sample size may be required.Entities:
Mesh:
Year: 2005 PMID: 16451686 PMCID: PMC1866709 DOI: 10.1186/1471-2156-6-S1-S72
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1Matrices indicating the extent of LD at loci D2 and D4.
Optimal subsets of locus D2 as identified by the two measures.
| No. of SNPs | Standardized entropy | Chapman's | ||||||||||||||
| Optimal subset | Rating | Optimal subset | Rating | |||||||||||||
| 1 | 0.1932 | 6 | 0.189 | |||||||||||||
| 2 | 0.3848 | 4 | 6 | 0.2553 | ||||||||||||
| 3 | 1 | 4 | 0.5718 | 1 | 4 | 6 | 0.3452 | |||||||||
| 4 | 1 | 2 | 4 | 0.7195 | 1 | 2 | 4 | 6 | 0.4335 | |||||||
| 5 | 1 | 2 | 3 | 4 | 0.8353 | 1 | 2 | 3 | 4 | 6 | 0.5794 | |||||
| 6 | 1 | 2 | 3 | 4 | 5 | 7 | 0.9493 | 1 | 2 | 3 | 4 | 5 | 7 | 0.7399 | ||
| 7 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 |
aSNPs highlighted in red differ between the two uniquely optimal subsets of equal size.
Optimal subsets of locus D4 as identified by the two measures.
| No. of SNPs | Standardized entropy | Chapman's | ||||||||||||||
| Optimal subset | Rating | Optimal subset | Rating | |||||||||||||
| 1 | 3 | 0.2311 | 3 | 0.1539 | ||||||||||||
| 2 | 3 | 5a | 0.4333 | 3 | 6 | 0.3416 | ||||||||||
| 3 | 3 | 5 | 6 | 0.6021 | 3 | 5 | 6 | 0.4264 | ||||||||
| 4 | 1 | 2 | 5 | 6 | 0.7544 | 1 | 3 | 5 | 6 | 0.5652 | ||||||
| 5 | 1 | 3 | 5 | 6 | 7 | 0.8941 | 1 | 3 | 5 | 6 | 7 | 0.7839 | ||||
| 6 | 1 | 2 | 4 | 5 | 6 | 7 | 0.9827 | 1 | 2 | 4 | 5 | 6 | 7 | 0.9424 | ||
| 7 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 |
aSNPs highlighted in red differ between the two uniquely optimal subsets of equal size.
Figure 2The effect of sample size on optimal subset selection at D2 and D4 using . a and b, the optimal subsets are identified from a sample using R2. They are rated using the R2 results obtained from 5,000 individuals. c and d, the optimal subsets are identified from a sample using Sε. They are rated using the R2 results obtained from 5,000 individuals. The R2 graphs are shown as broken lines for comparison.