| Literature DB >> 18466521 |
Pascal Croiseau1, Claire Bardel, Emmanuelle Génin.
Abstract
The presence of missing data in association studies is an important problem, particularly with high-density single-nucleotide polymorphism (SNP) maps, because the probability that at least one genotype is missing dramatically increases with the number of markers. A possible strategy is to simply ignore the missing data and only use the complete observations, and, consequently, to accept a significant decrease of the sample size. Using Genetic Analysis Workshop 15 simulated data on which we removed some genotypes to generate different levels of missing data, we show that this strategy might lead to an important loss in power to detect association, but may also result in false conclusions regarding the most likely susceptibility site if another marker is in linkage disequilibrium with the disease susceptibility site. We propose a multiple imputation approach to deal with missing data on case-parent trios and evaluated the performance of this approach on the same simulated data. We found that our multiple imputation approach has high power to detect association with the susceptibility site even with a large amount of missing data, and can identify the susceptibility sites among a set of sites in linkage disequilibrium.Entities:
Year: 2007 PMID: 18466521 PMCID: PMC2367517 DOI: 10.1186/1753-6561-1-s1-s24
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Proportion of replicates in which each marker gives a significant association test
| Locus | Association test | Association test conditional on DR |
| 1 | 0.19 | 0.05 |
| 2 | 0.6 | 0.03 |
| 3 | 1 | 0.18 |
| 4 | 1 | 0.33 |
| C | 1 | 0.83 |
| 6 | 1 | 0.2 |
| 7 | 0.97 | 0.09 |
| 8 | 0.09 | 0.06 |
| 9 | 0.16 | 0.06 |
| DR | 1 | X |
Figure 1Power to detect the effect of locus C in diseasesusceptibility. Comparison of the power to detect the C locus effect with and without MI in function of the percentage of missing data at locus C. Power of the test accounting for the DR locus effect is computed over the 100 replicates using the first 500 families.
Figure 2Number of informative families at locus C in function of the percentage of missing data.
Figure 3Number of times each marker gives the best scorefor the association test. Association test were performed given the effect of DR for different percentage of missing data at locus C. For each percentage of missing data, the number of replicates among the 100 replicates in which each of the ten markers gives the best association score is reported. Each bar represents a different marker and the black bar represents SNP C. Left, Results obtained when trios with missing data are discarded. Right, Results obtained using the MI approach.