| Literature DB >> 19077279 |
Zhenming Zhao1, Nadia Timofeev, Stephen W Hartley, David Hk Chui, Supan Fucharoen, Thomas T Perls, Martin H Steinberg, Clinton T Baldwin, Paola Sebastiani.
Abstract
BACKGROUND: Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood.Entities:
Mesh:
Year: 2008 PMID: 19077279 PMCID: PMC2636842 DOI: 10.1186/1471-2156-9-85
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Summary of the accuracies of IMPUTE using data from chromosome 21 in the NNC set
| 97.42% | 97.42% | 97.05% | 95.20% | 91.88% | ||
| 99.24% | 99.24% | 99.22% | 99.06% | 98.86% | ||
| 82.30% | 82.31% | 80.38% | 71.16% | 59.39% | ||
| 97.24% | 97.70% | 97.24% | 95.39% | 91.71% | ||
| 99.38% | 99.42% | 99.27% | 99.04% | 98.95% | ||
| 81.92% | 82.08% | 80.53% | 71.39% | 59.12% | ||
The columns report the accuracy of imputation when different proportions of SNPs ranging from 0.1% to 60% were imputed. The first three rows labelled as "Complete missing" summarize the accuracy when the genotype data were completely removed, while the last three rows labelled "80% missing" summarize the accuracy when 80% of the genotype data were randomly removed. The row labelled "Overall" reports the median accuracy and the minimum, 1st quartile, 3rd quartile, and maximum accuracy value within brackets. The row labelled "0.95.P.P" reports the median accuracy of the imputed genotypes when a minimum posterior probability of 0.95 was required for an imputed genotype to be acceptable. The row labelled "Percentage" reports the percentage of imputed genotype data that were acceptable by using the minimum posterior probability of 0.95 as a requirement.
Impact on imputation accuracy of splitting chromosomes into chunks
| 97.42% | 97.42% | 88.29% | 88.29% | ||
| 99.23% | 99.23% | 97.30% | 97.30% | ||
| 97.24% | 97.70% | 88.76% | 88.76% | ||
| 99.28% | 99.30% | 97.44% | 97.44% | ||
No obvious impact of splitting chromosome 2 into small chunks of 10 Mb on imputation accuracy while using the data from the NNC and SCA sets. In all tests, 10% of the SNPs on chromosome 2 were randomly selected and their genotype data were either completely removed (Complete missing), or only 80% randomly removed (80% missing).
Comparison of the accuracies of the imputed genotypes in different populations
| 85.66% | 87.39% | 94.23% | 97.05% | 97.06% | 96.43% | ||
| 96.70% | 97.22% | 98.06% | 99.22% | 99.24% | 99.15% | ||
| 59.00% | 60.77% | 72.56% | 80.38% | 80.40% | 77.43% | ||
| 85.92% | 87.64% | 93.98% | 97.24% | 97.25% | 96.43% | ||
| 96.79% | 97.37% | 98.46% | 99.27% | 99.08% | 99.08% | ||
| 59.03% | 61.27% | 72.82% | 80.53% | 80.44% | 77.43% | ||
The columns report the accuracy of imputation when 10% of SNPs were imputed. As in Table 1, the first three rows labelled as ''Complete missing'' summarize the accuracy when the genotype data were completely removed, while the last three rows labelled ''80% missing'' summarize the accuracy when 80% of the genotype data were randomly removed. The row labelled ''Overall'' reports the median accuracy and the minimum, 1st quartile, 3rd quartile, and maximum accuracy within brackets. The row labelled ''0.95.P.P'' reports the median accuracy of the imputed genotypes when a minimum posterior probability of 0.95 was required for an imputed genotype to be acceptable. The row labelled ''Percentage'' reports the percentage of imputed genotype data that were acceptable by using the minimum posterior probability of 0.95 as requirement.
Figure 1Distribution of imputation accuracies when 1% of the SNPs were randomly selected from chromosome 21 and their genotype data completely removed in the NNC set. The results for other proportion of missing SNPs are in the supplementary material. In each of the 1,000 simulations we randomly selected 1% of the SNPs to be removed from the data and their genotype data to be imputed. The chromosome is tagged by approximately 5,900 SNPs, so that 59 SNPs were removed in each run, and 59,000 SNPs had to be imputed across all 1,000 simulations. The x-axis reports the accuracy of each of the 59,000 SNPs that were imputed in the 1,000 simulations. The y-axis reports the frequency of different imputation accuracies.
Figure 2Accuracies versus minor allele frequency (MAF), when 1% of the SNPs on Chr21 were randomly selected and their genotype data were completely removed in NNC set. The cluster of 10 points corresponds to SNPs that are in recombination hotspots.
Figure 3Accuracies of imputed genotypes in 59,000 SNPs (y axis) versus a summary of the LD patterns surrounding them (x axis). The summary of LD is a weighted average of the pairwise D' between each SNP to be imputed and all other SNPs in the same chromosome with weights that are calculated as . In the formula, dis the physical distance between the SNP to be imputed and the ith SNP, in 100 kb, and d' is the estimate of LD between the same two SNPs.
Figure 4Results for the principal components analysis (PCA) assessing the degree of stratification between the samples used for imputation and the four Hapmap populations. The two panels plot the top two principal components for CEU (Purple), YRI (Red), NNC (Black), NECS (Blue), AA (Orange), SCA (Green). The left panel shows that the African Americans (orange) are more admixed as compared to the SCA (green) in the right panel.
Accuracy of imputation in samples from African Americans
| 85.66% | 87.00% | 97.14% | ||
| 96.70% | 96.97% | 100.00% | ||
| 85.92% | 87.08% | 96.43% | ||
| 96.79% | 97.06% | 100.00% | ||
Impact of splitting samples from African Americans into groups based on their similarity to the Yorubans and Caucasians on the imputation accuracy. The 1st column reports imputation accuracy when YRI haplotypes are used on the whole set, the 2nd column reports imputation accuracy when YRI haplotypes are used on a cluster of subjects close to YRIs, the 3rd column reports imputation accuracy when CEU haplotypes are used on a cluster of subjects close to CEUs.