| Literature DB >> 20018059 |
Daniel F Schwarz1, Silke Szymczak, Andreas Ziegler, Inke R König.
Abstract
Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.Entities:
Year: 2009 PMID: 20018059 PMCID: PMC2795966 DOI: 10.1186/1753-6561-3-s7-s65
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Figure 1Flow chart of algorithm 1. Algorithm 1 proceeds as follows: 1) enrich data; 2) mark missing and undefined genotypes; 3) roughly replace missing values; 4) grow forest; 5) calculate sample proximities; 6) update former missing values using proximities; 7) repeat Steps 4-6 several times; 8) extract imputed original data.
Figure 2Correlation between imputation accuracy and MAF. Blue and gray dots denote 3,775 SNPs that were imputed by RF and IMPUTE, respectively. SNPs are plotted according to accuracy and MAF. Black lines denote the three genotype frequencies of a SNP in Hardy-Weinberg equilibrium given its MAF.