| Literature DB >> 25709616 |
Pouya Khankhanian1, Lennox Din1, Stacy J Caillier1, Pierre-Antoine Gourraud1, Sergio E Baranzini1.
Abstract
Imputation is a commonly used technique that exploits linkage disequilibrium to infer missing genotypes in genetic datasets, using a well-characterized reference population. While there is agreement that the reference population has to match the ethnicity of the query dataset, it is common practice to use the same reference to impute genotypes for a wide variety of phenotypes. We hypothesized that using a reference composed of samples with a different phenotype than the query dataset would introduce imputation bias. To test this hypothesis we used GWAS datasets from Amyotrophic Lateral Sclerosis (ALS), Parkinson Disease (PD), and Crohn's Disease (CD). First, we masked and then performed imputation of 100 disease-associated markers and 100 non-associated markers from each study. Two references for imputation were used in parallel: one consisting of healthy controls and another consisting of patients with the same disease. We assessed the discordance (imprecision) and bias (inaccuracy) of imputation by comparing predicted genotypes to those assayed by SNP-chip. We also assessed the bias on the observed effect size when the predicted genotypes were used in a GWAS study. When healthy controls were used as reference for imputation, a significant bias was observed, particularly in the disease-associated markers. Using cases as reference significantly attenuated this bias. For nearly all markers, the direction of the bias favored the non-risk allele. In GWAS studies of the three diseases (with healthy reference controls from the 1000 genomes as reference), the mean OR for disease-associated markers obtained by imputation was lower than that obtained using original assayed genotypes. We found that the bias is inherent to imputation as using different methods did not alter the results. In conclusion, imputation is a powerful method to predict genotypes and estimate genetic risk for GWAS. However, a careful choice of reference population is needed to minimize biases inherent to this approach.Entities:
Keywords: SNP imputation; genome-wide association study; genomics; haplotype estimation
Year: 2015 PMID: 25709616 PMCID: PMC4321633 DOI: 10.3389/fgene.2015.00030
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Discordance of imputation.
| NAM (100) | ALS (137) | 17.40% [16.30, 18.50] | 17.60% [16.41, 18.80] | 0.2094 [0.1970, 0.2218] | 0.2108 [0.1977, 0.2240] |
| NAM (100) | PD (335) | 15.74% [14.53, 16.96] | 15.89% [14.69, 17.10] | 0.1927 [0.1787, 0.2067] | 0.1937 [0.1803, 0.2072] |
| NAM (100) | CD (406) | 15.73% [14.64, 16.81] | 16.13% [15.02, 17.26] | 0.1935 [0.1818, 0.2051] | 0.1978 [0.1857, 0.2099] |
| DAM (100) | ALS (137) | 19.65% [18.50, 20.80] | 18.04% [16.90, 19.18] | 0.2311 [0.2188, 0.243] | 0.2156 [0.2031, 0.2280] |
| DAM (100) | PD (335) | 19.12% [17.84, 20.41] | 18.22% [16.83, 19.61] | 0.2274 [0.2128, 0.2421] | 0.218 [0.2022, 0.2337] |
| DAM (100) | CD (406) | 15.77% [14.68, 16.87] | 15.45% [14.34, 16.56] | 0.1918 [0.1795, 0.2040] | 0.1883 [0.1759, 0.2007] |
NAM, non-associated markers, DAM, disease-associated markers.
Cases were imputed from ALS, Amyotrophic Lateral Sclerosis; PD, Parkinson's Disease; CD, Crohn's Disease.
Bias of imputation.
| NAM (100) | – | ALS (137) | 1.02% [0.29, 1.75] | 1.64% [0.77, 2.51] | −0.0019 [−0.009, 0.0053] | 0.0071 [−0.002, 0.0161] |
| NAM (100) | – | PD (335) | 1.67% [1.04, 2.30] | 1.93% [1.17, 2.68] | 0.0005 [−0.0056, 0.0067] | 0.0039 [−0.0035, 0.0113] |
| NAM (100) | – | CD (406) | 1.55% [1.05, 2.05] | 1.94% [1.40, 2.49] | 0.0017 [−0.0025, 0.0059] | 0.0069 [0.0017, 0.012] |
| DAM (54) | Major | ALS (137) | −10.67% [−12.59, −8.75] | −4.32% [−5.57, −3.07] | −0.126 [−0.147, −0.1051] | −0.0638 [−0.0762, −0.0514] |
| DAM (52) | Major | PD (335) | −7.21% [−8.30, −6.12] | −2.51% [−3.81, −1.22] | −0.0934 [−0.1033, −0.0835] | −0.0394 [−0.0522, −0.0266] |
| DAM (47) | Major | CD (406) | −4.55% [−5.87, −3.23] | −0.90% [−1.58, −0.21] | −0.0615 [−0.0762, −0.0467] | −0.0232 [−0.0302, −0.0162] |
| DAM (46) | Minor | ALS (137) | 12.51% [10.85, 14.18] | 6.00% [3.98, 8.01] | 0.1253 [0.1065, 0.1441] | 0.0571 [0.0352, 0.0789] |
| DAM (48) | Minor | PD (335) | 10.60% [9.24, 12.0] | 5.98% [4.62, 7.35] | 0.1018 [0.0875, 0.1161] | 0.0476 [0.0328, 0.0625] |
| DAM (53) | Minor | CD (406) | 8.97% [8.20, 9.73] | 5.55% [4.37, 6.37] | 0.0812 [0.0722, 0.0901] | 0.047 [0.0378, 0.0561] |
NAM, non-associated markers, DAM, disease-associated markers.
Cases were imputed from ALS, Amyotrophic Lateral Sclerosis; PD, Parkinson's Disease; CD, Crohn's Disease.
Positive values indicate preference for major allele.
Figure 1Imputation bias vs. odds ratio of association in ALS. Each circle represents one of the 100 DAM in ALS. For each SNP, the odds ratio (OR) of association (x-axis) indicates whether the minor allele (OR > 1) or the major allele (OR < 1) is the susceptibility allele (the allele more prevalent in cases than controls). The imputation bias (y-axis) indicates whether imputation error favors the major allele (positive values) or the minor allele (negative values). When controls were used as the reference for imputation, imputation is biased against the susceptibility allele. When an independent set of cases was used as the reference for imputation, the bias is significantly decreased. For reference, the 100 NAM (OR ≈ 1) are shown as boxes. Points are shaded by the log10 p-value of association with disease. The odds ratios of NAM are exaggerated for visual clarity.
Figure 2The distribution of (imputed OR/true OR) for 100 DAM in each dataset. In each of three datasets, 100 DAM were selected and the odds ratio of association with disease was estimated using both genotyped (true) data and imputed data. The ratio of imputed odds ratio to true odds ratio (x-axis) takes a similar distribution across the 100 SNPs in each disease. The odds ratio (OR) of association was generally lower for imputed data than for true data (imputed OR/true OR < 1). This hold true whether we use whole number imputation by Mach (top), fractional imputation by Mach (middle), or whole number imputation by Beagle (bottom). In contrast to Mach imputation, the magnitude of the mean imputed OR was closer to the magnitude of the true OR for all three datasets when using Beagle imputation. ALS, Amyotrophic Lateral Sclerosis; PD, Parkinson's Disease; CD, Crohn's Disease.