| Literature DB >> 21267008 |
Lorraine Southam1, Kalliope Panoutsopoulou, N William Rayner, Kay Chapman, Caroline Durrant, Teresa Ferreira, Nigel Arden, Andrew Carr, Panos Deloukas, Michael Doherty, John Loughlin, Andrew McCaskie, William E R Ollier, Stuart Ralston, Timothy D Spector, Ana M Valdes, Gillian A Wallis, J Mark Wilkinson, Jonathan Marchini, Eleftheria Zeggini.
Abstract
Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610 k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.Entities:
Mesh:
Year: 2011 PMID: 21267008 PMCID: PMC3083623 DOI: 10.1038/ejhg.2010.242
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Summary of QC steps and related SNP number breakdown
| None (‘unQCed' dataset) | 8064 | 7689 | 375 | 6498 | 77 |
| Typical GWAS QC (‘QCed' dataset) | 7910 | 7585 | 325 | 6446 | 67 |
| As above plus 14 significant SNPs removed with poor cluster plots | 7896 | 7592 | 304 | 6449 | 61 |
| As above plus 36 additional SNPs removed with poor cluster plots | 7860 | 7557 | 303 | 6419 | 58 |
| Typical GWAS QC plus MAF <5% | 7554 | 7269 | 285 | 6434 | 65 |
| Typical GWAS QC plus MAF <10% | 6544 | 6287 | 257 | 5569 | 53 |
Abbreviations: GWAS, genome-wide association study; MAF, minor allele frequency; NS, not significant; QC, quality control; QCed, quality controlled; S, significant.
Filtering is based on removal of SNPs with an IMPUTE-info score of <0.8 and MAF <5%.
There were 8082 SNPs in the unQCed data, of which 18 were monomorphic in the arcOGEN cases but polymorphic in HapMap; these SNPs were removed by IMPUTE.
Typical GWAS QC was MAF ≤5% with call rate <95% and MAF <5% with call rate <99%, Hardy–Weinberg equilibrium P<1 × 10−4, and exclusion of GC and AT allele SNPs and MAF <1% SNPs, applied as an additional post-association analysis and pre-imputation QC step.
Significant SNPs with poor cluster plots removed.
Those SNPs flanking the significant SNPs with poor cluster plots removed.
arcOGEN data for chromosome 22 detailing the different pre-imputation QC steps. A breakdown of the SNP number for each QC threshold is indicated both with and without the post-imputation QC.
NS, P≥1 × 10−6; significant SNPs, P<1 × 10−6.
Figure 1(a) Imputation results for the QCed data indicating the total number of SNPs filtered for different QC thresholds using the IMPUTE-info and freq-add-proper-info scores. The SNPs remaining after the filter (red bar) have been subdivided into SNPs that are significant (green bar) and not significant (yellow bar). (b) The same data as percentage of significant and nonsignificant SNPs removed for each threshold. Both methods of filtering appear to be equivalent, but the freq-add-proper-info is shifted to the right for the same numerical threshold; we chose the IMPUTE-info <0.8 for further analysis (similar to a freq-add-proper-info <0.9).
Figure 2Correlation plots and the associated R2 for (a) The unQCed and the QCed with and without post-imputation QC filtering (IMPUTE-info <0.8 and MAF <5%). (b) The imputed-only markers in the unQCed and fully QCed data (QCed data with all poorly clustered markers removed) without post-imputation QC filtering.