| Literature DB >> 25112433 |
Nab Raj Roshyara, Holger Kirsten, Katrin Horn, Peter Ahnert, Markus Scholz.
Abstract
BACKGROUND: Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE.Entities:
Mesh:
Year: 2014 PMID: 25112433 PMCID: PMC4236550 DOI: 10.1186/s12863-014-0088-5
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1Venn-Diagram describing the intersection of SNP datasets filtered by different quality criteria. Note that by definition, HQ is contained in every subset.
Description of scenarios of pre-imputation SNP filtering: Note that datasets contain different numbers of SNPs
| 4658 | high quality : criteria MAF ≥ 0.1, | |
| 7923 | Normal quality : MAF ≥ 0.01, | |
| 8472 | low quality: MAF ≥ 0.005, | |
| 8310 | MAF ≥ 0.01 | |
| 9547 | p(HWE) ≥ 10− 6 | |
| 9194 | ||
| 6344 | MAF ≥ 0.1 | |
| 9450 | p(HWE) ≥ 10− 2 | |
| 7148 | ||
| 8255 | MAF ≥ 0.01, p(HWE) ≥ 10− 6 | |
| 6261 | MAF ≥ 0.1, p(HWE) ≥ 10− 2 | |
| 8520 | MAF ≥ 0.005 | |
| 9574 | p(HWE) ≥ 10− 12 | |
| 8492 | MAF ≥ 0.005, p(HWE) ≥ 10− 12 | |
| 6337 | This data subset contains SNPs which fail NQ criterion and HQ | |
| 9602 | This data subset contains all available SNPs. |
We focus on the scenarios in bold. Results of all scenarios can be found in the supplement material.
Imputation quality of the scenario “Hole-filling without external reference”: percentages of masked genotypes imputed with a Hellinger score ≥0.6 are presented
| ALL | 9602 | ||||||
| LQ | 8472 | ||||||
| NQ | 7923 | 87.48* | 92.29 | 85.64 | |||
| BQ | 6337 | 89.82* | 88.24* | 79.53* | 89.00 | 87.31 | 76.89 |
| HQ | 4658 | 89.47* | 87.71* | 78.56* | 88.72 | 86.95 | 75.82 |
Datasets of different pre-imputation quality filtering were considered and different percentages of genotypes were masked. Results of the optimal imputation scenarios are described with (+). Results of the filtering scenarios which are not significantly inferior compared to the best scenario, are described with Italic-bold letters. An asterisk (*) indicates whether MaCH or IMPUTE2 performed significantly better in the corresponding scenario.
(Imputation quality of the scenario “Hole-filling with external HapMap reference”): percentages of overlapping masked genotypes imputed with good Hellinger score (≥0.6) are presented
| ALL | 9602 | ||||||
| LQ | 8472 | ||||||
| NQ | 7923 | 93.83* | 93.15* | 90.61* | 92.90 | 92.11 | 89.32 |
| BQ | 6337 | 90.85* | 89.62* | 83.34* | 89.85 | 88.61 | 82.47 |
| HQ | 4658 | 90.47* | 89.05* | 81.84* | 89.00 | 87.55 | 80.57 |
Datasets of different pre-imputation quality filtering were considered and different percentages of genotypes were masked. Results of the optimal imputation scenarios are described with (+). Results of the filtering scenarios which are not significantly inferior compared to the best scenario, are described with Italic-bold letters. An asterisk (*) indicates whether MaCH or IMPUTE2 performed significantly better in the corresponding scenario.
(Imputation quality of the scenario ”entire SNP imputation using external HapMap reference”): percentages of overlapping masked genotypes imputed with good Hellinger score (≥0.6) are presented
| ALL | 9602 | ||||||
| LQ | 8472 | ||||||
| NQ | 7923 | 94.27 | 93.82* | 91.33 | 93.50 | 91.39 | |
| BQ | 6337 | 91.69* | 90.76* | 85.08 | 91.20 | 90.18 | 84.85 |
| HQ | 4658 | 91.22* | 90.2* | 83.64* | 90.21 | 88.97 | 82.42 |
Datasets of different pre-imputation quality filtering were considered and different percentages of genotypes were masked. Results of the optimal imputation scenarios are described with (+). Results of the filtering scenarios which are not significantly inferior compared to the best scenario, are described with Italic-bold faced letters. An asterisk (*) indicates whether MaCH or IMPUTE2 performed significantly better in the corresponding scenario.
(Software specific quality scores for the scenario “Entire SNP imputation using external HapMap reference”): percentages of SNPs above a quality cut-off of 0.3 for both MaCH-rsq and IMPUTE-info score are provided
| ALL | 9602 | ||||||
| LQ | 8472 | ||||||
| NQ | 7923 | ||||||
| BQ | 6337 | 96.57 | 96.57 | 92.72 | 97.43 | ||
| HQ | 4658 | 96.36 | 96.36 | 91.65 | 96.36 | ||
Results of the optimal imputation scenarios are described with (+). Results of the filtering scenarios which are not significantly inferior compared to the best scenario, are described with Italic-bold letters. Note that MaCH-Rsq and IMPUTE-info scores are defined differently and cannot be compared directly.
Figure 2Pairwise comparison of the analyzed measures of imputation quality. Distribution and pair-wise correlation of SEN-scores obtained from MaCH ( MaCH_SEN) and IMPUTE (IMPUTE2_SEN), Hellinger score obtained from MaCH (MaCH_HELLI) and from IMPUTE (IMPUTE2_HELLI), MaCH Rsq-score(MaCH_Rsq) and IMPUTE-info (IMPUTE2_INFO) score are shown. We present the results for the scenario “Entire SNP imputation” without pre-filtering (“ALL”) with 50% missing SNPs. Values refer to the squared Pearson correlation.