| Literature DB >> 28401899 |
Mario Mitt1,2, Mart Kals1,3, Kalle Pärn1,4, Stacey B Gabriel5, Eric S Lander5, Aarno Palotie4,5, Samuli Ripatti4, Andrew P Morris1,6, Andres Metspalu1,2, Tõnu Esko1,5, Reedik Mägi1, Priit Palta1,4.
Abstract
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.Entities:
Mesh:
Year: 2017 PMID: 28401899 PMCID: PMC5520064 DOI: 10.1038/ejhg.2017.51
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Phasing speed and accuracy to phase chromosome 20 of the EGCUT WGS data
| 1 | 0.72 (257) | 179 | 0.70 (246) | 293 (169) | 0.81 (291) | 29 |
| 2 | 0.71 (255) | 98 | 0.71 (248) | 216 (92) | 0.81 (291) | 15 |
| 4 | 0.70 (250) | 51 | 0.70 (247) | 174 (50) | 0.81 (291) | 8 |
| 8 | 0.71 (254) | 28 | 0.71 (248) | 150 (26) | 0.81 (291) | 5 |
| 16 | 0.71 (254) | 16 | 0.70 (245) | 139 (15) | 0.81 (291) | 5 |
| 24 | 0.70 (251) | 12 | 0.71 (249) | 136 (12) | 0.81 (291) | 10 |
| 32 | 0.70 (253) | 11 | 0.69 (244) | 135 (11) | 0.81 (291) | 9 |
Abbreviations: EGCUT, Estonian Genome Center, University of Tartu; PIR, phase informative read.
Phasing errors (measured as percentage and count of switch errors out of 35 780 haplotype switches) and running times for different number of processor cores (1, 2, 4, 8, 16, 24 and 32)
Total running time, including the extraction of PIRs from the raw sequencing data (BAM files). Haplotype-phasing time (without PIR extraction) is given in parenthesis.
Description of compared IRPs
| Description | 26 cohorts worldwide | 20 cohorts of mostly European ancestry | Estonian diversity panel | 1+26 cohorts worldwide | 26+1 cohorts worldwide |
| Average sequencing coverage | 7.4 × | 4–8 × | 29.8 × | 29.8 × | 7.4 × |
| MAC filter | MAC≥1 | MAC≥5 | MAC≥3 | MAC≥1 | MAC≥1 |
| No of haplotypes | 5008 | 64976 | 4488 | 9496 | 9496 |
| No of autosomal SNVs | 81 027 987 | 39 235 157 | 16 536 512 | 16 536 512 | 81 027 987 |
Abbreviations: HRC, Haplotype Reference Consortium; IRP, imputation reference panel; SNV, single-nucleotide variant.
Figure 1Number of variants imputed from different IRPs. (a) Number of all shared and panel-specific variants in three distinct reference panels imputed with INFO-value >0.4 (in bold) and >0.8 (given in brackets); (b) Total number of imputed SNVs (bars); the number of SNVs imputed with imputation quality score (INFO-value)>0.4 (coloured) and INFO>0.8 (shaded areas).
Figure 2Number of common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) variants imputed from different IRPs. (a) Number of well-imputed SNVs (imputed with imputation confidence INFO>0.4); and (b) number of confidently imputed SNVs (imputed with imputation confidence INFO>0.8).
Figure 3Number of common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) LoF (a) and missense (b) variants imputed from different IRPs with INFO-value>0.4 (bars) and INFO-value>0.8 (shaded areas).
Figure 4Imputation accuracy for common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) well-imputed variants (INFO>0.4) imputed from different IRPs. (a) Non-reference (NR) sensitivity—proportion of whole-exome sequencing (WES) based NR variant calls that were also retrieved through imputation process. (b) NR discordancy rate—proportion of NR variants that were retrieved through imputation process but had incorrect genotype calls as compared to the WES genotypes.
Genotype concordance of well-imputed SNVs (INFO>0.4)
| 1000G | 88.5% (24.3) | 3.4% (22.0) | 75.9% (2.4) | 14.0% (2.1) | 39.9% (0.7) | 24.7% (0.4) |
| HRC | 89.4% (24.1) | 2.1% (21.9) | 77.8% (2.4) | 8.2% (2.0) | 41.9% (0.7) | 17.0% (0.4) |
| EGCUT | 91.4% (24.3) | 1.9% (22.5) | 87.2% (2.4) | 6.1% (2.2) | 48.6% (0.7) | 14.1% (0.4) |
| EGCUT+1000G | 91.5% (24.3) | 2.1% (22.6) | 87.2% (2.4) | 6.3% (2.2) | 49.0% (0.7) | 13.6% (0.4) |
| 1000G+EGCUT | 92.4% (24.3) | 2.2% (22.8) | 87.1% (2.4) | 6.5% (2.2) | 49.9% (0.7) | 14.3% (0.4) |
Abbreviations: EGCUT, Estonian Genome Center, University of Tartu; HRC, Haplotype Reference Consortium; IRP, imputation reference panel; MAF, minor allele frequency; NR, non-reference; SNV, single-nucleotide variant; WES, whole-exome sequencing.
The ‘best guess’ genotype calls obtained with different IRPs were compared to the WES data while treating the WES-based genotype calls as ‘gold standard’. Imputation sensitivity—proportion of WES-based non-reference variant calls that were also obtained through imputation process—and discordancy rate (proportion of NR variant calls that were obtained through imputation process but which had incorrect genotype calls) were calculated.