| Literature DB >> 35128485 |
Quan Sun1, Weifang Liu1, Jonathan D Rosen1, Le Huang2, Rhonda G Pace3, Hong Dang3, Paul J Gallins4, Elizabeth E Blue5,6, Hua Ling7,8, Harriet Corvol9, Lisa J Strug10,11, Michael J Bamshad12,13,14,6, Ronald L Gibson12, Elizabeth W Pugh15, Scott M Blackman16, Garry R Cutting8,15, Wanda K O'Neal3, Yi-Hui Zhou17, Fred A Wright4,17, Michael R Knowles3, Jia Wen18, Yun Li1,18,19.
Abstract
Cystic fibrosis (CF) is a severe genetic disorder that can cause multiple comorbidities affecting the lungs, the pancreas, the luminal digestive system and beyond. In our previous genome-wide association studies (GWAS), we genotyped approximately 8,000 CF samples using a mixture of different genotyping platforms. More recently, the Cystic Fibrosis Genome Project (CFGP) performed deep (approximately 30×) whole genome sequencing (WGS) of 5,095 samples to better understand the genetic mechanisms underlying clinical heterogeneity among patients with CF. For mixtures of GWAS array and WGS data, genotype imputation has proven effective in increasing effective sample size. Therefore, we first performed imputation for the approximately 8,000 CF samples with GWAS array genotype using the Trans-Omics for Precision Medicine (TOPMed) freeze 8 reference panel. Our results demonstrate that TOPMed can provide high-quality imputation for patients with CF, boosting genomic coverage from approximately 0.3-4.2 million genotyped markers to approximately 11-43 million well-imputed markers, and significantly improving polygenic risk score (PRS) prediction accuracy. Furthermore, we built a CF-specific CFGP reference panel based on WGS data of patients with CF. We demonstrate that despite having approximately 3% the sample size of TOPMed, our CFGP reference panel can still outperform TOPMed when imputing some CF disease-causing variants, likely owing to allele and haplotype differences between patients with CF and general populations. We anticipate our imputed data for 4,656 samples without WGS data will benefit our subsequent genetic association studies, and the CFGP reference panel built from CF WGS samples will benefit other investigators studying CF.Entities:
Keywords: cystic fibrosis; genotype imputation; mendelian disease; polygenic risk score
Year: 2022 PMID: 35128485 PMCID: PMC8804187 DOI: 10.1016/j.xhgg.2022.100090
Source DB: PubMed Journal: HGG Adv ISSN: 2666-2477
Figure 4Illustration of impact of imputation on PRS construction. (A) Imputation performed in target cohorts. We started with four independent discovery cohorts (I–III are TOPMed imputed data, IV is WGS data), performed association analysis for each subset separately and then meta-analyzed the association results. The meta-GWAS summary statistics was then used to construct PRS using the P+T method. The constructed PRS was applied to the same 1992 target samples but with four different marker densities (in yellow highlight): array genotype, TOPMed imputed, reduced CFGP imputed, or WGS data to compare the benefit of imputation in target cohort. (B) Imputation performed in discovery cohorts. We started with the same first three discovery cohorts as in A, but adopted three different marker sets (again in yellow highlight), as well as a fourth independent WGS cohort. We then performed association analysis and meta-analysis for each marker set, and constructed three different PRSs using the three different meta-GWAS summary statistics. The three PRSs were then applied to the same cohort to compare the performances.
Figure 3Histograms of mean true R2 difference and proportion of variants better imputed by reduced CFGP than TOPMed, across 2,872 1-Mb non-overlapping regions. We calculated the true R2 difference of the two reference panels using reduced-CFGP true R2 minus TOPMed true R2 for each variant, and then summarized variant level true R2 difference at the 1-Mb region level using the two statistics: difference of true R2 (A) and proportion of reduced-CFGP better imputed variants (B).
Numbers of well-imputed variants by different MAF categories for the seven GWAS arrays (genome wide)
| Illumina panel | Number of samples | Number of samples-by-site | Number (%) | Number (%) | Number (%) | Number (%) | Number (%) |
|---|---|---|---|---|---|---|---|
| 300 K | 144 | FrGMC 1,300 | 17,603,215 (5.73%) | 12,248,616 (3.99%) | 3,897,584 (1.31%) | 6,738,025 (2.24%) | 5,510,591 (88.02%) |
| 370 K | 145 | 14,471,514 (4.71%) | 11,156,390 (3.63%) | 2,533,058 (0.85%) | 5,519,937 (1.83%) | 5,636,453 (90.49%) | |
| 660 K | 1,011 | 30,661,930 (9.99%) | 20,830,921 (6.79%) | 11,883,847 (4.01%) | 15,138,988 (5.03%) | 5,691,933 (93.95%) | |
| 610-Quad | 3,840 | CGS 1,533; GMS 1467; TSS 840 | 58,672,809 (19.12%) | 43,095,581 (14.04%) | 33,399,492 (11.26%) | 37,276,108 (12.39%) | 5,819,473 (96.22%) |
| 660W-set1 | 2,012 | CGS 342; GMS 808; TSS 862; | 43,832,169 (14.28%) | 34,503,481 (11.24%) | 24,694,173 (8.33%) | 28,669,926 (9.53%) | 5,833,555 (96.33%) |
| 660W-set2 | 444 | TSS 444 | 23,814,328 (7.76%) | 20,792,798 (6.77%) | 10,176,358 (3.43%) | 14,916,691 (4.96%) | 5,876,107 (96.98%) |
| Omni5 | 374 | CGS 73; GMS 170 TSS 131; | 20,774,826 (6.83%) | 18,862,492 (6.20%) | 10,530,015 (3.55%) | 14,053,383 (4.68%) | 4,809,109 (97.65%) |
Corvol et al 2015.
Percentage taken over total number of imputed variants from TOPMed freeze 8 reference panel.
Percentage taken over imputed variants with MAF of <0.5%.
Percentage taken over imputed variants with MAF of <5%.
Percentage taken over imputed variants with MAF of ≥5%.
True R2 for the two arrays with the largest sample sizes (chr20)
| Illumina panel | MAC/MAF | Number of non-NA-R2 variants | Mean true R2 | Median true R2 | Total number of variants |
|---|---|---|---|---|---|
| 610-Quad (n = 1992) | MAC <10 | 311,625 | 0.93 | 1.00 | 377,397 |
| MAF <0.5% | 440,489 | 0.93 | 1.00 | 508,198 | |
| MAF <0.5%–5% | 85,270 | 0.93 | 0.96 | 85,278 | |
| MAF >5% | 120,991 | 0.98 | 1.00 | 120,998 | |
| 660W-set1 (n = 941) | MAC <10 | 229,286 | 0.96 | 1.00 | 299,329 |
| MAF <0.5% | 356,643 | 0.95 | 1.00 | 430,073 | |
| MAF <0.5%–5% | 85,195 | 0.94 | 0.97 | 85,201 | |
| MAF >5% | 121,013 | 0.98 | 1.00 | 121,019 |
MAC, minor allele count; MAF, minor allele frequency.
NA true R2 emerged owing to being monomorphic (either true or imputed). Some variants may be monomorphic in the 1992 subset, but not in the 3840 samples. The Pearson correlation between a constant and a vector is not defined.
Heterozygous concordance for extremely rare variants (chr20)
| Illumina panel | Number of samples | Number of non-NA het concordant variants | Mean het concordant (freq) | Median het concordant (freq) | Total number of variants |
|---|---|---|---|---|---|
| 610-Quad | 1992 | 212,759 | 0.98 | 1.00 | 296,088 |
| 660W-set1 | 941 | 289,811 | 0.97 | 1.00 | 374,166 |
Figure 1Imputation concordance for F508del using TOPMed and reduced CFGP reference panels. The true R2 for TOPMed and reduced CFGP imputed results are 0.835 and 0.926, and the sum of squared error for TOPMed and reduced CFGP are 117.58 and 82.42, respectively. The main reason that TOPMed is slightly worse is that it tends to underestimate the deletion frequency.
Figure 2Histograms of differences between reduced CFGP true R2 and TOPMed true R2 to compare the imputation quality of the two reference panels. (A) For overall chr7. Almost all variants are located to the left half, which means TOPMed is predominantly better than the reduced CFGP reference panel. (B) For CFTR region only. The advantage of TOPMed reference panel over the reduced CFGP becomes less pronounced.
Examples of variants that are much better imputed with reduced CFGP.
| Variant (hg38) | chr7:117480621:T:C | chr7:117509047:G:T | chr7:117559471:T:C | chr7:117587738:G:A | chr7:117656113:C:T |
|---|---|---|---|---|---|
| rsIDs | rs1244070394 | rs77284892 | rs139573311 | rs76713772 | rs893051013 |
| CFGP true R2 | 0.9934 | 0.9968 | 0.9703 | 0.9837 | 0.9423 |
| TOPMed true R2 | 0.5490 | 0.3333 | 2.52 × 10−7 | 0.7799 | 0.5010 |
| CF5095 AC | 6 | 21 | 8 | 115 | 21 |
| CF5095 AF | 5.89 × 10−4 | 2.06 × 10−3 | 7.85 × 10−4 | 0.0113 | 2.06 × 10−3 |
| TOPMed8 AC | 3 | 3 | 2 | 20 | 6 |
| TOPMed8 AF | 1.13 × 10−5 | 1.13 × 10−5 | 7.56 × 10−6 | 7.56 × 10−5 | 2.27 × 10−5 |
| CADD phred score | 0.809 | 38 | 25.8 | 29.1 | 1.097 |
| VEP annotation | intron | stop gain | missense | splice acceptor | intron |
| CF-disease causing | no | yes | yes | yes | no |
| CFTR mutation | c.53 + 474T > C | c.178G > A | c.1400T > C | c.1585-1G > A | c.3963 + 3182C > T |
AC, allele count; AF, allele frequency.
The middle three variants have very high CADD phred scores and are disease causing variants, but their TOPMed imputation qualities are not satisfying. It shows the value of our CF-specific reference panel.
According to cftr2.org.
PRS performance when applied to UW samples
| Without imputation | TOPMed imputation | CFGP imputation | |
|---|---|---|---|
| Correlation between PRS and KNoRMA | 0.0455 | 0.0779 | 0.0496 |
| p value for the correlation | 0.1191 | 0.0075 | 0.0890 |
| Two-sample | 0.7121 | 0.0380 | 0.0065 |
Two PRS formulae were applied to the 1397 UW samples. As detailed in Methods Section B, both PRS formulae were constructed from the same 6112 patients, but one without imputation and the other aided with imputation. Two-sample t test p value: performed two-sample t test of the true KNoRMA values for samples with the top and bottom 5% PRS scores, either based on the PRS formula without imputation, or the TOPMed/CFGP-based imputation-aided one to assess the distinctive power of the two PRSs in separating samples in terms of their KNoRMA scores. Our results show that the imputation-aided PRS results in better prediction (reflected by higher and more significant correlation with KNoRMA) and better distinctive ability to stratify patients.