| Literature DB >> 34064523 |
Linduni M Rodrigo1, Dale R Nyholt1.
Abstract
Given that improved imputation software and high-coverage whole genome sequence (WGS)-based haplotype reference panels now enable inexpensive approximation of WGS genotype data, we hypothesised that WGS-based imputation and analysis of existing ExomeChip-based genome-wide association (GWA) data will identify novel intronic and intergenic single nucleotide polymorphism (SNP) effects associated with complex disease risk. In this study, we reanalysed a Parkinson's disease (PD) dataset comprising 5540 cases and 5862 controls genotyped using the ExomeChip-based NeuroX array. After genotype imputation and extensive quality control, GWA analysis was performed using PLINK and a recently developed machine learning approach (GenEpi), to identify novel, conditional and joint genetic effects associated with PD. In addition to improved validation of previously reported loci, we identified five novel genome-wide significant loci associated with PD: three (rs137887044, rs78837976 and rs117672332) with 0.01 < MAF < 0.05, and two (rs187989831 and rs12100172) with MAF < 0.01. Conditional analysis within genome-wide significant loci revealed four loci (p < 1 × 10-5) with multiple independent risk variants, while GenEpi analysis identified SNP-SNP interactions in seven genes. In addition to identifying novel risk loci for PD, these results demonstrate that WGS-based imputation and analysis of existing exome genotype data can identify novel intronic and intergenic SNP effects associated with complex disease risk.Entities:
Keywords: GWAS; Parkinson’s disease; SNP–SNP interactions; genotype imputation; machine learning
Year: 2021 PMID: 34064523 PMCID: PMC8147919 DOI: 10.3390/genes12050689
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Composition of the genotype dataset before and after imputation.
| Region | MAF | Before | After | Before | After |
|---|---|---|---|---|---|
| Exonic | <0.05 | 68,026 | 62,829 | 61.56 | 4.29 |
| ≥0.05 | 12,581 | 38,282 | 11.39 | 2.61 | |
| >0 | 80,607 | 101,111 | 72.95 | 6.90 | |
| Intronic | <0.05 | 2385 | 393,633 | 2.16 | 26.85 |
| ≥0.05 | 12,085 | 378,058 | 10.94 | 25.79 | |
| >0 | 14,470 | 771,691 | 13.10 | 52.64 | |
| Intergenic | <0.05 | 3181 | 275,900 | 2.88 | 18.82 |
| ≥0.05 | 12,246 | 317,236 | 11.08 | 21.64 | |
| >0 | 15,427 | 593,136 | 13.96 | 40.46 | |
| Total | <0.05 | 73,592 | 732,362 | 66.60 | 49.96 |
| ≥0.05 | 36,912 | 733,576 | 33.40 | 50.04 | |
| >0 | 110,504 | 1,465,938 | 100 | 100 |
Summary of the additional genome-wide significant loci identified after imputation.
| SNP | CHR | BP | Nearest | EA | NEA | EAF | MAF | OR (95% CI) | |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| rs187989831 | 2 | 95,560,505 |
| C | G | 0.005 | 0.005 | 0.001 (3 × 10−5–0.004) | 7.56 × 10−10 |
| rs137887044 | 5 | 76,912,498 |
| C | T | 0.972 | 0.028 | 1.85 (1.49–2.27) | 2.41 × 10−8 |
| rs78837976 | 7 | 100,647,511 |
| C | T | 0.989 | 0.011 | 16.67 (9.09–33.33) | 2.98 × 10−18 |
| rs74125032 | 13 | 111,329,589 |
| T | C | 0.002 | 0.002 | 4.5 × 10−10 (6 × 10−13–4 × 10−7) | 2.36 × 10−10 |
| rs117672332 | 17 | 3,606,117 |
| T | C | 0.989 | 0.011 | 7.69 (4.76–12.5) | 2.20 × 10−15 |
|
| |||||||||
| rs983361 | 4 | 90,761,944 |
| T | G | 0.217 | 0.217 | 0.820 (0.77–0.88) | 6.29 × 10−9 |
| rs7221167 | 17 | 43,933,307 |
| C | T | 0.396 | 0.396 | 0.848 (0.80–0.90) | 3.08 × 10−8 |
CHR = chromosome; BP = base position in GRCh37 (hg19); OR = odds ratio; EA = effect allele; NEA = non-effect allele; EAF = effect allele frequency; OR (95% CI) and p-value = odds ratio (95% confidence interval) and p-value from association analyses.
Figure 1(a) Manhattan and (b) Q–Q plot of the genome-wide association analysis. The Manhattan plot, representing the −log10 p-values against the chromosome position. All genome-wide (GW) significant SNPs are depicted in red and the nearest gene of the most significant variant in each locus is labelled. The Q–Q plot shows the expected −log10 p-values under the null hypothesis on the x axis, while observed −log10 p-values are represented on the y axis. The λ is a measure of the genomic inflation (observed median χ2 test statistic/median expected χ2 test statistic under the null hypothesis).
Figure 2LocusZoom plots of novel genome-wide significant PD loci. (a) rs187989831 near TEKT4 on 2q11.1, (b) rs137887044 near WDR41 on 5q14.1, (c) rs74125032 in CARS2 on 13q34, (d) rs117672332 in ITGAE/HASPIN on 17p13.2, and (e) rs78837976 in MUC12 on 7q22.1. Association significance with PD is shown as −log10 p-values on the left y-axis. The most significant SNP represented by purple colour diamond. All other SNPs are shown as circles and are colour coded according to the strength of LD with the most significant SNP (LD measured using the European 1000 Genomes data).
Secondary association signals from conditional analysis.
| Secondary SNP | CHR | BP | Nearest | EA | EAF | Index SNP |
| OR | ORcond | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| rs112344141 | 1 | 154,983,036 |
| G | 0.0491 | rs35749011 | 0.001 | 1.3142 | 7.93 × 10−4 | 1.3337 | 4.18 × 10−4 |
| rs113319394 | 2 | 95,555,635 |
| C | 0.0045 | rs187989831 | 0.985 | 1.64 × 10−11 | 2.94 × 10−5 | 7.76 × 10−12 | 2.05 × 10−5 |
| rs181580861 | 4 | 958,812 |
| G | 0.0013 | rs34311866 | 0.0003 | 6.4117 | 4.88 × 10−3 | 7.0166 | 3.43 × 10−3 |
| rs3806789 | 4 | 90,759,556 |
| C | 0.4951 | rs356182 | 0.174 | 0.9373 | 2.10 × 10−2 | 0.8265 | 9.13 × 10−10 |
| rs72765119 | 5 | 76,363,276 |
| G | 0.2345 | rs137887044 | 0.0003 | 1.1172 | 2.90 × 10−3 | 1.1135 | 3.92 × 10−3 |
| rs28645997 | 7 | 100,352,470 |
| G | 0.4134 | rs78837976 | 9.87 × 10−5 | 1.0935 | 2.06 × 10−3 | 1.0870 | 4.26 × 10−3 |
| rs74125084 | 13 | 111,372,680 |
| T | 0.0058 | rs74125032 | 0.499 | 1.058 × 10−4 | 9.63 × 10−8 | 1.31 × 10−4 | 1.05 × 10−6 |
| rs11653889 | 17 | 3,627,456 |
| A | 0.0072 | rs117672332 | 0.747 | 0.0499 | 1.13 × 10−14 | 0.0340 | 1.79 × 10−10 |
| rs3851784 | 17 | 45,040,117 |
| A | 0.4376 | rs117300236 | 0.0115 | 0.8860 | 1.68 × 10−5 | 0.9081 | 6.78 × 10−4 |
Secondary SNP = secondary association single-nucleotide polymorphism; CHR = secondary SNP chromosome; BP = secondary SNP base position in GRCh37 (hg19); EA = secondary SNP effect allele; EAF = secondary SNP effect allele frequency; Index SNP = most significant SNP used to condition on; r = LD between the secondary and index SNP; OR = odds ratio and p-value = p-value for the secondary SNP from standard association analysis; ORcond = odds ratio and p-valuecond = p-value for the secondary SNP from conditional analyses.
Comparison of association results with Nalls et al. for SNPs not available in NeuroX dataset.
| SNP Information | Nalls et al. Results | Reanalysis of NeuroX | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Discovery Phase | Replication Phase | |||||||||||
| SNP | CHR | BP | Nearest | EA | EAF | OR | OR | Imp_rsq | OR | |||
| rs35749011 | 1 | 155,135,036 | A | 0.017 | 1.762 | 6.09 × 10−23 | 2.307 * | 7.48 × 10−9 * | 0.969 | 2.241 | 5.03 × 10−11 | |
| rs1474055 | 2 | 169,110,394 |
| T | 0.128 | 1.213 | 7.12 × 10−16 | 1.218 * | 1.07 × 10−6 * | 0.961 | 1.241 | 2.82 × 10−7 |
| rs115185635 | 3 | 87,520,857 |
| C | 0.035 | 1.789 | 2.18 × 10−8 | 0.931 * | 0.846 * | 0.999 | 0.983 | 0.802 |
| rs117896735 | 10 | 121,536,327 |
| A | 0.014 | 1.767 | 1.21 × 10−11 | 1.404 * | 1.10 × 10−3 * | 0.776 | 1.525 | 4.64 × 10−4 |
| rs3793947 | 11 | 83,544,472 |
| A | 0.443 | 0.912 | 2.59 × 10−8 | 0.976 * | 0.201 * | 0.998 | 0.983 | 0.538 |
| rs11158026 | 14 | 55,348,869 |
| T | 0.335 | 0.889 | 7.13 × 10−11 | 0.948 | 0.039 | 0.999 | 1.048 | 0.186 |
| rs1555399 | 14 | 67,984,370 |
| A | 0.468 | 0.872 | 5.53 × 10−16 | 0.971 * | 0.144 * | 0.902 | 1.033 | 0.239 |
| rs62120679 | 19 | 2,363,319 |
| T | 0.314 | 1.141 | 2.53 × 10−9 | 0.999 * | 0.518 * | 0.919 | 1.074 | 0.031 |
| rs8118008 | 20 | 3,168,166 |
| A | 0.657 | 1.111 | 2.32 × 10−8 | 1.113 * | 1.18 × 10−4 * | 0.955 | 1.120 * | 1.13 × 10−4 * |
CHR = chromosome; BP = base position relative in GRCh37 (hg19); EA = effect allele; EAF = effect allele frequency; OR = odds ratio and p-value = p-value of the association analysis; Imp_rsq = IMPUTE4 info score. In replication phase of Nalls et al. results, * indicates the SNPs that failed assay design or quality control and a suitable proxy SNP was used (proxy rs71628662 for rs35749011; proxy rs1955337 for rs1474055; proxy rs62267708 for rs115185635; proxy rs118117788 for rs117896735; proxy rs12283611 for rs3793947; proxy rs1077989 for rs1555399; proxy rs10402629 for rs62120679; proxy rs55785911 for rs8118008). In current study results, for rs8118008 that is not available in HRC to impute, a perfect (r2 = 1) proxy SNP rs8125675 was selected. SNPs with divergent replication results are shown in bold.
GenEpi SNP–SNP interaction results.
| SNP1 | SNP2 | Genotype freq | Nearest | OR | |||
|---|---|---|---|---|---|---|---|
| rsID | Chr:bp_Genotype | rsID | Chr:bp_Genotype | ||||
| rs11248057 | 4:906131_GG | rs11734449 | 4:921733_CC | 0.101 |
| 1.412 | 4.70 × 10−7 |
| rs6599388 | 4:939087_TT | rs1051613 | 4:951179_GG | 0.096 |
| 1.431 | 3.01 × 10−7 |
| rs356167 | 4:90673770_GG | rs34320254 | 4:90705606_TT | 0.478 |
| 0.771 | 1.54 × 10−6 |
| rs2965400 | 7:21733475_GG | rs6461595 | 7:21758045_GG | 0.132 |
| 0.750 | 3.77 × 10−6 |
| rs2521819 | 17:43543830_TC | rs7224890 | 17:43548778_GC | 0.299 |
| 1.260 | 5.55 × 10−6 |
| rs34186148 | 17:43854655_CC | rs242941 | 17:43892520_CC | 0.120 |
| 0.576 | 4.78 × 10−10 |
| rs1294776 | 17:44004442_TT | rs6503453 | 17:44062603_AA | 0.296 |
| 0.798 | 9.25 × 10−6 |
| rs200403 | 17:44781143_CA | rs35937770 | 17:44808360_GG | 0.205 |
| 0.752 | 1.57 × 10−7 |
Chr:bp_genotype = chromosome and base position (GRCh37 [hg19]) with the genotype of each SNP; Genotype freq = frequency in all individuals (cases and controls) for the combination of SNP1 genotype and SNP2 genotype; OR = odds ratio and p-value = p-value for the interaction for each genotype combination.
Figure 3Genotype frequency for the combination of SNPs identified by GenEpi. (a) Frequency in cases and controls of each genotype combination of the most significant SNP–SNP interaction effect on PD; (b) Frequency in cases and controls of each genotype combination of the SNP–SNP interaction effect identified in novel PD risk loci. (c–e) show frequency differences in cases and controls for each genotype combination of the most significant SNP–SNP interactions in other three independent loci. The dark-shaded cell of each figure represents the combination that has the strongest effect.