| Literature DB >> 35581276 |
Zhiyu Yang1, Peristera Paschou2, Petros Drineas3.
Abstract
The emergence of genome-wide association studies (GWAS) has led to the creation of large repositories of human genetic variation, creating enormous opportunities for genetic research and worldwide collaboration. Methods that are based on GWAS summary statistics seek to leverage such records, overcoming barriers that often exist in individual-level data access while also offering significant computational savings. Such summary-statistics-based applications include GWAS meta-analysis, with and without sample overlap, and case-case GWAS. We compare performance of leading methods for summary-statistics-based genomic analysis and also introduce a novel framework that can unify usual summary-statistics-based implementations via the reconstruction of allelic and genotypic frequencies and counts (ReACt). First, we evaluate ASSET, METAL, and ReACt using both synthetic and real data for GWAS meta-analysis (with and without sample overlap) and find that, while all three methods are comparable in terms of power and error control, ReACt and METAL are faster than ASSET by a factor of at least hundred. We then proceed to evaluate performance of ReACt vs an existing method for case-case GWAS and show comparable performance, with ReACt requiring minimal underlying assumptions and being more user-friendly. Finally, ReACt allows us to evaluate, for the first time, an implementation for calculating polygenic risk score (PRS) for groups of cases and controls based on summary statistics. Our work demonstrates the power of GWAS summary-statistics-based methodologies and the proposed novel method provides a unifying framework and allows further extension of possibilities for researchers seeking to understand the genetics of complex disease.Entities:
Mesh:
Year: 2022 PMID: 35581276 PMCID: PMC9114146 DOI: 10.1038/s41598-022-12185-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Power of fixed-effect meta-analysis with two input studies under different conditions. We compare the power of our method vs. ASSET/METAL for a significance threshold . METAL dev refers to the latest release in GitHub[28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated from the Z-scores[28], respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies, ie., a sample overlap equal to 100 means that there are 100 cases and 100 controls shared between two input studies. Total sample sizes for each input study, including the shared samples, are equal to 2000 when the sample overlap is equal to zero; 2400 when the sample overlap is equal to 100; and 4000 when the sample overlap is equal to 500. In each case, the sample is equally split to cases and controls.
Figure 2Type I error rate of fixed-effect meta-analysis with two input studies under different conditions. We compared the type I error rate of our method vs. ASSET/METAL for a significance threshold . METAL dev refers to the latest release in GitHub[28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated from the Z-scores[28], respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies, ie., a sample overlap equal to 100 means that there are 100 cases and 100 controls shared between two input studies. Total sample sizes for each input study, including the shared samples, are equal to 2000 when the sample overlap is equal to zero; 2400 when the sample overlap is equal to 100; and 4000 when the sample overlap is equal to 500. In each case, the sample is equally split to cases and controls.
Performance of fixed-effect meta-analysis on simulated data using the gcta model.
| Method | No sample overlap | 5000 sample overlap | 10,000 sample overlap | |||
|---|---|---|---|---|---|---|
| Power | Type I error | Power | Type I error | Power | Type I error | |
| ReACt (Exact) | 0.9738 | 0.8976 | 0.8757 | |||
| ReACt (Est.) | – | – | 0.9120 | 0.8794 | ||
| METAL/METAL dev | 0.9748 | 0.9111 | 0.8779 | |||
| ASSET | – | – | 0.8898 | 0.8660 | ||
Using the simulated phenotypes for UK biobank samples (50,000 cases and 250,000 controls), we compared the performance of our method vs. ASSET/METAL. We treated genome-wide significant SNPs (p-value ) as “true signals”, and reported average power and type I error rates on identifying those SNPs under the same genome-wide significance threshold for each method. METAL dev refers to the latest release in GitHub[28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated, respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies, i.e., 5000 sample overlap means that 5000 cases and 5000 controls were shared between the two studies when the split was carried out.
aWith 25,000 cases and 125,000 controls from each subset.
bOut of 27,500 cases and 127,500 controls from each subset.
cOut of 30,000 cases and 130,000 controls from each subset.
Performance of fixed-effect meta-analysis on real genotype data.
| SNP | P | Number of times the SNP had | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| No sample overlap | 500 sample overlap | 1000 sample overlap | |||||||||
| Exact | ASSET/METAL | Exact | Est. | METAL dev | ASSET | Exact | Est. | METAL dev | ASSET | ||
| rs60939828 | 2.77 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| rs17487484 | 2.61 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| rs62100766 | 1.55 | 10 | 10 | 9 | 9 | 8 | 9 | 9 | 4 | 4 | 9 |
| rs4510098 | 5.34 | 10 | 10 | 5 | 5 | 5 | 5 | 5 | 4 | 3 | 5 |
| rs1079232 | 6.69 | 2 | 2 | 5 | 4 | 3 | 5 | 3 | 2 | 2 | 3 |
| rs75056899 | 7.69 | 10 | 10 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 |
| rs12044988 | 7.75 | 10 | 10 | 5 | 1 | 1 | 5 | 6 | 4 | 3 | 6 |
| True positive per iteration | 6.2 | 6.2 | 4.7 | 4.2 | 4 | 4.7 | 4.7 | 3.8 | 3.6 | 4.7 | |
| False positive per iteration | 0.2 | 0.2 | 1.4 | 0.6 | 0.4 | 1.5 | 1.6 | 0.5 | 0.7 | 1.7 | |
We applied our method for fixed-effect meta-analysis to the depressive episode trait (ICD F32 Depressive episode) in UK biobank samples and compared the performance of our method vs. ASSET/METAL. SNPs with p-value strictly less than in the primary GWAS summary statistics using all samples were treated as “true signals”. In each iteration of an experiment, we split the dataset evenly into two, generated GWAS summary statistics for each subset, and meta-analyzed the summary statistics using our method and ASSET/METAL. We reported the number of times (out of ten iterations) that a “true signal” got captured using the “significance threshold” by each method under different sample overlap conditions. METAL dev refers to the latest release in GitHub[28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated, respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies, ie., 500 sample overlap means that 500 cases and 500 controls were shared between the two studies when the split was carried out. The variable P in the table indicates the p-value of the target SNP in the primary GWAS using all samples. True positive per iteration reports the average number of SNPs with p-value strictly less than in the primary GWAS that were captured in one iteration; and False positive per iteration reports the average number of extra SNPs being captured in one iteration.
aWith 9184 cases and 156,425 controls from each subset.
bOut of 9434 cases and 156,675 controls from each subset.
cOut of 9684 cases and 156,925 controls from each subset.
Performance of cc-GWAS as implemented in ReACt with different sample sizes.
| Risk | Fst | 2000 cases, 2000 controls | 5000 cases, 5000 controls | ||||
|---|---|---|---|---|---|---|---|
| Power | Type I err. | Type I err. | Power | Type I err. | Type I err. | ||
| 1.15 | 0.01 | 3.67 | 2.65 | 3.16 | 3.51 | 1.84 | 1.87 |
| 0.05 | 3.49 | 9.80 | 5.26 | 3.23 | 6.33 | 3.58 | |
| 0.1 | 2.81 | 2.43 | 5.02 | 2.85 | 1.94 | 5.21 | |
| 1.2 | 0.01 | 1.54 | 4.69 | 2.47 | 7.16 | 3.47 | 2.03 |
| 0.05 | 1.34 | 1.04 | 5.14 | 6.62 | 8.57 | 3.77 | |
| 0.1 | 1.23 | 2.33 | 5.83 | 6.03 | 1.65 | 5.27 | |
| 1.3 | 0.01 | 5.85 | 1.63 | 1.57 | 9.68 | 1.43 | 5.46 |
| 0.05 | 5.41 | 5.31 | 4.45 | 9.21 | 7.35 | 5.79 | |
| 0.1 | 4.85 | 2.63 | 6.18 | 8.71 | 1.67 | 6.84 | |
Three types of SNPs have been simulated: (i) trait differential SNPs; (ii) null SNPs; and (iii) stress SNPs. . Under each condition, we simulated individual level genotype with these three types of SNPs for N cases and N controls in each study ( and ) and generated GWAS summary statistics for each study. and generated GWAS summary statistics for each study respectively. We subsequently used the summary statistics to run cc-GWAS in ReACt. We reported the power for detecting type (i) SNPs, and false positive rates for picking up type (ii) SNPs (Type I err.) and type (iii) SNPs (Type I err.) under a significance threshold .
Comparison of genomic regions showing significant divergent genetic effects between BD and SCZ as detected by ReACt and ccGWAS by Peyrot et al.[16].
| Region | Our method (ReACt) | ccGWAS | ||||||
|---|---|---|---|---|---|---|---|---|
| CHR | Start | End | SNP | BP | SNP | BP | ||
| 1 | 50826176 | 51118253 | rs6682989 | 50826176 | – | – | 6.10 | |
| 1 | 98325796 | 98559093 | rs2660304 | 98512127 | – | – | ||
| rs6701877 | 174015259 | – | – | |||||
| 2 | 27498734 | 27752296 | rs113954968 | 27696207 | – | – | 1.10 | |
| 3 | 62563175 | 62583180 | rs1993149 | 62572944 | – | – | 8.10 | |
| 3 | 135807609 | 136597120 | rs9866687 | 94828190 | 6.55 | – | – | |
| 3 | 135807609 | 136597120 | rs7372313 | 135872958 | rs1278493 | 135814009 | ||
| 7 | 28453906 | 28484317 | rs2192303 | 28478332 | rs7790864 | 28478625 | ||
| 8 | 27406353 | 27453579 | rs11778040 | 27419807 | 5.39 | – | – | |
| 9 | 23345347 | 23362311 | rs12554512 | 23352293 | – | – | ||
| 9 | 36894685 | 36963222 | rs2039142 | 36963222 | – | – | 2.10 | |
| 10 | 353306 | 418676 | rs35198327 | 354301 | – | – | 1.10 | |
| 12 | 108596308 | 108633649 | rs3764002 | 108618630 | – | – | ||
| 12 | 110294902 | 111212762 | rs28637922 | 110819139 | – | – | ||
| 16 | 79386766 | 79463881 | rs6564668 | 79457393 | rs9319540 | 79458022 | ||
| 19 | 1812521 | 1866427 | rs1054972 | 1852582 | 6.43 | – | – | |
| rs6095394 | 47625544 | rs11696888 | 47753265 | |||||
We carried out cc-GWAS with ReACt using summary statistics of BD and SCZ and compared our results with the results from Peyrot et al. Only SNPs that are analyzed in both studies are included for the comparison. Genomic regions that are identified to show significant divergent genetic effects between BD and SCZ in either result are shown. CHR, Start and End are chromosomal and base-pair ranges for the region; SNP, BP and p-value (ordinary least squares p-values, , for ccGWAS by Peyrot et al.) are properties of the leading SNP (if the regions is reported genome-wide significant) or statistics for the matching SNP (if the region is not reported as genome-wide significant, but is detected by the other method); p-values in bold are leading SNPs that are reported genome-wide significant by each method; Regions with CHR, Start and End in bold are two loci that were also identified by the case-case GWAS using individual level data[34].
Estimated and real group mean and standard deviation of PRS for a synthetic target population.
| Risk | Group | Our Method (ReACt) | PRSice2 | ||
|---|---|---|---|---|---|
| Est. group mean | Est. group sd | Real group mean | Real group sd | ||
| 1.15 | Cases | 0.0009 | 0.0078 | 0.0009 | 0.0076 |
| Controls | − 0.0037 | 0.0078 | − 0.0036 | 0.0081 | |
| 1.2 | Cases | 0.0016 | 0.0060 | 0.0016 | 0.0059 |
| Controls | − 0.0065 | 0.0060 | − 0.0064 | 0.0061 | |
| 1.3 | Cases | 0.0021 | 0.0041 | 0.0021 | 0.0040 |
| Controls | − 0.0125 | 0.0041 | − 0.0125 | 0.0040 | |
We compared group mean and standard deviation of PRS estimated by ReACt from summary statistics of synthetic base and target studies to the real group mean and standard deviation of individual level PRS obtained using summary statistics of the base and individual level genotype of the target computed by PRSice2. Est stands for estimated. Note that the synthetic data is not subject to clumping since the simulation model does not generate LD structure.
Estimated and real group mean and standard deviation of PRS for depressive episode cases and controls in UK biobank population.
|
| #SNPs | Trait | Our method (ReACt) | PRSice2 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Reg. w/o covatiate | Reg. w/top 5PCs | |||||||||
| Mean PRS | Mean PRS | |||||||||
| 0.1 | 4236 | Cases | 5.50 | |||||||
| Controls | ||||||||||
| 0.01 | 594 | Cases | 1.47 |
| 1.45 | 1.44 | ||||
| Controls | ||||||||||
| 0.001 | 82 | Cases | 0.0112 | 1.09 | 0.0147 | 1.53 | ||||
| Controls | 0.0112 | 0.0146 | ||||||||
|
| 10 | Cases | 9.36 | 1.13 | ||||||
| Controls | ||||||||||
We assessed the performance of our method using the summary statistics of an independent MDD GWAS as the base study, and the UK biobank samples, including 18,368 cases with depressive episode and 312,849 controls, as the target population. We generated summary statistics for the target populations and estimated group mean PRS and standard deviation of target PRS using ReACt. We computed the individual level PRS for the target study using PRSice2. For both methods, we computed PRS using independent SNPs from the base summary statistics with p-values below various thresholds (P-thres) and compared the performances under each threshold. For ReACt, mean PRS represents the estimated group mean PRS for cases and controls; p-val are the t-test p-values comparing PRS distribution in cases and in controls. For PRSice2, mean PRS represents real group mean PRS computed from individual level data and p-val are the t-test p-values comparing real PRS distribution in cases and in controls; reg. w/o covariate indicates regression results without covariates, which include the regression value (reg. ) and the p-value for the PRS predictor (p-val); reg. w/top 5PCs indicates the regression results including the top five PCs as covariates, which also included the regression value (reg. ) and the p-value for the PRS predictor (p-val).
Using our method to perform PRS comparisons across eight neuropsychiatric disorders.
We further applied our method to the summary statistics of eight neuropsychiatric disorders from PGC (see table 6 for details). For each disorder, we used PGC GWAS summary statistics to compute the group mean and standard deviation of PRS for the other seven disorders. All group PRS were estimated using independent SNPs with in the base summary statistics. We report p-values from a t-test comparing the group mean PRS of cases against controls in the target study, and cells with deeper blue colors correspond to lower p-values. The threshold of significance under multiple testing correction is .
Table of allele counts for SNP i () in the -th GWAS ().
| Number of alleles | |||
|---|---|---|---|
| Cases | |||
| Controls |
The total number of cases for the -th study is and the total number of controls for the -th study is . Clearly, the total number of cases and controls in a study is the same for all SNPs, which is why the variable N does not depend on i. The total number of alleles in cases and controls is equal to twice the number of cases and controls, respectively.
Notations and definitions of (effective or non-effective) allele frequencies in cases and controls.
| Frequency of the | |
| Frequency of the | |
| Frequency of the | |
| Frequency of the |
The subscripts i and indicate SNP number and study number, respectively.
Genotype counts for cases and controls for SNP i in study .
| Cases | |||
| Controls |
Using the above formulas, we can reconstruct the genotype counts for cases and controls for each of the three possible genotypes.
Number of overlapping cases and controls between the two studies.
| Overlapping | Study 2: case | Study 2: control |
|---|---|---|
| Study 1: case | ||
| Study 1: control |
For example, the first cell of the table indicates the number of shared cases between the two studies. In practice, the off-diagonal cells of this table are close to zero, since they indicate cases in one study that became controls in the other study and vice-versa. Large numbers in these off-diagonal cells would indicate high heterogeneity across the two studies, in which case a fixed effect meta-analysis is not recommended.
The probability distribution of for SNP i.
In this table, denotes the allele frequency of in cases and .
Information on summary statistics for the eight psychiatric disorders used in the experiments.
| Disorder | #Cases | #Controls | Total | #SNPs | Reference |
|---|---|---|---|---|---|
| Obsessive-compulsive disorder (OCD) | 2688 | 7037 | 9725 | 8,409,516 | [ |
| Tourette syndrome (TS) | 4819 | 9488 | 14,307 | 8,947,432 | [ |
| Eating disorder (ED) | 3495 | 10,982 | 14,477 | 10,641,224 | [ |
| Autism spectrum disorder (ASD) | 18,382 | 27,969 | 46,351 | 9,112,386 | [ |
| Bipolar disorder (BIP) | 20,352 | 31,358 | 51,710 | 13,413,244 | [ |
| Schizophrenia (SCZ) | 36,989 | 113,075 | 150,064 | 9,075,843 | [ |
| Attention-deficit/hyperactivity disorder (ADHD) | 19,099 | 34,194 | 53,293 | 8,094,094 | [ |
| Major depression (MD) | 69,232 | 161,009 | 230,241 | 9,874,289 | [ |
Note that we used summary statistics only for samples of European ancestry. For MD, we used the summary statistics generated by UK biobank, excluding the 23andMe samples; for BIP, we used the summary statistics including all three patient sub-types.