| Literature DB >> 36212134 |
Anastasia Gurinovich1, Mengze Li2, Anastasia Leshchyk2, Harold Bae3, Zeyuan Song4, Konstantin G Arbeev5, Marianne Nygaard6, Mary F Feitosa7, Thomas T Perls8, Paola Sebastiani1.
Abstract
Performing a genome-wide association study (GWAS) with a binary phenotype using family data is a challenging task. Using linear mixed effects models is typically unsuitable for binary traits, and numerical approximations of the likelihood function may not work well with rare genetic variants with small counts. Additionally, imbalance in the case-control ratios poses challenges as traditional statistical methods such as the Score test or Wald test perform poorly in this setting. In the last couple of years, several methods have been proposed to better approximate the likelihood function of a mixed effects logistic regression model that uses Saddle Point Approximation (SPA). SPA adjustment has recently been implemented in multiple software, including GENESIS, SAIGE, REGENIE and fastGWA-GLMM: four increasingly popular tools to perform GWAS of binary traits. We compare Score and SPA tests using real family data to evaluate computational efficiency and the agreement of the results. Additionally, we compare various ways to adjust for family relatedness, such as sparse and full genetic relationship matrices (GRM) and polygenic effect estimates. We use the New England Centenarian Study imputed genotype data and the Long Life Family Study whole-genome sequencing data and the binary phenotype of human extreme longevity to compare the agreement of the results and tools' computational performance. The evaluation suggests that REGENIE might not be a good choice when analyzing correlated data of a small size. fastGWA-GLMM is the most computationally efficient compared to the other three tools, but it appears to be overly conservative when applied to family-based data. GENESIS, SAIGE and fastGWA-GLMM produced similar, although not identical, results, with SPA adjustment performing better than Score tests. Our evaluation also demonstrates the importance of adjusting by full GRM in highly correlated datasets when using GENESIS or SAIGE.Entities:
Keywords: GENESIS; GWAS; REGENIE; SAIGE; SPA; binary phenotype; correlated data; fastGWA-GLMM
Year: 2022 PMID: 36212134 PMCID: PMC9544087 DOI: 10.3389/fgene.2022.897210
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Metrics to compare GWAS methods.
| Metric | Description |
|---|---|
| Genomic inflation factor | The median of the observed chi-squared test statistics divided by the expected median of the corresponding chi-squared distribution |
| Agreement of the ranks of SNPs based on | Test whether the rankings of the results are significantly different using two measures of rank correlation (Spearman’s Rho, Kendall’s Tau). The rank correlation coefficients were computed by setting different significance thresholds. (GENESIS and SAIGE only) |
| Agreement of the score statistics | Visual comparison, descriptive statistics (GENESIS, SAIGE and fastGWA-GLMM only) |
| Agreement of the effect estimates | Visual comparison, descriptive statistics |
| Computational resources | CPU time taken to calculate GRM matrices, run the null model and the per-chromosome association tests in the same computational environment |
Correlations of the ranks of the p-values for NECS imputed genotype data.
|
| Number of SNPs | GENESIS SPA full GRM vs. GENESIS score full GRM | GENESIS SPA full GRM vs. GENESIS SPA sparse GRM | SAIGE full GRM vs. SAIGE sparse GRM | GENESIS SPA full GRM vs. SAIGE full GRM | GENESIS SPA sparse GRM vs. SAIGE sparse GRM | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Rho | Tau | Rho | Tau | Rho | Tau | Rho | Tau | Rho | Tau | ||
| 5.00E-03 | 86,107 | 0.997 | 0.972 | 0.928 | 0.774 | 0.923 | 0.765 | 0.98 | 0.883 | 0.981 | 0.891 |
| 5.00E-04 | 13,054 | 0.983 | 0.935 | 0.92 | 0.763 | 0.912 | 0.747 | 0.975 | 0.871 | 0.98 | 0.892 |
| 5.00E-05 | 2,409 | 0.967 | 0.916 | 0.898 | 0.747 | 0.881 | 0.722 | 0.979 | 0.884 | 0.974 | 0.881 |
| 5.00E-06 | 609 | 0.976 | 0.936 | 0.946 | 0.822 | 0.949 | 0.828 | 0.984 | 0.909 | 0.983 | 0.917 |
| 5.00E-07 | 269 | 0.983 | 0.949 | 0.941 | 0.831 | 0.952 | 0.844 | 0.987 | 0.922 | 0.984 | 0.933 |
| 5.00E-08 | 100 | 0.984 | 0.952 | 0.917 | 0.817 | 0.937 | 0.831 | 0.991 | 0.952 | 0.985 | 0.941 |
Rho, Spearman’s Rho; Tau, Kendall’s Tau.
FIGURE 1Manhattan and QQ plots of -log10(p-values) for the associations using imputed genotype data in the New England Centenarian Study (NECS) data. Panel (A): associations based on the score test and adjusted for the full genetic relation matrix (GRM) using GENESIS. Panel (B): associations based on the SPA and adjusted for the full GRM using GENESIS. Panel (C): associations based on the SPA and adjusted for the sparse GRM using GENESIS. Panel (D): associations based on the SPA and adjusted for the full GRM using SAIGE. Panel (E): associations based on the SPA and adjusted for the sparse GRM using SAIGE. Panel (F): associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE. Panel (G): associations based on the SPA and adjusted for the sparse GRM using fastGWA-GLMM. Lambda is a genomic inflation factor.
FIGURE 2Manhattan and QQ plots of -log10(p-values) for the associations using WGS data in the Long Life Family Study (LLFS) data. Panel (A): associations based on the score test and adjusted for the full genetic relation matrix (GRM) using GENESIS. Panel (B): associations based on the SPA and adjusted for the full GRM using GENESIS. Panel (C): associations based on the SPA and adjusted for the sparse GRM using GENESIS. Panel (D): associations based on the SPA and adjusted for the full GRM using SAIGE. Panel (E): associations based on the SPA and adjusted for the sparse GRM using SAIGE. Panel (F): associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE. Panel (G): associations based on the SPA and adjusted for the sparse GRM using fastGWA-GLMM. Lambda is a genomic inflation factor.
FIGURE 3Pairwise comparison plots of –log10(p values) for the associations using imputed genotype data in the NECS data. Panel (A): comparison of the associations based on the SPA and adjusted for the full genetic relation matrix (GRM) using GENESIS on the X axis versus associations based on the score test and adjusted for the full GRM using GENESIS on the Y axis. Panel (B): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (C): comparison of the associations based on the SPA and adjusted for the full GRM using SAIGE on the X axis versus associations based on the SPA and adjusted for the sparse GRM using SAIGE on the Y axis. Panel (D): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and adjusted for the full GRM using SAIGE on the Y axis. Panel (E): comparison of the associations based on the SPA and adjusted for the sparse GRM using SAIGE on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (F): comparison of the associations based on the SPA and adjusted for the sparse GRM using GENESIS on the X axis versus associations based on the score test and adjusted for the full GRM using GENESIS on the Y axis. Panel (G): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis. Panel (H): comparison of the associations based on the SPA and adjusted for the full GRM using SAIGE on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis. Panel (I): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the full GRM using GENESIS on the Y axis. Panel (J): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (K): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the full GRM using SAIGE on the Y axis. Panel (L): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the sparse GRM using SAIGE on the Y axis. Panel (M): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis.
FIGURE 4Pairwise comparison plots of –log10(pvalues) for the associations using WGS data in the LLFS data. Panel (A): comparison of the associations based on the SPA and adjusted for the full genetic relation matrix (GRM) using GENESIS on the X axis versus associations based on the score test and adjusted for the full GRM using GENESIS on the Y axis. Panel (B): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (C): comparison of the associations based on the SPA and adjusted for the full GRM using SAIGE on the X axis versus associations based on the SPA and adjusted for the sparse GRM using SAIGE on the Y axis. Panel (D): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and adjusted for the full GRM using SAIGE on the Y axis. Panel (E): comparison of the associations based on the SPA and adjusted for the sparse GRM using SAIGE on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (F): comparison of the associations based on the SPA and adjusted for the sparse GRM using GENESIS on the X axis versus associations based on the score test and adjusted for the full GRM using GENESIS on the Y axis. Panel (G): comparison of the associations based on the SPA and adjusted for the full GRM using GENESIS on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis. Panel (H): comparison of the associations based on the SPA and adjusted for the full GRM using SAIGE on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis. Panel (I): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the full GRM using GENESIS on the Y axis. Panel (J): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the sparse GRM using GENESIS on the Y axis. Panel (K): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the full GRM using SAIGE on the Y axis. Panel (L): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and adjusted for the sparse GRM using SAIGE on the Y axis. Panel (M): comparison of the associations based on the SPA and adjusted for the full GRM using fastGWA-GLMM on the X axis versus associations based on the SPA and polygenic effect estimates to control for relatedness using REGENIE on the Y axis.
Correlations of the ranks of the p-values for LLFS WGS data.
|
| Number of SNPs | GENESIS SPA full GRM vs. GENESIS score full GRM | GENESIS SPA full GRM vs. GENESIS SPA sparse GRM | SAIGE full GRM vs. SAIGE sparse GRM | GENESIS SPA full GRM vs. SAIGE full GRM | GENESIS SPA sparse GRM vs. SAIGE sparse GRM | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Rho | Tau | Rho | Tau | Rho | Tau | Rho | Tau | Rho | Tau | ||
| 5.00E-02 | 581,460 | 1 | 0.991 | 0.996 | 0.949 | 0.966 | 0.856 | 0.969 | 0.863 | 0.94 | 0.797 |
| 5.00E-03 | 68,524 | 0.996 | 0.958 | 0.995 | 0.94 | 0.912 | 0.771 | 0.954 | 0.83 | 0.877 | 0.707 |
| 5.00E-04 | 7,757 | 0.987 | 0.926 | 0.994 | 0.933 | 0.873 | 0.726 | 0.944 | 0.808 | 0.834 | 0.654 |
| 5.00E-05 | 840 | 0.986 | 0.917 | 0.994 | 0.938 | 0.791 | 0.637 | 0.949 | 0.816 | 0.767 | 0.589 |
| 5.00E-06 | 103 | 0.976 | 0.906 | 0.994 | 0.944 | 0.897 | 0.787 | 0.949 | 0.821 | 0.854 | 0.696 |
Rho, Spearman’s Rho; Tau, Kendall’s Tau.
Computation time used by GENESIS, SAIGE and REGENIE.
| Imputed dosages | WGS | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GENESIS score | GENESIS SPA full / sparse | SAIGE full / sparse | REGENIE | fastGWA-GLMM | GENESIS score | GENESIS SPA full / sparse | SAIGE full / sparse | SAIGE full plink | REGENIE | fastGWA-GLMM | |
| GRM calculations | 1 h36 m32 s | 1 h36 m32 s / 1 h35 m53 s | - / 0 h01 m38 s | - | 0 h01 m50 s | 2 h48 m36 s | 2 h48 m36 s / 2 h48 m08 s | - / 0 h01 m04 s | - | - | 0 h00 m12 s |
| Null model (and GRM calculations for SAIGE full GRM only) | 0 h02 m57 s | 0 h02 m56 s / 0 h00 m01 s | 0 h20 m49 s / 0 h00 m10 s | 0 h10 m06 s | 0h 00m 04 s | 0 h00 m13 s | 0 h00 m06 s / 0 h00 m01 s | 0 h04 m46 s / 0 h00 m04 s | 0 h04 m46 s | 0 h03 m18 s | 0 h00 m02 s |
| Association tests chr1 | 1 h09 m40 s | 1 h26 m58 s / 0 h41 m43 s | 1 h00 m27 s / 1 h22 m59 s | 0 h33 m02 s | 0 h00 m59 s | 0 h21 m05 s | 0 h23 m46 s / 0 h20 m21 s | 1 h12 m35 s / 1 h23 m42 s | 0 h11 m19 s | 0 h47 m56 s | 0 h02 m00 s |
| Association tests chr21 | 0 h12 m59 s | 0h16m09s / 0h06m22s | 0 h13 m38 s / 0 h18 m23 s | 0 h04 m34 s | 0 h00 m10 s | 0 h03 m00 s | 0 h03 m22 s / 0 h03 m03 s | 0 h16 m54 s / 0 h18 m34 s | 0 h01 m06 s | 0 h08 m00 s | 0 h00 m21 s |
| Total time | 3 h02 m08 s | 3 h22 m35 s / 2 h23 m59 s | 1 h34 m54 s / 1 h43 m10 s | 0 h47 m42 s | 0 h03 m03 s | 3 h12 m54 s | 3 h15 m50 s / 3 h11 m33 s | 1 h34 m15 s / 1 h43 m24 s | 0 h17 m11 s | 0 h59 m14 s | 0 h02 m35 s |
GENESIS: Full and sparse GRM calculations are always done as a separate step. GDS file format is used as input for all the GWASs with GENESIS.
SAIGE: For the GWASs with full GRM, GRM calculation and null model fitting are combined in one step. For the GWASs with sparse GRM, the GRM is calculated in a separate step before fitting a null model. PLINK bed/bim/fam file format is used as input to GRM calculations and Null model steps for all the GWASs with SAIGE. SAIGE full / sparse GWASs use VCF file format as input for Association tests steps. SAIGE full plink GWAS uses PLINK bed/bim/fam file format as input for Association tests steps.
REGENIE: Null model step as referred to in here is fitting the whole-genome regression model to the phenotype in REGENIE GWASs. PLINK bed/bim/fam file format is used as input for both GWASs with REGENIE.
fastGWA-GLMM: Null model step as referred to in here is the estimation of variance component in fastGWA-GLMM. PLINK bed/bim/fam file format is used as input for both GWASs with fastGWA-GLMM.