Literature DB >> 35653402

sumSTAAR: A flexible framework for gene-based association studies using GWAS summary statistics.

Nadezhda M Belonogova¹, Gulnara R Svishcheva^1,2, Anatoly V Kirichenko¹, Irina V Zorkoltseva¹, Yakov A Tsepilov^1,3, Tatiana I Axenovich¹.

Abstract

Gene-based association analysis is an effective gene-mapping tool. Many gene-based methods have been proposed recently. However, their power depends on the underlying genetic architecture, which is rarely known in complex traits, and so it is likely that a combination of such methods could serve as a universal approach. Several frameworks combining different gene-based methods have been developed. However, they all imply a fixed set of methods, weights and functional annotations. Moreover, most of them use individual phenotypes and genotypes as input data. Here, we introduce sumSTAAR, a framework for gene-based association analysis using summary statistics obtained from genome-wide association studies (GWAS). It is an extended and modified version of STAAR framework proposed by Li and colleagues in 2020. The sumSTAAR framework offers a wider range of gene-based methods to combine. It allows the user to arbitrarily define a set of these methods, weighting functions and probabilities of genetic variants being causal. The methods used in the framework were adapted to analyse genes with large number of SNPs to decrease the running time. The framework includes the polygene pruning procedure to guard against the influence of the strong GWAS signals outside the gene. We also present new improved matrices of correlations between the genotypes of variants within genes. These matrices estimated on a sample of 265,000 individuals are a state-of-the-art replacement of widely used matrices based on the 1000 Genomes Project data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35653402 PMCID： PMC9197066 DOI： 10.1371/journal.pcbi.1010172

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Methods paper.

Introduction

Gene-based association analysis is an effective replacement of genome-wide association analysis (GWAS) in identification of rare genetic variants [1, 2]. Many gene-based methods have been proposed recently. Their power depends on the underlying genetic architecture that is rarely known in complex traits. Therefore, a combination of such methods could serve as a universal approach. Among popular combined tests, SKAT-O was the first, for which the distribution of test statistics was analytically described [3]. For other combined tests, p-values were estimated empirically at the cost of dramatically increased computation time. The task to analytically combine the p-values obtained by different methods has been solved in the aggregated Cauchy omnibus test, ACAT [4]. This gave impetus to create a range of frameworks in order to one-by-one calculate a number of gene-based tests and then combine them by ACAT [5-9]. The frameworks differ from one another by the task, input data, methods used, and ways to include functional biological annotations. All these frameworks have a disadvantage: they are not flexible. They use the fixed set of methods, weights and combinations of functional annotations. Moreover, the majority of existing frameworks use individual phenotypes and genotypes as input data. Such data cannot be deposited in open-access databases, and so they are unavailable to a wide range of investigators. Recently, it was demonstrated that all popular methods of gene-based association analysis based on the linear regression models can use summary statistics instead of individual data [10]. Previously, we presented formulas for the wide range of association tests and implemented them in the sumFREGAT package [11]. The framework named STAAR (variant-set test for association using annotation information) stands out among others as a comprehensive and powerful tool that effectively incorporates SNP-weighting by allele frequencies, variant categories and multiple complementary annotations [6]. Here we propose the extended and modified version of the STAAR framework, which we called sumSTAAR. The modification concerns the input data: STAAR uses raw genotypes and phenotypes, and sumSTAAR uses GWAS summary statistics (effect sizes, standard errors, sample sizes etc.). Extension relates to the gene-based association analysis methods used: STAAR uses a fixed set of three methods, and sumSTAAR uses an arbitrary set including up to six methods. The methods comprised by the sumSTAAR framework are modified in two ways compared with those previously described [11]. First, they involve a special algorithm for the analysis of large genes with >500 SNPs. Second, they use more efficient computational algorithms for matrix operations. An additional empowering feature of the framework is the use of new LD matrices estimated on an extended sample: 265K instead of 503 individuals from the 1000G project. For the first time, due to these high-coverage estimates, it became possible to include the large amount of rare variants when analyzing summary statistics with a wide range of powerful gene-based methods. We also present the procedure of polygene pruning to guard against the influence of strong association signals outside the gene on the results of gene-based association analysis [12].

Methods

Gene-based association analysis

The sumSTAAR framework combines (a set of) the following methods: burden test (BT), SKAT, SKAT-O, aggregated Cauchy association test (ACAT-V), the tests using functional linear regression model (FLM) and principal component analysis (PCA). Variant-specific weights can be applied to all of these methods. We also introduced the probabilities of genetic variants being causal estimated using different functional annotations in BT, SKAT, SKAT-O, FLM, PCA and ACAT-V. All these modified tests and the parameters of their distributions are presented in S1 Text. The sumSTAAR() function (Fig 1) allows the user to arbitrarily choose a set of tests that differ in method, weighting function and probabilities of genetic variants being causal, calculate these tests, and then combine them using the aggregated Cauchy omnibus test, ACAT.

Fig 1

Workflow schematic.

(A) Each set of SNPs (all, non-coding, exonic, nonsynonymous and others) is analyzed separately. (B) Input data for sumFREGAT include GWAS summary statistics (p-values and effect sizes), correlations between genotypes calculated using the same or reference sample, the matrix of weighting functions defined by the parameters of the beta distribution, the probabilities of SNPs being causal (e.g., estimated using different functional annotations http://favor.genohub.org/). The list of methods can comprise an arbitrary subset of BT, SKAT, SKAT-O, PCA, FLM, and ACAT-V. All methods use summary statistics as input. For each method, region-based association analysis is repeatedly performed using different combinations of the weighting functions (i ∈ [1, I]) and probabilities of SNPs being causal (j ∈ [0, J]). ACAT is used for combining the p-values obtained by each method under different weighting functions and probabilities, and then for combining the results obtained by various methods.

Workflow schematic.

Analysis of large genes

To decrease the running time for association analysis of genes with a large number of SNPs, we propose the following algorithm. Using thresholding technique, we divide all SNPs within a gene into two groups by p-values, which correspond to their weighted z-scores. Since multiple linear regression models include SNP-specific weights, we form SNP groups taking into account these weights. We used a threshold of 0.8, which was selected empirically (see below). We apply a chosen gene-based test to the group with the smaller weighted p-values and calculate the simple mean weighted p-value for another group. Then we combine p-values obtained for the two groups by ACAT. Obviously, this algorithm is an approximation to the chosen gene-based test, however it proves to be effective for the genes with more than 500 SNPs. We introduced it in SKAT, SKAT-O, PCA and FLM methods.

Selecting the threshold

To choose the threshold, below which weighted p-values are considered as small when analyzing large genes, we performed an empirical assessment of the approximation on the material of summary statistics for neuroticism from UK Biobank dataset [13]. We calculated the approximated SKAT statistics using a range of values as threshold (from 0.05 to 0.95) on 2,103 genes having from 500 to 10,000 SNPs. For each threshold value, we measured the total elapsed time and calculated R2 between the original and approximated SKAT statistics (log10(p-values)). Since the approximated SKAT p-values deviated in both directions from original ones, we assessed both deviances using the formula Here and are the approximation and original p-values for the i-th gene, respectively; dev for conservativeness and inflation of the approximated test statistics was calculated using and , respectively.

Real data analysis

To test and evaluate the performance of sumSTAAR, we used two real data sets from the UK Biobank project [13, 14]. Sociodemographic, physical, lifestyle and health-related characteristics of the UK Biobank cohort have been reported by [15]. UK Biobank whole exome sequencing data and phenotype of the chronic ischaemic heart disease (IHD, ICD-10 code I25) contained 153,379 unrelated individuals (12,931 cases / 140,448 controls) with European ancestry (project #59345). We analyzed 110,538 variants covering 1,927 genes from chromosome 1 after the following filters: call rate > 0.98, MAC ≥ 5 and MAF < 0.01. Sex, age and batch were used as covariates. These data were used for comparing the results of STAAR and sumSTAAR. GWAS summary statistics for neuroticism contain information about the association for 10,847,151 imputed SNPs (MAF > 0.001 and INFO > 0.9) from a sample of 380,506 individuals and are freely available at https://ctg.cncr.nl/software/summary_statistics. Neuroticism levels were measured using the Eysenck Personality Questionnaire, Revised Short Form [16], consisting of 12 dichotomous items (0, 1). The quantitative trait was defined as a sum of 12 items (for details, see Nagel et al., 2018 [17]). These data were used for testing the new algorithm for analysis of large genes and for estimating the efficiency of different weighting functions and functional annotations defined via eight integrative scores (aPCs) from FAVOR v.2 (http://favor.genohub.org/) [6].

LD matrices

The LD matrices of genotype correlations are required as input data for all packages using summary statistics. Using the UK Biobank resource under application #59345, we calculated Pearson correlation coefficients (r) between genotypes of variants within gene for 19,426 genes using 265,000 participants of European ancestry from the UK Biobank cohort [14] and LDstore software v1.1 [18]. Only variants with MAF>10−5 and imputation quality r2>0.3 were used for the calculations.

Results

All three gene-based methods implemented in STAAR (BT, SKAT and ACAT-V) are available in the sumSTAAR framework. We analytically showed the equivalence of these tests between the frameworks (see S1 Text). We also numerically compared the results obtained in STAAR and sumSTAAR using simulated data (S2 Text) as well as UK Biobank exome sequencing data and phenotype of chronic ischaemic heart disease (S3 Text). STAAR implies an omnibus weighting scheme of combining multiple differently weighted tests (see S1 Fig). Using summary statistics, we reproduced this scheme in sumSTAAR and compared the results with those obtained by STAAR. As can be seen in S2 and S3 Figs, there is excellent agreement between the results obtained by two packages. Then, we tested the new algorithm developed for analysis of large genes and picked up the threshold separating the two groups of SNPs in accordance with their weighted p-values. We tried to find a reasonable compromise between approximation accuracy and computation time. We observed the highest correlation between the original and approximated test statistics for threshold values ranging from 0.6 to 0.8 (Fig 2A). There was no prominent change in total elapsed time among these runs. However, the approximated test proved to be more conservative and inflation less frequent with increasing threshold values (Fig 2B). Therefore, to prevent an increase in false positive rate, we selected the threshold of the weighted p-value = 0.8 to separate the SNPs on two groups.

Fig 2

Determination coefficient and deviances of approximated SKAT statistic related to the threshold value.

Determination coefficient and deviances of approximated SKAT statistic related to the threshold value.

(A) Determination coefficient (R2) between–log10(P value) of original and approximated tests shown in red. (B) Deviances indicating inflation and conservativeness of approximated test statistics compared with original shown in red and blue, respectively. Using the neuroticism summary statistics, we estimated the accuracy and efficiency of the modified methods implemented in the new version compared to the version of sumFREGAT without modifications. Fig 3 shows a good agreement between the p-values obtained by two packages and a decrease in the running time when using the new modified version of the package. For SKAT, SKAT-O, PCA and FLM, the running time was decreased by 2.4, 10.5, 3.4 and 2.6 times, respectively. The most prominent effect was shown for the most popular SKAT-O method.

Fig 3

Accuracy and running time of four gene-based methods for association analysis under approximation.

Accuracy and running time of four gene-based methods for association analysis under approximation.

Each point represents a gene: 7,990 genes for FLM (genes that passed collinearity filter for 25 basis functions, see S1 Text for details) and 17,975 genes for other methods. Left panels show–log10(P values), red lines are regression lines and black lines represent one-to-one correspondence. On the right panels, lines represent the best-fitted polynomial functions. Within the framework, we introduce the new LD matrices for 19,426 genes estimated using genotypes of 265K participants of the UK Biobank project. The matrices contain information about 21,155,091 SNPs, with 17,142,006 (81%) of them having MAF < 0.01. For comparison, the widely used matrices of SNP-SNP correlations estimated on 503 European participants of the 1000G project include 4,544,901 SNPs, with only 707,862 (16%) of them having MAF < 0.01. The UK Biobank matrices, therefore, provide 4.65 times higher SNP coverage and 24 times higher coverage for rare variants. The matrices can be used in our or any other software together with summary statistics from samples of European ancestry. If available, other matrices calculated, for example, on Asian or African populations, can be used in our framework to analyze the corresponding samples. For the polygene pruning procedure, we now publish an R-script to perform it step-by-step.

Discussion

We developed a new framework for gene-based association analysis using summary statistics. This framework can include an arbitrary set of methods for gene-based association analysis, weighting functions and functional annotations used for the estimation of SNP probability being causal. Many of the methods used in the framework were adapted to the genes with large number of SNPs. This allows us to increase the computation speed of different methods by 2.4–10.5 times. We compared STAAR and sumSTAAR and demonstrated the strong agreement of the results obtained by BT, SKAT, ACAT-V and their ACAT combinations. Our sumSTAAR framework, however, provides an opportunity to expand the range of methods with the fixed-effects models (PCA and FLM). High statistical power of these methods was previously shown for different simulation scenarios and real data [12, 19–22]. In addition, a method with random-effects model, SKAT-O, can be used instead of ACAT to combine BT and SKAT within the framework. Being more computationally intensive, SKAT-O nevertheless provides an optimal kernel-based combination of the methods when using the same weighting functions. It is known that there is no universal or optimal method of the gene-based association analysis for any gene and trait (see for example, Wang et al., 2017 [23]). The more methods are used, the greater the chance of finding a causal gene. For example, in our previous study of neuroticism [12] the combination of the BT, SKAT, PCA and ACAT-V identified 190 genes, while the smaller set of BT, SKAT and ACAT-V would identify only 153 genes. In addition to an extended set of methods, the power of the gene-based analysis can be increased by introducing various weighting functions and functional annotations. Using the neuroticism summary statistics for protein-coding variants, we showed that use of functional annotations and both weighting functions allowed us to identify a new neuroticism gene, NARF (see S4 Text). In total, using the additional PCA method increased the number of identified genes by 37, and additional weighting functions and functional annotations for protein-coding variants yielded one more gene. So, the effectiveness of our framework in this case can be estimated as 25% (38 / 153 * 100%). Our framework allows researchers not only to increase the number of tests simultaneously included in analysis, but also to form their own alternative subsets of methods, weighting functions and functional annotations. Large number of tests can be time-consuming to compute, though it insures against selecting a method or weighing function that is not appropriate for a particular study. To efficiently analyze the genes with large number of variants, we proposed a simple algorithm that implies subdividing the variants within a gene into two groups using a predefined p-value threshold. The running time decreases because the selected gene-based method applies only to the group of variants with lower p-values. For sequence kernel association tests, Lumley et al., 2018 [24] developed a fast approximation called fastSKAT. It does not directly limit the number of variants, but restricts the number of eigenvalues of genotype covariance matrix. Only 200 largest eigenvalues are included in the standard calculation of combined p-value; for the rest ones, the fast approximation is used. Due to the fixed size of the first group, fastSKAT running time for large genes grows with the square of m (number of within gene variants) instead of the cube of m for original SKAT. Unfortunately, the fastSKAT algorithm is specific to SKAT and cannot be applied to other methods of the sumSTAAR framework. Our algorithm is universal, though less efficient than fastSKAT because we fix the threshold for the p-values but not the size of the group. We analytically estimated the computation time to be half as less after applying our algorithm. This estimate assumes that running time increases as the cube of m for all methods in the sumSTAAR framework and the p-values are uniformly distributed under the null hypothesis. The expected time reduction factor was therefore calculated as 1 / 0.83 ≈ 2, where 0.8 is the p-value threshold selected in our study. However, since we also updated some algorithms for matrix operations, the running time decreased by 2.5–3 times for all methods except SKAT-O that showed the most prominent effect of 10.5 times speed-up (Fig 3, right panels). Using the neuroticism as an example, we compared the p-values from the original and approximated methods for the large genes and showed a good agreement (Fig 3, left panels). The sumSTAAR framework suggests using the polygene pruning procedure to guard against the influence of the strong GWAS signals outside the gene. It has been shown that a substantial share of gene-based association signals is inflated by these GWAS signals [6, 12]. To guard against this influence, the conditional GWAS summary statistics calculated using, for example, the GCTA-COJO package [25] can be used in sumFREGAT as input data. However, to calculate the conditional statistics, this type of analysis relies on the simple multiple regression with all the attendant limitations. For example, conditional SNPs should be in complete linkage equilibrium with each other and their number, therefore, cannot be large. The procedure called “polygene pruning” [12] represents an alternative way to reduce the effect of strong GWAS signals outside the gene. Polygene pruning results in exclusion of some variants within the gene being in LD with outside GWAS-identified variants from gene-based analysis. In essence, this procedure is analogous to the extreme weighting of within-gene SNPs based on their LD with outer GWAS signals. Polygene pruning way can be preferable when the set of within-gene variants is large or includes rare variants. Moreover, the classical conditional analysis is impossible to perform when genotypes of top GWAS signals are not available, while correlation structure sufficient for polygene pruning can be shared more easily. Our framework can be applied to any summary statistics including those obtained by exome or whole-genome sequencing association analysis. We demonstrated such possibility comparing the results of STAAR and sumSTAAR obtained on the real exome sequencing data (see S3 Text). In practice, the application of our framework to these data is limited by the properties of existing reference LD matrices and summary statistics quality. In principle, there are no restrictions to include variants with low MAC, even singletons with MAC = 1, in LD matrices, as we did in simulation experiment (S2 Fig). However, the ultra-rare variation is highly population-specific, and the robustness of their SNP-SNP correlations in the context of gene-based analysis was not yet estimated. Cross-population use of LD matrices for ultra-rare variants might, therefore, potentially lead to some uncontrolled errors. To bypass these problems, we ask the scientific community to publish genotype correlation matrices along with GWAS summary statistics. This would allow to perform the accurate population-specific gene-based analyses of the whole genome and exome sequencing data, as well as address the problem of correlation robustness for ultra-rare variants. Another limitation of using the framework for sequencing data is the quality of summary statistics. Many GWAS tools are not designed to ensure unbiased Z-scores at low MAFs. If there is uncontrolled inflation in rare variants statistics, it will inflate the statistics of the gene-based analysis. For case-control association studies, we suggest using special tools that apply saddlepoint approximation correction for rare variants, for example SAIGE [26] or fastGWA-GLMM [27]. Moreover, if binary trait is analyzed, we suggest not to use variants with MAC < 5 since there is no robust algorithm to produce unbiased Z-scores, especially under unbalanced case-control design. For quantitative traits, more attention to departures from normality should be paid [28]. To conclude, we present sumSTAAR, a flexible and comprehensive framework that allows researchers to perform state-of-the-art gene-based analyses using GWAS summary statistics.

Tests performed within the STAAR procedure.

Combined tests are shown in bold. (TIF) Click here for additional data file.

Comparison of the results obtained by the STAAR and sumFREGAT packages.

Negative log10(p-value) were calculated in 10 simulations. The first three panels show the results for individual gene-based tests (Burden test, SKAT and ACAT) with two sets of parameters for the Beta distribution and 11 variants of annotation weighting. The last panel presents–log10(p-values) for all combined tests. The regression lines are shown in red (overlap the lines of one-to-one correspondence). (TIF) Click here for additional data file. The -log10 transformed p-value of each gene is shown. The first three panels show the results for individual gene-based tests (Burden test, SKAT and ACAT) with two sets of parameters for the Beta distribution. The last panel presents results combined across all tests. The regression lines are shown in red (overlap the black lines of one-to-one correspondence); ‘r’ is the correlation coefficient. (TIF) Click here for additional data file.

Introducing the probabilities of genetic variants being causal and analytical equivalence of methods implemented in the sumFREGAT and STAAR package.

(DOCX) Click here for additional data file.

Comparison of STAAR and sumSTAAR using simulated data.

(DOCX) Click here for additional data file.

Comparison of STAAR and sumSTAAR using real exome sequencing data.

(DOCX) Click here for additional data file.

SumSTAAR procedure in application to real data.

(DOCX) Click here for additional data file. 8 Jan 2022 Dear Dr. Belonogova, Thank you very much for submitting your manuscript "sumSTAAR: a flexible framework for gene-based association studies using GWAS summary statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Andrey Rzhetsky Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Inspired by the STAAR framework, the authors extended their previous work in sumFREGAT to sumSTAAR. This paper is concisely written and easy to read. While there are some innovations in combining the two, e.g., new LD matrices, new algorithm for large genes, this reviewer does not find the contributions as significant. In addition, there seemed to be a lack of interest in answering any genetic problem using this new tool. There is no new findings other than the new tool agrees well with previously published ones and the speed is faster. The paper may strengthen if the authors can use this new tool to make significant genetic findings. Reviewer #2: The manuscript describes a statistical framework in attempt to extend the STAAR method to analyze summary statistics data instead of individual genotypes data, and to expand its capacity to be able to incorporate additional models beyond what have been included in the STAAR framework. The extension to analyze summary statistics data is practical and will be useful when implemented properly. This reviewer appreciates the authors’ effort but has several major concerns in this study. 1. For gene-based and/or region-based rare variant analysis, the purpose is to analyze rare variants revealed in sequencing data, to increase power by aggregating multiple potentially causal rare variants that are not usually covered by GWAS data. The current study uses GWAS data for such statistical analysis, which is unlikely equally applicable to sequencing data. In order to make it rigorous so that it can be used by the community, it requires extensive simulation and application to real sequencing data, both of which are severely lacking in the current study. 2. For GWAS, many statistical methods are being developed or extended to summary statistics, which is a huge benefit to the research community given the easy accessibility of summery statistics. The adaptation of such strategy for rare variants uncovered by sequencing, however, is not straightforward. One reason is that the estimates of the effect size and standard deviation are not as accurate for rare variants, and another reason is that the LD among rare variants is challenging to estimate from an external reference panel given that rare variants are often subtly different among sub-populations. Both of the two aspects pose great challenges to the summary statistics-based rare variants analysis. The current study does not address the challenges, and it is not clear how these challenges may affect the analysis results. Extensive investigations are required to evaluate the impact of these issues. 3. The inclusion of additional models (e.g. functional linear models) requires additional evaluation to show the benefits of incorporating the additional models. Without such investigation, it is unclear whether the included models are able to increase power or not. 4. The separation of the SNPs in long genes is arbitrary. Although theoretically sound, in practice in real examples are needed to show that this is a sensible strategy. Overall, as this tool is to be used by the researchers, it is critical that this tool is extensively evaluated to make sure that it delivers rigorous results under various scenarios. This is especially critical for rare variants, as it is well-known that there are unique challenges (e.g. point #1 and #2) and such rigorous evaluations are indispensable. Reviewer #3: In this paper, the authors proposed sumSTAAR as a gene-based association analysis framework using summary statistics from genome-wide association studies (GWAS). The framework allows users to select candidate methods, weighting functions, and probabilities of genetic vairants being causal as introduced in the original STAAR framework. Overall, the proposed framework and data interpretations are justified. I do have following questions and concerns regarding the manuscript. Main comments: 1. My main concern is on the “Analysis of large genes” and “Selecting the threshold” in the Methods section. It is known that the original SKAT method (Wu et al. AJHG 2011) is scalable to perform SNP-set test for variant number of several hundreds to thousands. The authors are expected to demonstrate the advantage of computation for variant number more than 10,000. See FastSKAT (Lumley et al. Gen. Epi. 2018) paper as an example. 2. Although the proposed sumSTAAR framework uses GWAS summary statistics instead of individual-level data as input. Two main limitations should be discussed clearly by the authors in the revised manuscript: (1) sumSTAAR could not be directly applied to large-scale sequencing studies, where most of the variants observed are extremely rare variants (e.g. 46% of the observed variants are singletons, see Taliun et al. Nature 2021). The LD matrices for 19,726 genes estimated using the genotypes of 265K participants of UK Biobank would not recover those variants. (2) sumSTAAR could not be directly applied to multi-ethnic GWAS studies. As UK Biobank samples are mostly within European ancestry, the use of reference LD matrices calculated from UK Biobank participants may not be generalizable to other ancestries or multi-ethnic studies. Minor comments: 3. In page 8 line 9, “aggregated Cauchy test, ACAT-O” should be “aggregated Cauchy association test, ACAT”. In addition, all the following appearance of “ACAT-O” should be “ACAT”, including page 8 line 11; page 9 line 21; page 10, line 2; Figure 1. 4. Following #3, in page 13 line 4, suggest changing “… and their ACAT-O-based combinations” to “… and their combinations” for clarity. 5. Following #3, in page 13 line 6-8, ACAT-O is a standalone variant-set test that combines the burden test, SKAT, and ACAT-V (Liu et al. AJHG 2019). It is shown that ACAT-O has better statistical power compared to SKAT-O. As such, it would be recommended for the authors to remove the sentence “In addition, …, compared with ACAT-O”. 6. In page 14 line 1-3, the conclusion sounds subjective to the broad readership of the journal. Suggest changing to “To our knowledge, sumSTAAR is a flexible and comprehensive framework that allow researchers to perform state-of-the-art gene based analyses using GWAS summary statistics”. 7. In page 11 line 23-25, since the authors utilized the STAAR package/tutorial and adapted the scripts based on https://github.com/xihaoli/STAAR/blob/master/docs/STAAR_vignette.html to facilitate the sumSTAAR vs STAAR comparison (https://github.com/nbelon/sumSTAAR-vs-STAAR-comparison/blob/main/sumSTAAR.vs.STAAR.R), it is recommended for the authors to acknowledge the STAAR package authors in the contributor list of the sumFREGAT package (https://cran.r-project.org/web/packages/sumFREGAT/index.html). ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: No: The R code provided is only to generate supplementary FigS2. No computational code or data were provided for he main results including Figure 2 or Figure 3. Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 11 Mar 2022 Submitted filename: Responses_10_03.docx Click here for additional data file. 29 Mar 2022 Dear Dr. Belonogova, Thank you very much for submitting your manuscript "sumSTAAR: a flexible framework for gene-based association studies using GWAS summary statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Andrey Rzhetsky Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have shown the immediate applicability of the sumSTAAR method using the neuroticism data. However, they were able to identify 1 additional associated gene, out of 190 previously found genes. It doesn't bring confidence to the significance of the method presented. And there is no guarantee that it always increase the number of genes identified. In particular, this reviewer agrees with Reviewer #2 that "it is critical that this tool is extensively evaluated to make sure that it delivers rigorous results under various scenarios." And the emphasis may be on evaluating this method extensively. It is hard to argue this method has been evaluated extensively on various scenarios based on the current revision. Minor comment: In Figure 1, it seems all 6 methods, BT, SKAT, SKAT-O, PCA, FLM, and ACAT-V, take in the exactly the same input. Is that the case? Since this paper emphasis the use of summary statistics, to avoid confusion it would be better for the authors to highlight which method uses summary statistics which doesn't. Reviewer #2: The authors have addressed most of my initial concerns. Here are a few comments that could improve the quality further: 1) QQ-plot (of all p-values) needs to be provided to make sure the type I error is well controlled in each analysis. For example, QQ-plots for FigS4 and for each of the listed 8 combination scenarios as shown in Table S1 (using the p-values from all genes analyzed). 2) The URLs for the SNP-SNP correlation matrices of imputed genotype data of UKBB are not working and need to be fixed. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 11 Apr 2022 Submitted filename: Responses_11_04.docx Click here for additional data file. 5 May 2022 Dear Dr. Belonogova, We are pleased to inform you that your manuscript 'sumSTAAR: a flexible framework for gene-based association studies using GWAS summary statistics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Andrey Rzhetsky Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all my concerns. Reviewer #2: The authors have addressed all my concerns and I thank the authors for their revision efforts and recommend this manuscript for acceptance now. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 27 May 2022 PCOMPBIOL-D-21-01971R2 sumSTAAR: a flexible framework for gene-based association studies using GWAS summary statistics Dear Dr Belonogova, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

27 in total

1. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies.

Authors: Yaowu Liu; Sixing Chen; Zilin Li; Alanna C Morrison; Eric Boerwinkle; Xihong Lin
Journal: Am J Hum Genet Date: 2019-03-07 Impact factor: 11.025

2. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies.

Authors: Seunggeun Lee; Mary J Emond; Michael J Bamshad; Kathleen C Barnes; Mark J Rieder; Deborah A Nickerson; David C Christiani; Mark M Wurfel; Xihong Lin
Journal: Am J Hum Genet Date: 2012-08-02 Impact factor: 11.025

3. COMBAT: A Combined Association Test for Genes Using Summary Statistics.

Authors: Minghui Wang; Jianfei Huang; Yiyuan Liu; Li Ma; James B Potash; Shizhong Han
Journal: Genetics Date: 2017-09-06 Impact factor: 4.562

4. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits.

Authors: Jian Yang; Teresa Ferreira; Andrew P Morris; Sarah E Medland; Pamela A F Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael N Weedon; Ruth J Loos; Timothy M Frayling; Mark I McCarthy; Joel N Hirschhorn; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2012-03-18 Impact factor: 38.330

5. Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies.

Authors: Christian Benner; Aki S Havulinna; Marjo-Riitta Järvelin; Veikko Salomaa; Samuli Ripatti; Matti Pirinen
Journal: Am J Hum Genet Date: 2017-09-21 Impact factor: 11.025

6. Item-level analyses reveal genetic heterogeneity in neuroticism.

Authors: Mats Nagel; Kyoko Watanabe; Sven Stringer; Danielle Posthuma; Sophie van der Sluis
Journal: Nat Commun Date: 2018-03-02 Impact factor: 14.919

7. A rank-based normalization method with the fully adjusted full-stage procedure in genetic association studies.

Authors: Li-Chu Chien
Journal: PLoS One Date: 2020-06-19 Impact factor: 3.240

8. Integrating comprehensive functional annotations to boost power and accuracy in gene-based association analysis.

Authors: Corbin Quick; Xiaoquan Wen; Gonçalo Abecasis; Michael Boehnke; Hyun Min Kang
Journal: PLoS Genet Date: 2020-12-15 Impact factor: 5.917

9. Gene-based association analysis identifies 190 genes affecting neuroticism.

Authors: Nadezhda M Belonogova; Irina V Zorkoltseva; Yakov A Tsepilov; Tatiana I Axenovich
Journal: Sci Rep Date: 2021-01-28 Impact factor: 4.379

10. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.

Authors: Wei Zhou; Jonas B Nielsen; Lars G Fritsche; Rounak Dey; Maiken E Gabrielsen; Brooke N Wolford; Jonathon LeFaive; Peter VandeHaar; Sarah A Gagliano; Aliya Gifford; Lisa A Bastarache; Wei-Qi Wei; Joshua C Denny; Maoxuan Lin; Kristian Hveem; Hyun Min Kang; Goncalo R Abecasis; Cristen J Willer; Seunggeun Lee
Journal: Nat Genet Date: 2018-08-13 Impact factor: 38.330