Literature DB >> 32287273

DOT: Gene-set analysis by combining decorrelated association statistics.

Olga A Vsevolozhskaya¹, Min Shi², Fengjiao Hu², Dmitri V Zaykin².

Abstract

Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost of management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer and cleft lip data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer disease risk.

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 32287273 PMCID： PMC7182280 DOI： 10.1371/journal.pcbi.1007819

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Methods paper.

Introduction

During the recent years, genome-wide association studies (GWAS) uncovered a wealth of genetic susceptibility variants. The emergence of new statistical approaches for the analysis of GWAS have largely contributed to that success. The majority of these methods require access to individual-level data, yet methods that require only summary statistics have been developed as well. The rising popularity of summary-based methods for the analysis of genetic associations has been motivated by many factors, among which is convenience and availability of summary statistics and high statistical power that can often match the power of analysis based on individual records [1-3]. Many types of association tests, including those originally developed for individual-level records, can be presented in terms of added summary statistics. For example, gene set analysis (GSA) tests or burden and overdispersion tests for rare variants [2, 4, 5], can be written as a weighted sum of summary statistics. In GSA applications, methods based on combined summary statistics can be used to efficiently aggregate information across many potentially associated variants within individual genes, as well as over several genes that may represent a common etiological pathway. When within-gene association statistics (or equivalently, P-values) are being combined, linkage disequilibrium (LD) needs to be accounted for, because LD induces correlation among statistics. The correlation among association test statistics for individual SNPs without covariates is the same as the correlation between alleles at the corresponding SNPs, if the genotype-phenotype relationship is linear. This fact allows one to model a set of statistics using a multivariate normal (MVN) distribution with the correlation matrix equal to the matrix of LD correlations. More generally, in the presence of covariates correlated with SNPs, MVN correlations among association statistics will depend not only on LD but also on other covariates in the model [6, 7]. When SNPs are coded as 0,1,2 values, reflecting the number of copies of the minor allele, the LD matrix of correlations can be obtained from SNP data as the sample correlation matrix. It can also be directly estimated from haplotype frequencies whenever those are available or reported. Specifically, the LD (i.e., the covariance between alleles i and j; D) is defined by the difference between the di-locus haplotype frequency, P, and the product of the frequencies of two alleles, D = P − pp. Then, the correlation between a pair of SNPs is defined as . The di-locus P frequency is defined as the sum of frequencies of those haplotypes that carry both of the minor alleles for SNPs i and j. Similarly, p allele frequency is the sum of haplotype frequencies that carry the minor allele of SNP i. It is important to distinguish situations, in which the LD matrix is estimated using the same data that was used to compute the association statistics from those, where the estimated LD matrix is obtained based on a suitable population reference panel. The reference panel approach is implemented in popular web-based association analysis platforms, such as “VEGAS” [8] or “Pascal” [9]. Based on a user-provided list of L SNPs, with the corresponding association P-values, VEGAS queries an online reference panel resource to obtain the matrix of LD correlations. P-values are then transformed to normal scores P → Z, i = 1, …, L, and vector Z is assumed to follow zero-mean MVN distribution under the null hypothesis of no association. The individual statistics in VEGAS are then combined as , (where TQ stands for “Test by Quadratic form”) and the overall SNP-set P-value is derived empirically by simulating a large number (j = 1, …, B) of zero-mean MVN vectors, adding their squared values to obtain statistics TQ( and computing the proportion of times when TQ( > TQ. The statistics similar to TQ are ubiquitous and appear in many proposed tests that aggregate association signals within a genetic region. As exemplified by VEGAS, the distribution of TQ must explicitly incorporate LD. However, an alternative approach that implicitly incorporates LD can be based on first decorrelating the association summary statistics, and then exploiting the resulting independence to evaluate the distribution of the sum of decorrelated statistics, which we call Decorrelation by Orthogonal Transformation (DOT). This general idea is straightforward and have been used in many contexts, including methods that utilize individual records [10]. For instance, Zaykin et al. suggested a variation of this approach for combining P-values (or summary statistics) but have not studied power properties of the method in detail [11]. Here, we propose a new decorrelation-based method for combining single-SNP summary association statistics. We derive theoretical properties of our method and explore asymptotic power of both DOT and TQ type of statistics. To the best of our knowledge, we are the first ones to derive the asymptotic distributions of DOT and TQ under the alternative hypothesis. Our results show that decorrelation can provide surprisingly large power boost in biologically realistic scenarios. However, high statistical power is not the only advantage of the proposed framework. Once statistics are decorrelated, one can tap into a wealth of powerful methods developed for combining independent statistics. These methods, among others, include approaches that emphasize the strongest signals by combining the top-ranked results [11-16]. Our theoretical analyses also reveal an unexpected result, showing that in many practical settings tests based on the statistic TQ do not gain power with the increase in L (assuming the same pattern of effect sizes for different values of L), while the proposed method steadily gains power under the same conditions. Specifically, the proposed decorrelation method gains power when the effect sizes and/or pairwise LD values become increasingly more heterogeneous. The reasons behind the respective behaviors of tests based on TQ and DOT are explored here theoretically and confirmed via simulations. We further derive power approximations that are useful for understanding power properties of the studied methods. To showcase our method, we evaluate associations between breast cancer susceptibility and SNPs in estrogen receptor alpha (ESR1), fibroblast growth factor receptor 2 (FGFR2), RAD51 homolog B (RAD51B), and TOX high mobility group box family member 3 (TOX3) genes, without access to raw genotype data. We first test for a joint association between SNPs in those four genes and breast cancer risk by decorrelating summary statistics based on the overall LD gene structure. We then describe how to follow up on the joint association results and identify one or more SNPs that drive joint association with disease risk. To further validate the utility of DOT, we also applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate. Both of our real data analyses confirmed previous associations and revealed new associations, suggesting new potential breast cancer and cleft lip SNP markers.

Results

As an introductory example of power analysis, we considered two simulated SNPs and a linear regression model Y = + ϵ, where X has a bivariate normal distribution, = {0.3, 0}, and ϵ has a Laplace distribution with unit variance. Thus, in this model Y does not have a normal distribution, however we expect that the theoretical powers for TQ and DOT tests, as derived in “Materials and Methods” section, will match the empirical power. We assumed sample size of 500. In the first simulation experiment with 10,000 simulated regressions, we assumed the bivariate correlation R = 0.99. Although two β coefficients are distinct, the mean values of association statistics induced by this model are similar to each other and they both are approximately equal to 0.29. These values can be obtained via Eq 2. Our noncentrality analysis in that section suggests that similarity of the mean values may lead to power advantage of the test TQ. The respective powers of the two tests were 0.87 and 0.80, empirically, and 0.86 and 0.80 by the theoretical calculation. In the second simulation experiment, we lowered R to 0.5. This caused the mean values to become distinct (0.29 and 0.14) and this difference of the two means caused the order of power to change, in agreement with our theoretical analysis. Powers now became 0.72 and 0.80, for TQ and DOT, respectively. In this case, empirical and theoretical powers matched to two digits. There is still difference in power at R = 0.2 (0.75 vs. 0.80), but of course, in the case R = 0, the two methods are identical. The power of DOT here is constant, and this reflects a special case, when only a single SNP has a non-zero effect size and, in addition, all correlations between SNPs are the same. We provide R software script which can not only reproduce these results, but is also capable of power analysis with larger correlation matrices, i.e., cases with multiple SNPs. Correlation matrices are generated as symmetric matrices of random numbers and then converted to positive definite ones using the package “Matrix” [17]. Using this script, we evaluated the type-I error of both methods, assuming α-level 0.05, 10 SNPs, and = 0. We found the type-I error to be close to the nominal level, using 100,000 simulations (0.04815 for DOT and 0.05002 for Tq). We note that the calculations are very fast and that the 100,000 simulation runs were completed in less than ten minutes on a typical laptop. Further, we conducted a different set of extensive simulation experiments to study statistical power of the proposed method based on the decorrelation statistic DOT, and to compare it to the statistic TQ. We also included a recently proposed method “ACAT” by Liu and colleagues [18], where association P-values for individual SNPs are transformed to Cauchy-distributed random variables, then added up to obtain the overall P-value. ACAT was included into comparisons because it has robust power across different models of association. Specifically, Liu et al. found ACAT to be competitive against popular methods, including SKAT and burden tests for rare-variant associations [19-22]. A distinctive feature of ACAT is its good type-I error control in the presence of correlation between P-values, which, interestingly, improves as the α-level becomes smaller, due to its usage of transformation to a moment-free Cauchy distribution. Among other similar approaches is MAGMA [23]. MAGMA analyzes summary association statistics by considering the mean of the chi-square statistic for the SNPs in a gene or the largest statistic among the SNPs in a gene. The mean of statistics method is equivalent to Fisher’s method for combining dependent P-values [24, 25]. The method based on the top chi-square statistic among the SNPs in a gene is equivalent to the Bonferroni correction for dependent tests. There have been extensive studies comparing these two methods [26]. Note that TQ is very similar to the Fisher method. We used two distinct scenarios in our simulation experiments: First, we assumed that the summary statistics and the sample correlation matrix among statistics are estimated from the same data set. This allowed us to validate power properties derived in “Materials and Methods.” Second, we assumed that the sample correlation LD matrix was obtained from external reference panel. We included this scenario into our simulations due to the concern that the type-I error rate of the methods considered here may be inflated if the correlation matrix is computed based on a separate data set.

Simulations assuming that the LD matrix and the summary statistics are obtained from the same data

To compare methods with and without decorrelation of statistics, we considered several distinct settings. In settings 1-4, the results of each row of the tables were based on one million simulations. Association statistics were simulated directly, namely, a 106 by L matrix of MVN vectors was simulated first, and then each row of the matrix was analyzed by the competing methods. The empirical powers were obtained as the proportion of times that a particular statistic value exceeded α = 0.05. The decorrelation method (DOT) is expected to gain power as the number of SNPs increases in scenarios where effect sizes vary markedly from SNP to SNP. However, if effect sizes for all SNPs are in fact very close to each other, the power of DOT may decrease. To illustrate this property, our first, and purposely contrived simulation setup is where the induced effect sizes (mean values of statistics) were all non-zero but very close to each other in their magnitude, varying uniformly from 2.3 to 2.4 (these are the values of the means of normally distributed standardized statistics). Table 1 shows the results of the simulations study under this setting, in which the decorrelation method was deliberately set up to fail. In the table, the columns labeled “Theoretic.” provide power calculated based on the distribution of the test statistics under the alternative hypothesis that we derived above. The columns labeled “Empiric.” provide results based on the empirical evaluation of power by computing P-values under the null. The columns labeled “Approx.” provide power calculated based on the Eq (17). The column labeled provides the average noncentrality value.

Table 1

Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes in magnitude and equicorrelation LD structure with ρ = 0.7.

Number of SNPs	Empiric.	Theor.	Approx.	Empiric.	Theor.	ACAT	γ¯
L	TQ	TQ	TQ	DOT	DOT	ACAT	γ¯
500	0.802	0.802	0.802	0.090	0.090	0.832	0.02
300	0.801	0.801	0.801	0.101	0.100	0.830	0.03
200	0.801	0.801	0.801	0.112	0.112	0.829	0.04
100	0.799	0.800	0.800	0.144	0.145	0.826	0.08
50	0.798	0.799	0.799	0.196	0.197	0.821	0.16
30	0.795	0.796	0.796	0.253	0.252	0.814	0.26
20	0.794	0.793	0.794	0.307	0.306	0.809	0.39

The table illustrates that our analytical calculations under the alternative hypothesis are correct. That is, the empirical power of both TQ and DOT statistics matches nearly exactly the analytical calculations. The approximation based on Eq (17) apparently works well as well, emphasizing the fact that the distribution of the TQ statistic can be well approximated by a one-degree of freedom chi-square distribution. Further, the table confirms that the decorrelation method is under-performing relative to TQ if there is very little heterogeneity among effect sizes. However, power of all methods would increase under lower correlation. For example, for ρ = 0.3 and L = 20, the powers for TQ and DOT become 0.98 and 0.67, respectively. Additional insight into power behavior of methods under this scenario can be gained by examining Eq (19). The asymptotic power for TQ can be simply computed in R as 1-pchisq(qchisq(1-0.05, df = 1), df = 1, ncp = 2.35^2/0.7). This gives 0.802 TQ power as L → ∞ for Table 1 and 0.99 for the situation when ρ is lowered to 0.3. This simple approximation is surprisingly precise and works well for the rest of the settings. Scenario 1 is admittedly unrealistic in practice. Furthermore, the table also illustrates that as the average non-centrality value increases, the power of DOT increases as well, while the power of TQ is relatively constant and about 80%. Finally, Table 1 shows that the power of TQ (although higher than that of DOT) does not change with L, highlighting the ceiling property of this method and the fact that combining more SNPs would not lead to higher power of TQ. One of the features of the decorrelation method is that it benefits from heterogeneity in pairwise LD. To illustrate this property, we added jiggle to the equicorrelation matrix as described in the “Materials and Methods” section, while keeping the effect size (mean values of statistics) vector the same as in Setting 1 (within the range of 2.3 to 2.4). Again, effect sizes were all non-zero. In this second set of simulations, uniformly distributed perturbations (in the range 0 to 5) were added through U, which made the pairwise correlations range from 0.14 to 0.98. Table 2 summarizes the results and once again, illustrates the ceiling feature of TQ power. However, the power of the statistic DOT now starts to climb up with L and the proposed test based on DOT eventually becomes more powerful than the one based on TQ. This phenomenon can be explained by examining the eigenvectors of the correlation matrix in Scenario 1. When eigenvectors are writen in the form of the Helmert eigenvectors, the first contributing DOT statistic is formed as the mean of original (non-transformed) statistics. The rest of contributing statistics are weighted sums of the original statistics with weights given by the entries of (2, …, L) Helmert eigenvectors. However, the structure of each vector is such that its entries add up to zero (and may contain zeros as well). Thus, when the means are very similar (as in Scenario 1), there is cancellation of individual terms when the sum is formed. Moreover, note that although the average noncentrality value does not increase with L, the DOT-test still gains power with L!

Table 2

Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes but heterogeneous LD structure.

Number of SNPs	Empiric.	Theor.	Approx.	Empiric.	Theor.	ACAT	γ¯
L	TQ	TQ	TQ	DOT	DOT	ACAT	γ¯
500	0.729	0.730	0.726	0.973	0.973	0.793	0.251
300	0.731	0.730	0.726	0.883	0.883	0.791	0.256
200	0.731	0.730	0.726	0.810	0.811	0.789	0.281
100	0.730	0.731	0.726	0.599	0.599	0.786	0.295
50	0.732	0.733	0.728	0.577	0.576	0.782	0.418
30	0.736	0.735	0.729	0.504	0.502	0.778	0.488
20	0.737	0.737	0.731	0.541	0.540	0.776	0.661

This setting is analogous to the equicorrelation scenario in Setting 1, except that the mean values of statistics were lowered: in Setting 1, the range in was 2.3 to 2.4, while here, the range was set to vary uniformly between 1 and 2.3, and effect sizes were all non-zero. Thus, the maximum effect size was lower than that in the previous simulations but the heterogeneity among effect sizes was higher. We emphasize again that while the equicorrelation assumption is unrealistic, it serves as a very useful benchmark scenario that highlights power behavior and features of the statistics TQ and DOT and allows one to introduce departures from equicorrelation in a controlled manner. Table 3 presents the results. The “Approx.” column in this table was removed and replaced by power values based on a “P-value”-approximation to the distribution of TQ as in Eq (16). This switch highlights the idea that both the power and the P-value for the TQ test can be reliably estimated based on the one degree of freedom chi-squared approximation. Importantly, Table 3 demonstrates that the power of the DOT-test reaches 100% as L increases (despite the fact that effect sizes were lower than in the previous settings), while the power of the TQ-test stays in the range 51.2 to 52.5%.

Table 3

Power comparison of TQ, DOT, and ACAT, assuming heterogeneity in effect sizes but equicorrelated LD.

Number of SNPs	Empiric.	Theor.	P-approx.	Empiric.	Theor.	ACAT	γ¯
L	TQ	TQ	TQ	DOT	DOT	ACAT	γ¯
500	0.525	0.525	0.526	1.000	1.000	0.626	0.479
300	0.526	0.525	0.526	1.000	0.999	0.624	0.486
200	0.526	0.525	0.524	0.993	0.993	0.622	0.494
100	0.525	0.524	0.524	0.919	0.920	0.616	0.518
50	0.522	0.523	0.522	0.762	0.762	0.607	0.566
30	0.521	0.521	0.521	0.648	0.648	0.599	0.630
20	0.519	0.519	0.520	0.578	0.579	0.592	0.709

This setting is similar to the scenario in Setting 2, except that we allowed higher heterogeneity in pair-wise LD values. Effect sizes were all non-zero. LD was constructed as perturbation of (as described in “Materials and Methods”), with U set to be a random sequence on the interval from -5 to 5. This resulted in LD values ranging from -0.93 to 0.99. The effect sizes (mean values of statistics) were sampled randomly within each simulation from (-0.15, 0.15) interval. Table 4 presents the results and shows that in this setting, the power of DOT is dramatically higher than that of TQ and ACAT. In fact, power values for the TQ and ACAT tests barely exceed the type-I error, while the power of the decorrelation method steadily increases with L, eventually exceeding 90%.

Table 4

Power comparison of TQ, DOT, and ACAT with effect sizes randomly sampled from -0.15 to 0.15 and heterogeneous LD.

Number of SNPs	Empiric.	Theor.	P-approx.	Empiric.	Theor.	ACAT	γ¯
L	TQ	TQ	TQ	DOT	DOT	ACAT	γ¯
500	0.0500	0.0503	0.0508	0.9226	0.9222	0.0564	0.2118
300	0.0506	0.0503	0.0509	0.7688	0.7689	0.0570	0.2107
200	0.0504	0.0503	0.0508	0.5970	0.5967	0.0570	0.2025
100	0.0504	0.0503	0.0509	0.3040	0.3038	0.0568	0.1655
50	0.0502	0.0503	0.0508	0.3074	0.3070	0.0555	0.2397
30	0.0505	0.0503	0.0507	0.1485	0.1487	0.0562	0.1527
20	0.0501	0.0503	0.0508	0.1191	0.1189	0.0557	0.1399

In these sets of simulations we used biologically realistic patterns of LD. Also, rather than specifying mean values of association statistics directly, we utilized a regression model for the effect sizes, as described in Eqs (1) and (2). Details of these simulations are given in “LD patterns from the 1000 Genome Project” in “Materials and Methods.” We re-iterate that when association of SNPs with a trait is present (under the alternative hypothesis), the correlation among statistics is not equal to LD, because it also has to incorporate effect sizes, as illustrated by Eq (5). This point is important if one wants to simulate statistics directly from the MVN distribution rather than computing them based on simulated data followed by regression. The results are presented in Table 5. Columns labeled “Regr.” represent scenarios, in which data were generated and statistics were computed. Columns labeled “MVN” represent scenarios, in which statistics were simulated directly. The rows of Table 5 show power values for three different α-levels. We expected the power values in “Regr.” and “MVN” columns to match, and they do, highlighting another utility of our analytical derivation of the distribution of the test statistic under the alternative hypothesis. That is, using our results, one can significantly reduce computational and programming burden in genetic simulations. Also note that power values in Table 5 do not decrease as α-level becomes smaller (Settings 6 and 7). This is due to the fact that we deliberately discarded effect size and LD configurations where power was expected to be too low, because we wanted to assure a good range of power values across methods.

Table 5

Power comparison of TQ, DOT, and ACAT using realistic LD patterns from 1000 Genomes project.

	Theor.	Approx.	Regr.	MVN	Theor.	Regr.	MVN
	TQ	TQ	TQ	TQ	DOT	DOT	DOT	ACAT
Setting 5
α = 10⁻³	0.34	0.34	0.34	0.34	0.60	0.60	0.60	0.40
Setting 6
α = 10⁻⁴	0.42	0.42	0.42	0.43	0.77	0.77	0.77	0.43
Setting 7
α = 10⁻⁷	0.24	0.24	0.24	0.24	0.76	0.76	0.76	0.18

As in previous simulations, power values of TQ and ACAT are similar. The power approximation by Eq (17) remains close to the predicted theoretical power, as well as to empirically estimated powers. We also observed that power of the decorrelation test, DOT, is substantially higher than the powers of either TQ or ACAT. Patterns of LD and effect sizes in Settings 1–4 are not necessarily realistic biologically, however, they serve as benchmark scenarios that help to understand and highlight differences in the respective statistical power of the methods. Simulations for Settings 1–4 were performed at the 5% α-level based on 2 × 106 evaluations. Settings 5–7 used realistic patters of LD derived from the 1000 Genomes Project data. Test sizes varied from 0.001 to 10−7 with at least 10,000 simulations for power estimates. Type-I error rates were well controlled for TQ and DOT. However, as noted by Liu et al., because the ACAT P-value is approximate, the null distribution of its statistic is evaluated under independence, and we found that at the nominal 5% α-level, the type-I error for the ACAT was somewhat higher and could reach 7% for some correlation settings. Nonetheless, the advantage of ACAT is that the approximation improves as the α-level becomes smaller.

Simulations assuming that the correlation matrix is estimated using external data

When only summary statistics are available, the correlation matrix Σ can be estimated from a reference panel of genotyped individuals. However, the type-I error of tests based on both TQ and DOT may potentially be affected due to substituting the sample estimate by an estimate obtained from external data. To study the effect of this mis-specification on the type-I error, we conducted a separate set of simulations. In these experiments, we again utilized LD structures derived from the 1000 Genomes Project data. Reference panels for these simulations were obtained as follows. Each LD matrix derived from real data was assumed to represent the population matrix. Next, a sample was drawn, and the corresponding sample LD matrix was calculated. That matrix should have been used for calculations of the gene-based test statistics. Instead, we drew a separate sample of size N, assuming the same population LD matrix. In the calculation of the tests, that sample correlation matrix was used in place of the correct one. The type-I error rates, given in Tables 6–8, show that both ACAT and TQ have close to the nominal type-I error rates, but the error rate for the decorrelation method (DOT) can be inflated, unless the sample size of the reference panel is 50 to 100 times larger than the number of SNPs (L). For the statistic DOT, the type-I error rates appear to be more inflated at smaller α-levels, such as 10−7. Power values for TQ are not shown, however they closely followed predicted theoretical power for the scenarios where the same data are used for both LD estimation and computation of association statistics. There was only 1 to 2% drop in power when the size of the panel was only 2 to 5 times larger than L.

Table 6

Type-I error rates (α = 10−3) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size	TQ	DOT	ACAT
N = 5L	1 × 10⁻³	3 × 10⁻³	1 × 10⁻³
N = 10L	1 × 10⁻³	3 × 10⁻³	1 × 10⁻³
N = 50L	1 × 10⁻³	2 × 10⁻³	1 × 10⁻³
N = 100L	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻³

Table 8

Type-I error rates (α = 10−7) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size	TQ	DOT	ACAT
N = 5L	2 × 10⁻⁷	3 × 10⁻⁴	1 × 10⁻⁷
N = 10L	2 × 10⁻⁷	2 × 10⁻⁴	1 × 10⁻⁷
N = 50L	2 × 10⁻⁷	2 × 10⁻⁴	1 × 10⁻⁷
N = 100L	2 × 10⁻⁷	1 × 10⁻⁴	1 × 10⁻⁷

Type-I error rates (α = 10−3) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Type-I error rates (α = 10−4) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Type-I error rates (α = 10−7) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Combining breast cancer association statistics within candidate genes

We applied our decorrelation method to a family-based GWAS study of breast cancer [27, 28]. The data set was comprised of complete trios, i.e., families where genotypes of both parents and the affected offspring were available. With complete trios, previously reported statistics become equivalent to statistics from the transmission-disequilibrium test and correlation among them is expected to follow the LD among SNPs [8]. We selected four candidate genes (TOX3, ESR1, FGFR2 and RAD51B), for which Shi et al. [27] and O’Brien et al. [28] replicated several previously reported risk SNPs in relation to breast cancer. For the joint association, we restricted our analysis to blocks of SNPs surrounding breast cancer risk variants that were previously reported in the literature. Specifically, we selected TOX3 rs4784220 [29], ESR1 rs3020314 [30, 31], FGFR2 rs2981579 [29], and RAD51B rs999737 [32-34], and then included blocks of SNPs around these ‘anchor’ risk variants with the LD correlation of at least 0.25. These blocks included 13 SNPs around rs4784220, 36 SNPs around rs3020314, 18 SNPs around rs2981579, and 30 SNPs around rs999737. As an illustration, Fig 1 displays 81 SNP P-values that were available for ESR1 gene, the vertical dashed line highlights the position of ‘anchor’ rs3020314, the red dots highlight 36 SNPs within LD-block of rs3020314, and the LD matrix displays sample correlation matrix among 36 SNPs. Once SNP blocks were identified for each gene, we applied four combination methods to assess their association with breast cancer.

Fig 1

Overview of DOT method in application to breast cancer data.

Overview of DOT method in application to breast cancer data.

We compute gene-level score by first decorrelating SNP P-values using the invariant to order matrix H and then calculating sum of independent chi-squared statistics. We utilize our DOT method to obtain a gene-level P-value. In the breast cancer data application, we chose an anchor SNP—a SNP that has previously been reported as risk variant (highlighted by a vertical dashed line),—and then combine SNPs in an LD block with the anchor SNP by the DOT. SNP-level P-values highlighted in red are those in moderate to high LD with the anchor SNP. Table 9 present the joint association analysis results. The first row of Table 9 shows P-values for the association between the LD block of 13 SNPs in TOX3 region and breast cancer, derived from 1277 Caucasian triads. All methods conclude a statistically significant link but our decorrelation method provides the most robust evidence with a substantially lower P-value. The third row of Table 9 shows joint association P-values for the LD block of 18 SNPs in FGFR2. Three out of four methods conclude an association at 5% level, with DOT approach, once again, providing the most significant result. We note that the last column of Table 9 gives the Bonferroni-style adjustment that is expected to be more conservative relative to the combination tests. Thus, it is not surprising that out of the four methods considered, the Bonferroni method failed to conclude an association. Lastly, the second and the fourth rows of Table 9 provide joint association P-values for LD block in ESR1 and RAD51B, respectively. For both ESR1 and RAD51B our decorrelation approach was the only one that concluded a statistically significant association between SNP-set in those genes with breast cancer.

Table 9

Breast cancer candidate gene association P-values.

Gene	TQ	DOT	ACAT	min(P) × L
TOX3/rs4784220 [29] (L = 13)	0.0005	0.0004	0.001	0.001
ESR1/rs3020314 [30, 31] (L = 36)	0.20	0.0001	0.19	0.96
FGFR2/rs2981579 [29] (L = 18)	0.01	0.003	0.01	0.07
RAD51B/rs999737 [32–34] (L = 30)	0.56	0.009	0.76	1

Table 10 details a list of top SNPs that are associated with breast cancer within the selected candidate genes. The top ranked SNPs were identified by considering the top three components in the linear combination , where X’s are the decorrelated summary statistics. Once the highest three values of were identified for each gene, we considered individual components of that are formed as a linear combination of the original statistics weighted by the elements of matrix H. The top individual components hZ (with the same sign as X) were corresponding to individual SNPs presented in Table 10.

Table 10

Breast cancer SNPs identified by DOT in the analysis of GWAS data.

Gene	Number of SNPs in analysis (L)	rs number	Reference
TOX3	13	rs4784220	This SNP was previously reported in the literature to be associated with breast cancer [29, 35].
		rs8046979	This SNP was also linked to breast cancer [29].
		rs43143	A new association with susceptibility to breast cancer.
ESR1	36	rs2347867	This SNP was previously reported to be involved in breast cancer risk [36, 37].
		rs985191	This SNP was previously reported to be associated with endocrine therapy efficacy in breast cancer [38], as well as with the overall breast cancer risk [39].
		rs3003921	A new association with susceptibility to breast cancer. This SNP was previously linked to the effectiveness of androgen deprivation therapy among prostate cancer patients [40].
		rs985695	A new association with susceptibility to breast cancer.
		rs2982689	A new association with susceptibility to breast cancer.
		rs3020424	A new association with susceptibility to breast cancer.
		rs926777	A new association with susceptibility to breast cancer.
FGFR2	18	rs1219648	This SNP was previously reported to be associated with premenopausal breast cancer [41] and the overall breast cancer risk [42–45].
		rs2860197	This SNP was previously suggested to have an association with breast cancer [46].
		rs2981582	This SNP was previously reported in the literature to be associated with breast cancer [43, 47–49].
		rs3135730	This SNP was previously suggested to have an interaction between oral contraceptive use and breast cancer [50].
		rs2981427	A new association with susceptibility to breast cancer.
RAD51B	30	rs999737	This SNP was previously reported in the literature to be associated with breast cancer [32–34, 51, 52].
		rs8016149	This SNP was previously suggested to have an association with breast cancer [53].
		rs1023529	This SNP has been patented as one of susceptibility variants of breast cancer [54].
		rs2189517	This SNP was showed to be associated with breast cancer in Chinese population [55].
		rs7359088	A new association with susceptibility to breast cancer.

For the LD block in TOX3 gene, the top three individual X’s in DOT statistic were all formed by having a very large weight assigned to a single SNP, i.e., the largest value, , was formed by assigning a large weight to rs4784220 statistic; the second largest value, , was formed by assigning a large weight to rs8046979 statistic; and the third largest value, , was formed by assigning a large weight to rs43143 statistic. The first few rows of Table 10 detail these results and identify rs43143 as a new possible association with breast cancer. For the LD block in ESR1 gene, the top X’s were quite different. Specifically, the largest value, X(1), was formed as a linear combination of 6 SNPs that all got assigned large weights. These 6 SNPs were rs2982689/rs3020424/rs985695/rs2347867/rs3003921/rs985191. The second highest linear combination, X(2), was formed by assigning high weights to 5 out of 6 SNPs listed above: rs2982689/rs3020424/rs985695/rs2347867/rs3003921. We note that the signs of X(1) and X(2) were in different directions and that is why it was possible for the same set of SNPs to be prioritized. Finally, the third largest value, X(3), also prioritized the same set of SNPs, with the exception of the single new addition of rs926777. Table 10 provides a detailed discussion of these SNPs and identifies rs3003921/rs985695/rs2982689/rs3020424 and rs926777 as new possible associations with breast cancer. Finally, for the LD blocks in FGFR2 and RAD51B we repeated the procedure detailed above and also identified top-ranking SNPs. Table 10 reviews these results and points FGFR2 rs2981427 and RAD51B rs7359088 as two more additional newly found associations.

Combining cleft lip association statistics within candidate genes

To further validate the utility of DOT, we applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate [56]. Summary statistics were based on transmission-disequilibrium test on autosomal SNPs in 1908 case-parent trios of European and Asian ancestry. We selected four genetic regions (ABCA4, chr. 8q24, IRF6, and MAFB) that were prioritized by Beaty et al. [56] for gene-based analysis. Anchor SNPs were chosen based on significant risk markers previously reported in literature. Specifically, rs560426 was chosen as an anchor for ABCA4 region [57] and formed an anchor block of L = 30 SNPs; rs987525 for chr. 8q24 [58] with L = 29 SNPs in a block; rs10863790 for IRF6 [59] with L = 6 SNPs in a block; and rs13041247 for MAFB [60] with L = 14 SNPs in a block. Table 11 provides summary of gene-based P-values and indicates that all four combination methods concluded significant associations. Results in Table 11 can also be viewed as a gauge of the relative power of the four combination methods. As such, Table 11 confirms that DOT may result in smaller P-values then those of competitors.

Table 11

Cleft lip candidate gene association P-values.

Gene	TQ	DOT	ACAT	min(P) × L
ABCA4/rs560426 [57] (L = 30)	8.9 × 10⁻⁸	1.3 × 10⁻¹³	7.2 × 10⁻¹¹	7.2 × 10⁻¹¹
chr. 8q24/rs987525 [58] (L = 29)	1.0 × 10⁻⁹	8.7 × 10⁻²²	4.7 × 10⁻¹⁵	3.2 × 10⁻¹⁵
IRF6/rs10863790 [59] (L = 6)	4.7 × 10⁻⁹	1.8 × 10⁻¹⁹	2.1 × 10⁻¹⁴	2.1 × 10⁻¹⁴
MAFB/rs13041247 [60] (L = 14)	1.5 × 10⁻⁸	2.9 × 10⁻⁸	2.4 × 10⁻¹¹	3.6 × 10⁻¹¹

Table 12 details a list of top SNPs that were associated with non-syndromic cleft lip with or without cleft palate within four genetic regions. For the LD block around rs560426 in ABCA4 gene, was formed by assigning large weights to two SNPs (rs4847196/rs563429) both of which were previously considered in association with cleft lip but were found to be not statistically significant [56]. The second highest DOT linear combination, , prioritized the same two SNPs (rs4847196/rs563429), thus reinforcing the idea that these two markers may be genuinely associated with cleft lip. The third highest linear combination, , was formed by assigning high weights to rs2275035 and rs546550, the former of which was recently identified to be associated with orofacial clefting [61], while the latter may be a new association with cleft lip.

Table 12

Cleft SNPs identified by DOT in the analysis of GWAS data.

Gene	Number of SNPs in analysis (L)	rs number	Reference
ABCA4	30	rs4847196	This SNP was previously studied in connection to cleft lip [56] but the association was found to be not statistically significant.
		rs563429	This SNP was also previously considered in association with cleft lip [56] but found to be not statistically significant.
		rs2275035	Was recently identified to be associated with orofacial clefting [61].
		rs546550	A new association with susceptibility to cleft lip. This SNP was previously suggested to be linked to esophageal cancer [62].
chr. 8q24	29	rs987525	One of the top results was the anchor SNP [58].
		rs882083	Was previously suggested to be associated with cleft lip [56, 58].
		rs1157136	Was previously suggested to be associated with cleft lip in Brazilian population [63].
		rs12548036	Was previously studied in connection to susceptibility to cleft lip in Japanese population [64] but the association was found to be not statistically significant.
		rs1530300	Was previously suggested to be associated with cleft lip in Brazilian population [57] and Brazilian population with high African ancestry [65].
		rs12547241	A new association with susceptibility to cleft lip.
IRF6	6	rs10863790	One of the top contributions was the anchor SNP [59].
		rs861020	Was previously reported to be associated with cleft lip [59, 66, 67].
		rs2236906	Was considered to be associated with cleft lip in a Kenya African Cohort [68] and in general population [69].
		rs2073485	Was reported to be associated with cleft lip in Western China [70] and Taiwanese population [71].
MAFB	14	rs11696257	Was previously reported to be associated with cleft lip [56, 72].
		rs6102085	Was previously reported to be associated with cleft lip in Han Chinese population [73].
		rs6065259	Was previously reported to be associated with cleft lip in a population in Heilongjiang Province, northern China [74].
		rs6102074	Was previously reported to be associated with cleft lip in Han Chinese population [73, 75].

For the LD block on chr. 8q24 region, was formed by assigning a large weight to the anchor SNP (rs987525). prioritize two SNPs: rs882083 that was already suggested to be associated with cleft lip [56, 58], and rs12547241 that may be a new risk marker. Finally, prioritized a set of three SNPs (rs1157136/rs12548036/rs1530300), all of which were previously studied in connection to cleft lip [57, 63–65]. For the last two LD block considered (IRF6 and MAFB genes), Table 12 details a list of top SNPs contributors to the DOT statistic. In brief, all of the prioritized SNPs were previously reported in association with cleft lip.

Discussion

In this research, we have proposed a new powerful decorrelation-based approach (DOT) for combining SNP-level summary statistics (or, equivalently, P-values) and derived its theoretical power properties. To the best our knowledge, we were the first to derive analytical properties of the traditional approach, TQ (e.g., as implemented in VEGAS), as well as of the DOT, with the help of new theory that incorporates effect sizes of SNPs into mean values of association statistics and correlations among them. Through extensive simulation studies, we have demonstrated that our decorrelation approach is a powerful addition to the tools available for studying genetic susceptibility to disease. Our analysis of breast cancer and cleft lip data illustrates unique properties of DOT. Our results revealed novel potential associations within candidate genes that would have not been found by previously proposed methods. These novel SNPs were identified by examining the top three linear-combination contributors to the overall value of the DOT-statistic. We note that the top contributions may give large weights to genetic variants that are truly associated with the outcome or to SNPs in a high positive LD with true causal variants. Caution is needed when interpreting such results because our method cannot distinguish between causal and proxy associations. Further studies would be needed to confirm these findings. The most important feature of the proposed method is that it may provide substantial power boost across diverse settings, where power gain is amplified by heterogeneity of effect sizes and by increased diversity between pairwise LD values. Genetic architecture of complex traits is far from being homogeneous, making our method applicable in various settings. We have developed new theory to explain unexpected and remarkable boost in power. This theory allows one to predict behavior of the tests in simulations with high accuracy and to explain unexpected scenarios, where the decorrelation method may give dramatically higher power compared to the traditional approach. Yet, there are important precautions to the decorrelation approach. When reference panel data are used to provide the LD information and, more generally, correlation estimates for all predictors, including SNPs and covariates, , sample size of the external data should be several times larger than the number of predictors. Ideally, the same data set should be used to obtain association statistics, as well as . Nevertheless, association statistics and are compact summaries of data and are much more easily transferred between separate research groups than raw data, due to privacy considerations and potentially large size of the raw data sets. Also, caution is needed if missing data are present in the original data set because the estimate () may no longer reflect the sample correlation between predictors. Imputation of missing values is a suitable solution, if missing values are independent of the outcome. With the usage of reference panel data, the type-I error inflation for the statistic DOT can be affected by many factors, and this statistic is expected to be sensitive not only to the size of a reference panel, but to population variations in LD, especially for highly correlated blocks of SNPs. Overall, it appears to be difficult to give specific recommendations, except that the reference panel size has to be at least 50 times larger than the number of SNPs to be combined. Therefore, we recommend to limit applications of the decorrelation method to situations, where the LD matrix is obtained from the same data set as the summary statistics. Note that all pairwise LD values can be obtained from sample haplotype frequencies of SNPs, thus the LD matrix can be reconstructed. Utility of this approach remains to be investigated, in particular, one concern is that the correlation between the SNP values reflect the composite disequilibrium values [76], while frequencies of sample haplotypes are often reported following likelihood maximization, e.g., by the EM algorithm. An important issue that still remains to be investigated is a systematic analysis of the performance of our method utilizing real genome-wide data. Such analysis would allow one a more thorough assessment of both the type-I error rate, as well as power to detect genetic regions already implicated in susceptibility to disease. In our simulations, the recently proposed method ACAT and the test based on the distribution of the sum of correlated association statistics (VEGAS, or TQ) had similar power. In many situations, power of these two tests was substantially lower than that of the DOT. The main advantage of ACAT is that it does not require any LD information. Our theory and simulations also revealed previously unknown robustness of the TQ method with respect to LD mis-specification: the method is valid and remains nearly as powerful when the sample LD matrix is substituted by a single value, summarizing the extent of all pairwise correlations. TQ also remains valid when the LD summary is obtained from a representative reference panel. We stress again that compared to ACAT and TQ, our method’s limitation is that in order to avoid possible bias, the LD information and the summary statistics should ideally come from the same data set and missing genotypes should be imputed prior to its application. In general, one should avoid utilization of external data as a source of LD information, as well as high rates of unimputed missing genotypes. Although not pursued here, a possible way to improve robustness of the DOT is to merge it with ACAT, that is, decorrelate the summary statistics first, convert the results to P-values and then combine them with ACAT.

Materials and methods

Genetic association tests based on summary statistics are often presented as a weighted sum [2, 4]. Let w denote the weight assigned to individual statistic. The weighted statistics can then be defined as with Z ∼ MVN(, Σ) and Y ∼ MVN(, Σ), where = W, Σ = WΣW, and . The statistics are marginally distributed as one degree of freedom chi-square variables with noncentralities . The overall statistic is then typically defined as .

Joint distribution of association summary statistics

In this section, we derive parameters and Σ of the joint MVN distribution of summary statistics. Under the null hypothesis, when none of the SNPs are associated with an outcome, = 0. If individual SNP models do not include covariates, Σ equals the LD matrix, i.e., the correlation matrix between the SNP values coded as 0, 1, or 2, reflecting the number of minor alleles in a genotype. In the presence of covariates, Σ is a Schur complement of the submatrix of the matrix of all predictor variables [6]. That is, the estimated correlation between association statistics can be obtained by inverting the covariance or correlation matrix of all predictors, selecting the SNP submatrix, inverting it back, and standardizing the result to correlation. Under the alternative hypothesis, when some SNPs are associated with a trait y, let β be the regression coefficient for the j-th SNP. Then, a typical linear model that determines the trait value is defined as: where ϵ ∼ N(0, 1). The mean value of the summary statistics (i.e., noncentralities) can be expressed as: where Σ is the j-th column of Σ, b = cor(y, SNP) and N is the sample size. An intuitive explanation of Eq (2) can be gained by considering the case of independent predictors, i.e.,Σ = I. If both the outcome and the set of predictors are standardized, then , which is a standardized regression coefficient. We note that Eq (2) is valid outside of the linear model settings. For example, consider a latent variable model, where the continuous unobserved (latent) variable y is linear in predictors according to Eq (1), and the observed variable (disease status) is y = 1 whenever y > l and y = 0 otherwise, where l is some threshold. When such binary outcome is analyzed by logistic regression, a good approximation to the noncentrality values will be: If error terms ϵ are assumed to be normally distributed, the reduction in correlation due to dichotomization by the factor d can be expressed as , where ϕ(⋅), Φ(⋅) are the probability and the cumulative densities of the standard normal distribution [77]. Under association, surprisingly, the correlation matrix between statistics is no longer Σ. Let σ be the i, j-th element of Σ, and ρ be correlations between predicdictors and the outcome. By using the multivariate delta method, we derived the i, j-th element of the correlation matrix as follows: Details of the derivation of these equations are given in [78]. An alternative derivation of the asymptotic covariance that includes the first two terms of Eq (5) has been given by Reshef et al. [79], assuming Gaussian genotypes, an assumption justifiable provided that there is a lower bound for minor allele frequency relative to sample size. Note that when some of SNP pairs (i, j) are associated, summary statistics may become correlated even if there is no LD between the SNPs, due to the last term, −bb, in Eq (5). Eqs (2), (3), (4) and (5) allow one to study power properties of the methods based on sums of association statistics, as well as to design realistic simulation experiments, where summary statistics can be sampled directly from the MVN distribution under the alternative hypothesis. That is, given effect sizes and the correlation matrix among predictors, statistics can be immediately sampled from the MVN distribution. This approach avoids both the data-generating step and the subsequent computation of summary statistics from that data, leading to a substantial gain in computation time. In certain situations, the difference in speed can be dramatic. For example, it is not trivial to simulate discrete (genotype) data given a specific LD matrix. Current state of the art methods tend to be slow, because they rely on ad hoc iterative techniques, such as generation of multiple random “proposal” data sets to fit the target correlation matrix [80]. Results of simulation experiments presented here were performed based on effect sizes specified via the linear model (Eq 1). However, we verified (not presented here) the validity of the proposed theory assuming logistic, probit, and Poisson regression models. We also note that Conneely et al. presented theoretical arguments supporting the validity of the MVN joint distribution of summary statistics under no association for a broad class of generalized regression models [6].

Distribution of sums of association summary statistics

As we noted at the beginning of the “Materials and Methods” section, weighted sums of summary statistics can be re-expressed as unweighted sums, where the mean and the correlation parameters are modified to absorb the weights. The distribution of follows the weighted sum of independent one degree of freedom non-central chi-square random variables. Although this result is standard, the components of this weighted sum depend on the joint distribution of association summary statistics under the alternative hypothesis, and this distribution has not been previously derived. In the previous section, we provide the components of and that determine the weights and the noncentralities of chi-squares. Therefore, where the weights, λ, are the eigenvalues of and is the vector of non-centrality parameters. The columns of the matrix E are orthogonalized and normalized eigenvectors of . The P-value for the statistic TQ = Y′Y is obtained by setting to zero and then calculating this tail probability at the observed value TQ = t. Note that the elements in , and therefore the eigenvectors, the eigenvalues λ, and the noncentralities explicitly depend on the β-coefficients through Eqs (2) and (5). Our decorrelation approach uses a symmetric orthogonal transformation of the vector of statistics Y to a new vector X, with the new joint statistic based on the sum of elements of X, . The orthogonal transformation is defined as follows. Let and define X = H Y, where H = E D E′. The squared values, , are one degree of freedom independent chi-square variables, thus DOT = X′X is a chi-square random variable with L degrees of freedom and noncentrality value of: The cumulative distribution of the new test statistic is thus, There are many ways to choose an orthogonal transformation, but a valid one for our purposes needs to have the following “invariance to order” property. Suppose we sample an equicorrelated MVN vector Y with a common correlation ρ for all pairs of variables. Before decorrelating the vector, we permute its values to a different order. A permutation in this example is a legitimate operation, because an equicorrelation structure does not suggest a particular order of Y values. After an orthogonal transformation of Y to X, the order of X entries may change due to permutation but their values should remain the same. Moreover, for the method to be useful in practice, we need the invariance to hold for a more general class of statistics than a simple sum of chi-squares, . For example, the Rank Truncated Product (RTP) is a powerful P-value combination method [12] that emphasizes small P-values: the RTP statistic TRTP is the product of the k smallest P-values, k < L, or equivalently, , where P1 ≤ P2 ⋯ ≤P. Note that −ln(P) is no longer a one degree of freedom chi-square variable. Since DOT produces a set of independent one degree of freedom chi-squares, to use it with with RTP, one can convert the set of chi-squares to P-values and take the product of the first smallest values, which is the RTP statistic. The “invariance to order” requirement implies that the value of DOT-statistic should not change due to a permutation of (equicorrelated) values in Y. Not all orthogonal transformations meet the invariance to order criteria. It can be easily verified that neither the inverse Cholesky factor (C−1) transformation, X = C−1 Y, nor another commonly used transformation , have the invariance to order property, except in the special case of the sum of L chi-squared variables . To clarify, we call this statistic “the special case,” because, for example, in the case of RTP with k = L, the statistic is no longer the sum of one degree of freedom chi-squares. Moreover, some transformations of equicorrelated data to independence, such as the Helmert transformation, may change values of X depending on the order of values in Y, even in a special equicorrelation case of ρ = 0 (i.e., when variables in Y are independent). The proposed H, as defined above, has both the invariance to order property and can be used with P-value transformations other than that to the one degree of freedom chi-square.

Theoretical analysis of power

For exploration of power properties, it is useful to first consider the equicorrelation case, because in this case it is possible to derive illustrative equations that relate power to: (1) the number of SNPs, L; (2) the common correlation value for every pair of SNPs, ρ; and (3) the mean values of association statistics, . In the equicorrelation case, the correlation matrix can be expressed as . The eigenvalue vector of has length L but only two distinct values, λ = {1 + ρ(L − 1), 1 − ρ, …, 1 − ρ}. For decorrelated statistic DOT, we derived a simple form of L noncentralities by utilizing the Helmert orthogonal eigenvectors [81, 82] as follows: where is the average of the values in . Next, let where is the average of d = (μ − μ)2, over all pairs of μ and μ, such that i < j. The values in d are the pairwise squared differences in the standardized effect values as captured by the vector . This representation yields the noncentrality of DOT as a function of the common correlation and the mean standardized effect size as: Note that as L increases, the first term in Eq (13) approaches , while the sum of the remaining noncentralities, δ, increases linearly with L, as long as the average of the squared effect size differences, , does not depend on L. Thus, the noncentrality of the decorrelated statistic DOT is expected to steadily increase with L and become approximately . Next, we consider the distribution of the statistic TQ = Y′Y. Note that , where γ’s are the noncentralities for TQ and δ’s are the noncentralities of DOT. In the equicorrelation case, the distribution TQ reduces to the weighted sum of two chi-square variables, because there are only two distinct eigenvalues that correspond to , namely: The term in Eq (15) approaches the constant as L increases. Therefore, under the null hypothesis, the distribution of the quadratic form Y′Y can be well approximated by the location-scale transformation of the one degree of freedom chi-squared random variable: where is 1 − α quantile of the one degree of freedom chi-square distribution. To summarize, we just showed that the distribution of the decorrelated set of variables gains in the total noncentrality with L, while the distribution of the sum Y′Y depends heavily only on the noncentrality of the first term, γ1. The approximate power of the test based on the statistic TQ = Y′Y can be computed as: where , and Ψ(⋅) is a one degree of freedom chi-square CDF with the noncentrality Lμ*/((L − 1)ρ* + 1), evaluated at t. The ceiling noncentrality value γ*, as L → ∞, is thus Let us re-emphasize the point that a test based on the distribution of the TQ statistic is expected to be less powerful than DOT in the presence of heterogeneity among effect sizes. Heterogeneity in LD will contribute to the difference in power. Starting with an equicorrelation model, we can introduce perturbations to the common value, ρ > 0, by adding noise derived from a rank-one matrix U U′, where U is a vector of random numbers. Specifically, perturbations can be added as . Next, B should be standardized to correlation as . When elements in U are close to zero, the matrix B deviates from by only a small jiggle around ρ. Matrix B provides a way to construct random correlation matrices in a controlled manner, where the degree of departure from the equicorrelation is controlled via the range of the elements in U. The utility of B is that it represents a perturbation of , and we expect our power results under equicorrelation case to hold approximately, at least for small jiggles around ρ. Nevertheless, it turns out that even for a more general correlation structure, our power approximations still hold, which we show via extensive simulation studies.

LD patterns from the 1000 Genome Project

In a separate set of simulation experiments, we utilized realistic LD patterns using data from the 1000 Genomes Project [83]. For every simulation experiment, we selected a random set of consecutive SNPs from a chromosome 17 region, that was spanning over 100 Kb and included SNPs from the gene FGF11 to the gene NDEL1. There was no particular reason for choosing this chromosome, but we expect our results to be generalizable to other regions of the genome in the sense that LD structure among SNPs on chromosome 17 is representative of LDs throughout the genome. Perhaps more important, and a potential limitation of our simulations, is the choice of the association model. That is, the model assumed high heterogeneity in effect sizes and statistics were combined for only proxy SNPs (those SNPs with zero effect sizes). Each stretch of consecutive SNPs contained from 10 to 200 SNPs with the minimum allele frequency 0.025. A random portion of SNPs in every set carried no effect on the outcome on its own, and we considered these SNPs to be proxies for causal variants due to LD. The median LD correlation varied from approximately -0.6 to 0.98 between random stretches of SNPs. The number of proxy SNPs varied from 3 to 197 across simulations. The sample size was also set to be random and varied from 500 to 3000 across simulations. Effect sizes for causal variants were modeled by β-coefficients, as given by Eq (1), and drawn randomly from the interval [-0.4, 0.4]. Different combinations of the number of causal SNPs, their individual effect sizes and LD patterns among them resulted in total proportion of phenotypic variance explained (i.e., the multiple correlation coefficient) varying from 10−5% (fifth percentile) to 7% (ninety-fifth percentile) with the mean value of 2.5% and the median value of 1%. Summary statistics were sampled from the MVN distribution with parameters given by Eqs (2) and (4). To check the validity of our approach of sampling the summary statistics directly, we first conducted a separate set of extensive simulation experiments, in which power and type-I error rates were obtained by simulating individual data and then TQ and DOT statistics were computed by running the actual regression analysis. We confirmed excellent agreement between the two approaches, thus most of the subsequent simulations were conducted by sampling the summary statistics directly (these results are not shown here). 25 Oct 2019 Dear Dr Zaykin, Thank you very much for submitting your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this paper, the authors present a new summary-statistics-based method for testing a group of common SNPs in aggregate for association to a phenotype. Unlike previous approaches, the authors' test statistic explicitly (and exactly) removes correlation between the individual SNPs' summary statistics. I generally like this paper and appreciate the authors' precision and rigor in deriving and presenting their method. Their theoretical results concerning the power of their test as well as others are also a valuable contribution. So I generally feel this is a very solid contribution to the field. In the long-term I would suggest that the authors consider applications of their framework beyond set-testing since my impression is that the growing number of highly significant associations between *individual* SNPs and phenotypes will eventually cause set-testing to decline as an approach in the common-variant realm. But this is beyond the scope of this paper and for now there remains a substantial community of users of set tests who could benefit from the approach described by the authors. Regarding the technical substance of the paper, I have the following major comments: - I'm unclear on the phenomenon whereby TQ tests don't experience an increase in power as more SNPs are added to the model, e.g., in Setting 1. Looking at the authors' model, in which the variance of the environmental noise, epsilon, is set at 1, it would seem that the more SNPs I add to the model with non-trivial effects, the more phenotypic variance is produced by the genetics. In the limit of infinite SNPs and constant-magnitude environmental noise then, the phenotype should be deterministically set by genotype. It would seem unintuitive that in this situation the TQ tests wouldn't have full power. What am I missing? Are the authors scaling something somewhere? - Relatedly, it would help if the authors included in their methods section more detailed descriptions of their simulation set ups especially including sample size and proportion of phenotypic variance explained by genotype for each simulation (including the simulations with real genotypes). - I don't know if the proportion of variance explained by genotype is high in the authors' simulations. But if it is, do they expect their results to generalize to settings where this is not the case? For real traits, any one set of tens to hundreds of contiguous SNPs typically only explains a very small proportion (on the order of 1%, usually even less than that) of phenotypic variance, so I'd be interested to see if this is the case in the simulations here. Sometimes it's okay to simulate small sections of genome explaining high proportions of phenotypic variance as long as sample size is lowered in some corresponding way, but if this is the case here the authors should explain and perhaps use their theory to justify. - How do the authors expect their statistic to behave in the presence of near-perfect LD? It seems they don't regularize their LD matrix, which surprised me. I would be interested to see power results under a simulation setting where two SNPs, only one of which is causal and contains 75% of the causal signal in locus, have a) 99% correlation and b) 100% correlation. - For the simulations with real genotypes, how was the 100kb region on chromosome 17 chosen? Do the authors expect the simulation results to generalize to other regions of the genome as well? If they are unsure, is it computationally feasible to do simulations where random sets of contiguous SNPs are chosen from the whole genome? - How were the genes ESR1, FGFR2, RAD51B, and TOX3 chosen by the authors for demonstration of their method? Does this set include all the genes found in the Min et al paper to have association with breast cancer? Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors' approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers? - Do the authors think it would make sense to compare (either in simulation or in practice) to the gene-level test in de Leeuw 2016 PLOS Comp Bio since that method also provides a way to test the SNPs surrounding an individual gene for association while accounting for correlation between variants in order to boost power? Relatedly: ACAT seems to be a method intended primarily for testing of rare variants in sequence data; could it be that this makes it an inappropriate comparison point? - I liked the way the authors argued for their particular choice of pseudoinverse by suggesting that exchangeability of SNPs should be preserved by this operation. Kudos! I also have the following minor comments: - It seems that the claims about the scaling of power as a function of L are for fixed rho > 0, because when rho=0 the tests considered are equivalent. The authors may want to clarify this. - In the definition of r_ij on page 2, should there be a square-root in the denominator? - On page 3 there is a typo in "This general idea is straightforward and HAVE been used..." (emphasis mine) - What was the sample size of the breast cancer data set that the authors analyzed? - In Equations 4 and 5, rho_ij appears on both sides of the equations. - The derivation of the covariance matrix of the vector of summary statistics can be carried out without the delta method but under the assumption of Gaussian genotypes (which is justifiable for large sample size and MAF bounded away from zero). See Proposition 2 in the supplement of Reshef et al 2018 Nat Genet. The authors may wish to comment on whether these two derivations give different results and if so why not. - For the results in Table 6: 1) which set of genotypes were the phenotypes simulated from? 2) Which set of genotypes was used as the reference panel? The only genotypes I saw mentioned were 1000 Genomes, but two distinct sets of genotypes are required for the described analysis. Reviewer #2: Zaykin et al propose DOT, a new method for Gene Based Association Testing. There is demand for a gene (or set-based) method, so a method that improves upon previous methods would be of much interest and (with easy to use software) could become highly used. Zaykin perform many simulations to show that DOT has the potential to improve on a state-of-the-art method, VEGAS (and also ACAT, a method I am not familiar with). They also have a real data example, but this is very limited. While I am not convinced from this draft alone, I believe that by including an extra simulation method, and a more convincing application, DOT could be a useful addition to the field. Major points Reading the method (and apologies that I did not understand all the details), DOT appears similar to methods which first compute principal components for each gene (ie eigen decompose the snp snp correlation matrix), then regress the phenotype on these (consider the following paper, or derivatives https://onlinelibrary.wiley.com/doi/pdf/10.1002/gepi.20219). Thus I require convincing this method is different to / an improvement on those. The format of the paper makes it challenging to read. Usually methods would come before results. However, if the journal requires such a style, then you must give some brief details at the start of results. I consider there to be insufficient detail of the simulations. For example, I can't see sample size and rho was hard to find. Is it the case for all simulations that all L snps are assigned effects, or just the first one? It is good you compare with vegas (TQ?). But to my knowledge, the most common methods are SCAT, or magma, and my preferred is Fast-LMM-Set, so would ideally like at least one of these considered (or a statement with justification that these very similar to VEGAS) The application is very limited. While I appreciate there is justification for the choice, unfortunately it looks odd to consider only four handpicked loci, rather than perform a genome wide analysis. I believe you require odds ratios for the SNPs in table 8 (ideally from multi snp analysis and perhaps those from single snp) Minor Points I applaud the range of simulations, and also of considering situations where DOTS is not well-suited I also like the insight into how DOT has the potential to gain power (when a wide spectrum of effect sizes, which is thought likely to be the case with complex traits). In the simulations, it is hard to understand the effect sizes. Can you instead report in terms of heritability, ideally both (average) phenotypic variance explained by the gene/region, and (average) variance explained by most significant individual snp The tables (and I think figures) require captions. In generally, these should give a full description (or if the same, say "see Table 1... etc"), rather than relying on the user to parse through the main text. Good that a github page is provided with software (although I have not tested) Please provide a summary of run time for a decent sized analysis. Very Minor Points Intro; It is important to distinguish situations ... I suggest you replace second "in which" with "from those" or something similar I would prefer if you provided more thresholds when testing the false positive rates (e.g. show not just alpha 1e-4, but also say 0.05, maybe a few others, in supplement if necessary) It is good you can accommodate covariates, but is this feature used in application? Signed Doug Speed Reviewer #3: In this manuscript, the authors combined single-SNP summary statistics in order to conduct joint analysis of a set of SNPs without accessing original genotype-phenotype datasets. To develop efficient overall summary-statistic, the authors used a decorrelation trick To simplify the correlation structure of the the vector of the single-SNP summary-statistics. The later are correlated by construction. Thus, by rotating the this vector over the eigenvectors of its corresponding correlation matrix one can simplify its correlation structure. Although the decorrelation-trick of a response vector is not a new concept—it has been used for kinship matrix several times in linear mixed models in presence of familial data, e.g. FastLMM— the theoretical and analytical development of the DOT p-values in this manuscript is relevant, in the context summary-statistic association. Major and Minor Comments are dteailed in a PDF file attached to this review. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Doug Speed Reviewer #3: No Submitted filename: PlosComBioRepot.pdf Click here for additional data file. 8 Jan 2020 Submitted filename: response_to_reviewers.pdf Click here for additional data file. 25 Feb 2020 Dear Dr Zaykin, Thank you very much for submitting your manuscript "DOT: Gene-set analysis by combining decorrelated association statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations, and in particular those of reviewer #1. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Overall the authors have addressed my theoretical and methods-related concerns quite well in this revision. However, I still have serious reservations about the authors' analysis of real data, which analyses a very small set of genes that were not chosen systematically. I previously wrote: "Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors’ approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers?" The authors did not perform this analysis, and so I still do not know whether their method is more powerful than existing methods beyond the very small set of genes they have analyzed. (The addition in revision of a second phenotype, cleft lip, analyzed in the same way as the first phenotype did not give me a better global sense for why people should use this method.) My understanfing of what the authors have shown is that: a) DOT assigns lower p-values than other methods do to the 4 selected breast cancer genes. This seems weak to me first because lower p-values don't necessarily correspond to higher power (a method can give very low p-values on 1 % of alternatives but fail to reject the null the rest of the time), and second because these genes have already been prioritized by other methods, suggesting that their connection to breast cancer is not a new discovery enabled by DOT. For example, these genes seem from the text to harbor previously reported risk SNPs. Am I missing something? b) DOT can point at new SNPs associated with breast cancer and cleft lip at these known loci (Tables 10 and 12). But the authors also state (appropriately) that since these results don't come with p-values they should be interpreted with caution, and they also state that cannot conclude that these SNPs are causal but rather only additional proxy SNPs. So I'm unsure what we can confidently learn from these results. I personally don't find (a) or (b) to be strong reasons that practitioners should use DOT. Overall, I see two ways forward: 1. The authors can carry out a systematic analysis of the performance of their method on real data. For example, they could run the method on a larger set of genes (e.g., all protein coding genes, or all genes expressed in breast tissue, or a set of genes benchmarked in other set testing papers). This would allow the authors to say things like "in a systematic analysis, our method identified X genes to be in loci that are significantly associated with breast cancer, while competing methods identified only Y such genes." I think this would make a much stronger case for the use of this method. And if it's not true, then that is important for potential users to know even if it doesn't preclude publication of the paper. 2. Alternatively, recognizing they have performed extensive revisions already, the authors can add a statement explaining that the genome-wide performance of their method is yet-uncharacterized and would be important to assess in future work. I suppose it would be okay to publish the paper in case 2, but my opinion is that I would be less excited about it. Not answering the central question of whether DOT is more powerful than other methods in practice on real data is not consistent with the otherwise high level of statistical rigor in this potentially interesting paper. Minor comments: - Just above Table 1, you have a typo: "the column labeled \\hat\\gamma provide the average noncentrality value" ("provide" should be "provides") - In the sentence “Different combinations of sample size, the number of causal SNPs, their individual effect sizes and LD patterns among them, resulted in total proportion of phenotypic variance explained...", whose addition I appreciate in this revision, sample size should not be enumerated as one of the parameters that affects the total proportion of phenotypic variance explained. - On page 10, you cite "Min et al. [27, 28]" but neither of refs. 27 or 28 has Min as the last name of a first author in your bibliography. - In your response to R1.1.6, you state that eqns 22 and 27 in Reshef et al. 2018 are derived under the null, but this is not true: Eq 22 defines the computation of summary statistics from data (regardless of model) and Equation 27 includes a parameter beta which can be non-zero. A question therefore remains about the relationship between your derivation and the derivation that assumes Gaussian genotypes. (Fine if you want to drop this issue.) Reviewer #2: The authors have made a careful response and I am happy with the changes. Reviewer #3: No addtional comments ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Doug Speed Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see 6 Mar 2020 Submitted filename: response_to_reviewers.pdf Click here for additional data file. 23 Mar 2020 Dear Dr Zaykin, We are pleased to inform you that your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I thank the authors for their revision, and I am happy to recommend acceptance given the clarifications the authors made about their analysis of real data. Setting aside this one point of disagreement, I feel this is very high quality work and I commend the authors on their valuable contribution to the field. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 6 Apr 2020 PCOMPBIOL-D-19-01433R2 DOT: Gene-set analysis by combining decorrelated association statistics Dear Dr Zaykin, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Table 7

Type-I error rates (α = 10−4) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size	TQ	DOT	ACAT
N = 5L	9 × 10⁻⁵	5 × 10⁻⁴	1 × 10⁻⁴
N = 10L	9 × 10⁻⁵	4 × 10⁻⁴	1 × 10⁻⁴
N = 50L	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴
N = 100L	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴

71 in total

1. Fine scale mapping of the breast cancer 16q12 locus.

Authors: Miriam S Udler; Shahana Ahmed; Catherine S Healey; Kerstin Meyer; Jeffrey Struewing; Melanie Maranian; Erika M Kwon; Jinghui Zhang; Jonathan Tyrer; Eric Karlins; Radka Platte; Bolot Kalmyrzaev; Ed Dicks; Helen Field; Ana-Teresa Maia; Radhika Prathalingam; Andrew Teschendorff; Stewart McArthur; David R Doody; Robert Luben; Carlos Caldas; Leslie Bernstein; Laurence K Kolonel; Brian E Henderson; Anna H Wu; Loic Le Marchand; Giske Ursin; Michael F Press; Annika Lindblom; Sara Margolin; Chen-Yang Shen; Show-Lin Yang; Chia-Ni Hsiung; Daehee Kang; Keun-Young Yoo; Dong-Young Noh; Sei-Hyun Ahn; Kathleen E Malone; Christopher A Haiman; Paul D Pharoah; Bruce A J Ponder; Elaine A Ostrander; Douglas F Easton; Alison M Dunning
Journal: Hum Mol Genet Date: 2010-03-23 Impact factor: 6.150

2. Fine mapping of 14q24.1 breast cancer susceptibility locus.

Authors: Phoebe Lee; Yi-Ping Fu; Jonine D Figueroa; Ludmila Prokunina-Olsson; Jesus Gonzalez-Bosquet; Peter Kraft; Zhaoming Wang; Kevin B Jacobs; Meredith Yeager; Marie-Josèphe Horner; Susan E Hankinson; Amy Hutchinson; Nilanjan Chatterjee; Montserrat Garcia-Closas; Regina G Ziegler; Christine D Berg; Saundra S Buys; Catherine A McCarty; Heather Spencer Feigelson; Michael J Thun; Ryan Diver; Ross Prentice; Rebecca Jackson; Charles Kooperberg; Rowan Chlebowski; Jolanta Lissowska; Beata Peplonska; Louise A Brinton; Margaret Tucker; Joseph F Fraumeni; Robert N Hoover; Gilles Thomas; David J Hunter; Stephen J Chanock
Journal: Hum Genet Date: 2011-09-30 Impact factor: 4.132

3. Simulating Ordinal Data.

Authors: Pier Alda Ferrari; Alessandro Barbiero
Journal: Multivariate Behav Res Date: 2012-07 Impact factor: 5.923

4. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

5. Analyzing 395,793 samples shows significant association between rs999737 polymorphism and breast cancer.

Authors: Haiying Dong; Zhiying Gao; Chengchong Li; Junping Wang; Ming Jin; Hua Rong; Yingcai Niu; Jicheng Liu
Journal: Tumour Biol Date: 2014-04-12

6. Genetic risk factors for nonsyndromic cleft lip with or without cleft palate in a Mesoamerican population: Evidence for IRF6 and variants at 8q24 and 10q25.

Authors: Augusto Rojas-Martinez; Heiko Reutter; Oscar Chacon-Camacho; Rafael B R Leon-Cachon; Sergio G Munoz-Jimenez; Stefanie Nowak; Jessica Becker; Ruth Herberz; Kerstin U Ludwig; Mario Paredes-Zenteno; Abelardo Arizpe-Cantú; Susanne Raeder; Stefan Herms; Rocio Ortiz-Lopez; Michael Knapp; Per Hoffmann; Markus M Nöthen; Elisabeth Mangold
Journal: Birth Defects Res A Clin Mol Teratol Date: 2010-07

7. Association between IRF6 and nonsyndromic cleft lip with or without cleft palate in four populations.

Authors: Ji Wan Park; Iain McIntosh; Jacqueline B Hetmanski; Ethylin Wang Jabs; Craig A Vander Kolk; Yah-Huei Wu-Chou; Philip K Chen; Samuel S Chong; Vincent Yeow; Sun Ha Jee; Beyoung Yun Park; M Daniele Fallin; Roxann Ingersoll; Alan F Scott; Terri H Beaty
Journal: Genet Med Date: 2007-04 Impact factor: 8.822

8. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1).

Authors: Gilles Thomas; Kevin B Jacobs; Peter Kraft; Meredith Yeager; Sholom Wacholder; David G Cox; Susan E Hankinson; Amy Hutchinson; Zhaoming Wang; Kai Yu; Nilanjan Chatterjee; Montserrat Garcia-Closas; Jesus Gonzalez-Bosquet; Ludmila Prokunina-Olsson; Nick Orr; Walter C Willett; Graham A Colditz; Regina G Ziegler; Christine D Berg; Saundra S Buys; Catherine A McCarty; Heather Spencer Feigelson; Eugenia E Calle; Michael J Thun; Ryan Diver; Ross Prentice; Rebecca Jackson; Charles Kooperberg; Rowan Chlebowski; Jolanta Lissowska; Beata Peplonska; Louise A Brinton; Alice Sigurdson; Michele Doody; Parveen Bhatti; Bruce H Alexander; Julie Buring; I-Min Lee; Lars J Vatten; Kristian Hveem; Merethe Kumle; Richard B Hayes; Margaret Tucker; Daniela S Gerhard; Joseph F Fraumeni; Robert N Hoover; Stephen J Chanock; David J Hunter
Journal: Nat Genet Date: 2009-03-29 Impact factor: 38.330

9. Identification of common non-coding variants at 1p22 that are functional for non-syndromic orofacial clefting.

Authors: Huan Liu; Elizabeth J Leslie; Jenna C Carlson; Terri H Beaty; Mary L Marazita; Andrew C Lidral; Robert A Cornell
Journal: Nat Commun Date: 2017-03-13 Impact factor: 14.919

10. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk.

Authors: Yakir A Reshef; Hilary K Finucane; David R Kelley; Alexander Gusev; Dylan Kotliar; Jacob C Ulirsch; Farhad Hormozdiari; Joseph Nasser; Luke O'Connor; Bryce van de Geijn; Po-Ru Loh; Sharon R Grossman; Gaurav Bhatia; Steven Gazal; Pier Francesco Palamara; Luca Pinello; Nick Patterson; Ryan P Adams; Alkes L Price
Journal: Nat Genet Date: 2018-09-03 Impact factor: 38.330

2 in total

1. A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies.

Authors: Zhonghe Shao; Ting Wang; Jiahao Qiao; Yuchen Zhang; Shuiping Huang; Ping Zeng
Journal: BMC Bioinformatics Date: 2022-08-30 Impact factor: 3.307

2. A flexible summary statistics-based colocalization method with application to the mucin cystic fibrosis lung disease modifier locus.

Authors: Fan Wang; Naim Panjwani; Cheng Wang; Lei Sun; Lisa J Strug
Journal: Am J Hum Genet Date: 2022-01-21 Impact factor: 11.025

2 in total